Speichern / Laden von scipy sparse csr_matrix im portablen Datenformat

Question 1

Wie speichert / lädt man einen Scipy Sparse csr_matrixin einem tragbaren Format? Die Scipy-Sparse-Matrix wird unter Python 3 (Windows 64-Bit) erstellt und unter Python 2 (Linux 64-Bit) ausgeführt. Anfangs habe ich pickle verwendet (mit protocol = 2 und fix_imports = True), aber dies funktionierte nicht von Python 3.2.2 (Windows 64-Bit) zu Python 2.7.2 (Windows 32-Bit) und bekam den Fehler:

TypeError: ('data type not understood', <built-in function _reconstruct>, (<type 'numpy.ndarray'>, (0,), '[98]')).

Als nächstes versuchte numpy.saveund numpy.loadsowie scipy.io.mmwrite()und scipy.io.mmread()und keine dieser Methoden funktionierte auch nicht.

Question 2

edit: SciPy 1.19 hat jetzt scipy.sparse.save_npzund scipy.sparse.load_npz.

from scipy import sparse

sparse.save_npz("yourmatrix.npz", your_matrix)
your_matrix_back = sparse.load_npz("yourmatrix.npz")

Für beide Funktionen kann das fileArgument auch ein dateiähnliches Objekt (dh das Ergebnis von open) anstelle eines Dateinamens sein.

Ich habe eine Antwort von der Scipy-Benutzergruppe erhalten:

A csr_matrix hat 3 Datenattribute , die Materie: .data, .indices, und .indptr. Alle sind einfache ndarrays, also numpy.savewerden sie funktionieren. Speichern Sie die drei Arrays mit numpy.saveoder numpy.savez, laden Sie sie mit zurück numpy.loadund erstellen Sie das Objekt mit der spärlichen Matrix neu mit:
new_csr = csr_matrix((data, indices, indptr), shape=(M, N))

Also zum Beispiel:

def save_sparse_csr(filename, array):
    np.savez(filename, data=array.data, indices=array.indices,
             indptr=array.indptr, shape=array.shape)

def load_sparse_csr(filename):
    loader = np.load(filename)
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])

Question 3

Obwohl Sie schreiben scipy.io.mmwriteund scipy.io.mmreadnicht für Sie arbeiten, möchte ich nur hinzufügen, wie sie funktionieren. Diese Frage ist die Nr. 1 Google-Treffer, also habe ich selbst mit np.savezund pickle.dumpvor dem Wechsel zu den einfachen und offensichtlichen Scipy-Funktionen begonnen. Sie arbeiten für mich und sollten nicht von denen beaufsichtigt werden, die sie noch nicht ausprobiert haben.

from scipy import sparse, io

m = sparse.csr_matrix([[0,0,0],[1,0,0],[0,1,0]])
m              # <3x3 sparse matrix of type '<type 'numpy.int64'>' with 2 stored elements in Compressed Sparse Row format>

io.mmwrite("test.mtx", m)
del m

newm = io.mmread("test.mtx")
newm           # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in COOrdinate format>
newm.tocsr()   # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in Compressed Sparse Row format>
newm.toarray() # array([[0, 0, 0], [1, 0, 0], [0, 1, 0]], dtype=int32)

Question 4

Hier ist ein Leistungsvergleich der drei am besten bewerteten Antworten mit dem Jupyter-Notizbuch. Die Eingabe ist eine 1M x 100K zufällige Matrix mit geringer Dichte und einer Dichte von 0,001, die 100M Werte ungleich Null enthält:

from scipy.sparse import random
matrix = random(1000000, 100000, density=0.001, format='csr')

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

`io.mmwrite` /. `io.mmread`

from scipy.sparse import io

%time io.mmwrite('test_io.mtx', matrix)
CPU times: user 4min 37s, sys: 2.37 s, total: 4min 39s
Wall time: 4min 39s

%time matrix = io.mmread('test_io.mtx')
CPU times: user 2min 41s, sys: 1.63 s, total: 2min 43s
Wall time: 2min 43s    

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in COOrdinate format>    

Filesize: 3.0G.

(Beachten Sie, dass das Format von csr auf coo geändert wurde).

`np.savez` /. `np.load`

import numpy as np
from scipy.sparse import csr_matrix

def save_sparse_csr(filename, array):
    # note that .npz extension is added automatically
    np.savez(filename, data=array.data, indices=array.indices,
             indptr=array.indptr, shape=array.shape)

def load_sparse_csr(filename):
    # here we need to add .npz extension manually
    loader = np.load(filename + '.npz')
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])


%time save_sparse_csr('test_savez', matrix)
CPU times: user 1.26 s, sys: 1.48 s, total: 2.74 s
Wall time: 2.74 s    

%time matrix = load_sparse_csr('test_savez')
CPU times: user 1.18 s, sys: 548 ms, total: 1.73 s
Wall time: 1.73 s

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

Filesize: 1.1G.

`cPickle`

import cPickle as pickle

def save_pickle(matrix, filename):
    with open(filename, 'wb') as outfile:
        pickle.dump(matrix, outfile, pickle.HIGHEST_PROTOCOL)
def load_pickle(filename):
    with open(filename, 'rb') as infile:
        matrix = pickle.load(infile)    
    return matrix    

%time save_pickle(matrix, 'test_pickle.mtx')
CPU times: user 260 ms, sys: 888 ms, total: 1.15 s
Wall time: 1.15 s    

%time matrix = load_pickle('test_pickle.mtx')
CPU times: user 376 ms, sys: 988 ms, total: 1.36 s
Wall time: 1.37 s    

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

Filesize: 1.1G.

Hinweis : cPickle funktioniert nicht mit sehr großen Objekten (siehe diese Antwort ). Nach meiner Erfahrung hat es bei einer 2,7 M x 50 k-Matrix mit 270 M Nicht-Null-Werten nicht funktioniert. np.savezLösung funktionierte gut.

Fazit

(basierend auf diesem einfachen Test für CSR-Matrizen) cPickleist die schnellste Methode, funktioniert jedoch nicht mit sehr großen Matrizen, np.savezist nur geringfügig langsamer, io.mmwriteviel langsamer, erzeugt größere Dateien und stellt das falsche Format wieder her. So np.savezist der Gewinner hier.

Question 5

Jetzt können Sie Folgendes verwenden scipy.sparse.save_npz: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.save_npz.html

Question 6

Angenommen, Sie haben auf beiden Computern scipy, können Sie einfach verwenden pickle.

Stellen Sie jedoch sicher, dass Sie beim Beizen von Numpy-Arrays ein Binärprotokoll angeben. Andernfalls erhalten Sie eine riesige Datei.

In jedem Fall sollten Sie dazu in der Lage sein:

import cPickle as pickle
import numpy as np
import scipy.sparse

# Just for testing, let's make a dense array and convert it to a csr_matrix
x = np.random.random((10,10))
x = scipy.sparse.csr_matrix(x)

with open('test_sparse_array.dat', 'wb') as outfile:
    pickle.dump(x, outfile, pickle.HIGHEST_PROTOCOL)

Sie können es dann laden mit:

import cPickle as pickle

with open('test_sparse_array.dat', 'rb') as infile:
    x = pickle.load(infile)

Question 7

Ab scipy 0.19.0 können Sie spärliche Matrizen folgendermaßen speichern und laden:

from scipy import sparse

data = sparse.csr_matrix((3, 4))

#Save
sparse.save_npz('data_sparse.npz', data)

#Load
data = sparse.load_npz("data_sparse.npz")

Question 8

BEARBEITEN Anscheinend ist es einfach genug, um:

def sparse_matrix_tuples(m):
    yield from m.todok().items()

Das ergibt ((i, j), value)Tupel, die leicht zu serialisieren und zu deserialisieren sind. Ich bin mir nicht sicher, wie es die Leistung mit dem folgenden Code vergleicht csr_matrix, aber es ist definitiv einfacher. Ich lasse die ursprüngliche Antwort unten, da ich hoffe, dass sie informativ ist.

Hinzufügen meiner zwei Cent: Für mich npzist es nicht portabel, da ich es nicht verwenden kann, um meine Matrix einfach auf Nicht-Python-Clients zu exportieren (z. B. PostgreSQL - froh, korrigiert zu werden). Ich hätte also gerne eine CSV-Ausgabe für die Sparse-Matrix erhalten (ähnlich wie Sie es für die Sparse-Matrix bekommen würden print()). Wie dies erreicht werden kann, hängt von der Darstellung der dünnen Matrix ab. Bei einer CSR-Matrix gibt der folgende Code die CSV-Ausgabe aus. Sie können sich für andere Darstellungen anpassen.

import numpy as np

def csr_matrix_tuples(m):
    # not using unique will lag on empty elements
    uindptr, uindptr_i = np.unique(m.indptr, return_index=True)
    for i, (start_index, end_index) in zip(uindptr_i, zip(uindptr[:-1], uindptr[1:])):
        for j, data in zip(m.indices[start_index:end_index], m.data[start_index:end_index]):
            yield (i, j, data)

for i, j, data in csr_matrix_tuples(my_csr_matrix):
    print(i, j, data, sep=',')

Nach dem save_npz, was ich getestet habe, ist es ungefähr zweimal langsamer als in der aktuellen Implementierung.

Question 9

Dies ist, was ich verwendet habe, um eine zu speichern lil_matrix.

import numpy as np
from scipy.sparse import lil_matrix

def save_sparse_lil(filename, array):
    # use np.savez_compressed(..) for compression
    np.savez(filename, dtype=array.dtype.str, data=array.data,
        rows=array.rows, shape=array.shape)

def load_sparse_lil(filename):
    loader = np.load(filename)
    result = lil_matrix(tuple(loader["shape"]), dtype=str(loader["dtype"]))
    result.data = loader["data"]
    result.rows = loader["rows"]
    return result

Ich muss sagen, ich fand NumPys np.load (..) sehr langsam . Dies ist meine aktuelle Lösung, die meiner Meinung nach viel schneller läuft:

from scipy.sparse import lil_matrix
import numpy as np
import json

def lil_matrix_to_dict(myarray):
    result = {
        "dtype": myarray.dtype.str,
        "shape": myarray.shape,
        "data":  myarray.data,
        "rows":  myarray.rows
    }
    return result

def lil_matrix_from_dict(mydict):
    result = lil_matrix(tuple(mydict["shape"]), dtype=mydict["dtype"])
    result.data = np.array(mydict["data"])
    result.rows = np.array(mydict["rows"])
    return result

def load_lil_matrix(filename):
    result = None
    with open(filename, "r", encoding="utf-8") as infile:
        mydict = json.load(infile)
        result = lil_matrix_from_dict(mydict)
    return result

def save_lil_matrix(filename, myarray):
    with open(filename, "w", encoding="utf-8") as outfile:
        mydict = lil_matrix_to_dict(myarray)
        json.dump(mydict, outfile)

Question 10

Das funktioniert bei mir:

import numpy as np
import scipy.sparse as sp
x = sp.csr_matrix([1,2,3])
y = sp.csr_matrix([2,3,4])
np.savez(file, x=x, y=y)
npz = np.load(file)

>>> npz['x'].tolist()
<1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>

>>> npz['x'].tolist().toarray()
array([[1, 2, 3]], dtype=int64)

Der Trick bestand darin .tolist(), das Objektarray der Form 0 in das ursprüngliche Objekt zu konvertieren.

Question 11

Ich wurde gebeten, die Matrix in einem einfachen und generischen Format zu senden:

<x,y,value>

Am Ende hatte ich Folgendes:

def save_sparse_matrix(m,filename):
    thefile = open(filename, 'w')
    nonZeros = np.array(m.nonzero())
    for entry in range(nonZeros.shape[1]):
        thefile.write("%s,%s,%s\n" % (nonZeros[0, entry], nonZeros[1, entry], m[nonZeros[0, entry], nonZeros[1, entry]]))

Speichern / Laden von scipy sparse csr_matrix im portablen Datenformat

io.mmwrite /. io.mmread

np.savez /. np.load

cPickle

Fazit

`io.mmwrite` /. `io.mmread`

`np.savez` /. `np.load`

`cPickle`