Sparse Matrix1#

Object

The fill rate of a matrix is a ration between non-zero and zero elements. If the latter significantly outweighs the former then we speak of Sparse Matrices. Depending on the sparsity pattern some storage format are more efficient than others. Nevertheless a sparse matrix is an object of multiple fields as opposed to a single contagious memory location with homogeneous type.

Netlib considers the following sparse storage formats:

description h5::dapl_t
Compressed Sparse Row h5::sparse::csr
Compressed Sparse Column h5::sparse::csc
Block Compressed Sparse Storage h5::sparse::bcrs
Compressed Diagonal Storage h5::sparse::cds
Jagged Diagonal Storage h5::sparse::jds
Skyline Storage h5::sparse::ss

Multi Dataset Storage Format#

Single Dataset Storage Format#

TODO: write code and documentation

Interop With Other Systems#

Python#

Alex Wolf discusses HDF5 and Sparse Matrix formats, and h5py nor pytables support sparse matrices.

PyTables#

has no direct support to save / load sparse matrices

import scipy.sparse as sp_sparse
import tables

with tables.open_file(filename, 'r') as f:
    mat_group = f.get_node(f.root, 'matrix')
    data = getattr(mat_group, 'data').read()
    indices = getattr(mat_group, 'indices').read()
    indptr = getattr(mat_group, 'indptr').read()
    shape = getattr(mat_group, 'shape').read()
    matrix = sp_sparse.csc_matrix((data, indices, indptr), shape=shape)

H5PY#

h5sparse#

Julia#

Object
SparseArrays uses Compressed Sparse Column format and the official JLD format can save and load sparse matrices. Less fortunate how the data sets are organized within the HDF5 container, instead the actual data is placed under _refs directory. The screen shot shows A,B sparse matrices saved in Julia, and a Pyhton h5sparse to compare. On the bright side the julia HDF5 package is feature full, it is possible loading sparse matrices to H5PY.

using JLD, SparseArrays

A = sprand(Float64, 10,20, 0.1)
B = sprand(Float64, 10,20, 0.1)
@save "interop.h5" "data-01/A" A "data-02/B" B 

R#

Bio Informatics#

Loompy#

is an efficient file format for large omics datasets. Loom files contain a main matrix, optional additional layers, a variable number of row and column annotations, and sparse graph objects. Under the hood, Loom files are HDF5 and can be opened from many programming languages, including Python, R, C, C++, Java, MATLAB, Mathematica, and Julia.

10x Genomics#

The top level of the file contains a single HDF5 group, called matrix, and metadata stored as HDF5 attributes. Within the matrix group are datasets containing the dimensions of the matrix, the matrix entries, as well as the features and cell-barcodes associated with the matrix rows and columns, respectively. format


Column Type Description
barcodes string Barcode sequences and their corresponding GEM wells (e.g. AAACGGGCAGCTCGAC-1)
data uint32 Nonzero UMI counts in column-major order
indices uint32 Zero-based row index of corresponding element in data
indptr uint32 Zero-based index into data / indices of the start of each column, i.e., the data corresponding to each barcode sequence
shape uint64 Tuple of (# rows, # columns) indicating the matrix dimensions

  1. Material based on Netlib Documentation