HDF5#

Within the HDF5 container, datasets maybe stored in compact, chunked or contiguous layouts. The stored datasets are referenced by strings separated with backslash character: /. The directory entries (non-leaf nodes) are called groups h5::gr_t, and the leaf nodes are the datasets h5::ds_t and named types h5::dt_t. Groups, datasets and named types can have h5::att_t attributes attached. At first glance the HDF5 appears as a regular file system with a rich set of API calls.

Layouts#

Chunked Layout and Partial IO#

An economic way to access massive data sets is to break them into smaller blocks or chunks. While the CAPI supports complex selection of regions for now H5CPP provides only economical means for sub-setting with h5::block{}, h5::stride{}. (1)

Chunked layout may be requested by creating a dataset with h5::chunk{..} added to dataset creation property list which will implicitly set h5::layout_chunked flag on.

The content of [..] are other optional dataset properties, fd is an opened HDF5 file descriptor of type ht::fd_t, ... denotes omitted size definitions:

h5::ds_t ds = h5::create<double>(fd, "dataset", ...,
        h5::chunk{4,8} [| h5::fill_value<double>{3} |  h5::gzip{9} ] );

Let M be a supported object type, or a raw memory region. For simplicity we pick an armadillo matrix: arma::mat M(20,16) then in order to save data to a larger dataset we need to pass the M object, the coordinates and possibly strides and blocks. h5::write( ds, M, h5::offset{4,8}, h5::stride{2,2} ). The H5 operator will find the memory location of the object, the datatype and the size, these properties are passed to the underlying IO calls.

When working with raw memory pointers, or H5CPP doesn't yet know of the object type, you need to specify the size of the object with h5::count{..}.

Example:

h5::write( ds,  M.memptr(), h5::count{5,10} [, ...] );

The above operations can be expressed in a single line. To create a dataset of the appropriate size for partial IO and some filters, then write the entire content of M matrix into the dataset:

h5::write(fd, "dataset", M, h5::chunk{4,8} | h5::fill_value<double>{3} |  h5::gzip{9} );

To learn more about through examples click here.

(1) The rational behind the decision is simplicity. Sub-setting requires to load data from disk to memory, then filter out the selected data which doesn't lead to IO bandwidth saving, but adds complexity.

Contiguous Layout and IO Access: continuous layout#

The simplest form of IO is to read a dataset entirely into memory, or write it to the disk. The upside is to reduce overhead when working with large amount of small size dataset. Indeed, when objects are saved in single IO op and no filtering is specified, H5CPP will choose this access pattern. The downside of simplicity is lack of filtering. This layout is handy for small datasets.

Example: in the simplest case, h5::write will open arma.h5 with write access, then creates a data set with the right dimensions, and commences data transfer.

arma::vec V( {1.,2.,3.,4.,5.,6.,7.,8.});
h5::write( "arma.h5", "one shot create write",  V);

To force contagious layout you need to pass h5::contigous flag with h5::dcpl_t.

DATASET "one shot create write" {
      DATATYPE  H5T_IEEE_F64LE
      DATASPACE  SIMPLE { ( 8 ) / ( 8 ) }
      STORAGE_LAYOUT {
         CONTIGUOUS
         SIZE 64
         OFFSET 5888
      }
      FILTERS {
         NONE
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE  H5D_FILL_VALUE_DEFAULT
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_LATE
      }
   }

Compact Layout#

is to store tiny data sets, perhaps the nodes of a very large graph.

Data Space and Dimensions#

is a way to tell the system how in-memory data mapped to file (or reverse). To give you an example picture a block of data in consecutive memory location that you wish to write to a cube shaped dataset. Other than that the data space may be fixed size, or able to be extended to a definite or unlimited size along some dimension.

When working with supported objects, the in-memory dataspace is pre computed for you. And when passing raw pointers to IO operators, the filespace will determine the amount of memory used.

List to describe dimensions of a dataset:

List how to select from datasets for read or write:

Note: h5::stride, h5::block and scatter - gather operations doesn't work when h5::high_throughput set, due to performance reasons.

IO Operators#

Modern C++ provides rich set of features to create variables and implement program logic running inside the compiler. This compile time mechanism, or template meta-programming, allows not only to match types but passing arguments in arbitrary order; much similarly to what we find in Python. The main difference however is in the implementation: the C++ version is without runtime overhead.

In the next sections we guide you through H5CPP's CRUD like operators: h5::create,h5::read,h5::write,h5::append and h5::open, h5::close. The function calls are given in EBNF notation and we start with a few common tokens.

Think of HDF5 as a container, or an image of a file system with a non-POSIX API to access its content. These containers/images may be passed around with standard file operations between computers, while the content may be retrieved with HDF5 specific IO calls. To reference a container within a file-system you either need to pass an open file descriptor h5::fd_t or the full path to the HDF5 file:

file ::= const h5::fd_t& fd | const std::string& file_path;

An HDF5 Dataset is an object within the container, and to uniquely identify one you either have to pass an open dataset-descriptor h5::ds_t or tell the system where to find the container, and the dataset within. In the latter case the necessary shim code is generated to obtain h5::fd_t descriptor at compile time.

dataset ::= (const h5::fd_t& fd | 
    const std::string& file_path, const std::string& dataset_path ) | const h5::ds_t& ds;

HDF5 datasets may take up various shapes and sizes in memory and on disk. A dataspace is a descriptor to specify the current size of the object, and if is capable of growing:

dataspace ::= const h5::sp_t& dataspace 
    | const h5::current_dims& current_dim [, const h5::max_dims& max_dims ] 
    [,const h5::current_dims& current_dim] , const h5::max_dims& max_dims;

T type is the template parameter of an object. In the underlying implementation the element type is deduced compile time, bringing you a flexible abstract approach. The objects may be categorized into ones with continuous memory blocks, such as matrices, vectors, C style POD structures, and complex types such as C++ classes. The latter objects are not yet fully supported. More detailed explanation in this section..

OPEN#

The previous section explained the EBNF tokens: file,dataspace. The behaviour of the objects are controlled through property lists and the syntax is rather simple:

[file]
h5::fd_t h5::open( const std::string& path,  H5F_ACC_RDWR | H5F_ACC_RDONLY [, const h5::fapl_t& fapl] );

[dataset]
h5::ds_t h5::open( const  h5::fd_t& fd, const std::string& path [, const h5::dapl_t& dapl] )

Property lists are: h5::fapl_t, h5::dapl_t

CREATE#

[file]
h5::fd_t h5::create( const std::string& path, H5F_ACC_TRUNC | H5F_ACC_EXCL, 
            [, const h5::fcpl_t& fcpl] [, const h5::fapl_t& fapl]);
[dataset]
template <typename T> h5::ds_t h5::create<T>( file, const std::string& dataset_path, dataspace, 
    [, const h5::lcpl_t& lcpl] [, const h5::dcpl_t& dcpl] [, const h5::dapl_t& dapl]  );
[attribute]
    ..TBD..

Property lists are: h5::fcpl_t, h5::fapl_t, h5::lcpl_t, h5::dcpl_t, h5::dapl_t

Example: to create an HDF5 container, and a dataset within:

#include <h5cpp/all>
...
arma::mat M(2,3);
h5::fd_t fd = h5::create("arma.h5",H5F_ACC_TRUNC);
h5::ds_t ds = h5::create<short>(fd,"dataset/path/object.name"
                ,h5::current_dims{10,20}
                ,h5::max_dims{10,H5S_UNLIMITED}
                ,h5::chunk{2,3} | h5::fill_value<short>{3} |  h5::gzip{9}
        );
//attributes:
ds["attribute-name"] = std::vector<int>(10);
...

READ#

There are two kind of operators:

Keep in mind that the underlying HDF5 system always reserves a chunk size buffer for data transfer, usually for filtering and or data conversion. Nevertheless this data transfer buffer is minimal -- as under ideal conditions the chunks should be not more than the level 3 cache size of the processor.

template <typename T> T h5::read( dataset
    [, const h5::offset_t& offset]  [, const h5::stride_t& stride] [, const h5::count_t& count]
    [, const h5::dxpl_t& dxpl ] ) const;
template <typename T> h5::err_t h5::read( dataset, T& ref 
    [, const [h5::offset_t& offset]  [, const h5::stride_t& stride] [, const h5::count_t& count]
    [, const h5::dxpl_t& dxpl ] ) const;                         

Property lists are: dxpl_t

example: to read a 10x5 matrix from a 3D array from location {3,4,1}

#include <armadillo>
#include <h5cpp/all>
...
auto fd = h5::open("some_file.h5", H5F_ACC_RDWR);
/* the RVO arma::Mat<double> object will have the size 10x5 filled*/
try {
    /* will drop extents of unit dimension returns a 2D object */
    auto M = h5::read<arma::mat>(fd,"path/to/object", 
            h5::offset{3,4,1}, h5::count{10,1,5}, h5::stride{3,1,1} ,h5::block{2,1,1} );
} catch (const std::runtime_error& ex ){
    ...
}

WRITE#

There are two kind of operators:

Keep in mind that the underlying HDF5 system always reserves a chunk size buffer for data transfer, usually for filtering and or data conversion. Nevertheless this data transfer buffer is minimal -- as under ideal conditions the chunks should be not more than the level 3 cache size of the processor.

template <typename T> h5::err_t h5::write( dataset,  const T& ref
    [,const h5::offset_t& offset] [,const h5::stride_t& stride]  [,const& h5::dxcpl_t& dxpl] );
template <typename T> h5::err_t h5::write( dataset, const T* ptr
    [,const hsize_t* offset] [,const hsize_t* stride] ,const hsize_t* count [, const h5::dxpl_t dxpl ]);

Property lists are: dxpl_t

#include <Eigen/Dense>
#include <h5cpp/all>

h5::fd_t fd = h5::create("some_file.h5",H5F_ACC_TRUNC);
h5::write(fd,"/result",M);

APPEND#

When receiving a stream of data, packet tables are the way to go. While this operator does rely on its own h5::pt_t descriptor, the underlying dataset is just the same old one introduced in previous section. The h5::pt_t are seamlessly convertible to h5::ds_t and vica-versa.

However the similarity ends with that. h5::pt_t internals are different from other H5CPP handles, as it has internal buffer and a custom data transfer pipeline. This pipeline can also be used in regular data-transfer operations by adding h5::experimental to data transfer property lists. The experimental pipeline is documented here.

#include <h5cpp/core>
    #include "your_data_definition.h"
#include <h5cpp/io>
template <typename T> void h5::append(h5::pt_t& ds, const T& ref);

example:

#include <h5cpp/core>
    #include "your_data_definition.h"
#include <h5cpp/io>
auto fd = h5::create("NYSE high freq dataset.h5");
h5::pt_t pt = h5::create<ns::nyse_stock_quote>( fd, 
        "price_quotes/2018-01-05.qte",h5::max_dims{H5S_UNLIMITED}, h5::chunk{1024} | h5::gzip{9} );
quote_update_t qu;

bool having_a_good_day{true};
while( having_a_good_day ){
    try{
        recieve_data_from_udp_stream( qu )
        h5::append(pt, qu);
    } catch ( ... ){
      if( cant_fix_connection() )
            having_a_good_day = false; 
    }
}

Supported Objects#

Linear Algebra#

HDF5 CPP is to simplify object persistence by implementing CREATE, READ, WRITE, APPEND operations on fixed or variable length N dimensional arrays. This header only implementation supports raw pointers | armadillo | eigen3 | blaze | blitz++ | it++ | dlib | uBlas by directly operating on the underlying data-store, avoiding intermediate/temporary memory allocations and using copy elision for returning objects:

arma::mat rvo = h5::read<arma::mat>(fd, "path_to_object"); //return value optimization:RVO

For high performance operations ie: within loops update the content with partial IO call:

h5::ds_t ds = h5::open( ... )       // open dataset
arma::mat M(n_rows,n_cols);         // create placeholder, data-space is reserved on the heap
h5::count_t  count{n_rows,n_cols};  // describe the memory region you are reading into
h5::offset_t offset{0,0};           // position we reasing data from
// high performance loop with minimal memory operations
for( auto i: column_indices )
    h5::read(ds, M, count, offset); // count, offset and other proeprties may be speciefied in any order

List of objects supported in EBNF:

T := ([unsigned] ( int8_t | int16_t | int32_t | int64_t )) | ( float | double  )
S := T | c/c++ struct | std::string
ref := std::vector<S> 
    | arma::Row<T> | arma::Col<T> | arma::Mat<T> | arma::Cube<T> 
    | Eigen::Matrix<T,Dynamic,Dynamic> | Eigen::Matrix<T,Dynamic,1> | Eigen::Matrix<T,1,Dynamic>
    | Eigen::Array<T,Dynamic,Dynamic>  | Eigen::Array<T,Dynamic,1>  | Eigen::Array<T,1,Dynamic>
    | blaze::DynamicVector<T,rowVector> |  blaze::DynamicVector<T,colVector>
    | blaze::DynamicVector<T,blaze::rowVector> |  blaze::DynamicVector<T,blaze::colVector>
    | blaze::DynamicMatrix<T,blaze::rowMajor>  |  blaze::DynamicMatrix<T,blaze::colMajor>
    | itpp::Mat<T> | itpp::Vec<T>
    | blitz::Array<T,1> | blitz::Array<T,2> | blitz::Array<T,3>
    | dlib::Matrix<T>   | dlib::Vector<T,1> 
    | ublas::matrix<T>  | ublas::vector<T>
ptr     := T* 
accept  := ref | ptr 

Here is the chart how supported linalgebra systems implement acessors, memory layout:

        data            num elements  vec   mat:rm                mat:cm                   cube
-------------------------------------------------------------------------------------------------------------------------
eigen {.data()}          {size()}          {rows():1,cols():0}    {cols():0,rows():1}     {n/a}
arma  {.memptr()}        {n_elem}                                 {n_rows:0,n_cols:1}     {n_slices:2,n_rows:0,n_cols:1}
blaze {.data()}          {n/a}             {columns():1,rows():0} {rows():0,columns():1}  {n/a}
blitz {.data()}          {size()}          {cols:1,  rows:0}                              {slices:2, cols:1,rows:0} 
itpp  {._data()}         {length()}        {cols():1,rows():0}
ublas {.data().begin()}  {n/a}             {size2():1, size1():0}
dlib  {&ref(0,0)}        {size()}          {nc():1,    nr():0}

Storage Layout: Row / Column ordering#

H5CPP guarantees zero copy, platform and system independent correct behaviour between supported linear algebra Matrices. In linear algebra the de-facto standard is column major ordering similarly to Fortran. However this is changing and many of the above listed linear algebra systems support row-major ordering as well.

Currently there is no easy way to automatically transpose column major matrix such as arma::mat into row major storage. One solution would be to do the actual transpose operation when loading/saving the matrix by a custom filter. The alternative is to mark the object as transposed, following BLAS strategy. The latter approach has minimal approach on performance, but requires cooperation from other library writers. Unfortunatelly the HDF5 CAPI doesn't support either of them. Nevertheless manual transpose always works, and is supported by most linear algebra systems.

Sparse Matrices/Vectors#

Compressed Sparse Row (CSR) and Compressed Sparse Column (CSC) formats will be supported. The actual storage format may be multi objects inside a h5::gr_t group, or a single compound data type as a place holder for the indices and actual data. Special structures such as block diagonal, tri diagonal, triangular are not yet supported. Nevertheless will follow BLAS/LAPACK storage layout whenever possible.

The STL#

There are three notable categories from storage perspective:

Raw Pointers#

Currently only memory blocks are supported in consecutive/adjacent location of elementary or POD types. This method comes handy when an object type is not supported. You find the way to grab a pointer to its internal datastore and the size then pass this as an argument. For read operation make sure there is enough memory reserved, for write operation you must specify the data transfer size with h5::count

Example: loading data from HDF5 dataset to a memory location

my_object obj(100);
h5::read("file.h5","dataset.dat",obj.ptr(), h5::count{10,10}, h5::offset{100,0});

Compound Datatypes#

POD Struct/Records#

Arbitrary deep and complex Plain Old Structured (POD) are supported either by h5cpp compiler or manually writing the necessary shim code. The following example is generated with h5cpp compiler, note that in the first step you have to specialize template<class Type> hid_t inline register_struct<Type>(); to the type you are to use it with and return an HDF5 CAPI hid_t type identifier. This hid_t object references a memory location inside the HDF5 system, and will be automatically released with H5Tclose when used with H5CPP templates. The final step is to register this new type with H5CPP type system : H5CPP_REGISTER_STRUCT(Type);.

namespace sn {
    struct PODstruct {
        ... 
        bool _bool;
    };
}
namespace h5{
    template<> hid_t inline register_struct<sn::PODstruct>(){
        hid_t ct_00 = H5Tcreate(H5T_COMPOUND, sizeof (sn::PODstruct));
        ...
        H5Tinsert(ct_00, "_bool",   HOFFSET(sn::PODstruct,_bool),H5T_NATIVE_HBOOL);
        return ct_00;
    };
}
H5CPP_REGISTER_STRUCT(sn::PODstruct);

The internal typesystem for POD/Record types supports:

C++ Classes#

Work in progress. Requires modification to compiler as well as coordinated effort how to store complex objects such that other platforms capable of reading them.

Strings#

HDF5 supports variable and fixed strings. The former is of interest, as the most common ways for storing strings in a file: consecutively with a separator. The current storage layout is a heap data structure making it less suitable for massive Terra Byte scale storage. In addition the strings have to be copied during read operation. Both filtering such as h5::gzip{0-9} and h5::utf8 features are supported.

not supported: wchar_t _wchar char16_t _wchar16 char32_t _wchar32

TODO: work out a new efficient storage mechanism for strings.

High Throughput Pipeline#

HDF5 comes with complex mechanism for type conversion, filtering, scatter - gather funtions,etc, but what if you need to engineer a system to bare metal without frills? h5::high_throughput data access property replaces the standard data processing mechanism with a BLAS level 3 blocking, a CPU cache aware filter chain and delegates all calls to H5DOwrite_chunk and H5DOread_chunk optimized calls.

Example: to save an arma::mat M(16,32) into an HDF5 data set using direct chunk write, first pass h5::high_throughput data access property when opening/creating data set, make certain to choose chunked layout by setting h5::chunk{...}. Optional standard filters and fill values may be set, however the data set element type must match with the element type of M. There will be no type conversion taking place.

h5::ds_t ds = h5::create<double>(fd,"bare metal IO",
    h5::current_dims{43,57},     // multiple of chunks
    h5::high_throughput,         // request IO pipeline
    h5::chunk{4,8} | h5::fill_value<double>{3} |  h5::gzip{9} );

You must align all IO calls to chunk boundaries: h5::offset % h5::chunk = 0 however the data set may have non-align size: h5::count % h5::chunk != 0 -> OK. Optionally define the amount of data transferred with h5::count{..}. When h5::count{...} is not specified, the dimension will be computed from the object. Notice h5::offset{4,16} is set to chunk boundary.

Saving data near edges have matching behaviour with standard CAPI IO calls. The chunk within edge boundary having the correct content, and the outside is undefined.

h5::write( ds,  M, h5::count{4,8}, h5::offset{4,16} );

Pros:

Cons:

The data set indeed is compressed, and readable from other systems:

HDF5 "arma.h5" {
GROUP "/" {
   DATASET "bare metal IO" {
      DATATYPE  H5T_IEEE_F64LE
      DATASPACE  SIMPLE { ( 40, 40 ) / ( 40, H5S_UNLIMITED ) }
      STORAGE_LAYOUT {
         CHUNKED ( 4, 8 )
         SIZE 79 (162.025:1 COMPRESSION)
      }
      FILTERS {
         COMPRESSION DEFLATE { LEVEL 9 }
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE  3
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_INCR
      }
   }
}
}

Type System#

In the core of H5CPP there lies the type mapping mechanism to HDF5 NATIVE types. All type requests are redirected to this segment in one way or another. That includes supported vectors, matrices, cubes, C like structs etc. While HDF5 internally supports type translations among various binary representation H5CPP restricts type handling to the most common case where the program intended to run. This is not in violation of HDF5 use-anywhere policy, just type conversion is delegated to hosts with different binary representation. Since the most common processors are Intel and AMD this approach has the advantage of skipping any conversion.

integral        := [ unsigned | signed ] [int_8 | int_16 | int_32 | int_64 | float | double ] 
vectors         :=  *integral
rugged_arrays   := **integral
string          := **char
linalg          := armadillo | eigen | ... 
scalar          := integral | pod_struct | string

# not handled yet: long double, complex, specialty types

Here is the relevant part responsible for type mapping:

#define H5CPP_REGISTER_TYPE_( C_TYPE, H5_TYPE )                                           \
namespace h5 { namespace impl { namespace detail {                                        \
    template <> struct hid_t<C_TYPE,H5Tclose,true,true,hdf5::type> : public dt_p<C_TYPE> {\
        using parent = dt_p<C_TYPE>;                                                      \
        using parent::hid_t;                                                              \
        using hidtype = C_TYPE;                                                           \
        hid_t() : parent( H5Tcopy( H5_TYPE ) ) {                                          \
            hid_t id = static_cast<hid_t>( *this );                                       \
            if constexpr ( std::is_pointer<C_TYPE>::value )                               \
                    H5Tset_size (id,H5T_VARIABLE), H5Tset_cset(id, H5T_CSET_UTF8);        \
        }                                                                                 \
    };                                                                                    \
}}}                                                                                       \
namespace h5 {                                                                            \
    template <> struct name<C_TYPE> {                                                     \
        static constexpr char const * value = #C_TYPE;                                    \
    };                                                                                    \
}                                                                                         \

Arithmetic types are associated with their NATIVE HDF5 equivalent:

H5CPP_REGISTER_TYPE_(bool,H5T_NATIVE_HBOOL)

H5CPP_REGISTER_TYPE_(unsigned char, H5T_NATIVE_UCHAR)           H5CPP_REGISTER_TYPE_(char, H5T_NATIVE_CHAR)
H5CPP_REGISTER_TYPE_(unsigned short, H5T_NATIVE_USHORT)         H5CPP_REGISTER_TYPE_(short, H5T_NATIVE_SHORT)
H5CPP_REGISTER_TYPE_(unsigned int, H5T_NATIVE_UINT)             H5CPP_REGISTER_TYPE_(int, H5T_NATIVE_INT)
H5CPP_REGISTER_TYPE_(unsigned long int, H5T_NATIVE_ULONG)       H5CPP_REGISTER_TYPE_(long int, H5T_NATIVE_LONG)
H5CPP_REGISTER_TYPE_(unsigned long long int, H5T_NATIVE_ULLONG) H5CPP_REGISTER_TYPE_(long long int, H5T_NATIVE_LLONG)
H5CPP_REGISTER_TYPE_(float, H5T_NATIVE_FLOAT)                   H5CPP_REGISTER_TYPE_(double, H5T_NATIVE_DOUBLE)
H5CPP_REGISTER_TYPE_(long double,H5T_NATIVE_LDOUBLE)

H5CPP_REGISTER_TYPE_(char*, H5T_C_S1)

Record/POD struct types are registered through this macro:

#define H5CPP_REGISTER_STRUCT( POD_STRUCT ) \
    H5CPP_REGISTER_TYPE_( POD_STRUCT, h5::register_struct<POD_STRUCT>() )

FYI: there are no other public/unregistered macros other than H5CPP_REGISTER_STRUCT

Using CAPI Functions#

By default the hid_t type automatically is converted to / from H5CPP h5::hid_t<T> templated identifiers. All HDF5 CAPI types are wrapped into h5::impl::hid_t<T> internal template, keeping binary compatibility, with the exception of h5::pt_t packet table handle.

T := [ file_handles | property_list ]
file_handles   := [ fd_t | ds_t | att_t | err_t | grp_t | id_t | obj_t ]
property_lists := [ file | dataset | attrib | group | link | string | type | object ]

#            create       access       transfer     copy 
file    := [ h5::fcpl_t | h5::fapl_t                            ] 
dataset := [ h5::dcpl_t | h5::dapl_t | h5::dxpl_t               ]
attrib  := [ h5::acpl_t                                         ] 
group   := [ h5::gcpl_t | h5::gapl_t                            ]
link    := [ h5::lcpl_t | h5::lapl_t                            ]
string  := [              h5::scpl_t                            ] 
type    := [              h5::tapl_t                            ]
object  := [ h5::ocpl_t                           | h5::ocpyl_t ]

Property Lists#

The functions, macros, and subroutines listed here are used to manipulate property list objects in various ways, including to reset property values. With the use of property lists, HDF5 functions have been implemented and can be used in applications with fewer parameters than would be required without property lists, this mechanism is similar to POSIX fcntl. Properties are grouped into classes, and each class member may be daisy chained to obtain a property list.

To give you an example how to obtain a data creation property list with chunk, fill value, shuffling, nbit, fletcher23 filters and gzip compression set:

h5::dcpl_t dcpl = h5::chunk{2,3} 
    | h5::fill_value<short>{42} | h5::fletcher32 | h5::shuffle | h5::nbit | h5::gzip{9};
auto ds = h5::create("container.h5","/my/dataset.dat", h5::create_path | h5::utf8, dcpl, h5::default_dapl);

Properties may be passed in arbitrary order, by reference, or directly by daisy chaining them. A list of property descriptors:

#            create       access       transfer     copy 
file    := [ h5::fcpl_t | h5::fapl_t                            ] 
dataset := [ h5::dcpl_t | h5::dapl_t | h5::dxpl_t               ]
attrib  := [ h5::acpl_t                                         ] 
group   := [ h5::gcpl_t | h5::gapl_t                            ]
link    := [ h5::lcpl_t | h5::lapl_t                            ]
string  := [              h5::scpl_t                            ] 
type    := [              h5::tapl_t                            ]
object  := [ h5::ocpl_t                           | h5::ocpyl_t ]

Default Properties:#

set to value (different from HDF5 CAPI):

set to zero (same as HDF5 CAPI):

File Operations#

File Creation Property List#

// you may pass CAPI property list descriptors daisy chained with '|' operator 
auto fd = h5::create("002.h5", H5F_ACC_TRUNC, 
        h5::file_space_page_size{4096} | h5::userblock{512},  // file creation properties
        h5::fclose_degree_weak | h5::fapl_core{2048,1} );     // file access properties

File Access Property List#

Example:

h5::fapl_t fapl = h5::fclose_degree_weak | h5::fapl_core{2048,1} | h5::core_write_tracking{false,1} 
            | h5::fapl_family{H5F_FAMILY_DEFAULT,0};

Group Operations#

Group Creation Property List#

Group Access Property List#

Dataset Operations#

Dataset Creation Property List#

Example:

h5::dcpl_t dcpl = h5::chunk{1,4,5} | h5::deflate{4} | h5::layout_compact | h5::dont_filter_partial_chunks
        | h5::fill_value<my_struct>{STR} | h5::fill_time_never | h5::alloc_time_early 
        | h5::fletcher32 | h5::shuffle | h5::nbit;

Dataset Access Property List#

In addition to CAPI properties, a custom high_throughput property is added, to request alternative, simpler but more efficient pipeline.

Dataset Transfer Property List#

Misc Operations#

Object Creation Property List#

Object Copy Property List#

MPI / Parallel Operations#

C++ Idioms#

RAII#

There are c++ mapping for hid_t id-s which reference objects with std::shared_ptr type of behaviour with HDF5 CAPI internal reference counting. For further details see H5inc_ref, H5dec_ref and H5get_ref. The internal representation of these objects is binary compatible of the CAPI hid_t and interchangeable depending on the conversion policy: H5_some_function( static_cast<hid_t>( h5::hid_t id ), ... ) Direct initialization h5::ds_t{ some_hid } bypasses reference counting, and is intended to for use case where you have to take ownership of a CAPI hid_t object reference. This is equivalent behaviour to std::shared_ptr, when object destroyed reference count is decreased.

{
    h5::ds_t ds = h5::open( ... ); 
} // resources are guaranteed to be released

Error handling#

Error handling follows the C++ Guidline and the philosophy H5CPP library is built around, that is to help you to start without reading much of the documentation, and providing ample of room for more should you require it. The root of exception tree is: h5::error::any derived from std::runtime_exception in accordance with C++ guidelines custom exceptions. All HDF5 CAPI calls are considered as resource, and in case of error H5CPP aims to roll back to last known stable state, cleaning up all resource allocations between the call entry and thrown error. This mechanism is guaranteed by RAII.

For granularity io::[file|dataset|attribute] exceptions provided, with the pattern to capture the entire subset by ::any. Exceptions thrown with error massages _FILE_ and _LINE_ relevant to H5CPP template library with a brief description to help the developer to investigate. This error reporting mechanism uses a macro found inside h5cpp/config.h and maybe redefined:

    ...
// redefine macro before including <h5cpp/ ... >
#define H5CPP_ERROR_MSG( msg ) "MY_ERROR: " 
    + std::string( __FILE__ ) + " this line: " + std::to_string( __LINE__ ) + " message-not-used"
#include <h5cpp/all> 
    ...

An example to capture and handle errors centrally:

    // some H5CPP IO routines used in your software
    void my_deeply_embedded_io_calls() {
        arma::mat M = arma::zeros(20,4);
        // compound IO operations in single call: 
        //     file create, dataset create, dataset write, dataset close, file close
        h5::write("report.h5","matrix.ds", M ); 
    }

    int main() {
        // capture errors centrally with the granularity you desire
        try {
            my_deeply_embedded_io_calls();      
        } catch ( const h5::error::io::dataset::create& e ){
            // handle file creation error
        } catch ( const h5::error::io::dataset::write& e ){
        } catch ( const h5::error::io::file::create& e ){
        } catch ( const h5::error::io::file::close& e ){
        } catch ( const h5::any& e ) {
            // print out internally generated error message, controlled by H5CPP_ERROR_MSG macro
            std::cerr << e.what() << std::endl;
        }
    }

Detailed CAPI error stack may be unrolled and dumped, muted, unmuted with provided methods:

usage:

    h5::mute();
     // ... prototyped part with annoying messages
     // or the entire application ...
    h5::unmute(); 

std::stack<std::string> h5::error_stack() - walks through underlying CAPI error handler

usage:

    int main( ... ) {
        h5::use_error_handler();
        try {
            ... rest of the [ single | multi ] threaded application
        } catch( const h5::read_error& e  ){
            std::stack<std::string> msgs = h5::error_stack();
            for( auto msg: msgs )
                std::cerr << msg << std::endl;
        } catch( const h5::write_error& e ){
        } catch( const h5::file_error& e){
        } catch( ... ){
            // some other errors
        }
    }

Design criteria - All HDF5 CAPI calls are checked with the only exception of H5Lexists where the failure carries information, that the path does not exist yet. - Callbacks of CAPI routines doesn't throw any exceptions, honoring the HDF5 CAPI contract, hence allowing the CAPI call to clean up - Error messages currently are collected in H5Eall.hpp may be customized - Thrown exceptions are hierarchical - Only RAII capable/wrapped objects used, guaranteed cleanup through stack unrolling

Exception hierarchy is embedded in namespaces, the chart should be interpreted as tree, for instance a file create exception is h5::error::io::file::create. Keep in mind namespace aliasing allow you customization should you find the long names inconvenient:

using file_error = h5::error::io::file
try{
} catch ( const file_error::create& e ){
    // do-your-thing(tm)
}

h5::error : public std::runtime_error
  ::any               - captures ALL H5CPP runtime errors with the exception of `rollback`
  ::io::              - namespace: IO related error, see aliasing
  ::io::any           - collective capture of IO errors within this namespace/block recursively
      ::file::        - namespace: file related errors
            ::any     - captures all exceptions within this namespace/block
            ::create  - create failed
            ::open    - check if the object, in this case file exists at all, retry if networked resource
            ::close   - resource may have been removed since opened
            ::read    - may not be fixed, should software crash crash?
            ::write   - is it read only? is recource still available since opening? 
            ::misc    - errors which are not covered otherwise: start investigating from reported file/line
       ::dataset::    -
            ::any
            ::create
            ::open
            ::close
            ::read
            ::write
            ::append
            ::misc
      ::attribute::
            ::any
            ::create
            ::open
            ::close
            ::read
            ::write
            ::misc
    ::property_list::
      ::rollback
      ::any
      ::misc
      ::argument

This is a work in progress, if for any reasons you think it could be improved, or some real life scenario is not represented please shoot me an email with the use case, a brief working example.

Diagnostics#

On occasions it comes handy to dump internal state of objects, while currently only h5::sp_t data-space descriptor and dimensions supported in time most of HDF5 CAPI diagnostics/information calls will be added.

    h5::ds_t ds =  ... ;                // obtained by h5::open | h5::create call
    h5::sp_t sp = h5::get_space( ds );  // get the file space descriptor for hyperslab selection
    h5::select_hyperslab(sp,  ... );    // some complicated selection that may fail, and you want to debug
    std::cerr << sp << std::endl;       // prints out the available space
    try { 
        H5Dwrite(ds, ... );            // direct CAPI call fails for with invalid selection
    catch( ... ){
    }

stream operators#

Some objects implement operator<< to furnish you with diagnostics. In time all objects will the functionality added, for now only the following objects:

Custom Filter Pipeline#

To Be Written

Performance#

experiment time trans/sec Mbyte/sec
append: 1E6 x 64byte struct 0.06 16.46E6 1053.87
append: 10E6 x 64byte struct 0.63 15.86E6 1015.49
append: 50E6 x 64byte struct 8.46 5.90E6 377.91
append:100E6 x 64byte struct 24.58 4.06E6 260.91
write: Matrix [10e6 x 16] no-chunk 0.4 0.89E6 1597.74
write: Matrix [10e6 x 100] no-chunk 7.1 1.40E6 563.36

Lenovo 230 i7 8G ram laptop on Linux Mint 18.1 system

gprof directory contains gperf tools base profiling. make all will compile files. In order to execute install google-pprof and kcachegrind.