Few years ago, I was looking for a data format with low latency block and stream support. While protocol buffers offered streams, it was lacking of indexed block access. Soon I realized I was looking for a container with file system like properties. When examined HDF5 I got very close to what I needed to store massive financial engineering datasets. In 2011 HDF5 had good support for full and partial read write operations for high dimensional extendable datasets with optional compression. Also scientific platforms such as Python, Julia, R and Matlab supported HDF5 and most importantly worked across operating systems.
Machine learning / data science is an emerging field where data-storage is necessary part but not the main attraction. Data science requires a general fast, block and sequential access, capable of storing the observations used for model building. HDF5 does provide basic building blocks for the role, but there is a gap between what it offers and what's provided.
Researchers working directly with popular linear algebra libraries, the STL or time series can benefit of H5CPP template library's CRUD like low latency operations. While engineers who need fast storage solution for arbitrary complex POD struct types -- often already available in C/C++ header files -- benefit from H5CPP clang based compiler technology.
The current HDF5 C++ approach considers C++ as a different language from C, and reproduces the CAPI calls, adding only marginal value. Also existing C/C++ library is lacking of high performance packet writing capability, seamless POD structure transformation to HDF5 compound types and has no support for popular matrix algebra libraries and STL containers. In fact HDF5 C++ doesn't consider C++ templates at all; whereas modern C++ is about templates, and template meta-programming paradigms.
The original design criteria was to implement an intuitive, easy to use template based library that supports most major linear algebra systems, with create, read, write and append operations. This work may be freely downloaded from this h5cpp11 page. However in the in the past few months, in co-operation with Gerd Heber, HDFGroup, I've been engaged with a design and implementation of new, unique interface: a mixture of Gerd's idea of having something python-ishly flexible, but instead of using dictionary based named argument passing mechanism, I proposed a sexy EBNF grammar, implemented in C++ template meta programming. This unique C++ API allows you to start coding without any knowledge of HDF5 API, yet it provides ample of room for the details when you need them.
The type system is hidden behind templates, and IO calls will do the right thing. In addition to templates, an optional clang based compiler scans your project source files, detects all C/C++ POD structures being referenced by H5CPP calls, then from the topologically sorted nodes produces the HDF5 Compound type transformations. The HDF5 DDL is required to do operations with HDF5 Compound datatypes, and can be a tedious process to do the old fashioned way when you have a large complex project. DDL to source code transformation has been around for decades: protocol buffers or apache thrift are good examples -- however what H5CPP compiler does is the exact opposite: it takes arbitrary C/C++ source code and produces HDF5 Compound type DDL. [DDL stands for Data Description Language] The above mechanism works with arbitrary depth of fundamental, array types and POD struct types.
Other design considerations: - RAII idiom for resource management to prevent leakage - conversion policy how software writers can reach CAPI from seamless integration to restricted explicit conversion - error handling policy: exceptions or no-excpetions - static polymorphism (CRTP idiom) instead of runtime polymorphism, no virtual table lookups - compile time expressions, copy elision and return value optimization - polished design
Steven Varga is an independent researcher in machine learning and computational finance, providing convex approximations for combinatorial problems, models sequential, categorical data and writes software for high performance computing in C++,Julia, Python or R.