Datasets, like NumPy arrays, they are homogenous collections of data elements, with an immutable datatype and (hyper)rectangular shape. Unlike NumPy arrays, they support a variety of transparent storage features such as compression, error-detection, and chunked I/O.
They are represented in h5py by a thin proxy class which supports familiar NumPy operations like slicing, along with a variety of descriptive attributes.
Datasets are created using either Group.create_dataset() or Group.require_dataset(). Existing datasets should be retrieved using the group indexing syntax (dset = group["name"]).
Datasets implement the following parts of the NumPy-array user interface:
- Slicing: simple indexing and a subset of advanced indexing
- shape attribute
- dtype attribute
Unlike memory-resident NumPy arrays, HDF5 datasets support a number of optional features. These are enabled by the keywords provided to Group.create_dataset(). Some of the more useful are:
When using HDF5 1.8, datasets can be resized, up to a maximum value provided at creation time. You can specify this maximum size via the maxshape argument to create_dataset or require_dataset. Shape elements with the value None indicate unlimited dimensions.
Later calls to Dataset.resize() will modify the shape in-place:
>>> dset = grp.create_dataset("name", (10,10), '=f8', maxshape=(None, None))
>>> dset.shape
(10, 10)
>>> dset.resize((20,20))
>>> dset.shape
(20, 20)
Resizing an array with existing data works differently than in NumPy; if any axis shrinks, the data in the missing region is discarded. Data does not “rearrange” itself as it does when resizing a NumPy array.
The best way to get at data is to use the traditional NumPy extended-slicing syntax. Slice specifications are translated directly to HDF5 hyperslab selections, and are are a fast and efficient way to access data in the file. The following slicing arguments are recognized:
- Numbers: anything that can be converted to a Python long
- Slice objects: please note negative values are not allowed
- Field names, in the case of compound data
- At most one Ellipsis (...) object
Here are a few examples (output omitted)
>>> dset = f.create_dataset("MyDataset", (10,10,10), 'f')
>>> dset[0,0,0]
>>> dset[0,2:10,1:9:3]
>>> dset[:,::2,5]
>>> dset[0]
>>> dset[1,5]
>>> dset[0,...]
>>> dset[...,6]
For compound data, you can specify multiple field names alongside the numeric slices:
>>> dset["FieldA"]
>>> dset[0,:,4:5, "FieldA", "FieldB"]
>>> dset[0, ..., "FieldC"]
For simple slicing, broadcasting is supported:
>>> dset[0,:,:] = np.arange(10) # Broadcasts to (10,10)
Importantly, h5py does not use NumPy to do broadcasting before the write. Broadcasting is implemented using repeated hyperslab selections, and is safe to use with very large target selections. In the following example, a write from a (1000, 1000) array is broadcast to a (1000, 1000, 1000) target selection as a series of 1000 writes:
>>> dset2 = f.create_dataset("MyDataset", (1000,1000,1000), 'f')
>>> data = np.arange(1000*1000, dtype='f').reshape((1000,1000))
>>> dset2[:] = data # Does NOT allocate 3.8 G of memory
Broadcasting is supported for “simple” (integer, slice and ellipsis) slicing only.
For any axis, you can provide an explicit list of points you want; for a dataset with shape (10, 10):
>>> dset.shape
(10, 10)
>>> result = dset[0, [1,3,8]]
>>> result.shape
(3,)
>>> result = dset[1:6, [5,8,9]]
>>> result.shape
(5, 3)
The following restrictions exist:
Additional mechanisms exist for the case of scattered and/or sparse selection, for which slab or row-based techniques may not be appropriate.
NumPy boolean “mask” arrays can be used to specify a selection. The result of this operation is a 1-D array with elements arranged in the standard NumPy (C-style) order:
>>> arr = numpy.arange(100).reshape((10,10))
>>> dset = f.create_dataset("MyDataset", data=arr)
>>> result = dset[arr > 50]
>>> result.shape
(49,)
Additionally, the selections module contains additional classes which provide access to native HDF5 dataspace selection techniques. These include explicit point-based selection and hyperslab selections combined with logical operations (AND, OR, XOR, etc). Any instance of a selections.Selection subclass can be used for indexing directly:
>>> dset = f.create_dataset("MyDS2", (100,100), 'i')
>>> dset[...] = np.arange(100*100).reshape((100,100))
>>> sel = h5py.selections.PointSelection((100,100))
>>> sel.append([(1,1), (57,82)])
>>> dset[sel]
array([ 101, 5782])
As with NumPy arrays, the len() of a dataset is the length of the first axis, and iterating over a dataset iterates over the first axis. However, modifications to the yielded data are not recorded in the file. Resizing a dataset while iterating has undefined results.
Note
Since Python’s len is limited by the size of a C long, it’s recommended you use the syntax dataset.len() instead of len(dataset) on 32-bit platforms, if you expect the length of the first row to exceed 2**32.
High-level interface to an HDF5 dataset.
Datasets can be opened via the syntax Group[<dataset name>], and created with the method Group.create_dataset().
Datasets behave superficially like Numpy arrays. NumPy “simple” slicing is fully supported, along with a subset of fancy indexing and indexing by field names (dataset[0:10, “fieldname”]).
The standard NumPy properties “shape” and “dtype” are also available.
Dataset properties
Dataset methods
Read a slice from the HDF5 dataset.
Takes slices and recarray-style field names (more than one is allowed!) in any order. Obeys basic NumPy rules, including broadcasting.
Also supports:
Write to the HDF5 dataset from a Numpy array.
NumPy’s broadcasting rules are honored, for “simple” indexing (slices and integers). For advanced indexing, the shapes must match.
Classes from the “selections” module may also be used to index.
Read data directly from HDF5 into an existing NumPy array.
The destination array must be C-contiguous and writable. Selections may be any operator class (HyperSelection, etc) in h5py.selections, or the output of numpy.s_[<args>].
Broadcasting is supported for simple indexing.
Write data directly to HDF5 from a NumPy array.
The source array must be C-contiguous. Selections may be any operator class (HyperSelection, etc) in h5py.selections, or the output of numpy.s_[<args>].
Broadcasting is supported for simple indexing.
Resize the dataset, or the specified axis (HDF5 1.8 only).
The dataset must be stored in chunked format; it can be resized up to the “maximum shape” (keyword maxshape) specified at creation time. The rank of the dataset cannot be changed.
“Size” should be a shape tuple, or if an axis is specified, an integer.
BEWARE: This functions differently than the NumPy resize() method! The data is not “reshuffled” to fit in the new shape; each axis is grown or shrunk independently. The coordinates of existing data is fixed.
The size of the first axis. TypeError if scalar.
Use of this method is preferred to len(dset), as Python’s built-in len() cannot handle values greater then 2**32 on 32-bit systems.