Tutorial: Pandas Dataframe to Numpy Array and store in HDF5
Posted on Sat 06 September 2014 in Python
Convert a pandas dataframe in a numpy array, store data in a file HDF5 and return as numpy array or dataframe.
In [108]:
import pandas as pd
import numpy as np
import h5py
In [109]:
np.random.seed(1234)
df = pd.DataFrame(np.random.randn(6,4),columns=list('ABCD'))
df
Out[109]:
In [110]:
# http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.as_matrix.html#pandas.DataFrame.as_matrix
df.as_matrix()
Out[110]:
In [111]:
# http://stackoverflow.com/questions/13187778/pandas-dataframe-to-numpy-array-include-index
# http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.to_records.html?highlight=to_record#pandas.DataFrame.to_records
df_to_nparray = df.to_records(index=False)
df_to_nparray
Out[111]:
In [112]:
# http://docs.h5py.org/en/latest/high/file.html
# http://blog.tremily.us/posts/HDF5/
# http://www.sam.math.ethz.ch/~raoulb/teaching/PythonTutorial/data_storage.html
# initialize file
# 'a' -> Read/write if exists, create otherwise (default)
f = h5py.File('tuto_myfile.hdf5','a')
# create dataset
f['dset'] = df_to_nparray
# close connection to file
f.close()
In order to evaluate the HDF5 file you should install 'hdf5-tools'.
In Ubuntu system:
$ sudo apt-get install hdf5-tools
And try:
$ h5dump tuto_myfile.hdf5
You'll retrieve something like:
$ h5dump tuto_myfile.hdf5
HDF5 "tuto_myfile.hdf5" {
GROUP "/" {
DATASET "dset" {
DATATYPE H5T_COMPOUND {
H5T_IEEE_F64LE "A";
H5T_IEEE_F64LE "B";
H5T_IEEE_F64LE "C";
H5T_IEEE_F64LE "D";
}
DATASPACE SIMPLE { ( 6 ) / ( 6 ) }
DATA {
(0): {
0.471435,
-1.19098,
1.43271,
-0.312652
},
(1): {
-0.720589,
0.887163,
0.859588,
-0.636524
},
(2): {
0.0156964,
-2.24268,
1.15004,
0.991946
},
(3): {
0.953324,
-2.02125,
-0.334077,
0.00211836
},
(4): {
0.405453,
0.289092,
1.32116,
-1.54691
},
(5): {
-0.202646,
-0.655969,
0.193421,
0.553439
}
}
}
}
}
In [113]:
# read from hdf5
# open file
# 'r' -> Readonly, file must exist
f = h5py.File('tuto_myfile.hdf5', 'r')
# load dataset: dset
dset = f['dset']
dset
Out[113]:
In [114]:
a = dset[...]
f.close()
In [115]:
a
Out[115]:
In [116]:
# http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables
# Reading hdf5 in pandas
df2 = pd.read_hdf('tuto_myfile.hdf5', 'dset')
In [117]:
df2
Out[117]:
In [118]:
# cleanup the mess (comment if needed)
! rm -f tuto_myfile.hdf5
ps.: I know, I know...pandas can store directly in HDF5: http://pandas.pydata.org/pandas-docs/dev/io.html#io-hdf5
;)