Pytables

General

Reading the Hints for SQL Users page is a great quick start.
"PyTables is a package for managing hierarchical datasets and designed to efficiently cope with EXTREMELY LARGE AMOUNTS OF DATA."
version 3.1.1 as of 10-27-2014
- 3.4.4 as of 01-23-2019
binary container for on-disk( not on memory ) structured data
takes every measure to reduce memory and disk usage during its operation
flexible/well-tested
best used for WORM write once read many
concurrent reads no problem, but doesn't support locking at all
FAST:
- Numexpr a just-in-time compiler that is able to evaluate expressions in a way that both optimizes CPU usage and avoids in-memory temporaries
- Blosc, a compressor designed to transmit data from memory to cache (and back) at very high speeds. It does so by using the full capacities present in modern CPUs, including its SIMD set of instructions (SSE2 or higher) in any number of available cores.

PyTables features

save native data containers (numpy, tuples, lists)
endow clear structure to all data
In-memory object tree replicated file structure. Access datasets by walking through tree
table entities by defining class to describe record fields
indexing support for columns - includes query capability for table objects
multidimensional/nested tables
USER-DEFINED METADATA
supports a good range of compressors, Zlib, bzip2, blosc
- high-performance IO with compressible data
support files bigger than 2GB
architecture independent, portable
can perform out-of-core operations very efficiently
goal is not to bloat memory

what it is not

not a relational database replacement
not a distributed database
not extremely secure or safe
not merely an hdf5 wrapper

Data structures

datatypes, scalars, strings,
every cell in a table can be multidimentional
variable length arrays

Implementation

HDF5 library, Python, and Numpy
performance critical parts generated using Cython
defaults are optimized for less than 10MB and 100 nodes
- can inform PyTables about expected size of datasets sizes for efficiency boost, better calculation of chunk size (File.create_* method, expectedrows argument)
- Use in-kernel query to speed selections, expression is evaluated in C-speed, only values brought to python space are those that satisfy condition

Indexing

build an index instead of sequential sort
indices are themselves compressed in PyTables
Partially sorted indices
- fast to make, O(n) time to create, compressible, but O(n) query time, not possible to do complex queries
Optimized Partially Sorted Indexes used by PyTables >= 2.0

Chunks

Chunks treated as atomic objects; I/O done in complete chunks
HDF5 keeps B-tree in memory to map chunk structures on disk
- small B-trees = I/O overhead; big B-trees = lower time to access but higher overhead

PyTables Table objects
- records whose values are stored in fixed-length fields
- all records have same structure and all values in each field have same data type
- Python records mapped to HDF5 C structs
- records in tables = "compound data types"
PyTables Array objects
- generic, enlargeable, variable length
Object tree entity is dynamically created imitating the HDF5 structure on disk
In-memory representation of the file
- If you never flush the buffer, none of it goes to disk, however, unless you do in-memory files, there'll be a 0-sized file created.
Lazy loading
- metadata only loaded on user request
- actual data is not read until user request via method on a given node
- data gets unloaded/revived for low memory consumption

Array object - simplest array

file.createArray( mygroup, 'arrayname', numpyref)

import numpy as np
import tables as tb
f = tb.openFile('atest.h5', 'w')
a = np.arange(100).reshape(20,5)
f.createArray(f.root, "array1", a )
# see data
f.root.array1[:]
ta = r.root.array1
ta[1:10:3, 2:5] # rows 1 through 10 counting by three, columns 2 through 5
# only those elements are read from disk
# if you check filesize of atest.h5 right now, it will be 0, you haven't flushed to disk yet, all input output is buffered, so it's very fast
f.flush()
f.close() #after this, all representations are removed
# e.g. ta returns error

Drawbacks of Array object

shape cannot change
cannot be compressed

CArray object - compressed array

data is stored in non-contiguous chunks on disk
- each chunks can be compressed independently
Shape cannot change

import numpy as np
import tables as tb
f = tb.openFile('atest.h5', 'w')
f.createCArray(f.root, "carray", tb.Float64Atom(), (10000,1000) )
ca = f.root.carray
na = np.linspace(0, 1, 1e7).reshape(10000,1000)
ca[:] = na
# in python can use %time to get statistics
# To compress, specify a filter, with copmpression level = 5:
f.createCArray(f.root, "carray", tb.Float64Atom(), (10000,1000),
               filters=tb.Filters(complevel=5) )

EArray objects - chunked compressed enlargeable

Data is stored in chunks
can be compressed
shape CAN change, can append as you go, don't need to know final size of array objects
shape must be kept regular
can't have a row that has more elements than others

VLArray object

data is stored in variable length rows
data cannot be compressed

Table object

chunked, compressed, enlarged or shrunk
limitation: fields cannot be of variable length
- workaround: used compression within a field

import numpy as np
import tables as tb
# The description for the tabular data
class TabularData(tb.IsDescription):
  """IsDescription is the metaclass"""
  col1 = tb.StringCol(200)
  col2 = tb.IntCol()
  col3 = tb.FloatCol()
# Open a file and create the Table container
f = tb.openFile( 'atable.h5', 'w' )
t = f.createTable(f.root, 'table', TabularData, 'table title',
                  filters=tb.Filters(5, 'blosc') )
# table initially comes with 0 rows
# chunkshape = number of rows in a chunk
# to fill row, get a handler for the row
# row object is very optimized to append new data
r = t.row
for i in xrange(1000*1000):
  r['col1'] = str(i)
  r['col2'] = i + 1
  r['col3'] = i * (i + 1)
  r.append()
t.flush()

Table query

%time [ r['col1'] for r in t if r['col2'] < 10 ]
# can be faster using string query the in-kernel method, evaluated by library Numexpr, 
%time [ r['col1'] for r in t.where( 'col2 < 10 ) ]
# complex conditions
%time [ r['col1'] for r in t if r['col2'] < 10 and r['col3'] < 10 ]
# complex conditions with string query, substitute ampersand for and
# ueses same notation as numpy
%time [ r['col1'] for r in t.where( '(col2 < 10) & (col3 < 10)' ) ]
# Load a numpy structured array into memory from disk
sa = t[:]
# If you perform the query in the in-memory object, it's about 5 times faster
%time sa[ ( (sa['col2'] < 10) & (sa['col3'] < 10) ) ]['col1']
# Can create a "completely sorted" index
%time t.cols.col2.createCSIndex()
# Then queries are 50 times faster than numpy
%timeit [ r['col1'] for r in t.where( '(col2 < 10) & (col3 < 10)' ) ]
f.close()

Dataset Hierarchy

can be organized into directory like structure called "groups"
Attributes: metadata about data
- date, number of observations
- even array data, not too big, if you want bigger, use a data container
Node, Group and Leaf classes
- Node is base class, group and Leaf descend form Node
- Group instance are grouping structure contain instances of zero or more leaves, with metadata (like a directory)
- Leaf instance - containers for data. Table, Array, Carray, Earray, VLArray descend from Leaf (like a file)
Objects in the object tree are often referenced by giving their full (absolute) path
- single string style with '/' delim
- OO "natural name" schema, traverse the tree through dot-notation/instance attributes

Why compression

takes much less to bring through IO
uses more CPU but CPU time is cheap compared with disk access
blosc
- reduce RAM to CPU cache time
- accellerates IO from two layers solid state disk to main memory
- memcpy (read from memory)
- time to decompress is way less than to read directly from memory

Operating with disk-based arrays

out of core computations
tables.Expr is an optimized evaluator for expressions of disk-based arrays
It is a combination of the Numexpr advanced computing capabilities with the high I/O performance of PyTables
Similar to Numexpr, disk-temporaries are avoided, and multi-threaded operation is preserved
leverage Numexpr, remove temporaries

"""OUT-OF-CORE OPERATION doesn't use a lot of memory, only does things in chunks"""
import numpy as np
import tables as tb
# Open file for performing out of core computation
f = tb.openFile('poly.h5', 'w' )
t0 = time()
x = f.createEArray( f.root, 'x', tb.Float64Atom(), (0,), filters=tb.Filters(5, 'blosc') )
# Don't want to create a big array in memory, so just create parts of the array and flush it to disk
for s in range( in range(10):
  x.append(np.linspace(s, s+1, 1e6) )
x.flush()
# time is on order of 0.7s, size is 11M with compression
# Now, create expression to compute
expr = tb.Expr( "0.25*x**3 + 0.75*x**2 + 1.5*x - 2" )
if hasattr(f.root, 'y'):
  f.removeNode(f.root.y)
y = f.createArray(f.root, 'y', tb.Float64Atom(), (len(x),), filters=tb.Filters(5, 'blosc') )
expr.setOutput(y)
# perform the actual computation
%time expr.eval()
# takes 1.2 seconds, aray placed on disk, not in memory = "out-of-core computation"
f.flush()
# poly.h5 is now 49M
# now plot using matplotlib
xm = x[:]
ym = f.root.y[:]
plot(xm, ym)

Pytables

Contents

General

PyTables features

what it is not

Data structures

Implementation

Indexing

Chunks

Array object - simplest array

Drawbacks of Array object

CArray object - compressed array

EArray objects - chunked compressed enlargeable

VLArray object

Table object

Table query

Dataset Hierarchy

Why compression

Operating with disk-based arrays

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools