Pytables

From Colettapedia
Jump to navigation Jump to search

General

  • Reading the Hints for SQL Users page is a great quick start.
  • "PyTables is a package for managing hierarchical datasets and designed to efficiently cope with EXTREMELY LARGE AMOUNTS OF DATA."
  • version 3.1.1 as of 10-27-2014
    • 3.4.4 as of 01-23-2019
  • binary container for on-disk( not on memory ) structured data
  • takes every measure to reduce memory and disk usage during its operation
  • flexible/well-tested
  • best used for WORM write once read many
  • concurrent reads no problem, but doesn't support locking at all
  • FAST:
    • Numexpr a just-in-time compiler that is able to evaluate expressions in a way that both optimizes CPU usage and avoids in-memory temporaries
    • Blosc, a compressor designed to transmit data from memory to cache (and back) at very high speeds. It does so by using the full capacities present in modern CPUs, including its SIMD set of instructions (SSE2 or higher) in any number of available cores.

PyTables features

  • save native data containers (numpy, tuples, lists)
  • endow clear structure to all data
  • In-memory object tree replicated file structure. Access datasets by walking through tree
  • table entities by defining class to describe record fields
  • indexing support for columns - includes query capability for table objects
  • multidimensional/nested tables
  • USER-DEFINED METADATA
  • supports a good range of compressors, Zlib, bzip2, blosc
    • high-performance IO with compressible data
  • support files bigger than 2GB
  • architecture independent, portable
  • can perform out-of-core operations very efficiently
  • goal is not to bloat memory

what it is not

  • not a relational database replacement
  • not a distributed database
  • not extremely secure or safe
  • not merely an hdf5 wrapper

Data structures

  • datatypes, scalars, strings,
  • every cell in a table can be multidimentional
  • variable length arrays

Implementation

  • HDF5 library, Python, and Numpy
  • performance critical parts generated using Cython
  • defaults are optimized for less than 10MB and 100 nodes
    • can inform PyTables about expected size of datasets sizes for efficiency boost, better calculation of chunk size (File.create_* method, expectedrows argument)
    • Use in-kernel query to speed selections, expression is evaluated in C-speed, only values brought to python space are those that satisfy condition

Indexing

  • build an index instead of sequential sort
  • indices are themselves compressed in PyTables
  • Partially sorted indices
    • fast to make, O(n) time to create, compressible, but O(n) query time, not possible to do complex queries
  • Optimized Partially Sorted Indexes used by PyTables >= 2.0

Chunks

  • Chunks treated as atomic objects; I/O done in complete chunks
  • HDF5 keeps B-tree in memory to map chunk structures on disk
    • small B-trees = I/O overhead; big B-trees = lower time to access but higher overhead
  • PyTables Table objects
    • records whose values are stored in fixed-length fields
    • all records have same structure and all values in each field have same data type
    • Python records mapped to HDF5 C structs
    • records in tables = "compound data types"
  • PyTables Array objects
    • generic, enlargeable, variable length
  • Object tree entity is dynamically created imitating the HDF5 structure on disk
  • In-memory representation of the file
    • If you never flush the buffer, none of it goes to disk, however, unless you do in-memory files, there'll be a 0-sized file created.
  • Lazy loading
    • metadata only loaded on user request
    • actual data is not read until user request via method on a given node
    • data gets unloaded/revived for low memory consumption

Array object - simplest array

  • file.createArray( mygroup, 'arrayname', numpyref)
import numpy as np
import tables as tb
f = tb.openFile('atest.h5', 'w')
a = np.arange(100).reshape(20,5)
f.createArray(f.root, "array1", a )
# see data
f.root.array1[:]
ta = r.root.array1
ta[1:10:3, 2:5] # rows 1 through 10 counting by three, columns 2 through 5
# only those elements are read from disk
# if you check filesize of atest.h5 right now, it will be 0, you haven't flushed to disk yet, all input output is buffered, so it's very fast
f.flush()
f.close() #after this, all representations are removed
# e.g. ta returns error

Drawbacks of Array object

  • shape cannot change
  • cannot be compressed

CArray object - compressed array

  • data is stored in non-contiguous chunks on disk
    • each chunks can be compressed independently
  • Shape cannot change
import numpy as np
import tables as tb
f = tb.openFile('atest.h5', 'w')
f.createCArray(f.root, "carray", tb.Float64Atom(), (10000,1000) )
ca = f.root.carray
na = np.linspace(0, 1, 1e7).reshape(10000,1000)
ca[:] = na
# in python can use %time to get statistics
# To compress, specify a filter, with copmpression level = 5:
f.createCArray(f.root, "carray", tb.Float64Atom(), (10000,1000),
               filters=tb.Filters(complevel=5) )

EArray objects - chunked compressed enlargeable

  • Data is stored in chunks
  • can be compressed
  • shape CAN change, can append as you go, don't need to know final size of array objects
  • shape must be kept regular
  • can't have a row that has more elements than others

VLArray object

  • data is stored in variable length rows
  • data cannot be compressed

Table object

  • chunked, compressed, enlarged or shrunk
  • limitation: fields cannot be of variable length
    • workaround: used compression within a field
import numpy as np
import tables as tb
# The description for the tabular data
class TabularData(tb.IsDescription):
  """IsDescription is the metaclass"""
  col1 = tb.StringCol(200)
  col2 = tb.IntCol()
  col3 = tb.FloatCol()
# Open a file and create the Table container
f = tb.openFile( 'atable.h5', 'w' )
t = f.createTable(f.root, 'table', TabularData, 'table title',
                  filters=tb.Filters(5, 'blosc') )
# table initially comes with 0 rows
# chunkshape = number of rows in a chunk
# to fill row, get a handler for the row
# row object is very optimized to append new data
r = t.row
for i in xrange(1000*1000):
  r['col1'] = str(i)
  r['col2'] = i + 1
  r['col3'] = i * (i + 1)
  r.append()
t.flush()

Table query

%time [ r['col1'] for r in t if r['col2'] < 10 ]
# can be faster using string query the in-kernel method, evaluated by library Numexpr, 
%time [ r['col1'] for r in t.where( 'col2 < 10 ) ]
# complex conditions
%time [ r['col1'] for r in t if r['col2'] < 10 and r['col3'] < 10 ]
# complex conditions with string query, substitute ampersand for and
# ueses same notation as numpy
%time [ r['col1'] for r in t.where( '(col2 < 10) & (col3 < 10)' ) ]
# Load a numpy structured array into memory from disk
sa = t[:]
# If you perform the query in the in-memory object, it's about 5 times faster
%time sa[ ( (sa['col2'] < 10) & (sa['col3'] < 10) ) ]['col1']
# Can create a "completely sorted" index
%time t.cols.col2.createCSIndex()
# Then queries are 50 times faster than numpy
%timeit [ r['col1'] for r in t.where( '(col2 < 10) & (col3 < 10)' ) ]
f.close()

Dataset Hierarchy

  • can be organized into directory like structure called "groups"
  • Attributes: metadata about data
    • date, number of observations
    • even array data, not too big, if you want bigger, use a data container
  • Node, Group and Leaf classes
    • Node is base class, group and Leaf descend form Node
    • Group instance are grouping structure contain instances of zero or more leaves, with metadata (like a directory)
    • Leaf instance - containers for data. Table, Array, Carray, Earray, VLArray descend from Leaf (like a file)
  • Objects in the object tree are often referenced by giving their full (absolute) path
    • single string style with '/' delim
    • OO "natural name" schema, traverse the tree through dot-notation/instance attributes

Why compression

  • takes much less to bring through IO
  • uses more CPU but CPU time is cheap compared with disk access
  • blosc
    • reduce RAM to CPU cache time
    • accellerates IO from two layers solid state disk to main memory
    • memcpy (read from memory)
    • time to decompress is way less than to read directly from memory

Operating with disk-based arrays

  • out of core computations
  • tables.Expr is an optimized evaluator for expressions of disk-based arrays
  • It is a combination of the Numexpr advanced computing capabilities with the high I/O performance of PyTables
  • Similar to Numexpr, disk-temporaries are avoided, and multi-threaded operation is preserved
  • leverage Numexpr, remove temporaries
"""OUT-OF-CORE OPERATION doesn't use a lot of memory, only does things in chunks"""
import numpy as np
import tables as tb
# Open file for performing out of core computation
f = tb.openFile('poly.h5', 'w' )
t0 = time()
x = f.createEArray( f.root, 'x', tb.Float64Atom(), (0,), filters=tb.Filters(5, 'blosc') )
# Don't want to create a big array in memory, so just create parts of the array and flush it to disk
for s in range( in range(10):
  x.append(np.linspace(s, s+1, 1e6) )
x.flush()
# time is on order of 0.7s, size is 11M with compression
# Now, create expression to compute
expr = tb.Expr( "0.25*x**3 + 0.75*x**2 + 1.5*x - 2" )
if hasattr(f.root, 'y'):
  f.removeNode(f.root.y)
y = f.createArray(f.root, 'y', tb.Float64Atom(), (len(x),), filters=tb.Filters(5, 'blosc') )
expr.setOutput(y)
# perform the actual computation
%time expr.eval()
# takes 1.2 seconds, aray placed on disk, not in memory = "out-of-core computation"
f.flush()
# poly.h5 is now 49M
# now plot using matplotlib
xm = x[:]
ym = f.root.y[:]
plot(xm, ym)