Pandas

"A powerful data analysis and manipulation library for Python."

Objects in the pandas namespace

list: Series, DataFrame, Panel, Index, DatetimeIndex, HDFStore, bdate_range, date_range, read_csv, read_fwf, read_table, ols
pd.read_table('/path/to/file', sep='::', names=list_of_names)
- index_col = Column to use as the row labels of the DataFrame. If you have a malformed file with delimiters at the end of each line, you might consider index_col=False to force pandas to _not_ use the first column as the index (row names)
- sep = Delimiter to use. Regular expressions are accepted.
  - sep=r"\s+" - space separated
- header='infer'
- parse_dates = True is parse the index, list implies columns to parse separately
- names = list of column names to use. If file contains no header row, then you should explicitly pass header=None
- comment = string, indicates the rest of the line shouldn't be parsed
- usecols = array_like, return a subset of the columns
- infer_datetime_format = default is false
DataFrame - "Two-dimensional size-mutable, potentially heterogeneous tabular data, structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure."
- data_frame = pd.read_csv( path )
- data_frame.index = pd.to_datetime( some_sub_frame )
- pd.DataFrame( {'ColumnA': time_series1, 'ColumnB': time_series2} )
  - Does the data alignment for you, conforms the union of the indexes, creates a nice regularly indexed rectangular table
date_range
- my_date_range = pd.date_range( '1/1/2000', periods=10 )
Series
- time_series1 = pd.Series( some_numpy_1D_array, index=my_date_range )
pd.concat( list_of_dfs, axis=1, keys=list_of_df_names )
- Here the alignment happens on the current indexing regime, probably the datetime
- Result has a hierarchical index
- Operation results in a new data frame that has all the columns of all the component data frames, and at least as many rows as both (maybe more if both data frames has data indexed to a time that the other frame doesn't have)

Time Series Data Frame object Attributes and Methods

df.size() - number of rows
grab certain rows
- Can use bracket notation: even_rows = all_rows[ [0,2,4,6,8] ]
- Can use data_frame.take( numpy_1D_array_of_indeces )
- boolean indexing df[df['A'] > 0]
.index - the index of the dataframe, can be timestamp, can be string, like a stock ticker tag.
Column members - Column names become data frame object member attributes
- Can access certain rows of certain columns by treating the column members like dicts
  - stock_prices.close_price['2009-10-15']
- data_frame.some_column.resample( 'M', how=['mean','median','std']) - create a new data frame comprised of the mean, median and standard deviation, resampling some time data to monthly (the 'M' in the function argument)
- 'B' = business daily
.head() - give the top 5 rows
time_data.at_time( time.time( 15, 0 ) ) - select all the data points at a particular time across many days
data_frame.plot() - returns a matplotlib graph for time
df.asfreq('4h') - for example, take daily data and resample to contain data for every 4 hours. Will result in missing data that may be forward filled, etc.

Missing Data

df.count() - returns number of non NaN entries for each column
df.sum() - doesn't choke if a column has missing data
df.mean(1) - argument indicates axis, 1 = rowwise mean
df.dropna() - drop a row if any column is a NaN
df.dropna(how='all') - drop a row if all columns are NaN
df.fillna( some_scalar ) - fill NaNs with a scalar value, like 0
df.fillna( method='ffill' ) - "forward fill," fill with the most recent non-NaN value
df['ColName'].unique() - unique vals in column
df.drop( 'ColName', inplace=True, axis=1 ) - delete a column

Groupby Operations

Can do SQL type group operations
One way to do a SUMIF = df.groupby( pd_series_w_classifications ).sum()
- In pd_series_w_classifications, first column contains index value, second column contains classification values
- Returns a new data frame containing sums for all the classification values
- Can also group by two or more classifiers if the data in the main data frame has multiple categories.
stack() - make hierarchical
unstack() - take hierarchical data frames and make into more cols/rows
Merging, Joining, Pivot Tables
- pd.merge( df1, df1 )

Statistics & Summary

data.some_column.value_counts() - count unique
data.some_column.describe() - gives mean, std, min, max, quartiles

Tricks and Tips

Subsample rows by creating an array of random indices, and then using frame.take()

Efficient Merge/Join

merge - single function as entrypoint for all standard database joins
types of join operations
- one-to-one - join 2 dataframes on their indexes

References

Modern PANDAS

Pandas

Contents

Objects in the pandas namespace

Time Series Data Frame object Attributes and Methods

Missing Data

Groupby Operations

Statistics & Summary

Tricks and Tips

Efficient Merge/Join

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools