Pandas

From Colettapedia
Jump to navigation Jump to search

"A powerful data analysis and manipulation library for Python."

Objects in the pandas namespace

  • list: Series, DataFrame, Panel, Index, DatetimeIndex, HDFStore, bdate_range, date_range, read_csv, read_fwf, read_table, ols
  • pd.read_table('/path/to/file', sep='::', names=list_of_names)
    • index_col = Column to use as the row labels of the DataFrame. If you have a malformed file with delimiters at the end of each line, you might consider index_col=False to force pandas to _not_ use the first column as the index (row names)
    • sep = Delimiter to use. Regular expressions are accepted.
      • sep=r"\s+" - space separated
    • header='infer'
    • parse_dates = True is parse the index, list implies columns to parse separately
    • names = list of column names to use. If file contains no header row, then you should explicitly pass header=None
    • comment = string, indicates the rest of the line shouldn't be parsed
    • usecols = array_like, return a subset of the columns
    • infer_datetime_format = default is false
  • DataFrame - "Two-dimensional size-mutable, potentially heterogeneous tabular data, structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure."
    • data_frame = pd.read_csv( path )
    • data_frame.index = pd.to_datetime( some_sub_frame )
    • pd.DataFrame( {'ColumnA': time_series1, 'ColumnB': time_series2} )
      • Does the data alignment for you, conforms the union of the indexes, creates a nice regularly indexed rectangular table
  • date_range
    • my_date_range = pd.date_range( '1/1/2000', periods=10 )
  • Series
    • time_series1 = pd.Series( some_numpy_1D_array, index=my_date_range )
  • pd.concat( list_of_dfs, axis=1, keys=list_of_df_names )
    • Here the alignment happens on the current indexing regime, probably the datetime
    • Result has a hierarchical index
    • Operation results in a new data frame that has all the columns of all the component data frames, and at least as many rows as both (maybe more if both data frames has data indexed to a time that the other frame doesn't have)

Time Series Data Frame object Attributes and Methods

  • df.size() - number of rows
  • grab certain rows
    • Can use bracket notation: even_rows = all_rows[ [0,2,4,6,8] ]
    • Can use data_frame.take( numpy_1D_array_of_indeces )
    • boolean indexing df[df['A'] > 0]
  • .index - the index of the dataframe, can be timestamp, can be string, like a stock ticker tag.
  • Column members - Column names become data frame object member attributes
    • Can access certain rows of certain columns by treating the column members like dicts
      • stock_prices.close_price['2009-10-15']
    • data_frame.some_column.resample( 'M', how=['mean','median','std']) - create a new data frame comprised of the mean, median and standard deviation, resampling some time data to monthly (the 'M' in the function argument)
    • 'B' = business daily
  • .head() - give the top 5 rows
  • time_data.at_time( time.time( 15, 0 ) ) - select all the data points at a particular time across many days
  • data_frame.plot() - returns a matplotlib graph for time
  • df.asfreq('4h') - for example, take daily data and resample to contain data for every 4 hours. Will result in missing data that may be forward filled, etc.

Missing Data

  • df.count() - returns number of non NaN entries for each column
  • df.sum() - doesn't choke if a column has missing data
  • df.mean(1) - argument indicates axis, 1 = rowwise mean
  • df.dropna() - drop a row if any column is a NaN
  • df.dropna(how='all') - drop a row if all columns are NaN
  • df.fillna( some_scalar ) - fill NaNs with a scalar value, like 0
  • df.fillna( method='ffill' ) - "forward fill," fill with the most recent non-NaN value
  • df['ColName'].unique() - unique vals in column
  • df.drop( 'ColName', inplace=True, axis=1 ) - delete a column

Groupby Operations

  • Can do SQL type group operations
  • One way to do a SUMIF = df.groupby( pd_series_w_classifications ).sum()
    • In pd_series_w_classifications, first column contains index value, second column contains classification values
    • Returns a new data frame containing sums for all the classification values
    • Can also group by two or more classifiers if the data in the main data frame has multiple categories.
  • stack() - make hierarchical
  • unstack() - take hierarchical data frames and make into more cols/rows
  • Merging, Joining, Pivot Tables
    • pd.merge( df1, df1 )

Statistics & Summary

  • data.some_column.value_counts() - count unique
  • data.some_column.describe() - gives mean, std, min, max, quartiles

Tricks and Tips

  • Subsample rows by creating an array of random indices, and then using frame.take()

Efficient Merge/Join

  • merge - single function as entrypoint for all standard database joins
  • types of join operations
    • one-to-one - join 2 dataframes on their indexes

References