2013 PyData NYC

From Colettapedia
Jump to navigation Jump to search
*feature engineering is 80-90% of machine learning
*All the models are picklable - save a model/estimator you already fit
problems across different versions of python.
* pickling - "serializing objects" saving objects for a different date

*******************************************************************
functional programming
Functional programmers claim to have more productivity cause they don't have to think about for loops
talk about what you're doing at a very high level
build things that don't have any purpose on its own
pmap is much easier

standard interfaces
standardization is great!
robust systems base themselves on tandard interface
ex: scipy stack all base themselves on numpy ndarray, there's this thing that we all decide we're gonna uses
ex: JSON very standardized, xml is custom
ex: unix operating system allow operationn between disparate
ex: pandas dataframe
ex: supra-national
ex: lego bricks
nex: car parts

FP rely on core data structures

today: core python data structures, eschew using classes, classes is customization
alternatively can just use a dictionary, everyone knows how to use a dict
dict to JSON is eaxt, you would have to implement a .toJSON method

don't make an object uness you absolutely need to.
use two things together, called composition

tools that were not designed to work together, but they do
standard interfaces have a good existing standard libraries that have been well-tested

=MAP=
* functions as data, functin that consumes other functions, f that produces other functions, iterate through list of functions
* allows us to encapsulate pattern
* list comprehensions is map
=FILTER=
only select the ones that are true
=REDUCE=
an accumulated variable, a binary operator (take two inputs) keep glomming on a new item onto operation
ex: sum, min, max, product
reduce( collection of things, binary op, starting value
numpy broadcasting

==groupby===
* a better filter()
SQL, itertools, pandas
ex: filter numbers for even odd, get the even and odd at the same time
same interface as filter, but gives you all the results

just like a powertool is not a drill, on it's pown is completely useless, but combine with function, like a drill bit, is very powerful
"compose" powertool with different bits.
build general tools, compose them with small tools

==accumulate==
like reduce, but return the history of all the binary ops
ex: cumulative sum, cumulative product

==curry==
haskell curry, programming lang named after him too
metaphor: put the bit into the power tool.
functools partial
results of curry are also partials
returns partially evaluated functions

==frequencies is a reduction plus groupby==
collections object

==pipe is a reduce operation==
need curry

==purity==
impure functions modify inputs
use tuples
exceptions: lists inside tuples CAN be modified!

******************************************************************

WAKARI - An amazon cloud instance with a web interface

1 GB file storage space, 100mb limit on file size
truthy.indiana.edu
predicting elections using twitter

systems programming, but no complex numbers, vectorized primitives

fortran is still where many scientists end up

python flying xkcd

packages for data analysis and visualization
numpy
scipy pandas matplotlib
sckikit, numba, biopython Pytables

python syntax gets out of your way, natural language syntax.

python is community driven

Which is the better data analysis language? R or Python
http://www.quora.com/Data-Analysis/Which-is-the-better-Data-analysis-language-R-or-Python

python is built by computer scientists

wakari browser-based linux & python environment
package notebooks, files folders
REPRODUCEABLE RESEARCH

Rule of thummb: half of published research cannot be replicated
Put data and code next to each other
collaborate

I don't have the packages you do
create a bundle that's password protected but that costs money.

anaconda is virtualenv + pip

******************************************************************

infoviz = information visualization

data ink ratio, strip ornamentation
cognitive style, trust person to use their brain, leave outlier there, don't point to it
chart junk

d3: data-driven-documents, create a mapping of data to elements on page, eg chart

visualization have some explanatory power

not png screen cap of your matplotlib

=peter wang=

* python designed as a teaching language, the product ofevolution, c was a teaching language, no other language were designed like that
* post-pc era
* python ecosystem, full scientific stack
* minimum transistor voltage floor, so more cores on the chips
* cambrian explosian of technology, natural evolution of tools in teh space
* python is the cortex
* the sun came out, who can evolve eyes the fastst, not just eyes but retinas
* computer's are terrible because they do what I tell them to do instead of what I want them to do.


+++++++++++++++++++++++++++++++++++++++++++++++++++

thomas wiecki bayesian

1. Maximum Liklihood estimate, as N goes to infinity and we get the right answer
2. Prior distribution
3. different things, probability density probability mass
- AT MINIMUM THE ALGORITHM MUST RECOVER THE PARAMETERS WE USED TO CREATE THE DATA
- T distribution is like normal dist but with longer tails
- Random walk assumption

++++++++++++++++++++++++++++++++++++++++++++++++++++
generators

eager versus lazy

functions enters in one way, exit another way

generators pause and yield back 

generator can put multiple things back in when you resume execution


difference between iterator and iterable.
list is iterable, but list obj doesn't know where you stopped
iterator knows where you stopped
e.g. function next doesn't work for lists

generators iterate over blocks of code delimited by yield statements

lazy thumped 

allows you to cherry pick values without allocating the whole thing with itertools
ex: generator next business day.

++++++++++++++++++++++++++++++++++++++++++++++++++
oozie (XML), azkaban, luigi
query the central server via REST api, scalding, pig

makefile:luigi::maven:oozie

mission control

+++++++++++++++++++++++++++++++++++++++++++++++
can use hadoop
work queue, celery, shard data, pulla data out of database, data scross shared file system
multiprocessing, global platforms
multithreading in c++ is not for the fain hearted because memory allocation
tools for concurrency are poorly integrated in the language
paralleize map


++++++++++++++++++++++++++++++++++++++++++++++++
embeddings james powell
download python source, CFLAGS='-g3 -ggdb -gdwarf-4 ... '
dis module

EvalFrameEx - gigantic switch operator
man & pyMAin at bottom of stack
PyRun_AnyFileExFlags
PyRun_InteractiveLoopFlags

pure embedding vs high level embedding
using cython as a means for embedding


++++++++++++++++++++++++++++++++++++
map reduce with python

easier to collect data now, can't fit all the data on a single machine
divide dataset into sub groups, perform operations in parallel

use case: log processing, counting

hadoop is most popular of map reduce
use aWS to run hadoop

amazon storage services, scalable, pay only for what you use
elastic map reduce, ability to resize your cluster, as many ec2 instances, spin up a hadoop cluster on demand

mrjob, prototype hadoop jobs locally