2013 PyData NYC
Jump to navigation
Jump to search
- Efficient Numpy tutorial by Jake Vanderplas
- pandas tutorials
- look for the tutorial by Olivier Grizzell
- python is a low-brow language, like the english language, use very small words, state concisely but lose people
- http://karissamck.com/blog/2013/10/30/big-data-is-big-because-it-doesnt-load-into-r/
- Yves Hilpisch slides
- https://www.wakari.io/gallery
- Kaggle competitions
- http://gekkoquant.com/2012/10/21/statistical-arbitrage-correlation-vs-cointegration/
- http://www.martinlaprise.info/pydata2013.html
- http://en.wikipedia.org/wiki/Idempotence
- https://github.com/azkaban/azkaban
- http://en.wikipedia.org/wiki/Deep_learning
*feature engineering is 80-90% of machine learning *All the models are picklable - save a model/estimator you already fit problems across different versions of python. * pickling - "serializing objects" saving objects for a different date ******************************************************************* functional programming Functional programmers claim to have more productivity cause they don't have to think about for loops talk about what you're doing at a very high level build things that don't have any purpose on its own pmap is much easier standard interfaces standardization is great! robust systems base themselves on tandard interface ex: scipy stack all base themselves on numpy ndarray, there's this thing that we all decide we're gonna uses ex: JSON very standardized, xml is custom ex: unix operating system allow operationn between disparate ex: pandas dataframe ex: supra-national ex: lego bricks nex: car parts FP rely on core data structures today: core python data structures, eschew using classes, classes is customization alternatively can just use a dictionary, everyone knows how to use a dict dict to JSON is eaxt, you would have to implement a .toJSON method don't make an object uness you absolutely need to. use two things together, called composition tools that were not designed to work together, but they do standard interfaces have a good existing standard libraries that have been well-tested =MAP= * functions as data, functin that consumes other functions, f that produces other functions, iterate through list of functions * allows us to encapsulate pattern * list comprehensions is map =FILTER= only select the ones that are true =REDUCE= an accumulated variable, a binary operator (take two inputs) keep glomming on a new item onto operation ex: sum, min, max, product reduce( collection of things, binary op, starting value numpy broadcasting ==groupby=== * a better filter() SQL, itertools, pandas ex: filter numbers for even odd, get the even and odd at the same time same interface as filter, but gives you all the results just like a powertool is not a drill, on it's pown is completely useless, but combine with function, like a drill bit, is very powerful "compose" powertool with different bits. build general tools, compose them with small tools ==accumulate== like reduce, but return the history of all the binary ops ex: cumulative sum, cumulative product ==curry== haskell curry, programming lang named after him too metaphor: put the bit into the power tool. functools partial results of curry are also partials returns partially evaluated functions ==frequencies is a reduction plus groupby== collections object ==pipe is a reduce operation== need curry ==purity== impure functions modify inputs use tuples exceptions: lists inside tuples CAN be modified! ****************************************************************** WAKARI - An amazon cloud instance with a web interface 1 GB file storage space, 100mb limit on file size truthy.indiana.edu predicting elections using twitter systems programming, but no complex numbers, vectorized primitives fortran is still where many scientists end up python flying xkcd packages for data analysis and visualization numpy scipy pandas matplotlib sckikit, numba, biopython Pytables python syntax gets out of your way, natural language syntax. python is community driven Which is the better data analysis language? R or Python http://www.quora.com/Data-Analysis/Which-is-the-better-Data-analysis-language-R-or-Python python is built by computer scientists wakari browser-based linux & python environment package notebooks, files folders REPRODUCEABLE RESEARCH Rule of thummb: half of published research cannot be replicated Put data and code next to each other collaborate I don't have the packages you do create a bundle that's password protected but that costs money. anaconda is virtualenv + pip ****************************************************************** infoviz = information visualization data ink ratio, strip ornamentation cognitive style, trust person to use their brain, leave outlier there, don't point to it chart junk d3: data-driven-documents, create a mapping of data to elements on page, eg chart visualization have some explanatory power not png screen cap of your matplotlib =peter wang= * python designed as a teaching language, the product ofevolution, c was a teaching language, no other language were designed like that * post-pc era * python ecosystem, full scientific stack * minimum transistor voltage floor, so more cores on the chips * cambrian explosian of technology, natural evolution of tools in teh space * python is the cortex * the sun came out, who can evolve eyes the fastst, not just eyes but retinas * computer's are terrible because they do what I tell them to do instead of what I want them to do. +++++++++++++++++++++++++++++++++++++++++++++++++++ thomas wiecki bayesian 1. Maximum Liklihood estimate, as N goes to infinity and we get the right answer 2. Prior distribution 3. different things, probability density probability mass - AT MINIMUM THE ALGORITHM MUST RECOVER THE PARAMETERS WE USED TO CREATE THE DATA - T distribution is like normal dist but with longer tails - Random walk assumption ++++++++++++++++++++++++++++++++++++++++++++++++++++ generators eager versus lazy functions enters in one way, exit another way generators pause and yield back generator can put multiple things back in when you resume execution difference between iterator and iterable. list is iterable, but list obj doesn't know where you stopped iterator knows where you stopped e.g. function next doesn't work for lists generators iterate over blocks of code delimited by yield statements lazy thumped allows you to cherry pick values without allocating the whole thing with itertools ex: generator next business day. ++++++++++++++++++++++++++++++++++++++++++++++++++ oozie (XML), azkaban, luigi query the central server via REST api, scalding, pig makefile:luigi::maven:oozie mission control +++++++++++++++++++++++++++++++++++++++++++++++ can use hadoop work queue, celery, shard data, pulla data out of database, data scross shared file system multiprocessing, global platforms multithreading in c++ is not for the fain hearted because memory allocation tools for concurrency are poorly integrated in the language paralleize map ++++++++++++++++++++++++++++++++++++++++++++++++ embeddings james powell download python source, CFLAGS='-g3 -ggdb -gdwarf-4 ... ' dis module EvalFrameEx - gigantic switch operator man & pyMAin at bottom of stack PyRun_AnyFileExFlags PyRun_InteractiveLoopFlags pure embedding vs high level embedding using cython as a means for embedding ++++++++++++++++++++++++++++++++++++ map reduce with python easier to collect data now, can't fit all the data on a single machine divide dataset into sub groups, perform operations in parallel use case: log processing, counting hadoop is most popular of map reduce use aWS to run hadoop amazon storage services, scalable, pay only for what you use elastic map reduce, ability to resize your cluster, as many ec2 instances, spin up a hadoop cluster on demand mrjob, prototype hadoop jobs locally