Class 1: Introduction and Installation

From Colettapedia
Jump to navigation Jump to search

BIOF 309 Curriculum - back to main page

Introduction and Installation(class 1)

Outline for Today's Class

  1. Housekeeping issues
  2. How to get help
  3. Discussion of programming language ecosystem and where Python fits in
  4. What makes Python distinctive
  5. The Python data analysis stack: core Python + essential 3rd party modules
  6. Setup of Environment
  7. Using Python interactively via IPython notebook
  8. Running a program
  9. Homework: Email me the magic number.

Course Description

Python is an easy to learn, powerful programming language with capabilities that overlap many proprietary software packages like excel, MATLAB, and STATA, but is free and open source. This course is intended for non-programmers who want to learn how to write programs that expand the breadth and depth of their daily research. Most elementary concepts in modern software engineering will be covered, including basic syntax, Object-Oriented programming, regular expressions, reading from and writing to text files, use of the Python debugger, and creating reusable code modules that are distributable to peers. The end of the course will focus on potential applications of the Python language to Bioinformatics, including sequence analysis and data visualization.

Course Description, ver. 2

This course is intended to teach research professionals without a background in programming to write programs to gain insight into data. In addition to covering tools and syntax that are specific to Python, the class will cover elementary concepts that are ubiquitous in modern software engineering, including object-oriented programming, regular expressions, reading from and writing to text files, recursion, use of the debugger, etc. The end of the course will focus on potential applications of the Python language to bioinformatics, including sequence analysis, machine learning, and data visualization.

Goals for BIOF 309

  1. To teach you how to write programs that provide insight into your data
    • organizing, searching, manipulating, visualizing
  2. To teach you how to think like a computer
    • Q. How did the programmer die in the shower? A. He read the shampoo bottle instructions: Lather. Rinse. Repeat.
    • Class assignment: Directions on how to make a peanut butter & jelly sandwich. Be careful what you say, I could take it literally.
  3. Teach you how to learn other programming languages in the future.
    • Anatoly anecdote: "Though Russian school of science emphasizes general/abstract ideas more than US school"
    • Python is the 12th programming language I've had to teach myself since 1990 (BASIC, Pascal, C++, Visual Basic, VB.NET, Perl, Bash, XSLT, Javascript, PHP, SQL, Python) => Won't be the last either!
    • In the long run, in 5 years, we may be learning a different language, maybe "Julia", born in 2012, natively optimized for parallel computing; then again, maybe it's a fad?

BIOF 309 Meta Info

  • Target audience: Research professionals who have not programmed before
    • Used Matlab or Excel in the past
    • Grad students in the social sciences
  • Style of course is linear: one concept builds on the next
  • Lectures taped so you can pause and do things, watch lecture multiple times, etc, skip around. Pace of each recorded lecture may be fast.
  • No required textbook, but
  • grades, homework
  • projects 20%
  • email list
    • ask questions on list, responders get extra credit
  • any submission to me in the form lastname_firstinitial_hw99_STUFF.py
    • student's name must appear first in the file name
    • the assignment must appear second in the file name

About Me

  • Computer Scientist, NIH NIA LG IICBU
  • Computer Vision, Machine Learning
  • Pychrm - Python implementation of WND-CHARM
  • Programming since uncle introduced me to Apple IIc BASIC
  • Mechanical Engineering UMASS
  • Software Engineer writing software professionally since 2006

About co-instructors

  • Matt Shirley
  • Ben Busby

Why take programming

  • something that's repetitive and you could theoretically train a monkey (grad student) to do
  • examples:
    • textscraping = hit a website multiple times (PubMed), take out a bit of text from that website, aggregate results
    • identify location of aa/nt sequence motifs
    • parsing text files to pull out what you want (normally have to import into excel, create columns based on.
    • generate many graphs at once, automate/eliminate excel/matlab from data analysis pipeline
  • you've grown out of excel
  • you're too poor to afford matlab

Why Python

  • FAQ from PyTables three good reasons why to use python: Python is interactive, Python is productive for beginners and experts alike, Python is data-handling friendly

How to Get Help

Python/Data Analysis SIGs

  • Pyladies - "mission is to promote, educate and advance a diverse Python community through outreach, education, conferences, events and social gatherings. PyLadies also aims to provide a friendly support network for women and a bridge to the larger Python world."
  • DC Python Meetup

Brief Survey of Programming Languages

Static Programming Languages

  • Examples: C, C++, Java
  • FAST to run, slow to develop, difficult to "try things outs"
  • Compiled into machine code before you run the program

Dynamic Programming Languages

  • Dynamic programming languages - frequently referred to as scripting languages
    • Often used as "glue" to piece together different programs into one program
  • Code not compiled ahead of time but interpreted at run time
  • Determines the type-safety of operations at run time
    • Type errors cannot be automatically detected until a piece of code is actually executed
  • Slower than compiled languages
  • Dynamic languages have strong overlap with concept of "high-level programming language"
    • Strong abstraction from the details of the computer, e.g., memory management
      • E.g., Give me a array of number from 1 to 100 - a lot of stuff has to happen behind the scenes
  • Pros: Easier to use interactively (type statements in one-at-a-time, examine outputs, try things out)
  • Cons: Abstraction penalty
    • "While high-level languages are intended to make complex programming simpler, low-level languages often produce more efficient code."
      • Programmer time is typically more valuable than CPU time, many are happy to make this tradeoff

Best of Both Worlds (Advanced)

  • Write the code C and wrap it in Python
    • E.g., Python packages Numpy & Scipy
  • LLVM - Compiler infrastructure that does language-agnostic program optimization
    • Python package Numba does this

Brief history of Python

  • Guido Rossum
  • Monty Python -
  • Published in 1991, 1.0 in 1994
  • Explicitly created as a teaching language; a low-brow language, low barriers to entry, natural language syntax
  • A person who programs in Python is a "Pythonista"
  • The collective noun of a group of Pythonistas is an "indentation."
  • Don't feel like you have to marry python, it's probably not the last programming language you'll ever learn.

Comparison of R vs. Python

What makes python distinctive

  • "Pythonic"
  • indentation style
    • result is uncluttered, very human readable
    • eliminates the need to explicitly indicate where things terminal keywords like end, endifs
  • list comprehension
  • lambda functions
  • large core library with great types sets, dicts, etc
  • everything is an object
  • Exceptions
  • import this
  • built to be interactive - read-evaluate-print - one command at a time

Core Python vs addons

  • introduce full python data processing stack
  • core = Python Standard Library
  • PyPI - Python Package Index
  • NumPy - matrix operations, not part of core
  • give the bioinformatics python ecosystem of modules where they connect
  • killer is package dependencies - therefore use package installer like pip

Dog and pony show

  • iPython notebook for demonstrations
    • Reproducible research
    • The unification of code and documentation
  • counting codons
  • making graphs
  • kernel smoothed density graphs
  • pulling out certain columns out of a text file
  • get correlation of two arrays using scipy

setup of environment

Preferred Method

  1. Use Anaconda Scientific Python Distribution

Manual Method

  1. installation of interpreter
  2. installation of extension installer - easy_install
  3. installation of extensions - numpy, scipy, matplotlib, biopython
  4. installation of shell - IDLE now graduate to ipython
  5. installation of editor - IDLE now, graduate to dedicated code editor
  • Enviroment variables
    • On Windows add path to interpreter to %PATH%
    • add your working directory to $PYTHONPATH

Installation on Mac

Choice of Editor

  • "learn one editor, learn it well."
  • notepad++ on windows
  • xcode free on Mac 10.7
  • something that shows surrounding parentheses
  • use the ipython interactive shell to probe things
  • shell = command line interface = cli
  • choice of IDE
    • iPython
      • IPython Notebook Viewer - integration of code, data, documentation, and visualization. Useful for teaching, and for reproducible research
      • tab completion
      • storage of past commands and results
      • pretty colors
      • existing packages plug into to create matlab environment
  • difference of code editors: highlight syntax
    • navigation within functions
    • code folding
    • basic: notepad "4 n00bz"
    • intermediate: komodo?
    • advanced: vim! only "4 haxorz"
  • easy_install - required

Using Python interactively using IPython Notebook

  • from modules - "If you quit from the Python interpreter and enter it again, the definitions you have made (functions and variables) are lost. Therefore, if you want to write a somewhat longer program, you are better off using a text editor to prepare the input for the interpreter and running it with that file as input instead. This is known as creating a script. As your program gets longer, you may want to split it into several files for easier maintenance."
  • Introducing IPython
  • vocab word: "The kernel"
  • shot of scotty talking into the mouse: youtube vid v9kTVZiJ3Uc
  • you talking directly to the computer!
  • the computer talking directly back to you
  • one command at a time, one statement per line
  • parts of the shell
  • prompt - >>>
  • The function and purpose of Ctrl-C
  • Exceptions
  • ideal for trying stuff out
  • %quickref
  • running a program within ipython means that created variables will remain in namespace so you can inspect them interactively
  • output cache Out[4] or _4 for example
  • input cache
  • semicolon suppresses output

Run someone else's program outside of IPython Notebook

  • how to run a program from the command line vs interactive mode
  • one statement per line with backslash extension
  • load a program into shell, use import w/o .py suffix
  • run one-liners from command line -- use -c
    • python -c "print 'suck it'"

Homework

  • run the script I'm gonna email you and email me back your magic number which
  • includes numpy, check to make sure numpy is installed
  • Automated grading using Praktomat or some other
  • Test driven development - you get the test first, so you'll know if your program works or not

Class Invite Email

  • Welcome to BIOF309 Introduction to Python. This message is to let you know that you've been added to the Google group email list. You should be able to access the group without having a prior Google email address if you create a Google account for your NIH email (yep, you read that right!). If you have a Google email already, you might find it easier to use that account for class correspondence. If so email me at christopher.coletta@gmail.com with your gmail address and I'll add you to the list that way. Cheers, Chris