Lecture 2

BIOF518
finding correspondances between amino acids "residues"
assessing similarity w/ substitution matrices
substitution: probability that one amino acid mutates into another
easier for some amino acids to (one step mutations) to convert to other (two and three- step full codon changes)
valine, lucine, isolucine, can substitute each other and it won't matter. evolutionarily speaking, doesn't disrupt function of protein
phenyalanine, tryptophan, tyrosine
substitution matrix goes one-for-one, very simple, reward match, penalize mismatch
many mutations don't matter
add: insertion/deletion penalty (indel penalty)
- bigger penalty for gap open, less penalty to gap extend (insertion)
Most people use BLOSUM -
nucleotide
- amino acid (20) has more inferential statistical power than nucleotide (4)
- phenotype is done at protein level, nucleotide is far less conserved
- translate to protein, align, then map backwards to nucleotides
BLOSUM62 - most important matrix - covers protein query length of 85-300 amino acids long
for more divergent (hunam -> bacteria) BLOSUM45 - best for detecting long and weak alignments
human to another human -- BLOSUM90
example: hemoglobin - paraloges (in same organism)
if ever run BLAST - look at "consensus"
Identical
stronger: invariant
similar - like the plus in consensus
%60 identical is different thatn %60 similar (much weaker inference)
sturcture is conserved even when sequence is not
homogous is like being pregnant, you are or you aren't

alignment algorithms

needleman-wunsch - dynamic programming - global alignment only, get tripped up easily
most of the time you want local alignment
smith-waterman - like to use if at all possible pick teh best local path (don't have to start from teh corners, find teh best "ribbon" in teh grid, ribbon path with the higest scores)
- a matrix M x K
- use it for pairwise sequence alignment

do whenever you can - accuracy is much better
use multiple species - get better statistical power
clustalW - don't use this. guide tree - align the easy local stuff first
MUSCLE - use this. builds draft alignment, then improves. get to the multiple alignment as quickly as possible, have correspondence of positions.

neural network implies we don't know waht's going on underlying , black box
secondary protein structure - tool "psipred"
tertiary structure:
- ab initio
  - we can't even model water, electron is delocalized,
- if modeling as balls on a spring, can get pretty far though
better: homology modelling - IF YOU CAN FIND ALIGNMENT

identify structure of you protein by comparing homologous protein with known structure
multiple sequence alignment bettern than pairwise
make alignments be as accurate as possible
past sequence into Phyre - find me structure, and it does!
align: best practice to align domains - a unit of verttical descent

domains: basic unit of homologs
may have different evolutionary history, not vertical descent from common ancestor
active site residues vs. catalytic site residues.

only a few key sequences
most pro
only a few proteins are absolutely necessary
some proteins just exist to pad teh core
sequence motif - a regular expression for amino acids
sequence motif - confers the function of protein
even better structureal motifs - structural template
X arrangement will cause protein to be a peptidase for example, you will never find certain structures that don't perform a certain structure
could have two different proteins that have exact same active site ... may be functional analogs, but
you can tell the important amino acids by looking at what's conserved across species, if not conserved, prolly not important