Lecture 4
Jump to navigation
Jump to search
Contents
project review
- Don't use MEGAblast - megablast is only for looking at similar sequences
- use tblastn - do the search at the protein level with
- wordsize 2
- blosom 45
- sequence conservation and the dna level - simplest explanation is that it's a gene
- if you search a protein database the introns will be gone
HMMs: intro
- Markov property = memoryless
- markov process = stochastic with markov property
- markov chain
- 2x2 matrix of dift kinds of markov proc
- GeneMark
- Glimmer
- psi-blast = use custom matrix - lot of info thown out using pssms in psi-blast
- hmms include position specific gap penalties
using HMMs
- profile - weight certain things more - build profile iteratively -
- if have representation of conserved motif
- identify the catalytic residues
- prebuild conserved motif profiles (RPS-blast)
- NOW = match profiles: match conserves residues to conserved residues
- uses profile on both sides of search
- HHpred
- more sensitive, specific, better
- HMMs scoring is 20 plus amino acids, plus insertions and deletions
- ROC curve receiver operatior curve - sensitivity vs specificity
- best - HHpred + SS (secondary structure)
- secondary structure is the general three-dimensional form of local segments of biopolymers such as proteins and nucleic acids (DNA/RNA).
- secondary structure added by psi-pred
- hhpred blows competition out of water when family is known, out of luck
caviats
- profile to profile is cool, but you need profiles to search
- not everything is in a known domain family
- very common to find viral proteins half of which have no homologs
hhpred
- protein folding problem has not ben solved
- even works well at predicting tertiary structure prediction
- ergo = sequence actually does beget structure begets function
- sequence is enough to define structure
- proteins structure (at least globular ones, the ones we can crystalize) have a structure, contrast to RNA which is floppy
hhblits
orthologs
- scop = structural calssification of proteins, hierarchical classification scheme
- go - gene ontology, hierarchical
- orthologs, evolutionary classification of genes
- orthologs are related by speciation
- paralogs related by duplication - not orthologs! like olfactory receptor, diverges and performs different function
- al beta gamma fetal hemoglobin - fetal binds oxygen tightly wraps tighter so it could steal oxygen away from mother
- paralogs are co-orthologous to genes - they don't individually are ortho to others, but as a set they are
- ortholog typically do same function
- xenology - genes arises by horizontal gene transfer from another organism
- alpha proteobacteria - mitochondria
- in-paralogs - two genes where speciation, then duplication
- co-ortho - collectively orthologous
- orthologous group collection of all descendents of ancestral gene - a simplifying principle
- flybase, wormbase, yeast genome browser
- pick the db that's the prettiest and works the best
tree reconciliation event
- reconciles a species tree with a gene tree
- occam's razor, duplication event towards leaves of tree, less likely that there's duplication event further up in the tree and then loss even in other species
- other tree topology can suggest duplication before speciation "out-paralogs" ... two separate gene units
- in-paralog duplication AFTER speciation event
pitfalls of tr
- >90% of bacterial genomes have undergone HGT
- 60:40::vertical inheritance:horozontal inheritance
- could you even define a species when 90% of genes come from HGT
- so heuristic, graph-based, approaches
- ribosomal proteins shared in all 3 domains of life