Lecture 4

From Colettapedia
Jump to navigation Jump to search

project review

  • Don't use MEGAblast - megablast is only for looking at similar sequences
  • use tblastn - do the search at the protein level with
  • wordsize 2
  • blosom 45
  • sequence conservation and the dna level - simplest explanation is that it's a gene
  • if you search a protein database the introns will be gone

HMMs: intro

  • Markov property = memoryless
  • markov process = stochastic with markov property
  • markov chain
  • 2x2 matrix of dift kinds of markov proc
  • GeneMark
  • Glimmer
  • psi-blast = use custom matrix - lot of info thown out using pssms in psi-blast
  • hmms include position specific gap penalties

using HMMs

  • profile - weight certain things more - build profile iteratively -
  • if have representation of conserved motif
  • identify the catalytic residues
  • prebuild conserved motif profiles (RPS-blast)
  • NOW = match profiles: match conserves residues to conserved residues
  • uses profile on both sides of search
  • HHpred
  • more sensitive, specific, better
  • HMMs scoring is 20 plus amino acids, plus insertions and deletions
  • ROC curve receiver operatior curve - sensitivity vs specificity
  • best - HHpred + SS (secondary structure)
  • secondary structure is the general three-dimensional form of local segments of biopolymers such as proteins and nucleic acids (DNA/RNA).
  • secondary structure added by psi-pred
  • hhpred blows competition out of water when family is known, out of luck

caviats

  • profile to profile is cool, but you need profiles to search
  • not everything is in a known domain family
  • very common to find viral proteins half of which have no homologs

hhpred

  • protein folding problem has not ben solved
  • even works well at predicting tertiary structure prediction
  • ergo = sequence actually does beget structure begets function
  • sequence is enough to define structure
  • proteins structure (at least globular ones, the ones we can crystalize) have a structure, contrast to RNA which is floppy

hhblits

orthologs

  • scop = structural calssification of proteins, hierarchical classification scheme
  • go - gene ontology, hierarchical
  • orthologs, evolutionary classification of genes
  • orthologs are related by speciation
  • paralogs related by duplication - not orthologs! like olfactory receptor, diverges and performs different function
  • al beta gamma fetal hemoglobin - fetal binds oxygen tightly wraps tighter so it could steal oxygen away from mother
  • paralogs are co-orthologous to genes - they don't individually are ortho to others, but as a set they are
  • ortholog typically do same function
  • xenology - genes arises by horizontal gene transfer from another organism
  • alpha proteobacteria - mitochondria
  • in-paralogs - two genes where speciation, then duplication
  • co-ortho - collectively orthologous
  • orthologous group collection of all descendents of ancestral gene - a simplifying principle
  • flybase, wormbase, yeast genome browser
  • pick the db that's the prettiest and works the best

tree reconciliation event

  • reconciles a species tree with a gene tree
  • occam's razor, duplication event towards leaves of tree, less likely that there's duplication event further up in the tree and then loss even in other species
  • other tree topology can suggest duplication before speciation "out-paralogs" ... two separate gene units
  • in-paralog duplication AFTER speciation event

pitfalls of tr

  • >90% of bacterial genomes have undergone HGT
  • 60:40::vertical inheritance:horozontal inheritance
  • could you even define a species when 90% of genes come from HGT
  • so heuristic, graph-based, approaches
  • ribosomal proteins shared in all 3 domains of life