Lecture 3

From Colettapedia
Jump to navigation Jump to search
  • BLAST - choose database
    • Swiss-Prot - manually curated
  • Change max target sequences - change to 20,000
  • change scoring params
  • leave filter options alone
  • program selection - optimize for
    • MEGABLAST - compare human to human mapped genome
    • not for human - bacteria searchces
  • B-link - precomputed blast search for some proteins
    • taxonomy report - tax-BLAST
  • other reports: distance tree of results

BLAST Algorithm

  • BLAST uses a heuristic
  • looks for seeds, and extend teh alignment around the seed
  • keep going until the score improves
  • report statistical significance of match
  1. start with query word (protein 3 letters, nucls 11)
  2. identify all nearly identical seed matches that score above threshold
    • ex: megablast has word size of 8, but you want the word size to be small
  • BLAST Statistics: E-value
  • go to blast help and click on statistics of sequence similarity scores
  • e-value doen't tell you if contains domains
  • try reciprocal BLAST to confirm - match extremely unlikely to occur by random chance twice in both directions
  • searching for something with the same function in BLAST
  • 35% identity or less - "the twilight zone"

Going beyond sequence similarity

  • dna mutates faster than struction
  • id similarities at structural level
  • conserve motifs to the rescue
  • certain changes in residue causes protein to fold differently
  • eg. sometimes ok to switch residues that have positive charges but won't change folding
  • if looking at things 60% identical - blast is ok
  • but in the twilight zone,
  • BLAST - 1990

PSI-BLAST - 1997

  • twilight zone is pushed back because it looks at conserved motifs
  • position specific iterated basic local alignment tool
  • profile corruption - can't disallow matches to other residues once you've allowed it.
  • low-complexity
  • custom representation of sequence motifs in protein family

variations

  • PHI blast - regular expression
    • start with regular expression matching motif
  • megablast - rapid alignment of very long DNA, but less sensitivity
    • intended for highly similar sequences
  • discontinuous megaBLASt - maybe better for cancer
  • BLAT - if your database doesn't change, build up huge dable of all seeds,
    • NR can't do that cause it changes daily, but genomes won't change
  • bottom line: phi blast works based on iterations. 1st iteration matching is done on straight blosum62 or whatever, then will take conserved motif (constrained amino acid variation that occurs in protein family) of the top matching organisms and make a new matching matrix (add info to blosum62)

class exercises

  1. not all genomes have been pre-processed
  • human DNA polymerase Beta sequence:
>gi|544186|sp|P06746.3|DPOLB_HUMAN RecName: Full=DNA polymerase beta
MSKRKAPQETLNGGITDMLTELANFEKNVSQAIHKYNAYRKAASVIAKYPHKIKSGAEAKKLPGVGTKIA
EKIDEFLATGKLRKLEKIRQDDTSSSINFLTRVSGIGPSAARKFVDEGIKTLEDLRKNEDKLNHHQRIGL
KYFGDFEKRIPREEMLQMQDIVLNEVKKVDSEYIATVCGSFRRGAESSGDMDVLLTHPSFTSESTKQPKL
LHQVVEQLQKVHFITDTLSKGETKFMGVCQLPSKNDEKEYPHRRIDIRLIPKDQYYCGVLYFTGSDIFNK
NMRAHALEKGFTINEYTIRPLGVTGVAGEPLPVDSEKDIFDYIQWKYREPKDRSE
  • main idea:
  1. what other domain is commonly associated with this domain:
    1. do a CDART (conserved domain architecture retrieval tool)
    2. list is arranged in order of most commonly associated domains