Module 2 Lecture 1 Homework

From Colettapedia
Jump to navigation Jump to search

Steps for building UPGMA trees from a distance matrix

  • "Unweighted Pair Group Method with Arithmetic Mean"

Turn Similarity Matrix into Distance Matrix

  • Original similarity matrix
A B C D E F
A 10 5 1 8 3 5
B 10 5 4 8 5
C 10 2 8 6
D 10 3 6
E 10 4
F 10
  • Similarity matrix normalized to 1
A B C D E F
A 1 0.5 0.1 0.8 0.3 0.5
B 1 0.5 0.4 0.8 0.5
C 1 0.2 0.8 0.6
D 1 0.3 0.6
E 1 0.4
F 1
  • Similarity matrix turned into distance matrix by subtracting 1 and taking absolute value
A B C D E F
A 0 0.5 0.9 0.2 0.7 0.5
B 0 0.5 0.6 0.2 0.5
C 0 0.8 0.2 0.4
D 0 0.7 0.4
E 0 0.6
F 0

Matrix Reduction

  • Inspect the original matrix and find the smallest distance. Half that number is the branch length to each of those two taxa from the first Node.
    • There are three pairs that tie for shortest distance: AD, BE and CE all have distance of 0.2. Branch length is 0.2/2 = 0.1.
  • Create a new, reduced matrix with entries for the new Node and the "unpicked" taxa. The distance from each of the unpicked will be the average of the distance from the unpicked to each of the two taxa in Node 1.
    • Start with AD
A,D B C E F
A,D 0 (AB+DB)/2 = (0.5+0.6)/2 = 0.55 (AC+DC)/2 = (0.9+0.8)/2 = 0.85 (AE+DE)/2 = (0.7+0.7)/2 = 0.7 (AF+DF)/2 = (0.5+0.4)/2 = 0.45
B 0 0.5 0.2 0.5
C 0 0.2 0.4
E 0 0.6
F 0
  • Inspect the reduced matrix and find the smallest distance. If these are two "unpicked taxa, then they will form a new Node with branch lengths half that distance. If Node 2 consists of Node 1 plus an "unpicked" taxon, then the branch length to the unpicked will be half the distance and the other branch (with a Node in it) will have two segments adding up to one-half the smallest distance in the matrix.
    • Now do to BE because they're both unpicked taxons, could have also gone with CE.
    • Branch length also 0.1
A,D B,E C F
A,D 0 (A,D->B+A,D->E)/2 = (0.55+0.7)/2=0.625 0.85 0.45
B,E 0 (BC+EC)/2 = (0.5+0.2)/2 = 0.35 (BF+EF)/2 = (0.5+0.6)/2 = 0.55
C 0 0.4
F 0
  • Continue these distance calculation and matrix reduction steps until all taxa have been picked.
    • Next shortest distance is BE to C at 0.35
    • Branch length from new node BEC to unpicked taxon C is 0.35/2 = 0.175
    • Branch length from new node BEC to node BE is 0.175 - 0.1 = 0.075
AD BEC F
AD 0 (ADBE+ADC)/2 = (0.625+0.85)/2 = 0.7375 0.45
BEC 0 (BEF+CF)/2 = (0.55+0.4)/2 = 0.475
F 0
  • Continue these distance calculation and matrix reduction steps until all taxa have been picked.
    • Next shortest distance is AD to F at 0.45
    • Branch length from new node ADF to unpicked taxon F is 0.45/2 = 0.225
    • Branch length from new node ADF to node AD is 0.225 - 0.1 = 0.125
ADF BEC
ADF 0 (AD->BEC+F->BEC)/2 = (0.7375+0.475)/2 = 0.60625
BEC 0
  • Only two nodes left, calculate distance of each node to root
    • Both distance must add up to one half the value from final reduced matrix above = 0.60625/2 = 0.303
    • Distance from root to node ADF = 0.303 - 0.225 = 0.078
    • Distance from root to node BEC = 0.303 - 0.175 = 0.128
  1. Usual convention for UPGMA (which have equal length branches from all nodes) is to use the boxy, horizontal and vertical line phylogram used in the Excel spreadsheet. This allows easy read of "time" along the X-axis.

here's a MacPaint version of my tree, apologies for the crudeness of the drawing

Steps for neighbor-joining (NJ) trees from a distance matrix

  1. From the original matrix calculate the ROW totals, r. The r for each taxon is the sum of the distance from it to each of the other taxa. Put these r-values in a column to the right of the distance matrix.
  2. Divide r by N-2, where N is the number of taxa in the working matrix.
  3. Use the r-values to create a transformed matrix. Each value is the original distance minus half the sum of the two r-values. [E.g., Hu-Ch-transformed is original Hu-Ch dist. - half the sum of Hu and Ch r-values].
  4. Pick the smallest value in the transformed matrix as the first Clade.
  5. Calculate the branch length to one of the two taxa (X and Y) by using the following formula:
x = [(X to Y in untransformed matrix) + {(r/(N-2) for X - r/N-2 for Y)}] / 2 
    • It doesn't matter which taxon you pick first to compute this branch length.
    • But it does matter which taxon you put first in the "difference bit". Which taxon comes first in the r/(N-2) difference bit? The "target" taxon. If we are calculating x (X to Node 1) then the first thing is X's r/N-2. If we are calculating c (C to Node 2) then the first one is C's r/(N-2). Etc. That is, the first term in the r/(N-2) "difference piece" corresponds to the taxon whose new little branch length we are trying to calculate.
  1. Verbal summary of Step 5: Branch length from new taxon to new node is [(the distance from the new taxon to the old node in our current working matrix) plus (the difference between the new taxon's r/(N-2) value and the old node's r/(N-2) value) ] all divided by two. When we are building the first node, the main difference is that the "old node" will just be the other taxon (Orang, or B or whatever the case may be). E.g. in the great ape case, "HominTrees.XL" the first node is DE, orangutan and gibbon, so when we calculate e (gibbon branch), the "old node" is just the single taxon D (orangutan).
  2. Now calculate y by simply subtracting x from the original X to Y distance. N.B. It will also work perfectly well to do y = [(Y to X ) + { Y's r/(N-2) - X's r/(N-2) } ] /2. It is more complicated, but it obeys the Step 5 rule and it gives us the same answer.

Put differently: x + y should equal the original X to Y distance. Or c + 1' = C to Node 1.

  1. Create a new, reduced matrix of distances with Node 1 and the "unpicked" taxa. Compute the distances from each of the unpicked taxa to Node 1 by using the following algorithm.
	Say the unpicked are K, L and M.  
	K to Node 1 = [K to X original - x + K to Y original - y] / 2

that is, [original unpicked to Node1 taxon 1 - newly computed branch length for Node 1 taxon 1 + original unpicked to Node 1 taxon 2 - newl computed branch length to Node 1 taxon 2] all divided by two.

  1. With the new matrix, go back and compute the r's and r/N-2 (remembering that N is now smaller), etc.
  • r-values are used to create the transformed matrix. r/N-2 values are used for computing branch lengths.