Scipy.cluster
Jump to navigation
Jump to search
Contents
resources
- Phylogenetics (Wikipedia)
- computational phylogenetics (Wiki)
- PHYLIP (Wikipedia)
- biopython phylo cookbook
- biopython documentation unrooted trees
- biopython docs talking about draw_graphviz()
- NetworkX (Wikipedia) - Python library for studying graphs and networks. NetworkX is suitable for operation on large real-world graphs: e.g., graphs in excess of 10 million nodes and 100 million edges. Due to its dependence on a pure-Python "dictionary of dictionary" data structure, NetworkX is a reasonably efficient, very scalable, highly portable framework for network and social network analysis
- Graphviz (Wikipedia) - (short for Graph Visualization Software) is a package of open-source tools initiated by AT&T Labs Research for drawing graphs specified in DOT language scripts.
- DOT language (Wikipedia) - really cool!
- Newick format (Wikipedia)
- Distance matrices in phylogeny (Wikipedia)
- [ftp.cse.sc.edu/bioinformatics/notes/020307patel.doc fitch and margoliash]
- Clustering Example
- least squares method
hierarchical clustering
- Agglomerative - each observation starts in its own cluster, clusters are merged
- Divisive = all observations start in one cluster, splits formed recursively
- Need to be able to measure how dissimilar one cluster is from one another
metric
- a measure of distance between pairs of observations
- eucludean distance
- squared Euclidean distance
- mahalanobis = where S is the covariance matrix
- cosine =
- Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. Put another way, it measures the minimum number of substitutions required to change one string into the other, or the number of errors that transformed one string into the other
- Levenshtein distance = minimum number of changes in spelling required to change one word into another.
linkage criterion
- specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets
- complete linkage clustering - maximum or farthest neighbor
- Single-linkage clustering - minimum, or nearest neighbor
- Unweighted Pair Group Method with Arithmetic Mean (UPGMA) -
- minimum energy clustering
- neighbor joining
scipy.cluster.hierarchy.dendrogram()
definitions
- cophenetic - a measure of how similar two objects have to be in order to be grouped into the same cluster
- cophenetic correlation coefficient - a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points
- monotonic
- linkage matrix
scipy.cluster.hierarchy.linkage()
- takes as input a top-triangle distance matrix