Scipy.cluster

From Colettapedia

Jump to navigation Jump to search

Contents

1 resources
- 1.1 hierarchical clustering
  - 1.1.1 metric
  - 1.1.2 linkage criterion
2 scipy.cluster.hierarchy.dendrogram()
- 2.1 definitions
3 scipy.cluster.hierarchy.linkage()

resources

Phylogenetics (Wikipedia)
computational phylogenetics (Wiki)
PHYLIP (Wikipedia)
biopython phylo cookbook
biopython documentation unrooted trees
biopython docs talking about draw_graphviz()
NetworkX (Wikipedia) - Python library for studying graphs and networks. NetworkX is suitable for operation on large real-world graphs: e.g., graphs in excess of 10 million nodes and 100 million edges. Due to its dependence on a pure-Python "dictionary of dictionary" data structure, NetworkX is a reasonably efficient, very scalable, highly portable framework for network and social network analysis
Graphviz (Wikipedia) - (short for Graph Visualization Software) is a package of open-source tools initiated by AT&T Labs Research for drawing graphs specified in DOT language scripts.
DOT language (Wikipedia) - really cool!
Newick format (Wikipedia)
Distance matrices in phylogeny (Wikipedia)
[ftp.cse.sc.edu/bioinformatics/notes/020307patel.doc fitch and margoliash]
Clustering Example
least squares method

hierarchical clustering

Agglomerative - each observation starts in its own cluster, clusters are merged
Divisive = all observations start in one cluster, splits formed recursively
Need to be able to measure how dissimilar one cluster is from one another

metric

a measure of distance between pairs of observations
- eucludean distance
- squared Euclidean distance
- mahalanobis = ${\sqrt {(a-b)^{\top }S^{-1}(a-b)}}$ where S is the covariance matrix
- cosine = ${\frac {a\cdot b}{\|a\|\|b\|}}$
- Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. Put another way, it measures the minimum number of substitutions required to change one string into the other, or the number of errors that transformed one string into the other
- Levenshtein distance = minimum number of changes in spelling required to change one word into another.

linkage criterion

specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets
- complete linkage clustering - maximum or farthest neighbor
- Single-linkage clustering - minimum, or nearest neighbor
- Unweighted Pair Group Method with Arithmetic Mean (UPGMA) -
- minimum energy clustering
- neighbor joining

scipy.cluster.hierarchy.dendrogram()

definitions

cophenetic - a measure of how similar two objects have to be in order to be grouped into the same cluster
cophenetic correlation coefficient - a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points
monotonic
linkage matrix

scipy.cluster.hierarchy.linkage()

takes as input a top-triangle distance matrix

Retrieved from "https://chriscoletta.com/index.php?title=Scipy.cluster&oldid=2326"