Scipy.cluster

From Colettapedia
Jump to navigation Jump to search

resources

hierarchical clustering

  • Agglomerative - each observation starts in its own cluster, clusters are merged
  • Divisive = all observations start in one cluster, splits formed recursively
  • Need to be able to measure how dissimilar one cluster is from one another

metric

  • a measure of distance between pairs of observations
    • eucludean distance
    • squared Euclidean distance
    • mahalanobis = where S is the covariance matrix
    • cosine =
    • Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. Put another way, it measures the minimum number of substitutions required to change one string into the other, or the number of errors that transformed one string into the other
    • Levenshtein distance = minimum number of changes in spelling required to change one word into another.

linkage criterion

  • specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets
    • complete linkage clustering - maximum or farthest neighbor
    • Single-linkage clustering - minimum, or nearest neighbor
    • Unweighted Pair Group Method with Arithmetic Mean (UPGMA) -
    • minimum energy clustering
    • neighbor joining

scipy.cluster.hierarchy.dendrogram()

definitions

  • cophenetic - a measure of how similar two objects have to be in order to be grouped into the same cluster
  • cophenetic correlation coefficient - a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points
  • monotonic
  • linkage matrix

scipy.cluster.hierarchy.linkage()

  • takes as input a top-triangle distance matrix