clusGap {cluster} | R Documentation |
clusGap()
calculates a goodness of clustering measure, the
“gap” statistic. For each number of clusters k, it
compares \log(W(k)) with E^*[\log(W(k))] where the latter
is defined via bootstrapping, i.e. simulating from a reference
distribution.
maxSE(f, SE.f)
determines the location of the maximum
of f
, taking a “1-SE rule” into account for the
*SE*
methods. The default method looks for the first (local)
maximum which is at least 1 standard error larger than the previous
values. This corresponds to Tibshirani et al's recommendation of
determining the number of clusters from the gap statistics and their
standard deviations.
clusGap(x, FUNcluster, K.max, B = 100, verbose = interactive(), ...) maxSE(f, SE.f, method = c("firstSEmax", "Tibs2001SEmax", "globalSEmax", "firstmax", "globalmax"), SE.factor = 1) ## S3 method for class 'clusGap' print(x, method = "firstSEmax", SE.factor = 1, ...)
x |
numeric matrix or |
FUNcluster |
a |
K.max |
the maximum number of clusters to consider, must be at least two. |
B |
integer, number of Monte Carlo (“bootstrap”) samples. |
verbose |
integer or logical, determining if “progress” output should be printed. The default prints one bit per bootstrap sample. |
... |
optionally further arguments for |
f |
numeric vector of ‘function values’, of length K, whose (“1 SE respected”) maximum we want. |
SE.f |
numeric vector of length K of standard errors of |
method |
character string indicating how the “optimal” number of clusters, k^, is computed from the gap statistics (and their standard deviations), or more generally how the location k^ of the maximum of f[k] should be determined.
|
SE.factor |
[When |
The main result <res>$Tab[,"gap"]
of course is from
bootstrapping aka Monte Carlo simulation and hence random, or
equivalently, depending on the initial random seed (see
set.seed()
).
On the other hand, in our experience, using B = 500
gives
quite precise results such that the gap plot is basically unchanged
after an another run.
an object of S3 class "clusGap"
, basically a list with
components
Tab |
a matrix with |
n |
number of observations, i.e., |
B |
input |
FUNcluster |
input function |
This function is originally based on the functions gap
of
(Bioconductor) package SAGx by Per Broberg,
gapStat()
from former package SLmisc by Matthias Kohl
and ideas from gap()
and its methods of package lga by
Justin Harrington.
The current implementation is by Martin Maechler.
Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the number of data clusters via the Gap statistic. Journal of the Royal Statistical Society B, 63, 411–423.
Tibshirani, R., Walther, G. and Hastie, T. (2000). Estimating the number of clusters in a dataset via the Gap statistic. Technical Report. Stanford.
Per Broberg (2006). SAGx: Statistical Analysis of the GeneChip. R package version 1.9.7. http://home.swipnet.se/pibroberg/expression_hemsida1.html
silhouette
for a much simpler less sophisticated
goodness of clustering measure.
cluster.stats()
in package fpc for
alternative measures.
### --- maxSE() methods ------------------------------------------- (mets <- eval(formals(maxSE)$method)) fk <- c(2,3,5,4,7,8,5,4) sk <- c(1,1,2,1,1,3,1,1)/2 ## use plot.clusGap(): plot(structure(class="clusGap", list(Tab = cbind(gap=fk, SE.sim=sk)))) ## Note that 'firstmax' and 'globalmax' are always at 3 and 6 : sapply(c(1/4, 1,2,4), function(SEf) sapply(mets, function(M) maxSE(fk, sk, method = M, SE.factor = SEf))) ### --- clusGap() ------------------------------------------------- ## ridiculously nicely separated clusters in 3 D : x <- rbind(matrix(rnorm(150, sd = 0.1), ncol = 3), matrix(rnorm(150, mean = 1, sd = 0.1), ncol = 3), matrix(rnorm(150, mean = 2, sd = 0.1), ncol = 3), matrix(rnorm(150, mean = 3, sd = 0.1), ncol = 3)) ## Note we use B = 60 in the following examples to keep them "speedy". ## ---- rather keep the default B = 500 for your analysis! ## note we can pass 'nstart = 20' to kmeans() : gskmn <- clusGap(x, FUN = kmeans, nstart = 20, K.max = 8, B = 60) gskmn #-> its print() method plot(gskmn, main = "clusGap(., FUN = kmeans, n.start=20, B= 60)") set.seed(12) system.time( gsPam0 <- clusGap(x, FUN = pam, K.max = 8, B = 60) ) ## Slightly faster way to use pam: pam1 <- function(x,k) list(cluster = pam(x,k, cluster.only=TRUE)) set.seed(12) system.time( gsPam1 <- clusGap(x, FUN = pam1, K.max = 8, B = 60) ) ## and show that it gives the same: stopifnot(identical(gsPam1[-4], gsPam0[-4])) gsPam1 print(gsPam1, method="globalSEmax") print(gsPam1, method="globalmax") gs.pam.RU <- clusGap(ruspini, FUN = pam1, K.max = 8, B = 60) gs.pam.RU plot(gs.pam.RU, main = "Gap statistic for the 'ruspini' data") mtext("k = 4 is best .. and k = 5 pretty close") ## This takes a minute.. ## No clustering ==> k = 1 ("one cluster") should be optimal: Z <- matrix(rnorm(256*3), 256,3) gsP.Z <- clusGap(Z, FUN = pam1, K.max = 8, B = 200) plot(gsP.Z) gsP.Z