I have run a sequence analaysis using the Optimal Matching algorithm. Afterwards, I have clustered the resulting distance matrice using the Ward algorithm and calculated silhouettes as measures of cluster quality and to identify representative sequences.
Now, I am curious whether it is possible to estimate the sequences of the cluster centroids which, to my knowledege, must not be an original data point. How can I estimate the sequence of a centroid?
To get an idea of the different steps of the analysis, consider this manual example[1]:
library(TraMineR)
library(WeightedCluster)
data(mvad)
mvad.alphabet <- c("employment", "FE", "HE", "joblessness", "school", "training")
mvad.labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training")
mvad.scodes <- c("EM", "FE", "HE", "JL", "SC", "TR")
## Define sequence objects
mvad.seq <- seqdef(mvad[, 17:86], alphabet = mvad.alphabet, states = mvad.scodes, labels = mvad.labels, weights = mvad$weight, xtstep = 6)
## Computing OM dissimilarities
mvad.dist <- seqdist(mvad.seq, method="HAM", sm="CONSTANT")
## Clustering
wardCluster <- hclust(as.dist(mvad.dist), method = "ward", members = mvad$weight)
clust4 <- cutree(wardCluster, k = 4)
## Silhouettes
sil <- wcSilhouetteObs(mvad.dist, clust4, weights = mvad$weight, measure = "ASWw")
## Sequence index plots ordered by representativeness
seqIplot(mvad.seq, group = clust4, sortv = sil)
In this example, it would be for example interesting to see whether the sequence of third cluster's centroid differes from the most representative, original sequences in the cluster which are printed at the very top of the sequence index plot. In other cases, the centroid sequence may even have a more idealtype character which does not exist in the original dataset but reflects certain typical structures.
[1] See for the example Studer, Matthias (2013). WeightedCluster Library Manual: A practical guide to creating typologies of trajectories in the social sciences with R. LIVES Working Papers, 24.