Weighting in sequence analysis
So far, I have scarcely found papers that address the issue of weighting for sequence analysis (using for example the optimal matching algorithm). Sequence analysis normally involves several steps:
- setting or calculation of substitution and insertion/deletion costs,
- computation of distance matrices and
- following cluster analyses or discrepancy analyses[1].
At least, the R-package TraMineR (see Gabadinho et al. 2010 and Gabadinho et al. 2011, p. 11) and the Stata-ado SEQCOMP by Laurent Lesnard makes it possible to include weights at step 1 and 3.
Furthermore, Lesnard explicitly recommends the usage of sample weights for steps 1 and 3:
"Sample weights should only be used to calculate transition matrices, and consequently substitution costs. Instead of counting the number of transitions, it is simply the weighted number of transitions that should be taken into account. The matching procedure in itself, namely, the comparison of pair of sequences, does not require any weights; it is by definition a one to one procedure. However, sample weights should be turned on to interpret results, for instance, if cluster analysis is used, the size of the clusters obtained must be weighted."
Lesnard (2010: 415, endnote 12)
Open questions
Nonetheless, there does not seem to be a consensus in the literature when and which weights are needed or useful.
- What do you think is the best rationale for applying weights in sequence analysis?
- When should sequences be weighted?
- Do you use cross-sectional sampling weights or longitudinal weights accounting for sampling probabilities as well as panel attrition?
- How do you apply weights if you have unbalanced panel data?
- The usage of weights in TraMineR is well documented; but do you have examples for the usage of weights with a Stata-ado?
References
- Gabadinho, Alexis, Gilbert Ritschard, Matthias Studer and Nicolas S. Müller (2010): Mining sequence data in R with the TraMineR package: A user's guide, University of Geneva.
- Gabadinho, Alexis, Gilbert Ritschard, Nicolas S. Müller and Matthias Studer(2011): Analyzing and visualizing state sequences in R with TraMineR, in: Journal of Statistical Software, Vol. 40, No. 4, pp. 1-37.
- Lesnard, Laurent (2010): Setting Cost in Optimal Matching to Uncover Contemporaneous Socio-Temporal Patterns, in: Sociological Methods and Research, Vol. 38, No. 3, pp. 389-419.
- Studer, Matthias, Gilbert Ritschard, Alexis Gabadinho and Nicolas S. Müller (2011): Discrepancy Analysis of State Sequences, in: Sociological Methods and Research. Vol. 40, No. 3, pp. 471-510.
[1] See Studer et al. (2011) for a presentation of discrepancy analysis that is an ANOVA like approach for distance matrices.