When and how to use weights for sequence analysis in social science?

Question

Weighting in sequence analysis

So far, I have scarcely found papers that address the issue of weighting for sequence analysis (using for example the optimal matching algorithm). Sequence analysis normally involves several steps:

setting or calculation of substitution and insertion/deletion costs,
computation of distance matrices and
following cluster analyses or discrepancy analyses[1].

At least, the R-package TraMineR (see Gabadinho et al. 2010 and Gabadinho et al. 2011, p. 11) and the Stata-ado SEQCOMP by Laurent Lesnard makes it possible to include weights at step 1 and 3.
Furthermore, Lesnard explicitly recommends the usage of sample weights for steps 1 and 3:

"Sample weights should only be used to calculate transition matrices, and consequently substitution costs. Instead of counting the number of transitions, it is simply the weighted number of transitions that should be taken into account. The matching procedure in itself, namely, the comparison of pair of sequences, does not require any weights; it is by definition a one to one procedure. However, sample weights should be turned on to interpret results, for instance, if cluster analysis is used, the size of the clusters obtained must be weighted."
Lesnard (2010: 415, endnote 12)

Open questions

Nonetheless, there does not seem to be a consensus in the literature when and which weights are needed or useful.

What do you think is the best rationale for applying weights in sequence analysis?
When should sequences be weighted?
Do you use cross-sectional sampling weights or longitudinal weights accounting for sampling probabilities as well as panel attrition?
How do you apply weights if you have unbalanced panel data?
The usage of weights in TraMineR is well documented; but do you have examples for the usage of weights with a Stata-ado?

References

Gabadinho, Alexis, Gilbert Ritschard, Matthias Studer and Nicolas S. Müller (2010): Mining sequence data in R with the TraMineR package: A user's guide, University of Geneva.
Gabadinho, Alexis, Gilbert Ritschard, Nicolas S. Müller and Matthias Studer(2011): Analyzing and visualizing state sequences in R with TraMineR, in: Journal of Statistical Software, Vol. 40, No. 4, pp. 1-37.
Lesnard, Laurent (2010): Setting Cost in Optimal Matching to Uncover Contemporaneous Socio-Temporal Patterns, in: Sociological Methods and Research, Vol. 38, No. 3, pp. 389-419.
Studer, Matthias, Gilbert Ritschard, Alexis Gabadinho and Nicolas S. Müller (2011): Discrepancy Analysis of State Sequences, in: Sociological Methods and Research. Vol. 40, No. 3, pp. 471-510.

_{[1] See Studer et al. (2011) for a presentation of discrepancy analysis that is an ANOVA like approach for distance matrices.}

Just an observation: Since the non-weighted scenario is equivalent to setting all weights to 1, it is clear that by allowing different weights you will be working with a more complex model. This means that the model will be able to capture phenomena not captured by the original model. This will come at the cost of requiring more data to properly generalize or an increased chance of over-fitting. So, without knowing anything specific about the domain, I would suggest using a weighted version only if the unweighted one doesn't work for you. — Bitwise, Jun 18 '13 at 14:02

Matthias Studer · Accepted Answer · 2013-06-19T06:37:21.083

I assume that you are using sampling weights to correct for representativity bias. Please note that some "data providers" require you to use the weights in your publications.

In my opinion, you should always use weights for descriptive analysis in order to get unbiased results. I think that there are more consensus for this kind of analysis. Descriptive analysis includes cluster analysis, sequences visualization, computation of transitions rates (and hence substitution costs based on them), for instance. For weighted cluster analysis, you can have a look at the WeightedCluster library and manual.

Regarding the weights to use, I would recommend to use longitudinal weights, since the sequences are defined for the whole period, but it depends on the exact weight definition. For a more general answer, you need to answer the following questions:

What sample do I have (at what time, and so on)?
to which population do I want to generalize?

In some panels, longitudinal weights use the sample defined by wave t and generalize it to the population at wave one. This is what you want if you want to follow the evolution at wave one.

Which kind of weight would you use? Cross-sectional sampling weights from date/wave 1 or longitudinal weights for the whole observed period of time? — non-numeric_argument, Jun 18 '13 at 16:48

When and how to use weights for sequence analysis in social science?

Weighting in sequence analysis

Open questions

References

1 Answers1