Laurent Lesnard's home page

Home | Contact | Site Map | Private area

 

Home > English > Programs

SEQCOMP, a sequence analysis Stata plug-in

Version 1.0 Available for Stata (v9 and higher) Mac (intel and PPC) and Windows

Wednesday 28 May 2008

This Stata plug-in implements a sequence analysis method which has been presented in a working paper and previously in an article published in the Electronic International Journal of Time Use Research, Vol. 1 No. 1, pp. 67-91.

Social sciences lack solutions to perform sequence analysis. This paper presents the Stata plug-in which was developed to implement a sequence analysis method I thought up to build a taxonomy of work schedules.

Warning! prior to version 0.7, the plugin was not the exact implementation of the formula proposed [1] here. Many thanks to Renzo Carriero who pointed out that to me.

First version: 7 december 2006

A sequence comparison method based on the sole substitution operations

Although this method can be seen as a particular case of Optimal Matching, it is only a distant relative since only substitution operations are used. As a consequence, this method is only suitable for sequences of identical length. In a way, this method is closer to the Hamming distance which is usually considered as the ancestor of the Levenshtein distance (OM). Hence, a possible name for this method could be “dynamic hamming dissimilarity measure”.

Indeed, subsitution costs are not equal to one unit as in the Hamming distance but are derived from the series of transition matrices which describe, between two episodes, the fluctuations between the states considered in the analysis. More precisely, sizable transitions between two states between t and t+1 means that they are close in probabilistic terms: the chances that switching between the two states are high. On the contrary, few transitions are observed between two states mean that these two states are distant.

Work schedules can be sumarized by a two-state ("work" and "no work") process. At 9 AM, transitions from "work" to "no work" are presumably higher than at 9 PM and consequently, workers and non workers will be considered as close at 9 AM and very distant at 9 PM.

As a sequence comparison method, the end result is a matrix composed of the dissimilarity for every pair of sequences. A data reduction technique, such as cluster analysis or multidimensional scaling (MDS) is needed if these dissimilarities are to be exploited.

Content of the zip file

A Stata plug-in is actually composed of two distinct files:

-  the plug-in strictly speaking, which extension is simply plugin [2].
-  an ado file, named here seqcomp.ado, an interface to distseq.plugin

These two files must be unzipped into your local personal ado folder, installed somwhere on your computer. Once these two files installed, the plugin can be used through basic Stata syntax:

In varlist, the first argument, should be put the list of variables the sequences to be analyzed are made of. The analysis can be restricted to certain sequences through the if option and weights can also be used [3].

Typical use is:

seqcomp episode1-episode100

The dissimilarities computed by the plugin are available as a Stata dissimilarity matrix named dhamdist. Note that the size of this matrix does not depend on matsize hence can be way over 800 for Stata Intercooled users and way over 11,000 for Stata SE ones. Getting the dissimilarities as a Stata matrix slows down a little things so it is possible to disable this feature using the nodistmat option. In this case the export option to save the result in a dissimilarity list becomes compulsory (results have to be stored somewhere!). The using command is also compulsory when export is chosen as it indicates where the results are to be stored. Remark that the file path must imperatively include at the end the appropriate folder separator. For example

seqcomp episode1-episode100 using "C:\temp\", export nodistmat

will analyse all the sequences in the files from episode1 to episode100 and will put the results in "C:\temp\". id() is optional but useful when export is chosen as it helps to match the internal id used to compute dissimilarities with any their original id, if any.

Weights are taken into account for the calculations of the transition matrices but not for matching, which is by definition a one to one comparison. When weights are turned on, it is the users’ responsibility to use them again properly in the data reduction stage. Finally, it is possible to tell seqcomp which variable identifies observations: a file including a mapping of this variable to the internal id used will be produced. Results are made of three files if the export option is chosen:

-  substitution.dat, which contains the series of the substitution cost matrices
-  distancelist.dat, which presents the dissimilarity matrix as a dissimilarity list file with three columns: dissimilarities are located in the third column whereas the id of the couples of sequences can be found in the two first columns.

2 1 x
3 1 x
3 2 x
4 2 x
1 3 x
...

-  idmapping.dat, made of two columns: the first one lists the internal ids of observations and the second gives their true id.

This is the most efficient way of storing a dissimilarity matrix and is quite easy to use with standard statistical packages, in particular with the cluster package ClustanGraphics which reads without problem proximity lists. Stata itself reads proximity lists but is restricted to small matrices [4]. However, Stata is not good when it comes to do cluster analysis: few (old) algorithms are available. SAS and ClustanGraphics are better in this field but neither features the latest methods.

Why writing a plug-in and not a classical Stata ado file with Mata statements?

The principle of sequence analysis is quite simple but require a lot of computer memory. Stata is not good when it comes to manage memory with such procedures and the only solution is to program these elements in C.

[1] Differences are likely to be minor but users are advised to check on their data.

[2] This extension is hiding a dll.

[3] The keyword iw is used since the version 0.4 in place of aw: iw is used to reflect the relative importance of observations (post-stratification etc.) whereas aw is inversely proportional to some variance measure (and as a consequence has nothing to do with sampling considerations).

[4] Matrix maximum size is 800 for Stata intercooled and 11,000 for Stata Special Edition (SE).