I have data that can be called demographic data.
Raw data
Person 0001
\begin{array}{|c|c|} \hline Feb\,1981- Apr\,85 & engaged\,\,in\,\,\underline{activity}\,\,\textit{A}\,\,of \,\,\underline{type}\,\,\textbf{square}\,\\ \hline Apr\,1985- July\,86 & engaged\,\,in\,\,\underline{activity}\,\,\textbf{$x_1$}\,\,at \,\,\underline{location}\,\,\textbf{beta}\,\,of \,\,\underline{kind}\,\,\textbf{red}\,\,\\ \hline July\,1986- Nov\,87 & engaged\,\,in\,\,\underline{activity}\,\,\textbf{$x_2$}\,\,at \,\,\underline{location}\,\,\textbf{beta}\,\,of \,\,\underline{kind}\,\,\textbf{red}\,\,\\ \hline Nov\,1987- Apr\,88 & engaged\,\,in\,\,\underline{activity}\,\,\textbf{$x_3$}\,\,at \,\,\underline{location}\,\,\textbf{beta}\,\,of \,\,\underline{kind}\,\,\textbf{red}\,\,\\ \hline Apr\,1988- June\,91 & engaged\,\,in\,\,\underline{activity}\,\,\textbf{$y_1$}\,\,at \,\,\underline{location}\,\,\textbf{gamma}\,\,of \,\,\underline{kind}\,\,\textbf{red}\,\,\\ \hline June\,1991- Sep\,92 & engaged\,\,in\,\,\underline{activity}\,\,\textbf{$y_2$}\,\,at \,\,\underline{location}\,\,\textbf{gamma}\,\,of \,\,\underline{kind}\,\,\textbf{red}\,\,\\ \hline ...\,....- ....\,.... &............\\ \hline Present\,\,time & engaged\,\,in\,\,\underline{activity}\,\,\textbf{$z_1$}\,\,at \,\,\underline{location}\,\,\textbf{kappa}\,\,of \,\,\underline{kind}\,\,\textbf{red}\,\,\\ \hline \end{array}
- The data is available for many thousands of persons.
- The start date is different from each person.
- The first Activity A (which can be series of chronological activities) is essentially different from other activities.
- Activity A has bearing on how the subsequent activities change.
- Activity A, type and the kind can take dozens of categorical values. Each drawn from its own set of categorical values.
- The subsequent activities $x_i, y_i, z_i, …… $ can take thousands of categorical values from the same set.
- The data for each person can be assumed to be iid:
- that is the data for the person 1, person 2, person 3 ..... arise from the same random process
- While the data of an individual person is interdependent
- The value of activity $x_2$ is dependent on $x_1$ and Activity A which in turn is influenced by the value of activity A and so on.
- That is to say that the process is not first order Markovien.
Desired Outcome
While I would like to predict both when the location and the activity change,
- Predicting location change is more important at the moment.
Ideally the outcome will be in the form of probabilities:
- given that in Sep 92 the activity was $y_2$, what is the probability that it will still be $y_2$ in Oct, Nov, Dec,…….
- If the activity changes then can we predict what will it be.
Training
I want to be able to train the data on all the many thousands of persons and be able to make prediction on the new data from a new person.
Solution Proposed
Index the data by time in the following manner:
Let January 1980 $\,\,\textbf{$[m_{1980.1}]$}\,\,$ be the arbitrary starting point for all the data.
\begin{array}{|c|c|c|c|c|c|c|c|c|c|} \hline & .... & m_{1980.2} & m_{1981.3} & ... & m_{1985.4} & m_{1985.5} & ... & m_{1986.7} & .... \\ \hline person\,\,0001 & ... & A\,\,of\,\,type\,\,fast & A\,\,of\,\,type\,\,fast & ... & x_1\,\,at\,\,location\,\,beta\,\,of\,\,kind\,\,red & x_1\,\,at\,\,location\,\,beta\,\,of\,\,kind\,\,red & ... & x_2\,\,at\,\,location\,\,beta\,\,of\,\,kind\,\,red & ... \\ \hline. \end{array}
- This will make it a time indexed ordered sequential data.
- While the each sequences (the data of a person) can be assumed to be iid
- The sub-sequences of a sequence very clearly are not iid.
The problem then becomes one of - training on thousands of sequences - predicting upto next dozen or more sub-sequences of new sequences.
Further comments:
- The very first activity A or (series of activities in other cases) is in a way different from subsequent activities $x_i, y_i.....$.
- Similarly type, location, kind are different.
- At the moment the intention is to model them as a part of the sub-sequence. Is there any different way. For instance since activity A occurs only at the outset maybe we can include it as a different kind of parameter?
Algorithms
- In the present modelling of data the algorithm that look most look suitable are the one used in either
- PoS tagging in NLP
- predicting the next word in NLP: what will be the next word given the previous sequence of words.
- Object detection: Where will be the object next move given its history.
- Following are the algorithms that I have been able to research. The above application is very novel so I seek help on how to adapt them for my purpose.
- Conditional Random fields: It will permit dependence of sub-sequences within the sequence but I haven't seen any practical implementation in this area.
- n-order Markovian model with n = number of months from the arbitrary starting point. I couldn't find any example.
- Kalman filters: again couldn't find any practical example outside of signal processing.
- Anomaly detection: On most of the month the location remains the same so change can be considered anomalous albeit the system remains in that anomalous state in the future!
Requested help:
- Is the modelling of the data most fit for the purpose.
- Do the proposed algorithms serve the purpose. In particular:
- Do they suitably deal with the problem of variable length of the sequence.