4

I have a dataset of order processing with 8 million rows with following columns:

  • HistoryId - Identity column of the records
  • ItemId - Id of the item tracked
  • Previous status - status of Item before the history record created
  • New status - status of Item when history record created
  • Time till status change - The time difference between the previous record of the item till this history record created.

I want to visualized the these data with a graph / chart that displays the flow similar to this:

Status 1 --> Status 2 --> Status 3 --> Status 4 --> Status 5
                                       Status 5
                          Status 4 --> Status 5
Status 2 --> Status 3 --> Status 5 --> Status 2 --> Status 4 --> Status 5
                                                    Status 5
                      --> Status 4 --> Status 5
             Status 4 --> Status 5

In example above - assume that all of the items have:

  1. First status is either Status 1 or Status 2.
  2. All item start with Status 1 have 2nd status is Status 2, while for item start with Status 2 partially Status 3, and others are Status 4
  3. Moving on similar logic of (2.) with break down by each flow.

I want to explore the data and visualize all possible flows of the data. Then summarize them by counting the number of items go through each Status of each flow. Currently, I am doing them manually. Is it possible to automate this and visualize it on graph?

Sample Data:

HistoryID     ItemId     Previous_status     New_Status   
1             1          NA                  status_1     
2             2          NA                  status_2   
3             1          status_1            status_2   
4             1          status_3            status_4  
5             2          status_1            status_3   
6             1          status_4            status_5    
7             2          status_3            status_5    

And here is a sample output even this output is not completely what I wanted - Explanation:

  • X - is the index of status that an Item have in it life cycle.
  • Y - is the status name
  • Size - the number of item go through status Y at index X.
  • Skip the color as it was exclude from the data in example.
  • As you can see most of item have the status Y as the beginning status (the fist left column).
  • Then moving to the right they are breaking down to other status (sometime they may come back the beginning status but in later index)
  • The limitation of this is not show the flow by detail even I can see the flow by the changing in size of the point overtime.
  • What I really want is something like a decision tree where you can see how item flow through the status, and each flow is separate from each others. Sample out put but not exactly

I managed to used the riverplot package. However I have some prolems as image below. Anyone know how to display the lable at the side instead of right on the spots? as I have a about 30 status and it is very confusing when having them display like the image below.

For your reference, here is the link to the package riverplot tutorial

enter image description here

sinhnhn
  • 71
  • 6
  • 1
    It might just be me, but I can't discern if your sample data are supposed to correspond to the visualization concept as you've laid it out. Are the rows organized by `ItemId`? Is `Time_till_status_change` supposed to be reflected visually in any way? – Nick Stauner Jul 23 '14 at 07:00
  • The main point here is to see the general status change flow from Status 1 -> ... -> Status 5 for example. How the status change overtime. The "Time_till_status_change" doesn't need to be visualize for each item but as an average figures. ex. Status 2 -> Status 3 average time is 2 – sinhnhn Jul 23 '14 at 07:05
  • Right, but what provides continuity to the "flow"? Are the rows organized by `ItemId`? I'm not clear on how you'd want to visualize those average times either...your example doesn't seem to include any such information. Please consider [editing](http://stats.stackexchange.com/posts/108974/edit) to clarify. – Nick Stauner Jul 23 '14 at 07:12
  • Added some clarification, remove the time to make the question focus on visualized the flow of items's statuses. – sinhnhn Jul 23 '14 at 07:51
  • 1
    Do you want something like a Sankey diagram (see: http://stats.stackexchange.com/questions/26578/best-way-to-visualize-attrition-using-r)? – gung - Reinstate Monica Jul 23 '14 at 15:19
  • Thank you gung! from articles you shared I found package [riverplot](http://cran.r-project.org/web/packages/riverplot/index.html) by January - that look interesting to tried out - I will tried it out and see if it help and update more on this article later. Thank you very much – sinhnhn Jul 23 '14 at 15:54

1 Answers1

3

Here is the answer to my question. There are two packages to try out:

riverplot - this package is very flexible - as it is flexible, will need some time to learn its format to master it.

ggparallel - easier regarding to data format which can be easily done using reshape2 package

Recently I learnt about d3.js and I think d3js is a very good tools for data visualization including for this purpose. Sample Sankey chart drawed in d3js

sinhnhn
  • 71
  • 6