1

I have a csv file file1.csv in the following format

serialNo, timestamp, visits, confirm, timeSec
1,  1:55:40, 3, 0, 198
2,  7:42:56, 2, 1, 102
3,  13:20:32, 3, 0, 181
4,  15:26:56, 0, 1, 101
5,  10:36:46, 1, 0, 198

timestamp is the timestamp, visits is the no. of visits to a website, timeSec is the time spent in seconds and confirm is an ordinal variable containing a 0/1 value I have imported this into a pandas dataframe

I wish to see if there is any connection between
a) confirm and visits
b) confirm and timeSec
c) confirm and timestamp - e.g. whether there is a greater chance of a confirm=1 value between 2 time intervals.

I realize that there is a method in pandas to find a correlation

data['confirm'].corr(sessionData['visits'])

that uses the pearson correlation by default and it is evaluated to -0.04981167717341486
and data['confirm'].corr(sessionData['timeSec']) gets evaluated to 0.010440316272189443

My question is -
Is pearson correlation the correct inferential statistics tool to use in both cases a, b and c? Also, what are the different strategies I can use to find a connection as mentioned in a, b and c?

  • Welcome to CV, a bit more info would help us answer. You state you want to determine if there is a connection and for us to give you the most suitable advice we'll need to make sure we understand what this means to you. By connection do you want to determine. Do you need to build a model to predict likelihood of confirm being true? Do you want to know which variables are statistically significant for a confirm event? Do you want to quantify the size of the interaction between the variables and confirm? If you don't know perhaps share a bit more about the aims of your investigation – ReneBt Oct 19 '18 at 08:25
  • @ReneBt - Hi Rene, thanks for your comment. The aim is to find a connection between the variables as I'd mentioned earlier. e.g. whether a higher time spent value directly corresponds to a confirm=1. Or a high correlation value between two variables suggests something. I'm not exactly sure how to go about this. Your input will be valuable. – inquisitiveProgrammer Oct 19 '18 at 09:07

1 Answers1

1

You should probably start with visualizing the data. For confirm (0/1) and timeSec, you could make parallel boxplots/violinplots or dotcharts, and a test of for instance equality of means of timeSec in the two groups, by a t-test or Wilcoxon test.

A Pearson correlation is not the most informative statistic: see my answer here: Correlations between continuous and categorical (nominal) variables.

For confirm and visits I suggest much the same. Alternatively, as visits is count data, you could have a Poisson model for visits in the two groups and compare means. If you want better advice, maybe augment your post with some plots.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • Thanks for your answer Kjetil. I have tested the hypothesis I had with t-tests. Wilcoxon test wasn't an option, since the size of the two groups is unequal. Also, I've modified the question for a connection between confirm and timestamp (c), is a t-test a good way to go about it? – inquisitiveProgrammer Oct 20 '18 at 05:23
  • Well, t-test could work, butprobably isn't the best. Some permutation test would be better. Can you show plots or link to data? – kjetil b halvorsen Oct 20 '18 at 08:02
  • I did try using the t-test, however the code errored out, because it was a datetime.time variable, for which a mean calculation isn't possible. Also, I'm not sure I can share the dataset. If you could recommend a few functions, I can try them out to see which fits best. Thanks! – inquisitiveProgrammer Oct 20 '18 at 12:33
  • I'm not using Python so cannot recommend function. In R mean of date/time variables is defined (not sum). But you could convert anyhow to numeric. – kjetil b halvorsen Oct 20 '18 at 13:22