Should I be running R to find the significance with such a large data set and if so how to obtain it for each category

Question

I have a very large data set which is over 30 of millions of records. My CSV file looks like this:

State  PointOfOrigin   DaysToInHome
CA     A               11.2
CA     A               10.8
CA     B               15.3
TX     B               10.5
TX     A               10.6

I am a noob to stats (and Stats.StackExchange) and I am trying to find the statistical significant difference between PointOfOrigin (2 different values) and DaysToInHome for each State (50 different values) leveraging R. I have already calculated the averages, leveraging Python, for the general team but we have a leader who always wants to see the hard stats. My junior stats experience leads me to believe that with such a large data set it is not relevant to look at the stats (since P value will more than likely be close to zero regardless) and we can rely on the calculated averages to confidently inform our decisions on whether there is any practical significance difference of DaysToInHome between PointofOrigin for each State.

So I basically have two questions:

1) Is my assumption correct that with such large data sets that stats are not helpful and I can confidently just rely on the averages to make a intelligent and informed business decision, or I should did deeper and find the statistical significance to CYA?

2) If the answer to #1 is that my assumption is wrong and I should know the stats how do I go about it?

I ran anova, in R, already

InHomeData = read.csv("C:/...Input.csv")
aov.InHomeData=aov(DaysToInHome~PointOfOrigin*State)
is(aov.InHomeData)
summary(aov.InHomeData)
summary.aov(InHomeData)

This provides:

                            Df Sum Sq Mean Sq F value Pr(>F)    
PointOfOrigin                1    952   952.3  134.30 <2e-16 ***
State                       36  65907  1830.7  258.20 <2e-16 ***
PointOfOrigin:State         12   6724   560.3   79.02 <2e-16 ***
Residuals                 9950  70550     7.1

But I am looking for a statistic that provides some guidance that the different average DaysToInHome is statistically difference for the State of TX between the two different PointsofOrigin. And the same for the other 49 other states since a decision will be make for each state based on the data

Someone mentioned I should bootstrap the the Anova but I cannot figure out how to do that. I also am not sure it is worth it.

Just to add some context. I few years back a team brought online PointOfOrigin B to increase the speed to market and we are not sure they did there homework first. For example, what we are trying to see is if average DaysToInHome for TX is less than X (say 1.1) different between the two different PointOfOrigin we might reconsider having Point B process any volume. But is the difference is greater than Y (say 1.1) we might keep Point B processing some volume.

The answer to your Q1 is yes exactly as you stated in your reasoning. — mdewey, Oct 27 '17 at 15:18

score 1 · Accepted Answer · edited Jun 11 '20 at 14:32

Your assumption is (IMO) correct. With large data sets, statistical significance gets less and less informative, because tiny differences will become statistically significant if your sample size is just large enough. And there will always be tiny differences, because a (point) null hypothesis is never correct.

The last sentence is where people may have different opinions. See Are large data sets inappropriate for hypothesis testing? for a dynamic exchange.
To address your leader's concerns, I would recommend plotting your data (e.g., using density or bean plots, using a subsample if necessary). Take a look at variability within vs. between PointOfOrigin values. You may find that differences are statistically significant but don't matter from a business perspective. This latter concept is called "clinical significance" in medicine, and it differs from statistical significance.

Thank you, having an unbiased opinion can help. As for #2, I may not be understand but I am not sure how they get me 'back up' for what we are looking for. To help clarify what business decision we are trying to make I updated my question by adding a extra paragraph to the end. — John Minze, Oct 27 '17 at 15:30

Should I be running R to find the significance with such a large data set and if so how to obtain it for each category

1 Answers1