I have a very large data set which is over 30 of millions of records. My CSV file looks like this:
State PointOfOrigin DaysToInHome
CA A 11.2
CA A 10.8
CA B 15.3
TX B 10.5
TX A 10.6
I am a noob to stats (and Stats.StackExchange) and I am trying to find the statistical significant difference between PointOfOrigin
(2 different values) and DaysToInHome
for each State
(50 different values) leveraging R. I have already calculated the averages, leveraging Python, for the general team but we have a leader who always wants to see the hard stats. My junior stats experience leads me to believe that with such a large data set it is not relevant to look at the stats (since P value will more than likely be close to zero regardless) and we can rely on the calculated averages to confidently inform our decisions on whether there is any practical significance difference of DaysToInHome
between PointofOrigin
for each State
.
So I basically have two questions:
1) Is my assumption correct that with such large data sets that stats are not helpful and I can confidently just rely on the averages to make a intelligent and informed business decision, or I should did deeper and find the statistical significance to CYA?
2) If the answer to #1 is that my assumption is wrong and I should know the stats how do I go about it?
I ran anova, in R, already
InHomeData = read.csv("C:/...Input.csv")
aov.InHomeData=aov(DaysToInHome~PointOfOrigin*State)
is(aov.InHomeData)
summary(aov.InHomeData)
summary.aov(InHomeData)
This provides:
Df Sum Sq Mean Sq F value Pr(>F)
PointOfOrigin 1 952 952.3 134.30 <2e-16 ***
State 36 65907 1830.7 258.20 <2e-16 ***
PointOfOrigin:State 12 6724 560.3 79.02 <2e-16 ***
Residuals 9950 70550 7.1
But I am looking for a statistic that provides some guidance that the different average DaysToInHome
is statistically difference for the State of TX between the two different PointsofOrigin
. And the same for the other 49 other states since a decision will be make for each state based on the data
Someone mentioned I should bootstrap the the Anova but I cannot figure out how to do that. I also am not sure it is worth it.
Just to add some context. I few years back a team brought online PointOfOrigin B to increase the speed to market and we are not sure they did there homework first. For example, what we are trying to see is if average DaysToInHome
for TX is less than X (say 1.1) different between the two different PointOfOrigin
we might reconsider having Point B process any volume. But is the difference is greater than Y (say 1.1) we might keep Point B processing some volume.