3

I have a skewed distribution where one datapoint (google, see below) dominates the dataset (the Visits metric). I can intuitively see that the %CR for the entire data set (calculated as Total Transactions/Total Visits) is driven by this data point. But what is the best statistical approach to prove that?

I've tried to phrase my question in different ways in Google to point me in the right direction but I've had no luck.

dataset

Steffen Moritz
  • 1,564
  • 2
  • 15
  • 22
  • 2
    I think that stating that the google data represent 79% of your data points is enough... – Elvis Sep 01 '13 at 15:40
  • You could simply present "results including Google" and then partition into "results excluding Google" and "Google alone"; allowing the reader to clearly see the extent to which Google drives the overall results. – Glen_b Sep 01 '13 at 23:26

2 Answers2

3

Your data are clear and obvious and tests can't actually "prove" anything anyway. What you really are asking is how to quantify how much the "Grand Total %CR" depends on the Google results.

There are various terms for such things, which are called "sensitivity," "leverage," "influence," etc. The idea uniting them all is that when you change one number that goes into a summary and the summary doesn't budge (much), then the summary is not sensitive to the number you changed. When the summary changes a lot, it is sensitive to the change. We can measure this.

Sensitivities can be computed in various ways, depending on how we quantify amounts of change. Consider raw changes in the %CR column, for instance. Evidently increasing Google's value from $0.03$ to $0.04$ would affect the bottom line much more than a similar $+0.01$ increase in any of the five other %CR values. Let's work out how much. To keep the formulas brief, number the rows of the table $1$ through $6$ from top to bottom and refer to the Total Visits as $V$, Total Transactions as $T$, and %CR as $R$ (a mnemonic for "ratio"). Use the row numbers as subscripts; when no subscript appears, we mean the value on the bottom line. For instance, we may refer to the value $2415$ for the bing/cpc Total Transactions on line $3$ as $T_3$. The total number of transactions is $T = 23625$.

The table is constructed so that:

  • In each row $i$, $R_i = T_i/V_i$. (Notice this is not a percent! It's just the ratio.)

  • $T = T_1+T_2+\cdots+T_6$ and $V = V_1+V_2+\cdots+V_6$: these are true totals.

  • $R = T/V$. Notice this value, equal to $0.0299$, differs from the average $\bar{R} = \left(R_1+R_2+\cdots+R_6\right)/6 = 0.0640$.

To find the sensitivities we employ rules of Calculus to differentiate $R$ with respect to the data. The wording of the question suggests that the sensitivity of $R$ with respect to the $R_i$ is of concern. When changing $R_i$ we imagine this being caused primarily by changes in the response $T_i$ rather than the underlying intensity $V_i$. Because $R$ is not directly expressed in terms of the $R_i$, begin with an algebraic manipulation to express it in terms of the total visits and CR values:

$$R = \frac{T_1+T_2+\cdots+T_6}{V_1+V_2+\cdots+V_6} = \frac{V_1R_1+V_2R_2+\cdots+V_6R_6}{V}.$$

Now we may compute

$$\frac{\partial R}{\partial R_i} = \frac{V_i}{V}.$$

Let's make a table of these sensitivities:

Row              Sensitivity
---------------- -----------
google/cpc             0.789 
(direct)/(none)        0.090 
bing/cpc               0.064 
vantage/cpc            0.033 
yahoo/rich media       0.015 
gsp/banner             0.010

For instance, the bing/cpc value is $V_3/V = 50821/789983 = .06433...$, rounded to $0.064$ for easy comprehension.

To make Google's domination of CR obvious, a simple graphic will suffice:

Barplot

An increase by $0.01$ (one percent) in Google's CR value will have almost $0.79/0.09\approx 9$ times the effect on the overall CR value than the next most influential row and it will have $79$ times the effect (actually $83$ times when computed more precisely) of the least influential row.


Calculations of this sort are readily carried out with statistical computing software such as R. Here is the code I used:

V = c(623123, 70941, 50821, 25818, 11742, 7538)
names(V) <- c("google/cpc", "(direct)/(none)", "bing/cpc", "vantage/cpc", 
              "yahoo/rich media", "gsp/banner")
T = c(15795, 104, 2415, 3789, 827, 695)
R = T/V
R.sens <- V/sum(V)

palette(terrain.colors(6))
b <- barplot(R.sens, main="CR Sensitivity", col=1:6)
h <- R.sens + 0.05; h[1] <- h[1]/2
text(b, h, labels=format(R.sens, digits=1))
whuber
  • 281,159
  • 54
  • 637
  • 1,101
0

I don't think there is a statistic to prove exactly this (but you would have to be more precise about what "dominates" means).

If you want to show that Google is different from all the others, you could do a t-test on 15,795/623,123 vs. (23,625-15,795)/(789,983-623,123). If you wanted to see which sites were different (not combining any) you could do a logistic regression with % as the DV and "site" as the IV.

EDIT (per comments)

OK, if you want to show that the distribution of visits across sites is not uniform, you can do a one sample chi-square test. In R:

chisq.test(c(623123, 70941, 50821, 25818, 11742, 7538))

and the resulting p-value is $2.2*10^{-16}$.

But that really is overkill.

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • Thanks for the reply. By "dominates" I mean that Google has 78% of visits. Because of this, its conversion rate seems to be the key driver to the overall conversion rate (0.03, bottom right corner). In this case that seems intuitive by looking at the data but in other cases it might not be quite so straight forward so I'd like a more scientific approach. – needlesslosses Sep 01 '13 at 13:15
  • 1
    Hadn't quite finished. I've done the t-test. My understanding is that t-test shows that A and B are different not how much A contributes to the overall figures. Is that right? Many thanks – needlesslosses Sep 01 '13 at 13:17