4

Looking at Google Analytics, I can't help but think that the summary stats which present averages have the potential to be very misleading (and probably are, in practice). For some of these summary stats, there doesn't appear to be an easy way to determine the distribution of the numbers underlying the data. For other stats, there's apparently no way to determine the underlying distribution. It's clear enough that the underlying distributions for some of these stats are likely to be non-Gaussian.

So I'd like to know - if you have "average" summary stats without knowing the distribution of the underlying numbers, how do you interpret or derive meaning from them?

  • 1
    Possibly of interest: [should the mean be used when the data are skewed?](https://stats.stackexchange.com/questions/96371/should-the-mean-be-used-when-data-are-skewed). – Alexis Mar 13 '17 at 23:28
  • 1
    @Alexis, thanks -- I'd totally forgotten that thread (even though I posted an answer -- amusingly, I was reading an answer yesterday to which I thought "Gee I wish I knew as much about this" only to find it was my own answer). I think that post is probably sufficiently close to call a duplicate (in that answers to that question in effect respond to this question). If they're not sufficient for Dreadnaught's purposes, OP can either edit to distinguish (ie focus on what isn't answered there) and flag to reopen or post a new question aimed at whatever specific issues were not covered. ... ctd – Glen_b Mar 14 '17 at 00:00
  • ctd .. This one and the one you linked to are the sort of really fundamental question about which volumes could be written but solid practical advice is very difficult to offer unless we go into a great deal of figuring out what we're really trying to find out and studying a great deal about the variables of interest. Good advice is especially difficult to offer for people not already having a fairly solid background in statistical ideas (where it's most needed, really, since the overwhelming majority of people for whom such a question is relevant don't have more than a smattering of it). – Glen_b Mar 14 '17 at 00:00
  • There are some other good links in the sidebar (under "Related"). Of possibly some marginal relevance is this other question about [when the mean is more efficient than the median with symmetric distributions](http://stats.stackexchange.com/questions/136671/for-what-symmetric-distributions-is-sample-mean-a-more-efficient-estimator-tha). – Glen_b Mar 14 '17 at 00:02
  • @Glen_b - Above you stated, "studying a great deal about the variables of interest..." The dilemma here is that my ability to study the variables in question is constrained by the lack of availability of the data. The issue is not one of non-normal distributions (maybe I should delete that part of the question?), or of the fundamental meaning of what a *mean* is. I'm clear on those concepts. Part of my intent is to ask whether interpreting certain descriptive stats is an exercise in self-deception when the ability to determine the shape of the data one is dealing with, is severely limited. – Dreadnaught Mar 14 '17 at 01:00
  • The expression covers a variety of things - what is known about the variables (continuous? measurements? times? counts? categories? upper and lower bounds? etc) plus any existing theory, any previous studies that incorporate the same or similar variables, plausible reasoning, etc – Glen_b Mar 14 '17 at 01:04
  • @Dreadnaught Re: "to ask whether interpreting certain descriptive stats is an exercise in self-deception when it may not be possible to obtain an good idea of the shape of the data one's dealing with" -- that sounds like it would be a more radical change than a refocusing of the present question (I think it would completely change the question out from under what my answer responded to) ... so it looks like a new question. It also sounds like a rather broad and vague question in its present form, so you'd need to make it clearer before posting it. – Glen_b Mar 14 '17 at 01:15
  • @Glen_b - In the case of one variable, what's known is that it measures the duration of an activity (time spent on a website). Thus, the lower bound is zero. It's measured in seconds. It's not entirely clear if the variable has a theoretical upper bound. In the particular case I'm dealing with, it's probably not worth the effort to do extensive research, if it comes to that. – Dreadnaught Mar 14 '17 at 01:16
  • That sort of information is useful (not least because it suggests a model, and even if there's no clear upper bound, it does reduce the likelihood of an extreme tail -- even the website itself won't stay up uninterrupted forever) – Glen_b Mar 14 '17 at 01:20
  • @Glen_b - Either way, I don't think that the question I posed is comparable to the question I allegedly duplicated, because that one refers to skewed distributions, whereas the question I asked has to do with cases where one is not certain of what sort of distribution one is dealing with. – Dreadnaught Mar 14 '17 at 01:21
  • times with a lower bound but no upper bound will be skewed, but in any case what makes a question a duplicate is the extent to which the answers would be similar (as already indicated). If you're quite sure the answers there really don't serve, I can reopen. – Glen_b Mar 14 '17 at 01:24
  • @Glen_b - indeed, that sort of data may be useful, but part of my motivation in asking this question is to delve into the issue of a case where many people are unknowingly misusing statistical data that they're receiving from a trusted software vendor (which means that these people are unaware of the problem, much less of how to fix it). – Dreadnaught Mar 14 '17 at 01:24
  • Your motivation is well and good, but that's not what your question asks about; I've responded to that already a few comments ago (it's a new question but if that's what you want to get an answer to, it is both too broad and too vague as it stands). The issue here is not your motivation but whether the answers in the linked question *respond to the question you asked*. If they do, it's a duplicate as stackexchange define it, whether or not that fits your motivation for posting – Glen_b Mar 14 '17 at 01:26
  • @Glen_b - Fair enough. I believe that the question I posed is phrased in a general enough manner that it doesn't specifically reference times, or other data that's necessarily skewed, but I'll leave it up to your discretion as to whether to reopen it. – Dreadnaught Mar 14 '17 at 01:28
  • I have reopened this question -- the reasoning is that even though the answers almost entirely apply, it's not going to be clear to people looking for answers to a question like yours the extent to which those answers also answer this one. Better to just link it as Alexis did. – Glen_b Mar 14 '17 at 01:37
  • @Glen_b - thanks. I appreciate the input you've provided on this topic. – Dreadnaught Mar 14 '17 at 02:03

1 Answers1

4

As Gertrude Stein might have written had she been a statistician, "Do we suppose that all she means is that a mean is a mean is a mean"$^{\,*}$.

You don't have to have normality for the mean to be meaningful. I use it quite happily when I think of an exponential or a Poisson model, or a binomial, or with a discrete uniform, for example, and even when I don't really have a good model for the distribution. (It's not necessarily the only statistic I care about, though, but it's a handy thing in a wide range of situations.)

The sample mean (aside from any interest in its own right as a kind of data-summary) is an unbiased estimator of the population mean (when it exists - but that will probably apply to all or at least nearly all of the measures you're looking at in practice), and converges to it.

Two big considerations when trying to estimate population quantities:

  1. What quantities are of interest to you?

  2. What's a good way to estimate them?

If you you include "the mean" in 1. but don't know much about the distribution that you're sampling from, then you don't know a good way to estimate the mean (i.e. it's hard to say much about 2.); at least the sample mean has some useful properties, and should get there eventually, at least if the population you're sampling is the population of interest. In that case, you can still interpret the mean as, well, the mean, and as an estimate of the population mean.

Imagine, for example, I was sampling from a lognormal distribution (but didn't know that). The sample mean is going to "work" as an estimate of the population mean. [Though depending on how much skewness we're dealing with, might be quite noisy, and if we want to give an interval for the mean, it's worse.]

However, while the mean has a few nice properties when you're sampling from what you want to make inferences about, you're right to approach the mean with caution (it's not very robust, for example, so even a teeny bit of contamination is a problem for it, and that can certainly mislead us if we're interested in understanding something about the population absent the process of contamination$^\dagger$), but by the same token you needn't be overly focused on Gaussian distributions if you actually want to know about the population mean.


$*$ which if she'd written it might have been in a work called Operas and averages

$\dagger$ it might be better to accept a potentially substantial bias induced by a slight robustification than a potentially unlimited amount if the contamination is wild enough

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • Thanks, Glen. I didn't mean to attribute excess importance to normality, in my question. My focus is more on the situation where you literally may not know what sort of distribution you're dealing with, which seems to be the case with a certain popular analytics tool. You did address that, and helped me clarify my thinking a bit about the topic. Thanks! I don't have enough rep to upvote, but I'll accept the answer in a couple of days, if there's no further input. – Dreadnaught Mar 13 '17 at 23:21
  • OMG +1 for G. Stein humor!!! :) Brief aside: when I visited [*Père Lachaise Cemetère*](https://en.wikipedia.org/wiki/P%C3%A8re_Lachaise_Cemetery), where Stein and Toklas are interred, their monument was covered in stacks of rocks and some flowers, whereas a few avenues over, Oscar Wilde's tomb was gifted with candy and poems or other written missives. :) – Alexis Mar 14 '17 at 00:38