1

This is a very basic question, but I find myself constantly frustrated with the same questions as I review regression results and read statistics forums. The question is whether or not some sort of software or automated approach exists to determine:

1) Which type of regression to run. For example, based on the distribution of the dependent variable, x would work. There seem to be many types to run, and it's never clear cut.

2) As a result of regression, how to interpret the results. Statistics programs give you a p-value, R-squared, and sometimes other terms, which you then need to look up. It's never clear if what it says to be significant is actually significant, or the result of some flaw in the data that then needs to be uncovered.

3) Analysis of results. There are all sorts of charts and plots, but all I really ever want to know is what model yields the best results to explain or predict the data. There seem to be endless diagnostics, but I never know what to run or how to do it.

I realize this is a very generic question, and perhaps I'm looking at this all wrong, but I imagine there would be a standard approach to doing this, and you should be able to provide a few columns of data, run tests, and get results that can be interpreted. Instead, what I find is that results come out using terms that are confusing to understand in plain English, and then there are further tests that than either validate or invalidate these results, which lead to more obscure terms. I get that this is why there are experts, but I'm looking for some explanation of how to approach an exercise where you are building a regression model to predict or explain something.

I'm a beginner here, but statistics seems like a world of never ending confusion, and often statisticians will dispute models and approaches, so that further compounds my problem. Typically when faced with these types of topics that are overly confusing, I arrive at "this is probably nonsense or something misleading is going on here." At least that's what I thought about complex derivatives in finance, and felt validated by 2008 (off topic, but this is a bit of a rant here). Apologies if this is not the appropriate forum. Please advise.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
  • 9
    Medicine, when taken to include every caring parent, naturopath, chiropractor, shaman, nutritionalist, *etc.* also is a "world of never ending confusion." Simply consult the Internet for abundant evidence! Nevertheless, there are scientific principles to medicine and modern healthcare is remarkably effective in many ways. Should we complain because people who do not have medical training may be so confused? Should we wonder why the Web cannot completely replace physicians? Should we insist that people be able to tell us all we need to know about healthcare in one simple post? – whuber Jan 12 '17 at 17:34
  • 1
    Even if someone were to spoon-feed you, someone could still complain that it's too hot. – Sycorax Jan 12 '17 at 18:02
  • 1
    @whuber, doctors are necessary, but a parallel would be a blood test that provides diagnostics and comparisons vs. other people your age to interpret whether you are "normal," or have some sort of irregularity which could mean x, y, and z. When I get blood work done, I get a sheet back with fonts from the 80s with what amounts to useless data to me. – sqlnewbie1979 Jan 12 '17 at 19:00
  • 4
    On (1): There's perhaps an implicit assumption here that the "right" regression to run could be determined solely by looking at the data, regardless of subject-matter knowledge or the goals of analysis. The type of software you're asking for would be something like an expert system that quizzed you until it had enough information to suggest a suitable approach. Related: [What do statisticians do that can't be automated?](http://stats.stackexchange.com/q/22572/17230) – Scortchi - Reinstate Monica Jan 12 '17 at 19:42

2 Answers2

13

There are many good reasons why you want is mostly impossible and not even a good idea in the first place.

1) Which type of regression to run. For example, based on the distribution of the dependent variable, x would work. There seem to be many types to run, and it's never clear cut.

No; it's rarely possible to tell from the distribution of the dependent variable whether a particular kind of regression will work, meaning work well. It's not even a formal assumption of any kind of regression that the dependent variable has a particular kind of distribution. You gave that as an example, but similar comments would apply, I assert, to any other kinds of example. For any state given as a modelling assumption -- in my view, usually better stated as an ideal condition for a particular model to work as well as possible -- one could concoct examples which are poorly behaved in the sense of that assumption, but regression works fine (and likely vice versa too). This is one reason why regression texts are so often so long (and incomplete too, even when they are so long). Compound that with all possible assumptions and you have a multiple tree of possible decisions and actions.

2) As a result of regression, how to interpret the results. Statistics programs give you a p-value, R-squared, and sometimes other terms, which you then need to look up. It's never clear if what it says to be significant is actually significant, or the result of some flaw in the data that then needs to be uncovered.

One could quibble a lot about the wording here -- which Olympian stance allows anyone to determine what is "actually significant" versus anything else -- but while the impulse to know how to think about results is admirable, this is a fiendishly difficult problem. You'd need to teach the program everything known about the data, difficulties in sampling and measurement, etc., etc. This is, in some sense, a goal of some statistical people: to build models that incorporate all kinds of uncertainty in a substantial project. Suffice it to say those are usually multi-member team, multiple year projects.

3) Analysis of results. There are all sorts of charts and plots, but all I really ever want to know is what model yields the best results to explain or predict the data. There seem to be endless diagnostics, but I never know what to run or how to do it.

Specifically, I have a guideline which is that most formal tests are misguided and the best approach is graphical, but I can hardly establish that with this sentence alone. Best to explain or predict? Bang on as a concisely stated goal with which most can agree as a starting point, but the detailed discussions start there. It's not even a matter of unanimity that models should explain! Some see every modelling exercise, especially with observational data, as essentially descriptive and that we are fooling ourselves if we pretend otherwise. There are numerous single-valued measures to guide model choice, except that no one much likes anyone's criterion except their own.

Note that your 2) and 3) contradict each other as the main point of the diagnostics in 3) is to help the thinking in 2) about what can be believed and what is trivial or artefactual.

A dark fact about regression, even in 2017, is that even so-called experts can disagree strongly about fundamentals, so the scope for a simple, unified, easily usable program that makes it all trivial is negligible. For example, my default position is to work on logarithmic scale, but I've seen people of similar or greater experience fight shy of any kind of transformation.

Don't you think that others would love that program too if it existed and its name would be known all over the internet? It's the statistical equivalent of world peace desired by every contestant.

Note I have to add a rider about "layman's terms". The difficulty of solving these problems with ordinary English (or French, Chinese, Hindustani, or any other language) is precisely why we need special notation, terminology and concepts. Any implications that experts are just obscure on purpose are, generally, unhelpful if not offensive. What are these layman's terms any way? Many scientists, for example, know a lot of mathematics; they just may not know much modern statistics. The call for layman's terms is ultimately self-defeating, because there is always someone less educated who can shout that they don't understand it. Undesirable, unfortunate, but it can't be a limit on statistics any more than it is in any other field.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
  • 1
    Also predicting and explaining are different things. We often need to make a tradeoff between the two, which is something you cannot delegate to a computer. – Maarten Buis Jan 12 '17 at 18:45
  • @Nick Cox, let me expand a little with something more concrete. First, I was asked to run a regression by a teacher. Upon looking at the results, she explained that OLS regression would not work b/c there were so many zeroes in my dependent variable so I had to use some sort of Poisson regression, or negative binomial regression, or logistic regression. Now stop there for a moment, and think about how much information is out there that is wildly confusing. I have what on the surface is a standard problem, and am now down a new path. – sqlnewbie1979 Jan 12 '17 at 19:09
  • @Nick Cox, then I come here to post about this, and I'm told to try a ZIP model or a hurdle model. As for point 2, I don't agree that you'd need to teach a program everything. I don't see any contradiction in point 2 and 3, as I'm asking how to interpret both of these. One is an output of the regression, and the other is plots and things that serve to compare models. There should be a standard path for basic regression. If the purpose is to explain something with data, it should be explainable, and there is nothing offensive about that. Your comment re a graphical approach is helpful. – sqlnewbie1979 Jan 12 '17 at 19:17
  • 3
    I don't deny the difficulties underlying learning and understanding in the slightest -- in fact most of my post expands on their existence and some reasons for them. The main plea in your post appears to be (why) can't it all be explained more simply and that is answered. The shortest answer is No. I'd be happy to have easier news for you. Everything hinges on a really good answer (analysis) depending on a well-informed person making really good decisions about a project in the light of detailed, systematic information. No program can substitute for that. – Nick Cox Jan 12 '17 at 19:55
  • 1
    All learning is spiral. One first learns means, and then learns about medians and modes as alternatives. Then you dive deeper and learn about general ideas of measuring location, including why that might not be a good idea or even valid at all. That couldn't (easily be) wrapped up in a useful program to summarize a distribution that works at all possible levels. – Nick Cox Jan 12 '17 at 20:01
  • And alongside this is the need to interpret whether the result is meaningful or not. A drug company might develop a pill that cures cancer, but if each pill costs $5 billion, then they probably can't move forward. A general purpose, automated analysis can't substitute for content knowledge. – Ashe Jan 12 '17 at 20:14
  • Thanks for all the responses. I really didn't mean any offense. I really want to learn more and was wondering how to approach that or if something to guide even existed. Any thoughts on this? Why was this deemed too broad by 5 users. I found the responses very helpful. Thanks and I want to reiterate this was not meant to offend. – sqlnewbie1979 Jan 13 '17 at 02:11
  • I've edited the sentence about what may be offensive. My answer, like your question, was intended generically, rather than personally. – Nick Cox Jan 13 '17 at 07:23
4

Elements of this do exist, but it's extremely difficult to write down a detailed strategy for explaining what's going on with any possible data set and modeling approach.

That said, aspects of this could be quite valuable, and there have been some efforts to make it work. Two (very different) examples I'm aware of:

  • The explainr package for R, which provides a framework for automatically explaining statistical results from specific models to a lay audience
  • The "automatic statistician" (paper; code; example report), which performs very sophisticated analyses and writes detailed reports (in a small domain of problems).
David J. Harris
  • 11,178
  • 2
  • 30
  • 53