1

I'm trying to predict the numeric value (based on different variables) and for that I want to use different methods and compare these methods (having a variety of methods is more important than finding the absolute best one). My problem is, that I'm a beginner in the field of predictive analytics and only have some basic knowledge in statistics in general, and therefore am afraid to choose methods which make no sense to use and waste time I don't have.

My data consists of about 1000 observation with 10 variables. There is only a rather small collinearity. My Model looks like: $y = a + bX + e$ (with $X$ being the Vector of my 10 variables and $b$ being the coefficients of said variables. $e$ is the error term and $a$ being a constant. I want to predict y based on my training data and compare the methods with my test data)

The methods I want to use are:

  • OLS Regression: because it is the most basic thing to do
  • PLS Regression: to compare it with the OLS Regression hoping that it will give better results. I'm not sure if I should use it, because I have a bunch of observation for a rather small number of variables
  • LASSO Regression: another regression which promises to be better. Again I'm note sure if I should use it, because of the number of variables. If there is an obvious reason to not use it, please let me know!
  • Decision Tree: Decision Trees are also a widely used methods, so I think I should compare it
  • Random Forest: Random Forest as a logical next step after the Decision Tree
  • Support Vector Machine: A machine learning method which promises good results
  • Neural Networks: An interesting sounding method to use for which I have some basic understanding how it works.
  • Genetic Algorithms: GA with the objective of minimizing the squared difference between the predicted and the real value. Is there any advantage to the normal OLS Regression? Or should I choose another objective for better results?
  • Naive Bayes classifier: another machine-learning method which may give some good results

Again, my aim is to have a variety of methods to compare. Having the best predicting comes only second. Nevertheless, I don't want to use methods which are obviously unsuitable for my task and I don't have the time to get deep knowledge of methods I don't use later. Of course, I will research all methods I will compare later, but for now it would be great I you could help me out!

Are these good (or at least suitable) choice or are there some methods I shouldn't use because these make no sense here? Are there any more methods, which I didn't pay attention to, you would suggest using?

Marquis de Carabas
  • 1,667
  • 19
  • 31
Sven E.
  • 69
  • 3
  • I am voting this question because it is unclear: Can you explain more about your data and how it should be modeled. You currently specify just a linear model $y = a + X b + e $ with an unknown error distribution. Also it is too broad: e.g. in relation to the linear model you could better ask something more specific like 'is it better to use lasso instead of OLS?'. – Sextus Empiricus Oct 06 '18 at 19:58

1 Answers1

0

Your model selection should be based on the type and distribution of your outcome variable. You say your DV is numeric but is it continuous, discrete, a scale, or some other type? Your answer to this question will determine to a large extent the best model to choose from.

For continuous variables such as income, it is customary to do a log transformation to get it as close to a normal distribution as possible. You can then employ OLS and run some diagnostics to check your model fit. For other types of continuous variables, get a histogram and check the distribution. If it is somewhat normal, you can run an OLS and check the diagnostics and model fit.

If you have a binary outcome, such as a Yes-No or 0-1 dichotomous variable, you can use logistic regression. For count variables, such as founding rates of new restaurants in a city, negative binomial or poisson regression is suitable depending on the nature of the variable. If you have lots of zeros, you can consider a zero-inflated negative binomial or poisson.

If your DV is has more than two categories, multinomial regression is often the safest route although it largely depends on the nature of the DV, your theory, and how you decide to recode it, if you do at all.

If you are new to regression and statistics in general, I would suggest you stay away from some of the highly-specialized models you mention in your post because they require a more advanced knowledge and understanding of statistical modeling than you probably possess right now.

monarque13
  • 118
  • 4
  • 2
    "For continuous variables such as income, it is customary to do a log transformation to get it as close to a normal distribution as possible" is incorrect . The assumption of normality or near normality is about the residuals. For more on when and why to take transforms see http://stats.stackexchange.com/questions/244950/lambda-value-for-boxcox-transformation-in-time-series-analysis/244951#244951 – IrishStat Dec 24 '16 at 11:31
  • Thank you for your input! Sorry, I meant to say that my variable is continuous. You are probably right, that I shouldn't do highly-specialized models and I'm note sure, which of these I want to use later on. but for now there is no obvious reason to not use these? Or should I look into some other methods? – Sven E. Dec 24 '16 at 13:40
  • Again, what is the nature of your project and what are your variables? What are you trying to predict/model? How is your DV distributed? You could run all of those models on the same data set and probably get an output but that doesn't mean they are correct. I suggest you start with the simplest case. If the range of your DV is limited, you could treat it as a count (if it includes positive values) and run a NB model. There are hundreds of reasons to try different models but only one is enough to stay away from them, if that makes sense. – monarque13 Dec 24 '16 at 19:53