Machine Learning on Percent/Continous Dependent Variable

Question

I have a large dataset of 30,000 cases with 150 variables. I am looking for a few possible machine learning solutions/methods that I could try and use for cross validation.

My dependent variable is a percent/continuous variable while all my independent variables are continuous or discrete categorical variables. I am only looking for one output which would be the prediction variable which provides a percent (continuous between 0 and 1).

Currently I have run linear, logit and probit models, with probit showing the most promise.

I believe Naïve Bayes is a simple one yet to try, but wondering what other types of machine learning models are out there that would give the desired output.

Update: I am modeling Election Turnout based on past turnout within an individual precinct. One precinct may have 33.33% turnout while another might have 57.65% turnout. The possible outputs could be any continuous percent between 0 and 1.

Any thoughts would be much appreciated! Thanks in advance!

If your target data are in (0,1) then beta regression is an option — , Jun 10 '14 at 15:16
If you want better advice, you should tell us more! One cannot really say much only based on the format of the data. We need to know what you are modelling and what you want to achieve. — kjetil b halvorsen, Jun 10 '14 at 15:56
This answer is relevant http://stats.stackexchange.com/a/29042/44764 — , Jun 10 '14 at 16:49

mike1886 · Answer 1 · 2014-06-10T16:56:15.497

1

There are a number of machine learning libraries that will allow you to try out many differnt algorithms. If you are using R, look into the caret package http://caret.r-forge.r-project.org/. This will allow you to do cross-validation and model selection in a very easy way and allow you to benchmark the performance of your different algorithms. All the available algorithms are listed in http://caret.r-forge.r-project.org/bytag.html

You can start by trying

SVM
Neural Nets
Logistic Regression
Beta Regression
Nearest Neighbors

These will give you the desired results.

A nice book on many of the methods can be found in the book Elements of Statistical Learning (http://www.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf)

edited Jun 10 '14 at 16:56

answered Jun 10 '14 at 15:11

mike1886

924
7
15

You listed several good options for classification but it's not so clear that that is what the OP is looking for. – Jun 10 '14 at 15:31
Thanks, did not see the (0,1) restriction. – mike1886 Jun 10 '14 at 15:35
sorry I meant any value between 0 and 1, meaning .2333, .234,.567, .999, etc. my y variable is a percent, and I would like a percent as output. – CooperBuckeye05 Jun 10 '14 at 16:33
OK, well the answer above still holds. Nothing about these algorithms is dependent of the range of the dependent variable. – mike1886 Jun 10 '14 at 16:58
It seems some of your examples posted above are mainly for classification problems (ie yes/no, or limited categories as dependent variables) and not regression problems. How can these examples be applied to continuous data? – CooperBuckeye05 Jun 16 '14 at 15:48
SVM, Neural Net, Beta Regression, and Nearest Neighbors are also used for regression. They handle continuous data. – mike1886 Jun 16 '14 at 15:59

Machine Learning on Percent/Continous Dependent Variable

1 Answers1