1

I have a large dataset of 30,000 cases with 150 variables. I am looking for a few possible machine learning solutions/methods that I could try and use for cross validation.

My dependent variable is a percent/continuous variable while all my independent variables are continuous or discrete categorical variables. I am only looking for one output which would be the prediction variable which provides a percent (continuous between 0 and 1).

Currently I have run linear, logit and probit models, with probit showing the most promise.

I believe Naïve Bayes is a simple one yet to try, but wondering what other types of machine learning models are out there that would give the desired output.

Update: I am modeling Election Turnout based on past turnout within an individual precinct. One precinct may have 33.33% turnout while another might have 57.65% turnout. The possible outputs could be any continuous percent between 0 and 1.

Any thoughts would be much appreciated! Thanks in advance!

CooperBuckeye05
  • 423
  • 1
  • 6
  • 16

1 Answers1

1

There are a number of machine learning libraries that will allow you to try out many differnt algorithms. If you are using R, look into the caret package http://caret.r-forge.r-project.org/. This will allow you to do cross-validation and model selection in a very easy way and allow you to benchmark the performance of your different algorithms. All the available algorithms are listed in http://caret.r-forge.r-project.org/bytag.html

You can start by trying

  • SVM
  • Neural Nets
  • Logistic Regression
  • Beta Regression
  • Nearest Neighbors

These will give you the desired results.

A nice book on many of the methods can be found in the book Elements of Statistical Learning (http://www.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf)

mike1886
  • 924
  • 7
  • 15
  • You listed several good options for classification but it's not so clear that that is what the OP is looking for. –  Jun 10 '14 at 15:31
  • Thanks, did not see the (0,1) restriction. – mike1886 Jun 10 '14 at 15:35
  • sorry I meant any value between 0 and 1, meaning .2333, .234,.567, .999, etc. my y variable is a percent, and I would like a percent as output. – CooperBuckeye05 Jun 10 '14 at 16:33
  • OK, well the answer above still holds. Nothing about these algorithms is dependent of the range of the dependent variable. – mike1886 Jun 10 '14 at 16:58
  • It seems some of your examples posted above are mainly for classification problems (ie yes/no, or limited categories as dependent variables) and not regression problems. How can these examples be applied to continuous data? – CooperBuckeye05 Jun 16 '14 at 15:48
  • SVM, Neural Net, Beta Regression, and Nearest Neighbors are also used for regression. They handle continuous data. – mike1886 Jun 16 '14 at 15:59