24

Is there an R random forest implementation that works well with very sparse data? I have thousands or millions of boolean input variables, but only hundreds or so will be TRUE for any given example.

I'm relatively new to R and noticed that there is a 'Matrix' package for dealing with sparse data, but the standard 'randomForest' package doesn't seem to recognize this data type. If it matters, the input data is going to be produced outside of R and imported.

Any advice? I can also look into using Weka, Mahout or other packages.

Archie
  • 572
  • 6
  • 16
Eryn
  • 181
  • 1
  • 1
  • 4
  • As far as I can tell, there's no R packages for sparse decision trees. I believe there are algorithm out there for sparse decision trees, which, if implemented in R, could be used to build random forests. – Zach Sep 04 '12 at 17:37
  • 2
    Here's a good candidate: http://www.cs.cornell.edu/~nk/fest/. If you can export your data in libsvm format, you can use this command line program. Would love to see an R port... – Zach Sep 04 '12 at 18:10
  • Zach - the link seems to be dead. – Benoit_Plante Jun 18 '13 at 14:39
  • Link worked fine for me – David Marx Jul 02 '13 at 16:42
  • 2
    @ cmoibenlepro the link is http://lowrank.net/nikos/fest/ – seanv507 Jul 02 '13 at 16:38

4 Answers4

14

No, there is no RF implementation for sparse data in R. Partially because RF does not fit very well on this type of problem -- bagging and suboptimal selection of splits may waste most of the model insight on zero-only areas.

Try some kernel method or better think of converting your data into some more lush representation with some descriptors (or use some dimensionality reduction method).

  • Hack-R's answer points out the xgboost package, which is perfectly able to do random forests with sparse matrices. – Edgar Jun 24 '19 at 09:50
8

Actually, yes there is.

It's xgboost, which is made for eXtreme gradient boosting. This is currently the package of choice for running models with sparse matrices in R for a lot of folks, and as the link above explains, you can use it for Random Forest by tweaking the parameters!

Hack-R
  • 848
  • 9
  • 24
5

The R package "Ranger" should do.

https://cran.r-project.org/web/packages/ranger/ranger.pdf

A fast implementation of Random Forests, particularly suited for high dimensional data.

Compared with randomForest, this package is probably the fastest RF implementation I have seen. It treats categorical variables in a native way.

amitos
  • 1
  • 1
  • 2
-4

There is a blog called Quick-R that should help you with the basics of R.

R works with packages. Each package can do something different. There is this packages called "randomForests" that should be just what you are asking for.

Be aware that sparse data will give problems no matter what method you apply. To my knowledge it is a very open problem and data mining in general is more an art than a science. Random forests do very well overall but they are not always the best method. You may want to try out a neural network with a lot of layers, that might help.

Vincent
  • 13
  • 1
  • 4
    No, randomForest is notoriously bad with sparse data, hence the whole question. classwt parameter is not properly implemented throughout randomForest. Manual oversampling is one approach, but it messes up OOB error. By the way, the package is not called 'randomForests'. – smci Jun 09 '15 at 21:06
  • 1
    The parts of this that are true are not answers to the question. – Sycorax Dec 18 '15 at 22:08