0

I have a lot of data that can be used to train a model - so much that I am not sure if my computer (16GB Ram) can handle all of the data at once. What are some ways to deal with this issue, given my hardware limitations, to build a gradient boosting tree or neural network?

Are there ways to train the model using subsets of the data? I am able to break my data into subsets, if needed. I can pull however much data I want at a time from my database (so files themselves can be small). I have been using lightgbm for boosting and plan to use keras for NN.

Thanks!

confused
  • 2,453
  • 6
  • 26
  • It's not hard to use an iterator to randomly sample observations from a database or read data from a drive or S3 bucket, and then use that sampling to do SGD using a neural network. You can similarly use subsampling to build boosted trees, but I'm not certain of the convergence conditions when the batch size is small. Someone with deeper knowledge of boosting would know about that. – Sycorax Apr 06 '20 at 01:42
  • Thanks. I figured it woudl be easier to implement with NNs since odds are we are optimizing based on mini-batch but I like the fact that boosting trees gives me variable importance which I am not sure how to extract from NNs. Eventually I want to remove unimportant features. I saw a post on maybe determining variable importance based on the first layer's coefficient weight to each feature but that's about it. Could we perhaps do a dropout as the first layer (before we get to any hidden layer) and kinda use the same methodology we use in randomforests to determine variable importance? – confused Apr 06 '20 at 03:51
  • That sounds like a different question than the one you ask here. Perhaps you could ask it in a new Question, but also search our archives: https://stats.stackexchange.com/questions/tagged/neural-networks+feature-selection – Sycorax Apr 06 '20 at 05:19
  • I know, but that's my rationale for using GBM because of the need to determine the features that actually matter and also the fact that I heard NN is better for perceptual problems and mine is just basic multi-class classification using quantitative inputs. I'll ask it once I start building NNs - I'm not ready yet,. Thanks. – confused Apr 06 '20 at 05:51
  • Check if [Lightning Memory-Mapped Database](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database) helps. – doubllle Apr 06 '20 at 20:36

0 Answers0