Strategy to analyze large ( 20 mill rows and 200 columns) to predict a single variable

Asked Jan 29 '19 at 05:31

Active Jan 29 '19 at 07:26

Viewed 16 times

I am curious to understand how data scientists attack exceedingly large datasets in order to build a regression model for y?

How does one decide where to start from? Reduce a large number of columns without the benefit of domain knowledge? Basic stats like removing - large number of null columns , single values aside what other steps do data scientists usually use ?

edited Jan 29 '19 at 07:26

asked Jan 29 '19 at 05:31

Adurthi Ashwin Swarup

200 variables given your sample size is not that many. What you want: explanatory or predictive model? You can simply use regularized regression. – Tim Jan 29 '19 at 05:57
How does regularized expression help in reducing the manual work of going through 200 columns ? – Adurthi Ashwin Swarup Jan 29 '19 at 07:27
Regularization does automatic feature selection https://stats.stackexchange.com/q/4272/35989 – Tim Jan 29 '19 at 07:36

Strategy to analyze large ( 20 mill rows and 200 columns) to predict a single variable

0 Answers0