Linear Regression and big data

Question

I have a very large data set ( 79 features: 74 categorical and 5 measurable ) on a very sparse matrix ( 79 x 1,500,000 rows )

The data set is done in this way:

phone model, carrier, day of the week, time range, time to landing page, time to external redirect, screen size, screen megapixel, user subscribed.

categorical features are phone model ( iPhone 5, Samsung A6.... ) carrier ( tim vodafone ... ) day of week ( monday .... sunday ) time range ( between 00:00 - 08:00 , 08:01-12:00 .... ) screen size ( several sizes ) screen megapixel ( several sizes ) user subscribed ( yes or no )

All these features have been splitted in columns, these columns contain 0 or 1.

example:

Iphone 5, Samsung A6,... Vodafone, ..., Monday,..... Time 00-08, ... Screensize 3", .... Megapixel 1M, .... 

1,0,.....1,......1,....1,....1,.....1

Means vodafone iPhone 5 recorded on monday between 00:00 - 08:00 with 3" screen size and 1M pixel on the screen...

There are 74 columns which can be 0 or 1,mostly are set to 0.

The value to optimise is the subscription, we know when the user did subscribe the service and want to know which feature can maximize the subscription.

Linear regression completes with success but coefficients are very low (i.e. 10^-9 or less ) and intercept is very close to 1. All the features have been normalized with Min Max. CHI SQUARE test says that it does not work.

It's a mess and I'm starting to thinking it is not the good path to reach the result, this data set is so huge and has so many categorical features.

Does it make sense apply linear regression over this data set?

I'm thinking that the best is the random forest, which can says for real which features help to reach the wanted value.

It is possibile that linear regression is failing due to numerical approximation ?

What's it the best way to achieve the subscriptions maximisation ?

How did you code the many categorical features? There is really to little information in your post for anybody to be able to help much. Please augment! — kjetil b halvorsen, Dec 10 '16 at 18:41
Do I understand you correctly that your response variable is binary (subscribed or not)? In this case you need classification algorithms, not regression — David Ernst, Dec 14 '16 at 13:26
@user7019377: No, one can (and probably should) use regression with a binary response variable. Logistic regression is a natural starting point! — kjetil b halvorsen, Dec 14 '16 at 13:29
Well yes, I counted logistic regression as a classification algorithm. Especially since "linear regression" was mentioned, I take that to mean OLS. I didn't think that would mean logistic regression. (I know logistic regression uses linear separation, still it is not commonly referred to as linear regression) — David Ernst, Dec 14 '16 at 13:45
user7019377, kjetil b halvorsen I agree with you. I don't think linear regression is the best solution for this task. — ozw1z5rd, Dec 14 '16 at 21:20

score 1 · Answer 1 · edited Apr 13 '17 at 12:44

You have a large $N$ but small $P$ problem, which means you have many data but not too many features. Linear regression can work well on such data set and done in reasonable amount of time.

To understand if linear model fits your data well, I would suggest to use a sample of data to fit model and exam the performance. After that, use more samples and observe what will happen on training and testing performance. This is essentially investigating "learning curve". Check my detailed answer in this post to know if you are over fitting or under fitting.

How to know if a learning curve from SVM model suffers from bias or variance?

With the information you provided, here are things may happen in your data

Linear model has high bias, if you have over 1 million rows, it is very possible you have a "under-fitting" model. Where the model is "stable" but have not very good performance.
As discussed in the comments, it also matters a lot on how you do the categorical feature coding. If you have 74 categorical features, each one has say 20 unique values, then the design matrix can be reasonable large and you may have $P>N$ problem or over fitting. But if all of these features are binary abd most values are 0 then it is ok to use linear model.

using less data it does converge, so it's underfitting. But this is not the only problem. To fix underfitting I can increase the order of the regression, something which is more than linear, but actually I feel this path is not the good one. We have to predict is a user will subscribe or not depending on some parameters, do you think the linear regression is a good way to achive this? I agree with you about how to use the linear regression, but does it fit well on this issue? — ozw1z5rd, Dec 14 '16 at 21:34

Linear Regression and big data

1 Answers1