I have a very large data set ( 79 features: 74 categorical and 5 measurable ) on a very sparse matrix ( 79 x 1,500,000 rows )
The data set is done in this way:
phone model, carrier, day of the week, time range, time to landing page, time to external redirect, screen size, screen megapixel, user subscribed.
categorical features are phone model ( iPhone 5, Samsung A6.... ) carrier ( tim vodafone ... ) day of week ( monday .... sunday ) time range ( between 00:00 - 08:00 , 08:01-12:00 .... ) screen size ( several sizes ) screen megapixel ( several sizes ) user subscribed ( yes or no )
All these features have been splitted in columns, these columns contain 0 or 1.
example:
Iphone 5, Samsung A6,... Vodafone, ..., Monday,..... Time 00-08, ... Screensize 3", .... Megapixel 1M, ....
1,0,.....1,......1,....1,....1,.....1
Means vodafone iPhone 5 recorded on monday between 00:00 - 08:00 with 3" screen size and 1M pixel on the screen...
There are 74 columns which can be 0 or 1,mostly are set to 0.
The value to optimise is the subscription, we know when the user did subscribe the service and want to know which feature can maximize the subscription.
Linear regression completes with success but coefficients are very low (i.e. 10^-9 or less ) and intercept is very close to 1. All the features have been normalized with Min Max. CHI SQUARE test says that it does not work.
It's a mess and I'm starting to thinking it is not the good path to reach the result, this data set is so huge and has so many categorical features.
Does it make sense apply linear regression over this data set?
I'm thinking that the best is the random forest, which can says for real which features help to reach the wanted value.
It is possibile that linear regression is failing due to numerical approximation ?
What's it the best way to achieve the subscriptions maximisation ?