0

Hello I am working on a regression problem with a data set where my predictor variables are in the multiples of 10,000 however my response variable is in percentages.

I tried to fit a linear regression model however it does not give me a good r squared value(0.29 is the r squared value). Now I am wondering if this needs to be tackled different way or if the linear model is not the right model to use here.

Can someone please suggest what should I do in case if my response variable is in percentages.

Below is how my data looks like. I have converted percentages to decimals

Month      VAR1         VAR2        VAR3        VAR4        VAR5            VAR6            VAR7        VAR8        VAR9            Top
2012-04-01  57145.35    10735.39    0           39930.91    0               12903.3         2.19        688504      475963          0.3
2012-05-01  56958.42    10741.27    0           8709.6      0               12767.3         2.18        1233748     109351          0.27
2012-06-01  0               0       0           17117.38    0                   0           2.06        830140      345306          0.31
2012-07-01  184470.1        0       0           5624.8      0                50386.7        2.04        1531915     250523          0.34
2012-08-01  93913.91    56048.32    0           0           26100           128562.61       2.18        99423       82818           0.31
2012-09-01  43623.7         0       0           0           2336.85         27101.62        2.18        7447        15842           0.3
2012-10-01  20314.8     50756       0           0           0               13950           2.15        556411      364217          0.32
2012-11-01  0               0       0           1618.39     1250                0           2.09        435190      119554          0.3
2012-12-01  74225.56    177199.56   0           0           350.4           12600           2.09        232469      34485           0.28
2013-01-01  107002.74   145564      0           0           10800           45000.2         2.11        176366      163140          0.32
2013-02-01  45692.08    1500        0           0           16500           67102.11        2.17        452578      226958          0.31
2013-03-01  167979      30418.16    0           0           48850           99286.75        2.18        728229      296780          0.28
2013-04-01  68040   6   2370.4      1420        1200        350             22500           2.09        880588      622676          0.31
2013-05-01  143796.3    58218       0           0           38418.3         51148.25        2.06        669520      14515           0.29
2013-06-01  54942.5     33519.34    0           0           23590.8         14255.12        2.13        439199      357560          0.35
2013-07-01  178708.5    62270.53    39214.430   0           19885           8549.86         2.23        395180      372332          0.27

Top is the response variable here

Thanks a lot in advance!!

Tushar Mehta
  • 111
  • 2
  • Logistic regression. Though you should be sure to check model assumptions before drawing inferences. – Demetri Pananos May 30 '19 at 04:30
  • @DemetriPananos. If I am not wrong logistic regression is done for binomial data. Are you suggesting to convert the response variables into a binary class. Sorry I am new to modelling and I am trying to learn on the go – Tushar Mehta May 30 '19 at 05:06
  • Other related Q&A's exist on the site, if you do a search, e.g., https://stats.stackexchange.com/q/261790/241093 – AlexK May 30 '19 at 05:32
  • As presented Top is of the order of 0.3. Can you confirm that means 0.3% or do you really have a proportion 0.3 or 30%? It makes a difference! – Nick Cox May 30 '19 at 06:18
  • @NickCox - This was actually 30% which I converted to 0.3. For example 30% of the population chose a certain product. Could you please explain what could be the difference. – Tushar Mehta May 30 '19 at 09:20
  • If the range is [0, 1] then values about 0.3 are near the middle and almost any link function will give similar results. If the range is [0, 100] then values about 0.3 are close to zero and results depend mightily on the link function (identity, logit, cloglog, whatever. If your scale is 0 to 1 then you should not, for clarity, describe that as a percentage. – Nick Cox May 30 '19 at 09:38
  • Are these all the data or just a token? – Nick Cox May 30 '19 at 09:39
  • @NickCox This is just a token. I can share the entire data if you would like to see that. The original response variable is in percentage. As i Said, it is a percentage of people from the entire population who chose a specific product out of seven other products. The min value for the response value is 26% and the max value is 36% – Tushar Mehta May 31 '19 at 00:33

1 Answers1

2

I think this problem can be solved using beta regression since your output is bounded by a range from both sides.

Check this out: Beta-Regression-With-R

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Kunal
  • 153
  • 1
  • 2
  • 9
  • 1
    Depends what "this problem" is. I think the OP is mostly disappointed with fit. I agree that a regression that respects the range of the outcome is, with nothing else said, likely to be preferable to one that doesn't. But in this case the outcome is roughly constant and a glance at the data suggests that low predictive power is likely whatever the model used. – Nick Cox May 30 '19 at 10:55
  • The output may not be as constant as it looks, since this is just a snippet of data – Kunal May 31 '19 at 04:49
  • Since my comment the range has been explained as from 0.26 to 0.36. Still. my hunch remains that beta regression won't yield much better fit. – Nick Cox May 31 '19 at 06:26