3

Currently, I am doing Poisson models with N=16,000. My study requires me to find $R^2$ for each model (using 'rsq' package). When I add P12, the $R^2$ decreased as shown below.

glm(DV ~ P1 + P2 + .. + P11, family=poisson)
R-squared = 0.7134144

glm(DV ~ P1 + P2 + .. + P12, family=poisson)
R-squared = 0.6956673

I read a similar topic and it says I need to check NA values. However, there are no NA values found in my dataset. Any idea how this happens and how to solve it? Thanks.

Adding a linear regression predictor decreases R squared

editted:

@Robert Long P3 until P7 are dummy variables

'data.frame':   16002 obs. of  14 variables:
 $ DV      : num  1527 1118 998 499 121 ...
 $ P1      : Factor w/ 135 levels "Alor Gajah            ",..: 123 60     
             108 101 43 95 82 116 132 125 ...
 $ P2      : num  49.7 62.1 52.1 124.7 258.3 ...
 $ P3      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ P4      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ P5      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ P6      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ P7      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ P8      : num  92525 92525 92525 92525 92525 ...
 $ P9      : num  -2.36 -2.36 -2.36 -2.36 -2.36 ...
 $ P10     : num  -1.17 -1.17 -1.17 -1.17 -1.17 ...
 $ P11     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ P12     : num  -3.51 -3.51 -3.51 -3.51 -3.51 ...


DV
Min.   :    0.0
1st Qu.:    0.0
Median :    7.0
Mean   :  120.7
3rd Qu.:   47.0
Max.   :43407.0

P1
A      :  126
B      :  126
C      :  126
D      :  126
E      :  126
F      :  126
(Other):15246

P2
Min.   :   6.559
1st Qu.: 240.023
Median : 764.723
Mean   : 831.344
3rd Qu.:1412.810
Max.   :2087.008

P3
Min.   :0.0000000
1st Qu.:0.0000000
Median :0.0000000
Mean   :0.0004999
3rd Qu.:0.0000000
Max.   :1.0000000

P4
Min.   :0.000000
1st Qu.:0.000000
Median :0.000000
Mean   :0.007374
3rd Qu.:0.000000
Max.   :1.000000

P5P
Min.   :0.0000
1st Qu.:0.0000
Median :0.0000
Mean   :0.0035
3rd Qu.:0.0000
Max.   :1.0000

P6
Min.   :0.0000000
1st Qu.:0.0000000
Median :0.0000000
Mean   :0.0004999
3rd Qu.:0.0000000
Max.   :1.0000000

P7
Min.   :0.00000
1st Qu.:0.00000
Median :0.00000
Mean   :0.05899
3rd Qu.:0.00000
Max.   :1.00000

P8
Min.   :  8368
1st Qu.: 34914
Median : 65589
Mean   :103226
3rd Qu.:129808
Max.   :919610

P9
Min.   :-3.5429
1st Qu.:-1.5107
Median :-0.9412
Mean   :-1.0175
3rd Qu.:-0.6540
Max.   : 0.6282

P10
Min.   :-2.8600
1st Qu.:-1.7614
Median :-1.3867
Mean   :-1.3363
3rd Qu.:-0.9275
Max.   :-0.0803

P11
Min.   :-6.838
1st Qu.:-5.258
Median :-1.058
Mean   :-2.560
3rd Qu.: 0.000
Max.   : 0.000

P12
Min.   :-6.6442
1st Qu.:-4.1593
Median :-3.4253
Mean   :-3.2055
3rd Qu.:-2.1378
Max.   : 0.2526
Ferdi
  • 4,882
  • 7
  • 42
  • 62
  • 1
    Please post the output of `str(mydata)` – Robert Long Aug 29 '18 at 12:56
  • 3
    If this is a generalised linear model then there is no $R^2$ in the usual sense so what you are showing us must be one of the many attempts at producing a pseudo-$T^2$. The answer may depend on which one is involved although I am not an authority on them. – mdewey Aug 29 '18 at 13:01
  • @RobertLong I've included it in the post. – Hakim Danial Aug 29 '18 at 13:47
  • @mdewey If what you are saying is true, I will need to read more on that topic. – Hakim Danial Aug 29 '18 at 13:47
  • Please also post the output of `summary(mydata)` – Robert Long Aug 29 '18 at 13:51
  • It would also be useful to describe the data too. What are the variables, and is it experimental or observational ? – Robert Long Aug 29 '18 at 13:53
  • @RobertLong I've included the summary. I think it is an observational study. I am doing a destination-constrained migration flow model. Flow is my DV, destination (P1), distance (P2), changing in settlement (P3 - P7), origin's population (P8), origin's socio-economic characteristics (P9 - P12). – Hakim Danial Aug 29 '18 at 14:20
  • A useful source for psuedo $R^2$ is https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/ although it does not seem to discuss the issue you raise in your question. – mdewey Aug 30 '18 at 09:44

1 Answers1

1

Avoid stepwise procedures (fitting a model and then adding or removing variables depending on model fit). If P3-P7 are dummy variables, then you should include them all. With 12 variables (5 of which are dummies, presumably for 1 categorical variable with 6 levels) you should be able to specify the model a priori.

Some initial exploratory data exploration (correlations, plots and other visualisations) may help to identify possible problems and will generally help to inform the modelling process.

Your dependent variable does not seem like a typical count variable, especially with a maximum value of 43407 (is this a real/valid data point?). It may be better just to use lm:

m0 <- lm(DV~., data=mydata)

and from there inspect the residual plot and other diagnostic plots before proceeding.

Robert Long
  • 53,316
  • 10
  • 84
  • 148