How to create a reliable regression model with a large number of variables and a few observations in R

Question

I am newbie with R and I am trying to create a model that explains sales value. In particular i want explain how this series of variable (downloaded from http://data.un.org/ and merged with Excel) impact on my sales value. For doing this, i use a Linear Regression (*lm()*function) with R. My dataset is small for the number of variables I have:

frame_data --> 27 obs of 40 Variables

When I run the model:

linearMod <- lm(`Net Sales Value (Net of Inv. Disc.) - US Dollar` ~. ,data=frame_data)
summary(linearMod)

the results is this:

    Call:
lm(formula = `Net Sales Value (Net of Inv. Disc.) - US Dollar` ~ 
    ., data = numeric_frame_data)

Residuals:
ALL 14 residuals are 0: no residual degrees of freedom!

Coefficients: (26 not defined because of singularities)
                                                                       Estimate Std. Error t value Pr(>|t|)
(Intercept)                                                          -4.946e-09         NA      NA       NA
`Net Sales Quantity`                                                  2.411e-14         NA      NA       NA
`Net Sales Value (Net of Inv. Disc.) - Euro ACT`                      1.130e+00         NA      NA       NA
`Population aged 0 to 14 years old (percentage)`                     -3.121e-11         NA      NA       NA
`Population aged 60+ years old (percentage)`                          3.428e-11         NA      NA       NA
`Population density`                                                 -7.512e-13         NA      NA       NA
`Population mid-year estimates (millions)`                            4.933e-09         NA      NA       NA
`Population mid-year estimates for females (millions)`               -5.170e-09         NA      NA       NA
`Population mid-year estimates for males (millions)`                 -4.684e-09         NA      NA       NA
`Sex ratio (males per 100 females)`                                   2.526e-11         NA      NA       NA
`Surface area (thousand km2)`                                        -8.268e-15         NA      NA       NA
`Tourism expenditure (millions of US dollars)`                       -1.624e-14         NA      NA       NA
`Tourist/visitor arrivals (thousands)`                                5.703e-15         NA      NA       NA
`Gross enrollement ratio - Primary (male)`                            2.413e-11         NA      NA       NA
`Gross enrollment ratio - Primary (female)`                                  NA         NA      NA       NA
`Gross enrollment ratio - Secondary (female)`                                NA         NA      NA       NA
`Gross enrollment ratio - Secondary (male)`                                  NA         NA      NA       NA
`Gross enrollment ratio - Tertiary (female)`                                 NA         NA      NA       NA
`Gross enrollment ratio - Tertiary (male)`                                   NA         NA      NA       NA
`Students enrolled in primary education (thousands)`                         NA         NA      NA       NA
`Students enrolled in secondary education (thousands)`                       NA         NA      NA       NA
`Students enrolled in tertiary education (thousands)`                        NA         NA      NA       NA
`Assault rate per 100,000 population`                                        NA         NA      NA       NA
`Intentional homicide rates per 100,000`                                     NA         NA      NA       NA
`Kidnapping at the national level, rate per 100,000`                         NA         NA      NA       NA
`Percentage of male and female intentional homicide victims, Female`         NA         NA      NA       NA
`Percentage of male and female intentional homicide victims, Male`           NA         NA      NA       NA
`Robbery at the national level, rate per 100,000 population`                 NA         NA      NA       NA
`Theft at the national level, rate per 100,000 population`                   NA         NA      NA       NA
`Total Sexual Violence at the national level, rate per 100,000`              NA         NA      NA       NA
`GDP in constant 2010 prices (millions of US dollars)`                       NA         NA      NA       NA
`GDP in current prices (millions of US dollars)`                             NA         NA      NA       NA
`GDP per capita (US dollars)`                                                NA         NA      NA       NA
`GDP real rates of growth (percent)`                                         NA         NA      NA       NA
`Labour force participation - Female`                                        NA         NA      NA       NA
`Labour force participation - Male`                                          NA         NA      NA       NA
`Labour force participation - Total`                                         NA         NA      NA       NA
`Unemployment rate - Female`                                                 NA         NA      NA       NA
`Unemployment rate - Male`                                                   NA         NA      NA       NA
`Unemployment rate - Total`                                                  NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
  (13 observations deleted due to missingness)
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 13 and 0 DF,  p-value: NA

Reading online i have undestand that my dataset is too small [1]. When i Reduce the number of variables:

linearMod <- lm(`Net Sales Value (Net of Inv. Disc.) - US Dollar` ~        # Variable Y

                `Population aged 0 to 14 years old (percentage)` +        # Variable X
                `Population aged 60+ years old (percentage)`     +
                `Population density`     +
                `Population mid-year estimates (millions)`     +
                `Population mid-year estimates for females (millions)`     +
                `Population mid-year estimates for males (millions)`     +
                # `Sex ratio (males per 100 females)`     +
                # `Surface area (thousand km2)`     +
                # `Tourism expenditure (millions of US dollars)`     +
                # `Tourist/visitor arrivals (thousands)`     +
                # `Gross enrollement ratio - Primary (male)`     +
                # `Gross enrollment ratio - Primary (female)`     +
                # `Gross enrollment ratio - Secondary (female)`     +
                # `Gross enrollment ratio - Secondary (male)`     +
                # `Gross enrollment ratio - Tertiary (female)`     +
                # `Gross enrollment ratio - Tertiary (male)`     +
                 `Students enrolled in primary education (thousands)`     +
                 `Students enrolled in secondary education (thousands)`     +
                 `Students enrolled in tertiary education (thousands)`     +
                 `GDP in constant 2010 prices (millions of US dollars)`     +
                 `GDP in current prices (millions of US dollars)`     +
                 `GDP per capita (US dollars)`     +
                 `GDP real rates of growth (percent)`     +
                #  `Assault rate per 100,000 population`     +
                #  `Intentional homicide rates per 100,000`     +
                #  `Kidnapping at the national level, rate per 100,000`     +
                #  `Percentage of male and female intentional homicide victims, Female`     +
                #  `Percentage of male and female intentional homicide victims, Male`     +
                #  `Robbery at the national level, rate per 100,000 population`     +
                #  `Theft at the national level, rate per 100,000 population`     +
                #  `Total Sexual Violence at the national level, rate per 100,000`     +
                #  `Labour force participation - Female`     +
                #  `Labour force participation - Male`     +
                #  `Labour force participation - Total`     +
                #  `Unemployment rate - Female`     +
                #  `Unemployment rate - Male`     +
                 `Unemployment rate - Total`
                ,data=frame_data)                                         # My dataframe
summary(linearMod)

My new Result is this:

Call:
lm(formula = `Net Sales Value (Net of Inv. Disc.) - US Dollar` ~ 
    `Population aged 0 to 14 years old (percentage)` + `Population aged 60+ years old (percentage)` + 
        `Population density` + `Population mid-year estimates (millions)` + 
        `Population mid-year estimates for females (millions)` + 
        `Population mid-year estimates for males (millions)` + 
        `Students enrolled in primary education (thousands)` + 
        `Students enrolled in secondary education (thousands)` + 
        `Students enrolled in tertiary education (thousands)` + 
        `GDP in constant 2010 prices (millions of US dollars)` + 
        `GDP in current prices (millions of US dollars)` + `GDP per capita (US dollars)` + 
        `GDP real rates of growth (percent)` + `Unemployment rate - Total`, 
    data = frame_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-377123 -127525   20489   95333  388344 

Coefficients:
                                                         Estimate Std. Error t value Pr(>|t|)  
(Intercept)                                             4.853e+06  3.148e+06   1.542   0.1671  
`Population aged 0 to 14 years old (percentage)`       -1.107e+05  7.591e+04  -1.459   0.1880  
`Population aged 60+ years old (percentage)`           -9.130e+04  7.583e+04  -1.204   0.2677  
`Population density`                                   -2.468e+02  7.284e+02  -0.339   0.7446  
`Population mid-year estimates (millions)`              3.789e+07  2.503e+07   1.513   0.1740  
`Population mid-year estimates for females (millions)` -3.698e+07  2.532e+07  -1.460   0.1876  
`Population mid-year estimates for males (millions)`   -3.875e+07  2.474e+07  -1.566   0.1613  
`Students enrolled in primary education (thousands)`    1.017e+03  4.584e+02   2.219   0.0620 .
`Students enrolled in secondary education (thousands)` -1.526e+03  5.463e+02  -2.793   0.0268 *
`Students enrolled in tertiary education (thousands)`   3.491e+02  6.538e+02   0.534   0.6099  
`GDP in constant 2010 prices (millions of US dollars)` -5.915e+00  3.390e+00  -1.745   0.1246  
`GDP in current prices (millions of US dollars)`        7.579e+00  3.463e+00   2.188   0.0649 .
`GDP per capita (US dollars)`                           5.528e+00  8.886e+00   0.622   0.5536  
`GDP real rates of growth (percent)`                   -1.470e+05  1.274e+05  -1.154   0.2863  
`Unemployment rate - Total`                            -7.130e+04  4.676e+04  -1.525   0.1711  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 320700 on 7 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.877, Adjusted R-squared:  0.6311 
F-statistic: 3.566 on 14 and 7 DF,  p-value: 0.0487

If i understand, the model start to find something. I don't know how to select the best variables for my models. Starting from my first model I tried with stepAIC and step [2][3] but I obtain:

AIC is -infinity for this model, so 'step' cannot proceed

Maybe I'm just making a big mess.

Reference:

[1] https://stackoverflow.com/questions/47386290/summary-of-model-returning-na.

[2] http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/154-stepwise-regression-essentials-in-r/

[3] https://www.rdocumentation.org/packages/stats/versions/3.6.0/topics/step

You have more variables than you do observation, so a unique set of coefficients can not be determined. In order to circumvent this, you can fit a regularized model (like ridge regression or the LASSO). Inferences from these models (i.e. confidence intervals) are not to be trusted since the models are biased. — Demetri Pananos, Jun 14 '19 at 16:20
Possible duplicate of [What is model identifiability?](https://stats.stackexchange.com/questions/20608/what-is-model-identifiability) — Sycorax, Jun 15 '19 at 13:20

score 1 · Answer 1 · answered Jun 14 '19 at 16:22

You have way too many predictors for this regression. You started off with 27 data points in your dv and then you had some missing data which has further reduced the number. Regression with a sample this size is not unheard off, but you really have to be careful about what an analysis like this can tell you. With that in mind, before someone can tell you what the 'best' variables are for your model, you need to explain what you are trying to do with your model. Are you using it primarily for prediction or are you trying to actually create a model that explains sales value. If the former, you are probably stuck with such a small sample, but methods like penalized regression can work well when you have too many variables in your model. If the latter, you really want to be guided by a theory or an idea in your head about how these variables relate to the outcome and which is the most important.

Yes, i want explain the sales value. Thanks for the feedback. I update the questions. — Antonio Faienza, Jun 14 '19 at 16:33

score 1 · Answer 2 · answered Jun 14 '19 at 16:57

You need to do feature selection -- you can't do linear regression with 27 datapoints for 40 variables. It's like asking if you can fit a line to a single datapoint -- doesn't make sense.

You CAN use the Lasso technique:

https://en.wikipedia.org/wiki/Lasso_(statistics)

where essentially you are penalizing the cost function for your regression to try to ensure it uses as few variables as possible BUT I don't think that you should -- 40 variables is a lot for 27 datapoints.

The first thing I would do is try to figure out which variables you need (aka feature selection) and which ones you don't so you can reduce the number of variables. In R, if you have a dataframe you can use:

mydata.cor = cor(mydata)

to generate a correlation matrix, showing you which variables are correlated to which other variables. There is a package called corrplot you can use to easily visualize this:

install.packages("corrplot") library(corrplot) corrplot(mydata.cor)

It would not be surprising to find that some of your variables are correlated, in which case you can eliminate some of them and simplify your problem.

When you are using stepAIC you are doing forward stepwise selection:

https://en.wikipedia.org/wiki/Stepwise_regression

which is a good approach to try but I think you still need to reduce the number of variables first -- get rid of any correlated predictors and then use any prior knowledge you have about the problem to eliminate any other unnecessary variables first.

Bottom line is you have to reduce the number of variables here...I would start by eliminating correlated variables then try stepwise selection or eliminate variables based on prior knowledge.

How to create a reliable regression model with a large number of variables and a few observations in R

2 Answers2