I am newbie with R and I am trying to create a model that explains sales value. In particular i want explain how this series of variable (downloaded from http://data.un.org/ and merged with Excel) impact on my sales value. For doing this, i use a Linear Regression (*lm()*function) with R. My dataset is small for the number of variables I have:
frame_data --> 27 obs of 40 Variables
When I run the model:
linearMod <- lm(`Net Sales Value (Net of Inv. Disc.) - US Dollar` ~. ,data=frame_data)
summary(linearMod)
the results is this:
Call:
lm(formula = `Net Sales Value (Net of Inv. Disc.) - US Dollar` ~
., data = numeric_frame_data)
Residuals:
ALL 14 residuals are 0: no residual degrees of freedom!
Coefficients: (26 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.946e-09 NA NA NA
`Net Sales Quantity` 2.411e-14 NA NA NA
`Net Sales Value (Net of Inv. Disc.) - Euro ACT` 1.130e+00 NA NA NA
`Population aged 0 to 14 years old (percentage)` -3.121e-11 NA NA NA
`Population aged 60+ years old (percentage)` 3.428e-11 NA NA NA
`Population density` -7.512e-13 NA NA NA
`Population mid-year estimates (millions)` 4.933e-09 NA NA NA
`Population mid-year estimates for females (millions)` -5.170e-09 NA NA NA
`Population mid-year estimates for males (millions)` -4.684e-09 NA NA NA
`Sex ratio (males per 100 females)` 2.526e-11 NA NA NA
`Surface area (thousand km2)` -8.268e-15 NA NA NA
`Tourism expenditure (millions of US dollars)` -1.624e-14 NA NA NA
`Tourist/visitor arrivals (thousands)` 5.703e-15 NA NA NA
`Gross enrollement ratio - Primary (male)` 2.413e-11 NA NA NA
`Gross enrollment ratio - Primary (female)` NA NA NA NA
`Gross enrollment ratio - Secondary (female)` NA NA NA NA
`Gross enrollment ratio - Secondary (male)` NA NA NA NA
`Gross enrollment ratio - Tertiary (female)` NA NA NA NA
`Gross enrollment ratio - Tertiary (male)` NA NA NA NA
`Students enrolled in primary education (thousands)` NA NA NA NA
`Students enrolled in secondary education (thousands)` NA NA NA NA
`Students enrolled in tertiary education (thousands)` NA NA NA NA
`Assault rate per 100,000 population` NA NA NA NA
`Intentional homicide rates per 100,000` NA NA NA NA
`Kidnapping at the national level, rate per 100,000` NA NA NA NA
`Percentage of male and female intentional homicide victims, Female` NA NA NA NA
`Percentage of male and female intentional homicide victims, Male` NA NA NA NA
`Robbery at the national level, rate per 100,000 population` NA NA NA NA
`Theft at the national level, rate per 100,000 population` NA NA NA NA
`Total Sexual Violence at the national level, rate per 100,000` NA NA NA NA
`GDP in constant 2010 prices (millions of US dollars)` NA NA NA NA
`GDP in current prices (millions of US dollars)` NA NA NA NA
`GDP per capita (US dollars)` NA NA NA NA
`GDP real rates of growth (percent)` NA NA NA NA
`Labour force participation - Female` NA NA NA NA
`Labour force participation - Male` NA NA NA NA
`Labour force participation - Total` NA NA NA NA
`Unemployment rate - Female` NA NA NA NA
`Unemployment rate - Male` NA NA NA NA
`Unemployment rate - Total` NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
(13 observations deleted due to missingness)
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 13 and 0 DF, p-value: NA
Reading online i have undestand that my dataset is too small [1]. When i Reduce the number of variables:
linearMod <- lm(`Net Sales Value (Net of Inv. Disc.) - US Dollar` ~ # Variable Y
`Population aged 0 to 14 years old (percentage)` + # Variable X
`Population aged 60+ years old (percentage)` +
`Population density` +
`Population mid-year estimates (millions)` +
`Population mid-year estimates for females (millions)` +
`Population mid-year estimates for males (millions)` +
# `Sex ratio (males per 100 females)` +
# `Surface area (thousand km2)` +
# `Tourism expenditure (millions of US dollars)` +
# `Tourist/visitor arrivals (thousands)` +
# `Gross enrollement ratio - Primary (male)` +
# `Gross enrollment ratio - Primary (female)` +
# `Gross enrollment ratio - Secondary (female)` +
# `Gross enrollment ratio - Secondary (male)` +
# `Gross enrollment ratio - Tertiary (female)` +
# `Gross enrollment ratio - Tertiary (male)` +
`Students enrolled in primary education (thousands)` +
`Students enrolled in secondary education (thousands)` +
`Students enrolled in tertiary education (thousands)` +
`GDP in constant 2010 prices (millions of US dollars)` +
`GDP in current prices (millions of US dollars)` +
`GDP per capita (US dollars)` +
`GDP real rates of growth (percent)` +
# `Assault rate per 100,000 population` +
# `Intentional homicide rates per 100,000` +
# `Kidnapping at the national level, rate per 100,000` +
# `Percentage of male and female intentional homicide victims, Female` +
# `Percentage of male and female intentional homicide victims, Male` +
# `Robbery at the national level, rate per 100,000 population` +
# `Theft at the national level, rate per 100,000 population` +
# `Total Sexual Violence at the national level, rate per 100,000` +
# `Labour force participation - Female` +
# `Labour force participation - Male` +
# `Labour force participation - Total` +
# `Unemployment rate - Female` +
# `Unemployment rate - Male` +
`Unemployment rate - Total`
,data=frame_data) # My dataframe
summary(linearMod)
My new Result is this:
Call:
lm(formula = `Net Sales Value (Net of Inv. Disc.) - US Dollar` ~
`Population aged 0 to 14 years old (percentage)` + `Population aged 60+ years old (percentage)` +
`Population density` + `Population mid-year estimates (millions)` +
`Population mid-year estimates for females (millions)` +
`Population mid-year estimates for males (millions)` +
`Students enrolled in primary education (thousands)` +
`Students enrolled in secondary education (thousands)` +
`Students enrolled in tertiary education (thousands)` +
`GDP in constant 2010 prices (millions of US dollars)` +
`GDP in current prices (millions of US dollars)` + `GDP per capita (US dollars)` +
`GDP real rates of growth (percent)` + `Unemployment rate - Total`,
data = frame_data)
Residuals:
Min 1Q Median 3Q Max
-377123 -127525 20489 95333 388344
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.853e+06 3.148e+06 1.542 0.1671
`Population aged 0 to 14 years old (percentage)` -1.107e+05 7.591e+04 -1.459 0.1880
`Population aged 60+ years old (percentage)` -9.130e+04 7.583e+04 -1.204 0.2677
`Population density` -2.468e+02 7.284e+02 -0.339 0.7446
`Population mid-year estimates (millions)` 3.789e+07 2.503e+07 1.513 0.1740
`Population mid-year estimates for females (millions)` -3.698e+07 2.532e+07 -1.460 0.1876
`Population mid-year estimates for males (millions)` -3.875e+07 2.474e+07 -1.566 0.1613
`Students enrolled in primary education (thousands)` 1.017e+03 4.584e+02 2.219 0.0620 .
`Students enrolled in secondary education (thousands)` -1.526e+03 5.463e+02 -2.793 0.0268 *
`Students enrolled in tertiary education (thousands)` 3.491e+02 6.538e+02 0.534 0.6099
`GDP in constant 2010 prices (millions of US dollars)` -5.915e+00 3.390e+00 -1.745 0.1246
`GDP in current prices (millions of US dollars)` 7.579e+00 3.463e+00 2.188 0.0649 .
`GDP per capita (US dollars)` 5.528e+00 8.886e+00 0.622 0.5536
`GDP real rates of growth (percent)` -1.470e+05 1.274e+05 -1.154 0.2863
`Unemployment rate - Total` -7.130e+04 4.676e+04 -1.525 0.1711
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 320700 on 7 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared: 0.877, Adjusted R-squared: 0.6311
F-statistic: 3.566 on 14 and 7 DF, p-value: 0.0487
If i understand, the model start to find something. I don't know how to select the best variables for my models. Starting from my first model I tried with stepAIC and step [2][3] but I obtain:
AIC is -infinity for this model, so 'step' cannot proceed
Maybe I'm just making a big mess.
Reference:
[1] https://stackoverflow.com/questions/47386290/summary-of-model-returning-na.
[3] https://www.rdocumentation.org/packages/stats/versions/3.6.0/topics/step