R - linear regression - good fit, but residuals aren't normally distributed and homoscedastic

Question

I'm trying to find a function which describes one data set.

I've tried to create a model by function lm() like this:

Call:
lm(formula = time ~ I(1/nprocs) + I(1/nprocs):ndoms + nDOF:ndoms + 
    I(nDOF^2):ndoms, data = dataFact)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.33075 -0.05145  0.00782  0.05273  0.35107 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1.260e-01  1.390e-02  -9.067  < 2e-16 ***
I(1/nprocs)        1.994e-01  4.103e-02   4.859 2.25e-06 ***
I(1/nprocs):ndoms  3.645e-04  5.560e-05   6.555 3.94e-10 ***
ndoms:nDOF         4.239e-07  7.004e-09  60.521  < 2e-16 ***
ndoms:I(nDOF^2)    4.717e-11  4.787e-13  98.530  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1054 on 219 degrees of freedom
Multiple R-squared:  0.9994,    Adjusted R-squared:  0.9994 
F-statistic: 8.76e+04 on 4 and 219 DF,  p-value: < 2.2e-16

As we can see from t-test, all coefficients are significant. Moreover $R^2$ and F-test are both good and fit looks ok even visually:

The problems arise, when I try to perform additional tests.

Multicollinearity

I check multicollinearity with vif() function and here is the result:

Multi-collinearity test: 
      I(1/nprocs) I(1/nprocs):ndoms        ndoms:nDOF   ndoms:I(nDOF^2) 
         2.949647          3.242263         14.640992         13.642409

As we can see, high multicollinearity is found with members ndoms:nDOF and ndoms:I(nDOF^2). But we I omit the member ndoms:nDOF, the fit is pretty poor in higher indices:

When I omit the member ndoms:I(nDOF^2), the fit is even worse.

How could I deal multicollinearity in this case?

Residuals are not normally distributed, autocorrelated and heteroscedastic

I try to check the normality of residuals with shapiro.test() function, auto-correlation with durbinWatsonTest() and heteroscedasticity with ncvTest().

Results are not good good at all:

    Shapiro-Wilk normality test

data:  residuals(currentFit)
W = 0.9822, p-value = 0.006407

Auto-correlation of residuals test (Durbin-Watson): 

 lag Autocorrelation D-W Statistic p-value
   1       0.5594217     0.8761061       0
 Alternative hypothesis: rho != 0

Residual homoscedasticity test: 

Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 35.17967    Df = 1     p = 3.006452e-09

I've tried to detect outliers with outlierTest() and it says there aren't any outliers:

No Studentized residuals with Bonferonni p < 0.05
Largest |rstudent|:
    rstudent unadjusted p-value Bonferonni p
209 3.607098         0.00038392     0.085999

Moreover, residual plot looks like this:

We can see heteroscedasticity, but not outliers (at least I don't see them).

I've found even this question, but I don't think it's relevant for me, because of this. Outliers were an important issue in it.

I've even tried to remove autocorrelation with cochrane.orcutt() and to remove heteroscedasticity with BoxCoxTrans() applied to the dependent variable time, both with no success.

Other residual plots look like this:

Could you, please, tell me, how am I supposed to correct my model, so that it fixes these issues?

I'd really appreciate answers with examples shown on this data set - I'm a total beginner when it comes to statistics, so the vague answers aren't very useful for me and I'd like to make this sort of a reference question for this type of problems. Thank you!

My complete R-code

#!/usr/bin/Rscript

##################
# Nacteni modulu #
##################
require(caret)
library(car)
library(orcutt)
library(QuantPsyc)

#############################
# Definice vlastnich funkci #
#############################
printf <- function(...) cat(sprintf(...))

###############
# Nacist data #
###############
dataFact = read.csv('factorizationKSorted.dat')

#################
# Tvorba modelu #
#################

# Linearni fit
#fit1 <- lm(formula=time ~ (nDOF + I(nDOF^2)):ndoms,
#           data=dataFact)

fit2 <- lm(formula=time ~ I(1/nprocs) +  I(1/nprocs):ndoms + nDOF:ndoms + I(nDOF^2):ndoms,
           data=dataFact)

fit3 <- lm(formula=time ~ I(1/nprocs) +  I(1/nprocs):ndoms + nDOF:ndoms,
           data=dataFact)

# Model, ktery se bude dal analyzovat
currentFit <- fit2

# Vykresleni grafu
X11()
plot(1:length(dataFact$time), dataFact$time, col="black", xlab="index")
lines(predict(currentFit), col="green")

# Vypis informaci o modelu
#
# S-value (standartni chyba) misto R^2: summary(fit)$sigma
summary(currentFit)

#currentFit.beta <- lm.beta(currentFit)
#currentFit.beta

# Test odlehlych hodnot
printf("\nOutliers test (Bonferroni):\n")
outlierTest(currentFit)
X11()
plot(1:length(currentFit$residuals), currentFit$residuals, xlab="index")

# Test multi-kolinearity
printf("\nMulti-collinearity test: \n")
vif(currentFit)

# Test nelinearity ??

###################
# Analyza rezidui #
###################

# Test normality rezidui
message("\nNormal distribution of residuals test (Shapiro-Wilk):\n")
shapiro.test(residuals(currentFit))
X11()
hist(residuals(currentFit), main="Residuals")

# Stredni hodnota rezidui
printf("\nMean of residuals is %f.\n", mean(currentFit$residuals))

# Test autokorelace - Durbin-Watson
message("\nAuto-correlation of residuals test (Durbin-Watson): \n")
durbinWatsonTest(currentFit)

# Odstraneni autokorelace - Cochrane-Orcutt
#currentFitCoch <- cochrane.orcutt(currentFit)
#currentFitCoch

# Test homoskedasticity
message("\nResidual homoscedasticity test: \n")
ncvTest(currentFit)
bptest(currentFit)

# Ostraneni heteroskedasticity - Box-cox transformace zavisle promenne
# https://datascienceplus.com/how-to-detect-heteroscedasticity-and-rectify-it/
#head(dataFact)
#timeBCMod <- BoxCoxTrans(dataFact$time)
#dataFact <- cbind(dataFact, time_bc=predict(timeBCMod, dataFact$time))
#head(dataFact)
#currentFitBC <- lm(time_bc ~ nDOF + ndoms + (nDOF + I(nDOF^2)):ndoms,
#                   data=dataFact)
#ncvTest(currentFitBC)
#bptest(currentFitBC)

# Vykresleni grafu pro vizualizaci rezidui
X11()
par(mfrow = c(2, 2))
plot(currentFit)


##########################################################
# Cross-validace (over-fitting test) 10-fold, 10 iteraci #
##########################################################
printf("\nCross-validation\n")
stdErrs=c() # Vektor pro prumerne hodnoty s. erroru v jednotlivych iteracich
for (i in 1:10) {
    # 1) Vytvorim 10 skupin dat
    folds <- createFolds(dataFact$time, k=10, list=TRUE, returnTrain=FALSE)
    #folds <- createFolds(dataFact$z, k=10, list=TRUE, returnTrain=FALSE)
    names(folds)[1] <- "test" # Cast dat slouzici pro nasledny test

    # 2) 90% Dat bude slouzit pro trenink a 10% pro nasledny test
    dataFactTest <- dataFact[ folds$test, ]
    dataFactTrain <- dataFact[ c(folds[[2]], 
                                 folds[[3]], 
                                 folds[[4]], 
                                 folds[[5]], 
                                 folds[[6]], 
                                 folds[[7]], 
                                 folds[[8]], 
                                 folds[[9]], 
                                 folds[[10]]), ]

    # 3) Kalibrace funkce jen na trenovacich datech
    testFit <- lm(formula=formula(currentFit),
                  data=dataFactTrain)

    # 4) Graficke srovnani testovaci funkce se vsemi daty
    #    a zvlast s testovaci mnozinou
    tmp <- predict(testFit, dataFactTest, se.fit=TRUE)
    X11()
    plot(1:length(dataFactTest$time), dataFactTest$time)
    lines(1:length(dataFactTest$time),
          tmp$fit, 
          col="green")

    printf("%d iter: Prumerna standardni chyba (S-value) pro fit na testovaci mnozine je %f.\n", i, mean(tmp$se.fit))
    stdErrs[i] = mean(tmp$se.fit)
}
printf("Prumerna standardni chyba (S-value) za vsechny iterace je %f.\n", mean(stdErrs))

# Zprava pro "pozastaveni"grafu
message("\nPress Return To Continue")
invisible(readLines("stdin", n=1))

Data - factorizationKSorted.dat

nprocs,nthreads,ndoms,size,nDOF,time
1,24,288,4,375,0.17715325
2,12,288,4,375,0.0629865
3,8,288,4,375,0.057708125
4,6,288,4,375,0.04788125
6,4,288,4,375,0.04303875
8,3,288,4,375,0.04060275
12,2,288,4,375,0.038032625
24,1,288,4,375,0.034977375
1,24,288,6,1029,0.284978625
2,12,288,6,1029,0.13397875
3,8,288,6,1029,0.125101
4,6,288,6,1029,0.110273375
6,4,288,6,1029,0.105055625
8,3,288,6,1029,0.10273
12,2,288,6,1029,0.100959875
24,1,288,6,1029,0.099074375
1,24,288,8,2187,0.479892
2,12,288,8,2187,0.306054125
3,8,288,8,2187,0.295369625
4,6,288,8,2187,0.284210125
6,4,288,8,2187,0.273315125
8,3,288,8,2187,0.274951375
12,2,288,8,2187,0.266159875
24,1,288,8,2187,0.263813125
1,24,288,10,3993,0.89622025
2,12,288,10,3993,0.673204
3,8,288,10,3993,0.666580625
4,6,288,10,3993,0.6477035
6,4,288,10,3993,0.648032125
8,3,288,10,3993,0.64495625
12,2,288,10,3993,0.640551125
24,1,288,10,3993,0.638785875
1,24,288,12,6591,1.580801
2,12,288,12,6591,1.37637042857
3,8,288,12,6591,1.36346242857
4,6,288,12,6591,1.36495214286
6,4,288,12,6591,1.36805057143
8,3,288,12,6591,1.36849742857
12,2,288,12,6591,1.37384785714
24,1,288,12,6591,1.37941357143
1,24,288,14,10125,2.84967157143
2,12,288,14,10125,2.66627714286
3,8,288,14,10125,2.682575
4,6,288,14,10125,2.66889842857
6,4,288,14,10125,2.677353
8,3,288,14,10125,2.67224042857
12,2,288,14,10125,2.68130785714
24,1,288,14,10125,2.67892771429
1,24,288,16,14739,4.89139571429
2,12,288,16,14739,4.69489785714
3,8,288,16,14739,4.74105342857
4,6,288,16,14739,4.70664585714
6,4,288,16,14739,4.73917842857
8,3,288,16,14739,4.70832771429
12,2,288,16,14739,4.715311
24,1,288,16,14739,4.736754
1,24,384,4,375,0.23214775
2,12,384,4,375,0.085062
3,8,384,4,375,0.0773935
4,6,384,4,375,0.063308375
6,4,384,4,375,0.056720375
8,3,384,4,375,0.053515875
12,2,384,4,375,0.050725
24,1,384,4,375,0.047851875
1,24,384,6,1029,0.387123125
2,12,384,6,1029,0.177154
3,8,384,6,1029,0.164021375
4,6,384,6,1029,0.1471485
6,4,384,6,1029,0.13990975
8,3,384,6,1029,0.136657
12,2,384,6,1029,0.13419125
24,1,384,6,1029,0.13110525
1,24,384,8,2187,0.646564875
2,12,384,8,2187,0.403944
3,8,384,8,2187,0.393314375
4,6,384,8,2187,0.37626875
6,4,384,8,2187,0.368974375
8,3,384,8,2187,0.358301
12,2,384,8,2187,0.359731125
24,1,384,8,2187,0.3541595
1,24,384,10,3993,1.135393
2,12,384,10,3993,0.884951625
3,8,384,10,3993,0.879945
4,6,384,10,3993,0.855973375
6,4,384,10,3993,0.8522425
8,3,384,10,3993,0.852744375
12,2,384,10,3993,0.855075875
24,1,384,10,3993,0.8507125
1,24,384,12,6591,2.10579685714
2,12,384,12,6591,1.83842342857
3,8,384,12,6591,1.83237142857
4,6,384,12,6591,1.82270728571
6,4,384,12,6591,1.82788628571
8,3,384,12,6591,1.82526614286
12,2,384,12,6591,1.82717142857
24,1,384,12,6591,1.82618185714
1,24,384,14,10125,3.81616642857
2,12,384,14,10125,3.56664771429
3,8,384,14,10125,3.58948371429
4,6,384,14,10125,3.56408085714
6,4,384,14,10125,3.56926157143
8,3,384,14,10125,3.55938014286
12,2,384,14,10125,3.55807242857
24,1,384,14,10125,3.55977614286
1,24,384,16,14739,6.55986157143
2,12,384,16,14739,6.26693228571
3,8,384,16,14739,6.30919285714
4,6,384,16,14739,6.27896728571
6,4,384,16,14739,6.27579871429
8,3,384,16,14739,6.27007928571
12,2,384,16,14739,6.26270857143
24,1,384,16,14739,6.25499785714
1,24,576,4,375,0.346824
2,12,576,4,375,0.127150875
3,8,576,4,375,0.11508025
4,6,576,4,375,0.097335
6,4,576,4,375,0.083914125
8,3,576,4,375,0.080992125
12,2,576,4,375,0.075828625
24,1,576,4,375,0.070827625
1,24,576,6,1029,0.59015625
2,12,576,6,1029,0.26846175
3,8,576,6,1029,0.245241875
4,6,576,6,1029,0.2193675
6,4,576,6,1029,0.208428375
8,3,576,6,1029,0.207515
12,2,576,6,1029,0.204429
24,1,576,6,1029,0.197228
1,24,576,8,2187,0.98994575
2,12,576,8,2187,0.605623
3,8,576,8,2187,0.587526875
4,6,576,8,2187,0.55606225
6,4,576,8,2187,0.546602
8,3,576,8,2187,0.537873375
12,2,576,8,2187,0.536193625
24,1,576,8,2187,0.527466375
1,24,576,10,3993,1.725655375
2,12,576,10,3993,1.325127375
3,8,576,10,3993,1.317314625
4,6,576,10,3993,1.290353
6,4,576,10,3993,1.285149
8,3,576,10,3993,1.283298375
12,2,576,10,3993,1.283781375
24,1,576,10,3993,1.2788885
1,24,576,12,6591,3.15235257143
2,12,576,12,6591,2.765567
3,8,576,12,6591,2.76951185714
4,6,576,12,6591,2.74102728571
6,4,576,12,6591,2.737753
8,3,576,12,6591,2.73729985714
12,2,576,12,6591,2.73825257143
24,1,576,12,6591,2.72781542857
1,24,576,14,10125,5.71093614286
2,12,576,14,10125,5.35875985714
3,8,576,14,10125,5.36633285714
4,6,576,14,10125,5.34668342857
6,4,576,14,10125,5.32762214286
8,3,576,14,10125,5.33975814286
12,2,576,14,10125,5.350373
24,1,576,14,10125,5.31987557143
1,24,576,16,14739,9.78473642857
2,12,576,16,14739,9.44161628571
3,8,576,16,14739,9.42669171429
4,6,576,16,14739,9.41840128571
6,4,576,16,14739,9.37364014286
8,3,576,16,14739,9.41561828571
12,2,576,16,14739,9.41685542857
24,1,576,16,14739,9.39101542857
1,24,1152,4,375,0.710592875
2,12,1152,4,375,0.246370625
3,8,1152,4,375,0.2160605
4,6,1152,4,375,0.182422125
6,4,1152,4,375,0.164191875
8,3,1152,4,375,0.1583665
12,2,1152,4,375,0.1544865
24,1,1152,4,375,0.142128
1,24,1152,6,1029,1.2401235
2,12,1152,6,1029,0.52502025
3,8,1152,6,1029,0.490228
4,6,1152,6,1029,0.43791025
6,4,1152,6,1029,0.419299125
8,3,1152,6,1029,0.411805375
12,2,1152,6,1029,0.4083735
24,1,1152,6,1029,0.39699275
1,24,1152,8,2187,1.993820125
2,12,1152,8,2187,1.204231625
3,8,1152,8,2187,1.166492625
4,6,1152,8,2187,1.116419625
6,4,1152,8,2187,1.092001625
8,3,1152,8,2187,1.0832535
12,2,1152,8,2187,1.076723875
24,1,1152,8,2187,1.0744375
1,24,1152,10,3993,3.511545625
2,12,1152,10,3993,2.66910925
3,8,1152,10,3993,2.66473925
4,6,1152,10,3993,2.60924175
6,4,1152,10,3993,2.594369125
8,3,1152,10,3993,2.578603625
12,2,1152,10,3993,2.581412625
24,1,1152,10,3993,2.594349375
1,24,1152,12,6591,6.256258375
2,12,1152,12,6591,5.568122625
3,8,1152,12,6591,5.583774125
4,6,1152,12,6591,5.523111625
6,4,1152,12,6591,5.503487125
8,3,1152,12,6591,5.48894275
12,2,1152,12,6591,5.4785635
24,1,1152,12,6591,5.473963875
1,24,1152,14,10125,11.3590212857
2,12,1152,14,10125,10.7868857143
3,8,1152,14,10125,10.8603297143
4,6,1152,14,10125,10.7189155714
6,4,1152,14,10125,10.6728481429
8,3,1152,14,10125,10.695908
12,2,1152,14,10125,10.6502984286
24,1,1152,14,10125,10.6126634286
1,24,1152,16,14739,19.6414048571
2,12,1152,16,14739,18.9570958571
3,8,1152,16,14739,18.950933
4,6,1152,16,14739,18.8408438571
6,4,1152,16,14739,18.7679364286
8,3,1152,16,14739,18.848793
12,2,1152,16,14739,18.7847892857
24,1,1152,16,14739,18.8019534286

What's your goal with the model? Why are you concerned with VIF, homoskedacity, and normal residuals? — Matthew Drury, Mar 03 '17 at 15:54
@MatthewDrury I've tried to study regression and I found those to be the crucial conditions for the model to be correct. Otherwise, my goal is to predict the measured values as precisely as possible. — Eenoku, Mar 03 '17 at 15:59
Those assumptions are only necessary for specific purposes. If your goal is prediction then it is much more important for you to validate the predictive power of your model is sufficient (using, say, cross validation), than to validate these structural assumptions. — Matthew Drury, Mar 03 '17 at 16:15
@MatthewDrury I have performed cross-validation and it looks pretty well too. Could you, please, write a little more about the "specific purposes" you've mentioned? — Eenoku, Mar 03 '17 at 17:28

R - linear regression - good fit, but residuals aren't normally distributed and homoscedastic

Multicollinearity

Residuals are not normally distributed, autocorrelated and heteroscedastic

My complete R-code

Data - factorizationKSorted.dat

0 Answers0