how to select the best non-linear model that represents the data?

Question

I have developed different non-linear models, all of them follow the following formula:

y=b0 + b1*(x1)^b2 + b3*(x2)^b4

where y is the dependent variable;x1 and x2 are independent variables;b0, b1, b2, b3, and b4 are the estimated regression coefficients.

different values of b0, b1, b2, b3, and b4 have been estimated with different R-square and adjusted R-square.

All estimated regression coefficients in all models are positive, the range of all estimated values are between 0.001-5 the difference only in the combinations of regression coefficient values in the models. by example :

model 1 : y= 2.5 + 0.01*(x1)^2 + 0.2*(x2)^0.5

model 2 : y= 2.3 + 0.005*(x1)^3 + 0.25*(x2)^0.4

I estimated the regression coefficients based on the Gauss-newton method and by assuming different initial values for the regression coefficient for each run (model). how to select the best model based on anything except R-square? how to determine the goodness of fit for these models?

The answer -- at least a correct one -- would rely on *how* you are fitting the models; in particular, it's crucial to know what you are assuming about any discrepancies between the observed and fitted values. Could you provide that information in an edit to your post? — whuber, Nov 17 '20 at 17:14
Thank you, but those edits haven't provided any relevant or useful information. As an example of what would be useful, perhaps you are modeling the response as the sum of a formula (as stipulated in your Models 1 and 2, perhaps) and independent identically distributed Gaussian errors and you are estimating the parameters using Maximum Likelihood. Those are, of course, pure guesses: what are you actually doing?? — whuber, Nov 17 '20 at 17:45
@whuber Ok am sorry. I estimated the regression coefficients based on the Gauss-newton method and by assuming different initial values for the regression coefficient for each run (model). I assumed the error distribution should follow the normal distribution. is it enough ? — Yazan Alatoom, Nov 17 '20 at 18:05
That is enough: if you would include that information in your post, it's likely you will get some useful answers. — whuber, Nov 17 '20 at 19:42

score 2 · Accepted Answer · answered Nov 18 '20 at 19:13

These models exemplify an approach known as the "Box-Tidwell Transformation." Given explanatory variables $x_i$ and a response variable $y,$ the usual Ordinary Least Squares model can be written

$$y = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p + \varepsilon$$

and $\varepsilon$ (the "error") is modeled as a random variable centered at $0.$ It is either assumed or, by means of a suitable transformation of $y,$ forced to be the case that all the $\varepsilon$ (of which there is one for each observation) have the same (finite) variance. It is usually assumed all the $\varepsilon$ are independent, too.

In case $y$ does not appear to enjoy such a linear relationship with the $x_i,$ it often is possible to "linearize" it by transforming some of the $x_i.$ (See https://stats.stackexchange.com/a/4833/919 for a general discussion of this process.) When a variable is positive, the power transformations $x \to x^\gamma$ are among the simplest, best-understood, and flexible possibilities.

Let us, then, identify a subset of the explanatory variables that might be so transformed. Numbering them $1$ through $k,$ the model is

$$y = \beta_0 + \beta_1 x_1^{\gamma_1} + \cdots + \beta_k x_k^{\gamma_k} \ + \ \beta_{k+1}x_{k+1} + \cdots + \beta_p x_p + \varepsilon.$$

This is precisely your model with $k=p=2.$

The Box-Tidwell method is the least-squares solution. This means it seeks a set of parameter estimates $\hat\beta_0, \hat\beta_1, \ldots, \hat\beta_p;$ $\hat\gamma_1, \ldots, \hat\gamma_k$ to minimize the mean squared deviation between the observed values of $y$ and the values predicted by the model. (These deviations are the "residuals.") It finds these estimates in a two-stage process:

Given candidate values of the powers $\hat\gamma_i,$ the best possible values of the $\hat\beta$ are given by the Ordinary Least Squares solution, which has a simple, direct formula and can be computed efficiently.
Systematically search over the set of possible powers to minimize the mean squared deviation.

Thus, what looks like a problem of optimizing a nonlinear function of $1+p+k$ parameters is reduced to a problem of optimizing a nonlinear function of just $k$ parameters.

For better interpretability, I recommend using a variation of the Box-Cox transformation. The Box-Cox transformation is the function

$$\operatorname{BC}(x;\gamma) = \int_1^x t^{\gamma-1} \, \mathrm{d}t.$$

It equals $(x^\gamma - 1)/\gamma$ when $\gamma\ne 0$ and is the natural logarithm when $\gamma=0.$ One distinct advantage it has over a pure power is that (unlike a power transformation with a possibly negative power) it preserves order: whenever $x_1 \gt x_2,$ $\operatorname{BC}(x_1;\gamma) \gt \operatorname{BC}(x_2;\gamma).$ Since we pay attention to the signs of regression coefficients $\hat\beta_i,$ it is useful to preserve order because that will tend to preserve the sign.

Going further -- this is a bit of an innovation in that I haven't seen anyone use it -- I would suggest modifying the Box-Cox transformation in the following way. For any batch of positive values $(x_1,x_2,\ldots, x_n),$ let $m$ be their mean and for any positive number $x$ set

$$\phi(x;\gamma, m) = m\left(1 + \operatorname{BC}(x/m; \gamma)\right).$$

Especially when $\gamma$ is not too "strong" -- that is, too far from $1$ -- this function barely changes values of $x$ near the middle of the $(x_i).$ As a result, values of $\phi$ tend to be comparable to the original values and therefore the corresponding parameter estimates tend also to be comparable to estimates using the original (untransformed) variables.

What are those estimates, by the way? Letting $m_i$ be the mean of variable $i$ (for $ 1\le i \le k$), simply rewrite the new model in terms of the original Box-Cox transformations (or power transformations) to discover the relationships:

$$\begin{aligned} y &= \beta_0 + \beta_1 \phi(x_1;\gamma_1,m_1) + \cdots + \varepsilon \\ &= \beta_0 + \beta_1 (m_1(1+ \operatorname{BC}(x_1/m_1;\gamma_1)) + \cdots + \varepsilon\\ &= (\beta_0 + \beta_1 m_1 + \cdots) + \beta_1 m_1\operatorname{BC}(x_1/m_1;\gamma_1) + \cdots + \varepsilon\\ &= (\beta_0 + \beta_1 m_1 + \cdots) + \beta_1m_1\left(\frac{\left(x_1/m_1\right)^{\gamma_1} - 1}{\gamma_1}\right) + \cdots + \varepsilon\\ &= \left(\beta_0 + \beta_1 m_1\left(1-\frac{1}{\gamma_1}\right) + \cdots\right) + \frac{\beta_1 m_1^{1-\gamma_1}}{\gamma_1}x_1^{\gamma_1} + \cdots + \varepsilon\\ &= \alpha_0 + \alpha_1 x_1^{\gamma_1} + \cdots + \alpha_k x_k^{\gamma_k}\ +\ \alpha_{k+1} x_{k+1} + \cdots + \alpha_p x_p + \varepsilon. \end{aligned}$$

This is the model of the question with

$$\alpha_0 = \beta_0 + \beta_1 m_1\left(1-\frac{1}{\gamma_1}\right) + \cdots +\beta_k m_k\left(1-\frac{1}{\gamma_k}\right)$$

and

$$\alpha_i = \frac{\beta_1 m_1^{1-\gamma_1}}{\gamma_1},\ i = 1, 2, \ldots, k;$$

$$\alpha_i = \beta_i,\ i = k+1, \ldots, p.$$

I will illustrate this with an example.

The car package installed with R includes a boxTidwell function (developed by John Fox of McMaster University) to estimate the $\gamma_i.$ Its documentation uses the Prestige dataset of 98 (non-missing) observations of occupation of Canadians in 1971. It proposes a model in which two variables, income ($x_1$) and education ($x_2$) may be transformed; and another four variables (a categorical variable type with three levels and a quadratic function of women) are not transformed. Thus, $k=2$ and $p=6$ in this example.

Here are the raw $(y,x_1,x_2)$ data (with point colors indicating the three possible values of type, which will be a covariate $x_3$ in the model

The relationship between income and prestige looks especially non-linear, suggesting the value of re-expressing income.

The Box-Tidwell method suggests values of $\hat\gamma_1 \approx -1/3$ and $\hat\gamma_2 \approx 5/4.$ Here is what $\phi$ does to these data with these powers:

The transformation of education has a negligible effect, but the transformation of income is strong. (The gray lines are the reference line where $y=x:$ that is, points lying near the gray lines have had their values left essentially unchanged by the transformation.)

We obtain a new picture of the relationships with prestige using these re-expressed values:

The relationships now appear linear (bearing in mind we have not accounted for the effects of type and women).

We proceed to find the best fit by applying Ordinary Least Squares regression. Here is a summary of its results:

Residuals:
     Min       1Q   Median       3Q      Max 
-12.4683  -3.5879   0.2383   3.9615  16.2124 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      2.4514762  4.6707391   0.525   0.6010    
income.BC        0.0019379  0.0003016   6.425 5.93e-09 ***
education.BC     3.0130871  0.5757606   5.233 1.06e-06 ***
typeprof         5.9699887  3.4694199   1.721   0.0887 .  
typewc          -2.8419944  2.3066882  -1.232   0.2211    
poly(women, 2)1 25.1152072  9.7221536   2.583   0.0114 *  
poly(women, 2)2 14.2613548  6.3131982   2.259   0.0263 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.243 on 91 degrees of freedom
Multiple R-squared:  0.8749,    Adjusted R-squared:  0.8666 
F-statistic: 106.1 on 6 and 91 DF,  p-value: < 2.2e-16

It is usually of interest to know how much, if at all, this extra effort of estimating powers $\gamma_1$ and $\gamma_2$ has accomplished. Without them, the model results are these:

Residuals:
     Min       1Q   Median       3Q      Max 
-15.6046  -4.6437   0.3103   4.9961  18.7581 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -0.3124871  5.1687172  -0.060 0.951924    
income           0.0009747  0.0002600   3.748 0.000312 ***
education        3.6446694  0.6350495   5.739 1.24e-07 ***
typeprof         6.7172869  3.8919915   1.726 0.087755 .  
typewc          -2.5248200  2.6276942  -0.961 0.339174    
poly(women, 2)1  0.3381270  9.2670315   0.036 0.970974    
poly(women, 2)2 14.5245798  7.1146127   2.042 0.044095 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.012 on 91 degrees of freedom
Multiple R-squared:  0.8422,    Adjusted R-squared:  0.8318 
F-statistic: 80.93 on 6 and 91 DF,  p-value: < 2.2e-16

The improvement is subtle but real: a residual standard error (the root mean square) has decreased from $7.012$ to $6.243$ and the residuals are no longer as extreme as they were. (Some adjustment to the p-values and adjusted R-squared statistics should be made to account for the preliminary estimation of two powers, but that discussion would make this post too lengthy.) In the model with transformed variables, the quadratic term women looks significant, but it was not significant in the original least squares model. That may be of fundamental interest in sociological research.

Notice how little the parameter estimates changed between the models: that is what use of $\phi$ rather than the powers $x\to x^\gamma$ or the Box-Cox function $\operatorname{BC}$ has accomplished for us. To some extent we may still interpret the coefficients as we always would: namely, marginal rates of change. For instance, the original income estimate $\hat\beta_1 = 0.0009747$ might be interpreted as "increases of one unit of income are associated with changes of $+0.00097$ units of prestige." For the new estimate we might say "increases of one unit of income for people with average incomes are associated with changes of $+0.001938$ units of prestige." It would be fair to conclude that the model with the power transformations estimates the income coefficient is about $0.0019/0.0097 \approx 2$ times the model without the power transformations, at least for typical incomes. This simple interpretation is possible only when using $\phi$ for the transformations -- not with $\operatorname{BC}$ or pure powers of the variables.

The following R code produced the figures and shows how to use the boxTidwell function and lm function to fit the power model of the question.

library(car) # Exports `boxTidwell` and `Prestige` (a data frame)
#
# Remove records with missing values.  (If included, several of these would
# be outliers, btw.)
#
df <- subset(Prestige, subset=!is.na(type))
# df$type <- with(df, factor(ifelse(is.na(type), "NA", as.character(type))))
#
# Plot relevant data.
#
pairs(subset(df, select=c(prestige, income, education)), 
      pch=21, bg=hsv(as.numeric(df$type)/5,.8,.9,.75),
      main="Data")
#
# A good way to study the relationships is to take out the effects of the 
# remaining covariates.
#
x <- residuals(lm(cbind(prestige, income, education) ~ type + poly(women, 2), df))
colnames(x) <- paste0(colnames(x), ".R")
pairs(x, pch=21, bg=hsv(as.numeric(df$type)/5,.8,.9,.75),
      main="Residuals")
#
# Estimate the Box-Cox (power) parameters.
#
obj <- boxTidwell(prestige ~ income + education, ~ type + poly(women, 2), data=Prestige,
                  verbose=TRUE)
lambda <- obj$result[, "MLE of lambda"]
# lambda <- round(12*lambda) / 12
#
# Compute `phi`, the normalized B-C transformation.
#
BC <- function(x, p=1) {
  m <- mean(x, na.rm=TRUE)
  x <- x / m
  if(isTRUE(p==0)) m * (1 + log(x)) else m * (1 + (x^p - 1)/p)
}
#
# Apply the estimated transformations.
#
df$income.BC <- BC(df$income, lambda["income"])
df$education.BC <- BC(df$education, lambda["education"])
#
# Plot their effects.
# s <- c(income="-1/3", education="5/4")
s <- sprintf("%.2f", lambda); names(s) <- names(lambda)
par(mfrow=c(1,2))
with(df,
     {
       plot(income, income.BC, asp=1, pch=21, bg="#00000040",
            main=bquote(paste("Power = ", .(s["income"]))))
       abline(0:1, lwd=2, col="Gray")
       plot(education, education.BC, asp=1, pch=21, bg="#00000040",
            main=bquote(paste("Power = ", .(s["education"]))))
       abline(0:1, lwd=2, col="Gray")
     }
)
par(mfrow=c(1,1))
#
# Study the relationships among the transformed variables.
#
pairs(subset(df, select=c(prestige, income.BC, education.BC)), 
      pch=21, bg=hsv(as.numeric(df$type)/5,.8,.9,.75),
      main="Data with Transformed Predictors")
#
# Fit and study the full model (with transformations).
#
fit.BC <- lm(prestige ~ income.BC + education.BC + type + poly(women, 2), data=df)
summary(fit.BC)
par(mfrow=c(2,2))
plot(fit.BC,  sub.caption="Box-Tidwell Model")
par(mfrow=c(1,1))
#
# Fit and study the model with no power transformations.
#
fit <- lm(prestige ~ income + education + type + poly(women, 2), data=df)
summary(fit)
par(mfrow=c(2,2))
plot(fit, sub.caption="No Transformations")
par(mfrow=c(1,1))

Thank you , there is many details in your answer , i will search more to understand all details — Yazan Alatoom, Nov 18 '20 at 22:44
@Sextus I'm sorry, but I don't follow what you mean by "candidate powers" or even "starting points." *Of course* the search for the powers must start somewhere! Furthermore, starting powers of $1$ are universally valid, so there's no issue in finding a feasible point. — whuber, Nov 24 '20 at 13:41
@Sextus The duplicate is *exactly* the same question, so your objections appear to focus on my solution. But that is no reason not to close the other as a duplicate. — whuber, Nov 24 '20 at 15:07

how to select the best non-linear model that represents the data?

1 Answers1

Linked

Related