17

In R, if I write

lm(a ~ b + c + b*c) 

would this still be a linear regression?

How to do other kinds of regression in R? I would appreciate any recommendation for textbooks or tutorials?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
suprvisr
  • 643
  • 2
  • 8
  • 14
  • I tried to reword a little bit your question. I'm afraid it looks like you're asking two very different questions. For the second one, a lot of resources is available on this site, but also on [CRAN](http://cran.r-project.org/). – chl Mar 23 '11 at 20:11
  • @chl, yeap, thanks, I wasnt clear. My questions is really this: If I write LM in R does R understands it as linear always or triesd to fit any model, not necessarily linear regression but any regression ? – suprvisr Mar 30 '11 at 20:06
  • No, `lm()` stands for a linear regression. Your model includes three parameters (minus the intercept) for `b`, `c`, and their interaction `b:c`, which stands for `b + c + b:c` or `b*c` for short (R follows Wilkinson's notation for statistical models). Fitting a Generalized Linear Model (i.e., where the link function is not identity, as is the case for the linear model expressed above) is requested through `glm()`. – chl Mar 30 '11 at 20:16

5 Answers5

31

Linear refers to the relationship between the parameters that you are estimating (e.g., $\beta$) and the outcome (e.g., $y_i$). Hence, $y=e^x\beta+\epsilon$ is linear, but $y=e^\beta x + \epsilon$ is not. A linear model means that your estimate of your parameter vector can be written $\hat{\beta} = \sum_i{w_iy_i}$, where the $\{w_i\}$ are weights determined by your estimation procedure. Linear models can be solved algebraically in closed form, while many non-linear models need to be solved by numerical maximization using a computer.

Charlie
  • 13,124
  • 5
  • 38
  • 68
  • 12
    +1 Specifically, in a "linear model" the dependent variable $y$ is a linear function of the *parameters* but not necessarily of the data. – whuber Mar 24 '11 at 21:14
  • 1st one is linear ? really - the one to the power of x ? – suprvisr Mar 25 '11 at 16:12
  • 3
    Yes, because $x$ is not the quantity of interest (the one you optimize for) but instead $\beta$ is. Thus, it is linear in $\beta$. – bayerj Apr 01 '11 at 15:34
  • 1
    +1, but this answer could be improved by commenting on the formula in the question. – naught101 Apr 17 '12 at 05:26
  • 2
    I notice, upon a second reading, that the second half of this reply confuses "linear model" with "linear estimator." The two concepts are separate and different. Nonlinear models often have linear estimators and linear models can have nonlinear estimators (consider GLMs, for instance). – whuber Apr 17 '12 at 06:23
  • Isn’t $y=e^x \beta + \epsilon$ an *affine* function of $\beta$? And likewise for several parameters $\beta_0, \beta_1, \dots, \beta_n$, i.e. $Y=\beta X +\epsilon$? – schn Dec 21 '20 at 17:54
  • nonlinear models need to be solved by a computer? What a hurdle :D – rep_ho Apr 16 '21 at 06:47
  • But in `=+ ` `` is still a constant just like `=+ ` – MAC Oct 30 '21 at 08:24
7

This post at minitab.com provides a very clear explanation:

  • A model is linear when it can be written in this format:
    • Response = constant + parameter * predictor + ... + parameter * predictor
      • That is, when each term (in the model) is either a constant or the product of a parameter and a predictor variable.
    • So both of these are linear models:
      • $Y = B_0 + B_1X_1$ (This is a straight line)
      • $Y = B_0 + B_1X_1^2$ (This is a curve)
  • If the model cannot be expressed using the above format, it is non-linear.
    • Examples of non-linear models:
      • $Y = B_0 + $$X_1^{B_1}$
      • $Y = B_0 \centerdot \cos (B_1 \centerdot X_1)$
Silverfish
  • 20,678
  • 23
  • 92
  • 180
Patrick Ng
  • 561
  • 5
  • 4
4

I would be careful in asking this as an "R linear regression" question versus a "linear regression" question. Formulas in R have rules that you may or may not be aware of. For example:

http://wiener.math.csi.cuny.edu/st/stRmanual/ModelFormula.html

Assuming you're asking if the following equation is linear:

a = coeff0 + (coeff1 * b) + (coeff2 * c) + (coeff3 * (b*c))

The answer is yes, if you assemble a new independent variable such as:

newv = b * c

Substituting the above newv equation into the original equation probably looks like what you're expecting for a linear equation:

a = coeff0 + (coeff1 * b) + (coeff2 * c) + (coeff3 * newv)

As far as references go, Google "r regression", or whatever you think might work for you.

bill_080
  • 3,458
  • 1
  • 20
  • 21
  • How does renaming something make it linear? I don't understand, if the identity newv = b * c holds, it's not linear at all. I am confused. – bayerj Mar 24 '11 at 08:21
  • @bayer: newv is a new variable. The new equation is a linear function of three variables (b, c, newv), where the coefficients provide a linear relationship. Neither equation is a linear combination of just two variables. – bill_080 Mar 24 '11 at 15:17
  • @bayer See the reply by @Charlie. In the present example, *both* models are linear (whether or not R views them as such) because in both of them `a` is a linear function of the four coefficients. – whuber Mar 24 '11 at 21:16
  • thanks, it makes sense... can I simply add new variable neww being b*c for each case in the database (medical) and then treat it as linear regression ? – suprvisr Mar 30 '11 at 20:05
3

You can write out the linear regression as a (linear) matrix equation.

$ \left[ \matrix{a_1 \\a_2 \\a_3 \\a_4 \\a_5 \\ ... \\ a_n} \right] = \left[ \matrix{b_1 & c_1 & b_1*c_1 \\ b_2 & c_2 & b_2*c_2 \\b_3 & c_3 & b_3*c_3 \\b_4 & c_4 & b_4*c_4 \\b_5 & c_5 & b_5*c_5 \\ &...& \\ b_n & c_n & b_n*c_n } \right] \times \left[\matrix{\alpha_b & \alpha_c & \alpha_{b*c}} \right] + \left[ \matrix{\epsilon_1 \\\epsilon_2 \\\epsilon_3 \\\epsilon_4 \\\epsilon_5 \\ ... \\ \epsilon_n} \right] $

or if you collapse this:

$\mathbf{a} = \alpha_b \mathbf{b} + \alpha_c \mathbf{c} + \alpha_{b*c} \mathbf{b*c} + \mathbf{\epsilon} $

This linear regression is equivalent to finding the linear combination of vectors $\mathbf{b}$, $\mathbf{c}$ and $\mathbf{b*c}$ that is closest to the vector $\mathbf{a}$.

(This has also a geometrical interpretation as finding the projection of $\mathbf{a}$ on the span of the vectors $\mathbf{b}$, $\mathbf{c}$ and $\mathbf{b*c}$. For a problem with two column vectors with three measurements this can still be drawn as a figure for instance as shown here: http://www.math.brown.edu/~banchoff/gc/linalg/linalg.html )


Understanding this concept is also important in non-linear regression. For instance it is much easier to solve $y=a e^{ct} + b e^{dt}$ than $y=u(e^{c(t-v)}+e^{d(t-v)})$ because the first parameterization allows to solve the $a$ and $b$ coefficients with the techniques for linear regression.

Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161
  • I feel this is the best answer, because it answers the question Why instead of just What. Answering with "What" does not lead to better intuition. – Hexatonic Oct 20 '18 at 16:32
0

The specific answer to the question is "yes, that is a linear model". In R the "*" operator used in a formula creates what is known as an interaction. If those two variables are both continuous, then the new variable created will be a mathematical product, but it also has meanings when one or both of the variables are categorical (known as factors in R parlance.) The reason that it is called a linear model is that the formula implies a relationship between the lefthand side and the righthand side that is determined by parameters for each term that are "linear" or "constants" which are solved for to minimize the total deviation of the data from the model. The answer from Sextus Empiricus lays that out formally:

$\mathbf{a} = \alpha_b \mathbf{b} + \alpha_c \mathbf{c} + \alpha_{b*c} \mathbf{b*c} + \mathbf{\epsilon} $

In R the variables a, b, and c can be defined in a manner that will produce a "non-planar" interaction. (I choose that term because to use the phrase "non-linear" would conflict with the its meaning in regression terminology.) The best fit interaction model will be a twisted surface.

 c=runif(100)
 b= runif(100)
 a = 3*b +6*c - 8*b*c + rnorm(100)  
 # higher combined values of b & c will be lower than without the interaction
 ls.fit <- lm(a~b+c+b*c)  # formula could have been just a~b*c
 summary( lm(a~b+c+b*c) )
#--------------------
Call:
lm(formula = a ~ b + c + b * c)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.61259 -0.50276  0.09259  0.69230  2.11442 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.6965     0.3540  -1.967    0.052 .  
b             3.7147     0.6363   5.838 7.18e-08 ***
c             7.3041     0.6500  11.237  < 2e-16 ***
b:c          -9.5917     1.2091  -7.933 3.94e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9264 on 96 degrees of freedom
Multiple R-squared:  0.5973,    Adjusted R-squared:  0.5847 
F-statistic: 47.46 on 3 and 96 DF,  p-value: < 2.2e-16
#-------------------
 y <- predict( lm(a~b+c+b*c), # predict idealized values from rectangular grid
                newdata=expand.grid(b=seq(0,1,length=20),
                                    c=seq(0,1,length=20)) )
png()
 wireframe( y~b+c, data=data.frame(
                             y,
                             expand.grid(b=seq(0,1,length=20),
                                         c=seq(0,1,length=20))) ,
               screen = list(z = 90, x = -60))
dev.off()  # now insert it in answer

enter image description here

DWin
  • 7,005
  • 17
  • 32