0

I try to modelize princing datas where the price depends on 3 parameters : the profession and the city of the user.

The model is very simple : Price = $avgPrice_{profession}\cdot\beta_{city} $ : for each profession, we have a average price, corrected by a coefficient for each city.

With R, I used lm in the following way : lm(Price ~ factor(Profession):factor(City),data). But R change the factors in dummies variables, and create all interaction combinaisons.

Example : let say we have 4 cities (NYC, Boston, Chicago, Miami) and 3 professions (Doctor, Lawyer, Driver). R try to solve all the interactions : factor(city)NYC:factor(profession)Doctor, factor(city)NYC:factor(profession)Lawyer, factor(city)NYC:factor(profession)Driver, factor(city)Boston:factor(profession)Doctor, factor(city)Boston:factor(profession)Lawyer, etc.

Instead, I would like R to find the following coefficients : factor(city)NYC, factor(city)Boston, factor(city)Chicago, factor(city)Miami and factor(profession)Doctor, factor(profession)Lawyer, factor(profession)Driver

Is it possible and if so, how should I configure my formula and lm parameters ?

Train data :

train_data = structure(list(Profession = c("Doctor", "Lawyer", "Driver",
"Doctor", "Doctor", "Doctor"), City = c("Miami ", "Miami ", "Miami ", "Boston", 
"Chicago", "NYC"), Tarif = c(25.48, 29.99, 33.23, 25.49, 24.24, 
28.08)), .Names = c("Profession", "City", "Tarif"), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -6L))

Test data :

test_data = structure(list(Profession = c("Doctor", "Lawyer", "Driver", "Doctor", 
"Lawyer", "Driver", "Doctor", "Lawyer", "Driver", "Doctor", "Lawyer", 
"Driver"), City = c("Miami ", "Miami ", "Miami ", "Boston", "Boston", 
"Boston", "Chicago", "Chicago", "Chicago", "NYC", "NYC", "NYC"
), Tarif = c(25.48, 29.99, 33.23, 25.49, 30, 33.23, 24.24, 28.53, 
31.61, 28.08, 33.13, 36.77)), .Names = c("Profession", "City", 
"Tarif"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-12L))
Etienne
  • 11
  • 2

2 Answers2

1

If you just want the main effects, then use a + instead of * in your formula. Then you won't have the interaction terms.

However, I would recommend that you don't factorize your age. Binning continuous predictors is almost always a bad idea. If you believe that the effect of age on prices is non-linear (which certainly sounds valid), then consider transforming age via .

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • Main effects are not enough, this is not a `plus` but a `times` relation between the 2 parameters. For the `splines` solution, I will look for, but I also have 2 more factors with are not continuous (I stayed on a simple question but my problem is quite more complex). – Etienne Nov 07 '17 at 10:40
  • If you have four cities, then you will get four parameters (or one intercept and three offset parameters) for the cities alone. If you run an interaction between cities and age (continuous, not factor), then you will get one intercept, three city offsets, one overall age effect and three interaction terms, for a total of eight parameters. I think I do not fully understand what you are aiming for. Can you clarify? – Stephan Kolassa Nov 07 '17 at 10:48
  • Solving the problem with one factor and one continuous parameter is - of course - not a problem. But in my situation the age is not continous and I have 2 more factors. So let's update my question : I have 4 `cities` and 4 `professions`. For each profession I have a average price and for each city a correction coefficient : $Price = avgPrice_{profession} \cdot \beta_{city}$. For the model, I have a list of prices for a combinaison of profession/city, and I want to find the model with average prices and the city coefficients (and predict new combinaisons) – Etienne Nov 07 '17 at 10:56
  • Can you edit your question to include a sample of your data? – Stephan Kolassa Nov 07 '17 at 12:26
  • Train & Test data added – Etienne Nov 07 '17 at 13:47
  • OK, thank you for the data. What is your desired output? Judging from the paragraph starting "Instead, I would like R to find the following coefficients" in your question, `with(train_data,lm(Tarif~Profession+City))` should do what you want, so you seem to want something else. Are you looking for a multiplicative model? If so, you could take logarithms of your `Tarif`. – Stephan Kolassa Nov 07 '17 at 14:58
  • Thanks, use of logarithms is brilliant. I'm indeed looking for a multiplicative model. `train_data %>% lm(log(Tarif) ~ Profession + City,.) %>% predict(test_data) %>% exp()` gives the good values. But with a more complex model `lm` is may not be the right tool... Maybe `optim` is indeed more flexible ? – Etienne Nov 08 '17 at 08:45
  • `lm` is shorthand for "linear model", so it can deal with anything that is linear or can be linearized, e.g., by taking logs. If you need a non-linear model, then you can set up the equation and maximize the (log) likelihood directly using `optim`, or there may be specialized packages. – Stephan Kolassa Nov 08 '17 at 09:51
0

I cannot think of a way, to do this with lm. You can do it with optim. I did hardcode the train_data set in my function res.sum.squares. There will most certainly be a more elegant way to code this, but for simplicity and with this small a data set, this is my approach:

res.sum.squares <- function(coeff){
  doc <- coeff[1]
  law <- coeff[2]
  dri <- coeff[3]
  mia <- coeff[4]
  bos <- coeff[5]
  chi <- coeff[6]
  nyc <- coeff[7]

  return(sum(c( (doc*mia-25.48)^2 +
                (law*mia-29.99)^2 +
                (dri*mia-33.23)^2 +
                (doc*bos-25.49)^2 +
                (doc*chi-24.24)^2 +
                (doc*nyc-28.08)^2)))
}

optim(c(doc=1, law=1, dri=1, mia=1, bos=1, chi=1, nyc=1),
  fn = res.sum.squares, method="BFGS")

The result for this training data is:

$par
      doc       law       dri       mia       bos       chi 
11.177929 13.156439 14.577809  2.279492  2.280387  2.168559 
      nyc 
 2.512093 

$value
[1] 9.467214e-25

$counts
function gradient 
      76       26 

$convergence
[1] 0

Please note, that we searched for seven coefficients in only 6 observations. This will not yeald reliable results. It is probably a reason for the extremely low residual sum of squares as well. lm would have given us standard errors for coefficients, optim does not do that. At least, it is a way to solve your equations.

Bernhard
  • 7,419
  • 14
  • 36
  • Use `optim` to do a manual `lm` optimisation looks better. With constrains to have an average price for doctor, lawyer and driver it's perfect : `t = constrOptim(c(doc=1, law=1, dri=1, mia=1, bos=1, chi=1, nyc=1), f = res.sum.squares, grad=NULL,ui = rbind(c(0,0,0,1,1,1,1) , c(0,0,0,-1,-1,-1,-1) ),ci = c(3.9999,-4.0001))` – Etienne Nov 08 '17 at 10:49