3

I am performing a regression task on a relatively small dataset (4000 observations). These 4000 observations are grouped in such a way, that if I look at the dependent variable, there are only about 170 distinct values, implying that for ~20 distinct sets of independent variables, I should have the same value in the dependent variable.

My initial approach, so far, has been to rely on a simple linear regression, however when I plot a scatter for the out-of-sample predicted v/s actual values, there seems to be no fit. Other models such as the tree-based RandomForest and XGBoost models show similar results.

I have tried multiple approaches, including relaxing the hyperparameters for my model and using k-fold cross validation to assess the performance over multiple sets, however the performance of the models remains the same.

I cannot expand this dataset, and reducing it to only contain 170 values isn't an option either given how the problem is structured (we want to see the effect of the independent variables in each observation on the dependent variable). What other techniques and methods should I look at to improve my model's performance?

Also, when I look at the in-sample predictions, I see that for my XGBoost model, for the same, unique dependent variable, it predicts the same value, eventhough the independent variables are slightly different. Shouldn't the model be predicting, even on the in-sample, slightly different values per observation?

More on the data - The independent variables are price and specifications of products and the dependent variable is the product price of another company, in the same category as the product whose features are used as independent variables.

Now, I have different products similar to the product's price that I am predicting for, and, the assumption here is, that, given a comparable product, I want to see what the price would be for a product in the same space which is yet to be released. With different comparables for each product, I should ideally be getting a distribution of prices as the output of my regression.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
jitmanchan
  • 45
  • 5
  • 2
    Can you explain **why** there are some duplicate values, and give details (formula ...) of the models you are fitting? There is very little information to go on in your post ... Please add new information as an edit to the post, not only as comments. – kjetil b halvorsen Oct 11 '20 at 19:00
  • @kjetilbhalvorsen sure, i have added more information about the data. Apologies for not doing so earlier - I should have been more specific – jitmanchan Oct 11 '20 at 23:06
  • 1
    Does this mean that for several of your companys product that you want to determine a price, the **same** competing product is used for the response? If so, then certainly you cannot use an independence assumption! – kjetil b halvorsen Oct 11 '20 at 23:10
  • 1
    @kjetilbhalvorsen Yep, exactly. What would you suggest to solve this kind of a problem? I have tried two approaches so far - 1. Taking the mean of all the comparable products for a given product i am trying to find the price for - but, I have very less data to work with (about a 170 observations) and this does not generalize well. 2. I tried adding noise to the training set (mean-0, std dev-1), thinking that it would help me solve for the independence issue, but I find that the fit is the same as from before on the out-of-sample. – jitmanchan Oct 11 '20 at 23:17
  • What is the **ultimate goals** of this modeling exercise? Is it just to decide prices, or something else? Do you have some calculations of cost-based prices? – kjetil b halvorsen Oct 12 '20 at 02:28
  • @kjetilbhalvorsen the goal of the exercise is to deduce prices, yes. I do have cost-based estimates for a few products (about 30/170), but I could ask for estimates or calculate them myself for the others. This should give me a reasonable estimate of what the price should be purely based on the cost of production. Thanks for the suggestions! – jitmanchan Oct 12 '20 at 21:40

1 Answers1

3

Let us write a simple linear regression model, using a random effect $\alpha$ common for the observations with a common response (so it would have 170 different realizations.) $$ Y_i=\mu+\alpha+X_i^T\beta +\epsilon_i $$ where $\epsilon_i$ is the error term. No consider one of the groups with a common response. Within that group we have $$\underbrace{Y_i-\mu-\alpha}_{\text{constant!}}=X_i^T\beta+\epsilon_i $$ so the covariables $X_i$ and $\epsilon_i$ cannot be independent. That can explain the estimation problems you see! since such dependence can destroy consistency properties of the least-squares estimator.

If you have, say, some independent calculation (not based on the regression modelling) of product costs, you could use that as an instrument in IV (Instrumental variables) estimation. Search this site. I have no experience with such methods, so take this only as a suggestion.

Addition: Since you are using cross-validation, for your data, to get a realistic cross-validation, keep the groups with common response together.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467