I'm working on a regression problem with imbalanced data, and I would like to know if I'm weighting the errors correctly. I'll try to illustrate the concept with a simple example.
Imagine I'm building a model to predict house prices in New York and Los Angeles. I have many more training examples in NY
than in LA
, but I want the algorithm to perform equally well in both cities. To further complicate the issue, house prices in NY
have a greater variance than those in LA
.
Here is an example training dataset:
City N_rooms House_Price
NY 4 400
NY 7 1000
NY 5 800
NY 3 300
NY 7 600
NY 2 100
NY 4 500
LA 3 400
LA 5 500
LA 4 500
I have 7
training examples for NY
and 3
training examples for LA
. If my cost function is MSE
, namely sum((y_pred - y_true)^2)/10
, to make sure that the algorithm performs equally well in both cities, I would need to give different weights on the prediction errors, namely
sum(w * (y_pred - y_true)^2)/10
I would like to know which one of the following would be the correct way to define w
and/or rescale training data:
- Do not use weights (i.e.,
w=1
) - Define
w
as the inverse frequency of each class in the training set, namelyw=1/3
for houses inLA
andw=1/7
for houses inNY
- Standardize prices in
NY
andLA
separately, namely subtract the average price inNY
from the price of every house inNY
, then divide the price of every house inNY
by the standard deviation of house prices inNY
. Similarly, subtract the average price inLA
from the price of every house inLA
, then divide the price of every house inLA
by the standard deviation of house prices inLA
. Now train the regression model on the scaled data. To predict actual prices, apply the inverse scaling to the model predictions. - Apply both points
2
and3
.
Note: the goal is not only to minimize the overall error, but to build an algorithm that performs equally well in both cities.