Which is the right way to handle imbalanced data in a regression problem?

Question

I'm working on a regression problem with imbalanced data, and I would like to know if I'm weighting the errors correctly. I'll try to illustrate the concept with a simple example.

Imagine I'm building a model to predict house prices in New York and Los Angeles. I have many more training examples in NY than in LA, but I want the algorithm to perform equally well in both cities. To further complicate the issue, house prices in NY have a greater variance than those in LA.

Here is an example training dataset:

City    N_rooms  House_Price
NY      4        400
NY      7        1000
NY      5        800
NY      3        300
NY      7        600
NY      2        100
NY      4        500
LA      3        400
LA      5        500
LA      4        500

I have 7 training examples for NY and 3 training examples for LA. If my cost function is MSE, namely sum((y_pred - y_true)^2)/10, to make sure that the algorithm performs equally well in both cities, I would need to give different weights on the prediction errors, namely

sum(w * (y_pred - y_true)^2)/10

I would like to know which one of the following would be the correct way to define w and/or rescale training data:

Do not use weights (i.e., w=1)
Define w as the inverse frequency of each class in the training set, namely w=1/3 for houses in LA and w=1/7 for houses in NY
Standardize prices in NY and LA separately, namely subtract the average price in NY from the price of every house in NY, then divide the price of every house in NY by the standard deviation of house prices in NY. Similarly, subtract the average price in LA from the price of every house in LA, then divide the price of every house in LA by the standard deviation of house prices in LA. Now train the regression model on the scaled data. To predict actual prices, apply the inverse scaling to the model predictions.
Apply both points 2 and 3.

Note: the goal is not only to minimize the overall error, but to build an algorithm that performs equally well in both cities.

score 3 · Answer 1 · answered Nov 19 '19 at 08:17

If I have understood you correctly, the issue here is that you wish to fit your regression in such a way that it performs equally well on both cities, by which you means that you want to minimise the weighted sum-of-squares, with weights that ensure equal total weight to the data from each city. If that is correct, then this should be a fairly simple problem, where you can use weighted least-squares estimation. For this type of estimation, you have an $n \times n$ diagonal weighting matrix $\mathbf{w}$, and the coefficient estimator is:

$$\hat{\boldsymbol{\beta}} = (\mathbf{x}^\text{T} \mathbf{w} \mathbf{x})^{-1} (\mathbf{x}^\text{T} \mathbf{w} \mathbf{y}).$$

Now, suppose that you have $n_\text{NY}$ data points from New York and $n_\text{LA}$ data points from Los Angeles (so that $n= n_\text{NY}+n_\text{LA}$). Then you would use weights $w_\text{NY} = 1/n_\text{NY}$ and $w_\text{LA} = 1/n_\text{LA}$ in your weighting matrix, and this would ensure that the two cities are equally weighted in the aggregate. As a result of this weighting, more weight would be given to data points from the city that has been sampled less.

Now, I will also deal with your further complication, which is that you say there is more variance in one city than the other. My suggestion here would be to fit a first-pass model where you use weighted-least-squares, with a weight of unity on one city, and a free parameter to weight the other city. This will give you an estimate of the relative sizes of the error variance for the two cities. You can then take that estimate and apply it as an additional weight when you do your main weighted analysis (as described above). So, for example, if we let $\hat{\delta} \equiv \hat{\sigma}_\text{NY}^2 / \hat{\sigma}_\text{LA}^2$ denote the estimated relative error variance, then we would use the subsequent weightings $w_\text{NY} = 1/n_\text{NY}$ and $w_\text{LA} = \hat{\delta}/n_\text{LA}$ in your weighted analysis. This should allow you to incorporate both the different error variance of the two cities, and also apply your own weighting to force the analysis to give "equal weight" (after adjustment for error variance) to the two cities.

score 0 · Answer 2 · answered Nov 19 '19 at 06:04

One way to handle imbalanced data is to over-sample the rare or uncommon regions of interest in the response variable and under-sample the common ones.

I might suggest the paper cited below. If you're more interested in a practical solution, the first author has an R implementation on how to accomplish this, which is available on her Github page. https://github.com/paobranco/SMOGN-LIDTA17

If Python is more of your persuasion, I recently distributed an entirely Pythonic implementation of the SMOGN algorithm that is now available and currently being unit tested. https://github.com/nickkunz/smogn

If you need a fast and intuitive solution to a highly skewed distribution, a common method is just using the log of your variable. Although, I understand that this has its obvious limitations. I hope this helped.

Branco, P., Torgo, L., Ribeiro, R. (2017). "SMOGN: A Pre-Processing Approach for Imbalanced Regression". Proceedings of Machine Learning Research, 74:36-50.http://proceedings.mlr.press/v74/branco17a/branco17a.pdf.

Which is the right way to handle imbalanced data in a regression problem?

2 Answers2