0

I am working on a non-linear multi-output regression problem. I have created a simple neural network. The net is supposed to be a point estimator of $\hat{\theta}_{MAP}$, where $\theta$ are the parameters (weights and biases) of the network. To get my feet wet, I am working on a small example system with 1 independent variable (x) and 1 dependent variable (y) and 1000 training data points.

The network was created in torch and is shown in the code sample at the end of the post. In the left plot below, we see the losses for every minibatch. My conclusion from this plot is that the net is learning. In the right plot, we see the training data in blue, the test data in red and the neural network prediction $f_{NN}(x^*)=y^*$ in yellow. I would expect the net to find a moving average over the data, but this is clearly not the case, especially for x>1.0. Because the yellow line is smooth and overlaps with the predicted line for the training data (not shown in the plot), I conclude that we are not overfitting.

enter image description here

I, non-systematically, tried combinations of the following parameters to find a neural net that better fits my expected moving average line:

  • Learning rates [0.00001 - 0.1]
  • Weight decay with optim.AdamW (though I don't think we are overfitting)
  • Activation function 'nn.Relu'
  • Batch-sizes [32,64,128]
  • Different layer configurations (i.e. fewer dropouts)
  • Optimizer optim.RMSprop
  • Using a scheduler

None of the combinations of above yielded much and so I think that this poor performance is largely due to the uneven density of data. There are far more points for x~-1 than there are points for x~2. Do you agree with this hypothesis? If so, what should I do about it? If not, what other phenomenon could explain that this regression model performs so poorly and how do I address that? My real-world (not toy) problem has ~80 independent variables and ~6 dependent variables, so, if data-density is indeed the problem, I am looking for a general solution to deal with uneven density of data for a regression.

My own solution would be something like this: compute a kernel density estimate of data and sample data inverse to that estimate. Basically, we are bootstrapping data and creating minibatches according to some density. Would this be a 'valid way' of addressing the problem of uneven data density? Am I correct in saying that the data is then not IID? In reality, I want to build a Bayesian neural net that capture the uncertainty in my estimates (as you can tell from the plot, the blue points are noisy and heteroskedastic). The problem I see with bootstrapping data is that the network is not fully Bayesian, since it sees some data more than other data. Another problem is that parameters for the kernel density estimate (i.e. bandwidth for Gaussian) will affect the regression model performance quite a lot and will require tuning.

Thank you for thinking along!

CODE SAMPLE:

from torch import nn
from torch import optim
from torch.utils.data import DataLoader
from tqdm import tqdm

DEVICE = 'cpu'
hN = 5

net= nn.Sequential(
    nn.Linear(1, hN),
    nn.Tanh(),
    nn.Dropout(), # we regularize by activation dropout
    
    nn.Linear(hN, hN),
    nn.Tanh(),
    nn.Dropout(),
    
    nn.Linear(hN, hN),
    nn.Tanh(),
    nn.Dropout(),
    
    nn.Linear(hN, 1),
).to(DEVICE)

optimizer = optim.Adam(params=net.parameters(), lr=0.01)
# scheduler = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.90)
scheduler = None

LOADER = DataLoader(TRAIN_DATA, batch_size=32, shuffle=True)

LOSSES = []
NUM_EPOCHS = 40
for i in range(NUM_EPOCHS):
    for j, (X, y) in enumerate(LOADER):
        X, y = X.to(DEVICE), y.to(DEVICE)
        y_hat = net(X)
        loss = ((y - y_hat) ** 2).mean()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        LOSSES.append(loss.item())
        if scheduler is not None:
            scheduler.step()
  • 1
    I don't think that the issue is with the data distribution *per se*; I just think you're not training the model long enough. It will learn to fit the most common values first (because that's the easiest way to drive down the loss fastest), but then the lack-of-fit to the larger values of $x$ will be the next-easiest place to reduce the loss. Right now, you're training for 40 epochs; what happens if you tune a learning rate scheduler and train for 400 epochs? More suggestions: https://stats.stackexchange.com/questions/352036/what-should-i-do-when-my-neural-network-doesnt-learn/352037#352037 – Sycorax Oct 12 '21 at 17:00
  • Another thing to keep in mind when you're just trying to fit the training data well is that you can turn off all regularization -- it will make it much easier to diagnose problems because fixing the problem in your post is purely about finding a network configuration (number of hidden units, learning rate, number of layers, etc) that can fit (or even overfit) the training data. Once you can do that, then add regularization back in. Also, dropout with only 5 units will be very noisy -- hard to optimize. – Sycorax Oct 12 '21 at 17:02
  • Thank you for your input! My reasoning was that since the loss plateaued, the network was 'done' learning and therefore adding epochs would not change much. The turning off regularization tip is useful! I had tried with 2 dropout layers, but not with none at all. I will focus on that next. My previous tweaks were indeed inspired by that stats.overflow post. Do you have an opinion on the use of KDE estimates of data density? – Patrickens Oct 12 '21 at 17:43
  • 1
    5 hidden nodes only is too few. Ramp it up to 64 or 256, and see if it overfits (it should) – Firebug Oct 12 '21 at 17:44
  • 1
    Similar to comments above, I think the issue here might have to deal with the toy nature of the problem, not data distribution. NNs generally have thousands of parameters and many inputs. Oversampling data isn't a terrible idea, but I don't think it will be necessary once you move to the real problem. – Tanner Phillips Oct 12 '21 at 17:46
  • @Patrickens I don't think there's any need to resample the data. Are you scaling the data prior to handing it off to the network? That can make a big difference. – Sycorax Oct 12 '21 at 19:16
  • Thanks a lot for all the comments. It seems that increasing the model complexity (32 hidden neurons in 3 layers) and getting rid of regularization (no dropout layers) has partly resolved the issue. @Sycorax the domain of all input and output variables is [0,1]. I am only scaling the input features by subtracting the mean and dividing by variance. Im also not using dropout as much anymore. – Patrickens Oct 13 '21 at 15:50

0 Answers0