I am working on a non-linear multi-output regression problem. I have created a simple neural network. The net is supposed to be a point estimator of $\hat{\theta}_{MAP}$, where $\theta$ are the parameters (weights and biases) of the network. To get my feet wet, I am working on a small example system with 1 independent variable (x) and 1 dependent variable (y) and 1000 training data points.
The network was created in torch
and is shown in the code sample at the end of the post. In the left plot below, we see the losses for every minibatch. My conclusion from this plot is that the net is learning. In the right plot, we see the training data in blue, the test data in red and the neural network prediction $f_{NN}(x^*)=y^*$ in yellow. I would expect the net to find a moving average over the data, but this is clearly not the case, especially for x>1.0. Because the yellow line is smooth and overlaps with the predicted line for the training data (not shown in the plot), I conclude that we are not overfitting.
I, non-systematically, tried combinations of the following parameters to find a neural net that better fits my expected moving average line:
- Learning rates
[0.00001 - 0.1]
- Weight decay with
optim.AdamW
(though I don't think we are overfitting) - Activation function 'nn.Relu'
- Batch-sizes
[32,64,128]
- Different layer configurations (i.e. fewer dropouts)
- Optimizer
optim.RMSprop
- Using a scheduler
None of the combinations of above yielded much and so I think that this poor performance is largely due to the uneven density of data. There are far more points for x~-1 than there are points for x~2. Do you agree with this hypothesis? If so, what should I do about it? If not, what other phenomenon could explain that this regression model performs so poorly and how do I address that? My real-world (not toy) problem has ~80 independent variables and ~6 dependent variables, so, if data-density is indeed the problem, I am looking for a general solution to deal with uneven density of data for a regression.
My own solution would be something like this: compute a kernel density estimate of data and sample data inverse to that estimate. Basically, we are bootstrapping data and creating minibatches according to some density. Would this be a 'valid way' of addressing the problem of uneven data density? Am I correct in saying that the data is then not IID? In reality, I want to build a Bayesian neural net that capture the uncertainty in my estimates (as you can tell from the plot, the blue points are noisy and heteroskedastic). The problem I see with bootstrapping data is that the network is not fully Bayesian, since it sees some data more than other data. Another problem is that parameters for the kernel density estimate (i.e. bandwidth for Gaussian) will affect the regression model performance quite a lot and will require tuning.
Thank you for thinking along!
CODE SAMPLE:
from torch import nn
from torch import optim
from torch.utils.data import DataLoader
from tqdm import tqdm
DEVICE = 'cpu'
hN = 5
net= nn.Sequential(
nn.Linear(1, hN),
nn.Tanh(),
nn.Dropout(), # we regularize by activation dropout
nn.Linear(hN, hN),
nn.Tanh(),
nn.Dropout(),
nn.Linear(hN, hN),
nn.Tanh(),
nn.Dropout(),
nn.Linear(hN, 1),
).to(DEVICE)
optimizer = optim.Adam(params=net.parameters(), lr=0.01)
# scheduler = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.90)
scheduler = None
LOADER = DataLoader(TRAIN_DATA, batch_size=32, shuffle=True)
LOSSES = []
NUM_EPOCHS = 40
for i in range(NUM_EPOCHS):
for j, (X, y) in enumerate(LOADER):
X, y = X.to(DEVICE), y.to(DEVICE)
y_hat = net(X)
loss = ((y - y_hat) ** 2).mean()
optimizer.zero_grad()
loss.backward()
optimizer.step()
LOSSES.append(loss.item())
if scheduler is not None:
scheduler.step()