1

Problem discription

I have a forth order polynomial function clearly defined. I would sample $n$ number of points and feed them to my function to get the corresponding $y$ values. I then apply a FC neural network to learn this function, which I presume should be quite a simple task.

When I perform random sampling from the input space, the neural network learns just fine. I can see the training loss getting smaller and in just a few epochs, gets pretty close to 0. Now, I change my sampling method to a special sampling method that I have, and repeat the same process. Suddenly the NN seems to not learn the function at all. And I don't mean it's generalizing poorly at the test set, it's not fitting to the training set. The loss training loss stays stagnant every epoch.

This is extremely perplexing to me. The NN learns just fine when the inputs are sampled randomly, but fails completely when another set of inputs which I generated using alternative method was used. When I examine the predicted values from the model, it seems it's predicting everything with just the same output value. This was not the case when it trained on random generated input.

Properties of the problematic data

I tried to do a tSNE on my problematic input data, and color them by y value. The low and high values seems to seperate themselves pretty well in this space, so I don't see why the NN just can't learn it properly.

tSNE plot

Per Davidmh's suggestion, I looked at a histogram of the y value distribution. Horizontal axis being the function (y) value, vertical axis is count. On the left is the function value of thee problematic input, right is for random input.

histogram

Code

The parameters of the polynomial and the problematic inputs can be found here: https://drive.google.com/drive/folders/14xqPK7M8msCJpdJ6qsHuLggNt90gcPjC

  1. Original function that I sampled from. Input is 32 dimensions
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# 4th degree polynomial features
poly = PolynomialFeatures(4)

# regression parameters stored in json file
reg = LinearRegression()
with open('regression.txt', 'r') as file:
    reg_params = json.load(file)
reg.coef_ = np.array(reg_params['coef_'])
reg.intercept_ = reg_params['intercept_']
reg.rank_ = reg_params['rank_']
reg.singular_ = np.array(reg_params['singular_'])

def source_function(X):
    try:
        X = poly.transform(X)
        return reg.predict(X)
    except ValueError:
        X = poly.transform(X.reshape(1,-1))
        return reg.predict(X)[0]
  1. Neural network
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as Data
from torch.nn import Linear, ReLU

# MLP class
class mlp(nn.Module):
  def __init__(self, **kwargs):
    super().__init__()
    self.layer1 = nn.Linear(in_features=kwargs["input_shape"], out_features=48)
    self.layer2 = nn.Linear(in_features=48, out_features=48)
    self.layer3 = nn.Linear(in_features=48, out_features=48)
    self.layer4 = nn.Linear(in_features=48, out_features=48)
    self.pred = nn.Linear(in_features=48, out_features=1)
  
  def forward(self, features):
    x = self.layer1(features)
    x = torch.relu(x)
    x = self.layer2(x)
    x = torch.relu(x)
    x = self.layer3(x)
    x = torch.relu(x)
    x = self.layer4(x)
    x = torch.relu(x)
    prediction = self.pred(x)
    return prediction
  1. Sampled data
# load up datapoints, this is the problematic dataset
# input domain is limited to [-1, 1] for all dimensions

X_train = np.genfromtxt('X_train.csv', delimiter=',')
# get corresponding y values by feeding X_train to defined function
y_train = source_function(X_train)

# a randomly generated dataset
X_train_random = scipy.sparse.random(10000, 32, density=0.25).A
X_train_random  = X_train_random *2-1 # inputs to be within [-1, 1]
y_train_random = source_function(X_train_random)
  1. Training code
train_dataset = Data.TensorDataset(torch.from_numpy(X_train).type(torch.DoubleTensor),
                                   torch.from_numpy(y_train).type(torch.DoubleTensor))

train_loader = Data.DataLoader(dataset=train_dataset, batch_size=32, 
                               shuffle=True, num_workers=2,)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = mlp(input_shape=dim).to(device)
model = model.double()

optimizer = optim.Adam(model.parameters(), lr=5e-3)

criterion = nn.MSELoss()

# Training loop

epochs = 500
for epoch in range(epochs):
    loss = 0
    for batch_features, out in train_loader:
        batch_features = batch_features.to(device)
        
        optimizer.zero_grad()
        
        outputs = model(batch_features.double())
        
        train_loss = criterion(outputs, out)
        train_loss.backward()
        
        optimizer.step()

        loss += train_loss.item()
    
    # compute the epoch training loss
    loss = loss / len(train_loader)
    
    # display the epoch training loss
    print("epoch : {}/{}, loss = {:.6f}".format(epoch + 1, epochs, loss))

Hardware: Google Colab non-GPU

Any help would be greatly appreciated!

  • 2
    I edited your question to remove the link. Pickles are generally considered as unsafe format, opening untrusted pickles can launch arbitrary code on your computer. If you want to share the data and code, please use formats that are auditable (e.g. source code, data in csv files etc). – Tim Sep 10 '21 at 05:06
  • Can you describe your special sampling procedure? How does it work? What are its properties? – Sycorax Sep 10 '21 at 05:08
  • Hi @Tim, thanks for the warning, I didn't know that. What format do you suggest me to use? – Tianxun Zhou Sep 10 '21 at 05:08
  • Use csv, JSON or any other format that stores the data in plaintext. – Tim Sep 10 '21 at 05:11
  • Hi @Sycorax, I have a stochastic population based optimizer, which generates a bunch of candidate points per iteration. I record down all the candidate points generated in this whole process as I try to find the minima of this function. To prevent a large proportion of my points from coming from the minima region, I run the same optimization process on the maxima as well. – Tianxun Zhou Sep 10 '21 at 05:11
  • Lots of suggestions here https://stats.stackexchange.com/questions/352036/what-should-i-do-when-my-neural-network-doesnt-learn/352037#352037 What things have you tried to improve the model fit? – Sycorax Sep 10 '21 at 05:21
  • Hi @Sycorax, 1. I think the code should be bug free - it trains well on another dataset. 2. I tried adding more hidden layers, increased the number of nodes in hidden layer as well from 32 previously to 48 now. 3. Played around with learning rate as well – Tianxun Zhou Sep 10 '21 at 05:32
  • A couple of things to try: (a) your input is in a 32-dimensional space, right? Do you get the same behaviour in 1 or 2-D? That would make it easier to plot. (b) On the original data, can you plot a histogram of the values of your training data for both sampling procedures? – Davidmh Sep 10 '21 at 05:38
  • Hi @Davidmh, thanks for your suggestions. I've edited the qn to include histogram plot. Indeed, the problematic sample has y value that is less well distributed. But I'm still not sure why the model can't over-fit to that. At this point, I'm not concerned about overfitting, I actually want the model to over-fit to that training data to at least for now as compared to not learning at all – Tianxun Zhou Sep 10 '21 at 05:57
  • @TianxunZhou try making the network bigger, possibly ridiculously so (say, 256 neurons). My rule-of-thumb for layer sizes is to go by powers of two, because that is the step size you start to see some differences. You should try a shallower network, more than one or two hidden layers without residual connections have a harder time converging. – Davidmh Sep 10 '21 at 06:05
  • Pretty much all of your data has a narrow range of values, because you're doing optimization to identify the candidate points, the input distribution has a very different distribution than the uniform random one (by design). One thing you might try, and this is a recommendation from the linked thread, is whitening the inputs to the network first because this can make optimization easier. Dramatically changing the number of neurons and/or layers (up or down) in the network might also help. And you'll need to tune the learning rate in addition to all of these proposals. Best of luck. – Sycorax Sep 10 '21 at 13:44
  • As an aside, I don't understand what you're trying to gain by sampling data in this way. If the network is a good approximation to the function on $[-1,1]^{32}$ when you train it using the random data, that seems like a perfectly adequate result. Why are you using this special strategy for the purposes of *training*, instead of achieving a good fit and then getting predictions for your special data? – Sycorax Sep 10 '21 at 13:48
  • @Sycorax Thanks for the advice. Regarding why I'm trying to sample data this way, I am trying to test the idea of using NN as a surogate function for optimization when the evaluating the actual function is too expensive. When I generate random sampled data, there are very little datapoints close to the optimas of the function, presumably because it's hard to stumble upon them by random. Hence I try to address this problem by purposely generating many datapoints that have high function value so the network has more of those examples to learn – Tianxun Zhou Sep 10 '21 at 14:42
  • Thanks for that context. I appreciate that your intent in this question is to solve this problem using NNs; however, there are a a number of alternatives. In case it's of interest, here's a thread: https://stats.stackexchange.com/questions/193306/optimization-when-cost-function-slow-to-evaluate/193310#193310 – Sycorax Sep 10 '21 at 15:12

0 Answers0