2

I am trying to build a Deep Learning model in which I have the following structure

user feature binary_label
1 100 0
2 200 1
3 140 0
... ... ...
6000000 188 1

But the problem is that when I try to use all data I am running out of memory.

Since some users have the same features and labels I rewrote my SQL to

select
  count(user) users,
  feature,
  binary_label
from
  table
group by 2,3

which returns

users feature binary_label
1132 100 0
2 200 1
3435 140 0
... ... ...
3251 188 1

Since my users are aggregated I cannot parse that as an input as my prediction users would always be one (prediction per user). I think I'd lose value if I only use the feature since I believe the number of users falling in a particular segment adds weight to the prediction model. What would be the best approach to be able to use this column in python?

The current model that only uses the feature is

X_train, X_test, y_train, y_test = train_test_split(
    df.feature.values,
    df.binary_label.values,
    test_size=.2,
    shuffle=True)

model = Sequential()

model.add(Dense(4, input_shape=(1,), activation='tanh'))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='sgd', loss='binary_crossentropy')

model.fit(X_train, y_train, epochs=100)

model.evaluate(X_test, y_test)

Which when evaluated returns .7

Sycorax
  • 76,417
  • 20
  • 189
  • 313

1 Answers1

5

The loss function binary_crossentropy is equivalent to the negative log-likelihood of a Bernoulli model that estimates a success probability $p$. In this case, you're using a neural network $f$ to estimate $p$ as a function of the feature $x$, so we have $f(x_i)=p_i$. It's perfectly appropriate to use binary_crossentropy in the first case, where each observation is a single trial of the binary outcome $y\in \{0,1\}$.

And you are correct, that we should be able to compress the data in some manner. This answer shows how to do so by exploiting the connection between the Bernoulli distribution and the binomial distribution.

We can compress that data by grouping the like observations together, which we can do by adjusting your SQL query.

select
  feature,
  count(*) as n,
  sum(binary_label) as k
from
  table_name
group by feature

This tells us how many trials $n$ were undertaken, and how many of those trials $k$ were a success at each value of feature. When you repeat a Bernoulli trial for a fixed number of times $n$ and the trials have fixed probability of success $p$ and the trials are independent, the distribution of successes $k$ is a binomial random variable.

A binomial random variable has cross entropy (negative log-likelihood) given by

$$\begin{align} L &= \prod_{i=1}^N \binom{n_i}{k_i} p_i^{k_i} (1-p_i)^{n_i-k_i} \\ -\log L &= -\sum_{i=1}^N \left[ \log\binom{n_i}{k_i} + k_i \log p_i + (n_i-k_i)\log(1-p_i) \right] \\ &= -k_i \log p_i - (n_i - k_i)\log(1-p_i) +C \end{align}$$ where $i$ indexes the rows in your SQL output.

This function should be immediately recognizable as the Bernoulli cross-entropy function, except we've weighted it by the number of successes $k$ and failures $n-k$. (We can drop the binomial coefficient from the expression because it is not a function of $p$, so it does not change the location of any optima.) We can use this loss function in place of binary_crossentropy.

Here's what this looks like in psuedocode.

f = SomeNeuralNetwork()
train_data = SqlQuery("select feature, count(*) as n, sum(binary_label) as k from table name group by feature;")
for x, n, k in train_data:
    p = f(x)
    loss = - k * log(p) - (n - k) * log(1 - p)
    backprop_update(loss)

I don't think Keras has a native implementation of this loss function, so if you use Keras, you'll have to do that yourself (I explain this in a comment). Pytorch does have an implementation, but you'll need to do a little algebra to use it, as Jonny Lomond explains in a comment:

[T]his loss function is equivalent to passing $k_i/n_i$ as the target and passing $n_i$ as the sample weight with [BCELoss].

In a comment, OP says

I'm gonna be honest with you I got lost with this response. I guess I am still a beginner. I've used your suggestion though to get sum(binary_label) and created a calculated label using sum(binary_label) / count(*). I've then fitted a poly to get the 'predicted' probability for a feature so any >=50% = 1 and any <50% = 0. Would using a deep learning model, in this case, be more suitable?

This relabeling scheme is a very bad idea! This has no relation to the original model that you want to fit, and will badly bias your model.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • (+1) Want to point out that, I think, this loss function is equivalent to passing $k_i/n_i$ as the target and passing $n_i$ as the sample weight with `binary_crossentropy`, so there's no need to implement a new loss function. The resulting model will still be different than one trained on the full data because of differences unrelated to the loss function, for example the batches created with the uncompressed data will be different than those created with compressed data. – Jonny Lomond Nov 24 '21 at 21:14
  • @JonnyLomond I understand what you mean, but my reading of the Keras documentation does not appear to support this usage. https://keras.io/api/losses/probabilistic_losses/#binary_crossentropy-function "`y_true` (true label): This is either 0 or 1" -- so we can't use the fractional representation you suggest and also there's no "weight" argument for in this documentation page. On the other hand, there are `pytorch` functions that work the way you outline: https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html – Sycorax Nov 25 '21 at 02:12
  • I'm gonna be honest with you I got lost with this response. I guess I am still a beginner. I've used your suggestion though to get sum(binary_label) and created a calculated label using sum(binary_label) / count(*). I've then fitted a poly to get the 'predicted' probability for a feature so any >=50% = 1 and any <50% = 0. Would using a deep learning model, in this case, be more suitable? – Andrei Budaes Dec 06 '21 at 13:33
  • Relabeling your data in this way is a terrible idea, and fitting a deep neural network to it won't fix the problems that the relabeling creates. I've added some psuedocode to show you how this loss function actually works. The intuition should be obvious though: write down the binary cross entropy loss for 3 samples with the same value of feature (1 in one class, 2 in the other class) and then add them together. The result will look exactly like my loss expression. – Sycorax Dec 06 '21 at 14:22