The loss function binary_crossentropy
is equivalent to the negative log-likelihood of a Bernoulli model that estimates a success probability $p$. In this case, you're using a neural network $f$ to estimate $p$ as a function of the feature $x$, so we have $f(x_i)=p_i$. It's perfectly appropriate to use binary_crossentropy
in the first case, where each observation is a single trial of the binary outcome $y\in \{0,1\}$.
And you are correct, that we should be able to compress the data in some manner. This answer shows how to do so by exploiting the connection between the Bernoulli distribution and the binomial distribution.
We can compress that data by grouping the like observations together, which we can do by adjusting your SQL query.
select
feature,
count(*) as n,
sum(binary_label) as k
from
table_name
group by feature
This tells us how many trials $n$ were undertaken, and how many of those trials $k$ were a success at each value of feature
. When you repeat a Bernoulli trial for a fixed number of times $n$ and the trials have fixed probability of success $p$ and the trials are independent, the distribution of successes $k$ is a binomial random variable.
A binomial random variable has cross entropy (negative log-likelihood) given by
$$\begin{align}
L &= \prod_{i=1}^N \binom{n_i}{k_i} p_i^{k_i} (1-p_i)^{n_i-k_i} \\
-\log L &= -\sum_{i=1}^N \left[ \log\binom{n_i}{k_i} + k_i \log p_i + (n_i-k_i)\log(1-p_i) \right] \\
&= -k_i \log p_i - (n_i - k_i)\log(1-p_i) +C
\end{align}$$
where $i$ indexes the rows in your SQL output.
This function should be immediately recognizable as the Bernoulli cross-entropy function, except we've weighted it by the number of successes $k$ and failures $n-k$. (We can drop the binomial coefficient from the expression because it is not a function of $p$, so it does not change the location of any optima.) We can use this loss function in place of binary_crossentropy
.
Here's what this looks like in psuedocode.
f = SomeNeuralNetwork()
train_data = SqlQuery("select feature, count(*) as n, sum(binary_label) as k from table name group by feature;")
for x, n, k in train_data:
p = f(x)
loss = - k * log(p) - (n - k) * log(1 - p)
backprop_update(loss)
I don't think Keras has a native implementation of this loss function, so if you use Keras, you'll have to do that yourself (I explain this in a comment). Pytorch does have an implementation, but you'll need to do a little algebra to use it, as Jonny Lomond explains in a comment:
[T]his loss function is equivalent to passing $k_i/n_i$ as the target and passing $n_i$ as the sample weight with [BCELoss
].
In a comment, OP says
I'm gonna be honest with you I got lost with this response. I guess I am still a beginner. I've used your suggestion though to get sum(binary_label) and created a calculated label using sum(binary_label) / count(*). I've then fitted a poly to get the 'predicted' probability for a feature so any >=50% = 1 and any <50% = 0. Would using a deep learning model, in this case, be more suitable?
This relabeling scheme is a very bad idea! This has no relation to the original model that you want to fit, and will badly bias your model.