I'm not totally sure what you mean by "its value is itself a distribution," so let me say a few things and see if they help; feel free to ask more questions if not.
The network is predicting a discrete distribution over the three entries. Letting the predictive label be the random variable $\hat Y$ and naming its possible values $a$, $b$, and $c$, it says that $\Pr(\hat Y = a) = 0.4$, $\Pr(\hat Y = b) = 0.1$, and $\Pr(\hat Y = c) = 0.5$. Note that $\hat Y$ is a function of the network's parameters $\theta$ and the feature vector $x$: we can write it as $\hat Y_\theta(x)$ to denote its dependence on $\theta$ and $x$.
Now, we want to see if that predicted distribution is any good. We only have one data point to evaluate this with: the true observed value $y$, which in this case was observed as $a$. Taking a maximum-likelihood approach, we choose to evaluate the quality of a network $\theta$ by its likelihood: the probability of $a$ under the predictive distribution $\Pr(\hat Y_\theta(x) = \cdot)$, which we can evaluate as $P(\hat Y_\theta(x) = a) = 0.4$. (If the labels were continuous, then we'd use the probability density.)
Now, the network actually predicts one of these distributions for each of the possible inputs $x_i$; our measure of the overall quality of the network as a predictor is the sum of the likelihoods for each data sample. Because we assume these are iid, we get
$$
\log \Pr\left( \big( \hat Y_\theta(x^{(i)}) \big)_{i=1}^n = \big( y^{(i)} \big)_{i=1}^n \right)
= \log \prod_{i=1}^n \Pr\left( \hat Y_\theta(x^{(i)}) = y^{(i)} \right)
= \sum_{i=1}^n \log \Pr\left( \hat Y_\theta(x^{(i)}) = y^{(i)} \right)
.$$
The log-likelihood of a parameter value $\theta$ under the data $\{(x^{(i)}, y^{(i)}) \}_{i=1}^n$ is then
$$
\ell(\theta) = \sum_{i=1}^n \log \Pr(\hat Y_\theta(x^{(i)}) = y^{(i)})
,$$
which is what we want to maximize.
Compare to the case of finding the maximum-likelihood estimator for a series of biased coin flips. There the model is $\mathrm{Bernoulli}(\theta)$, i.e. $\Pr(\hat Y_\theta = H) = \theta$, $\Pr(\hat Y_\theta = T) = 1 - \theta$. The log-likelihood is
$$\sum_{i=1}^n \begin{cases}\log(\theta) & y^{(i)} = H \\ \log(1 - \theta) & y^{(i)} = T\end{cases},$$
and we can estimate $\theta$ by maximizing $\ell(\theta)$ given the $y^{(i)}$.
The only difference in this case is that there are also feature vectors $x^{(i)}$, and we're maximzing the likelihood conditional on the $x^{(i)}$.