How can a multiclass perceptron work?

Question

I don't have any background in math, but I understand how the simple Perceptron works and I think I grasp the concept of a hyperplane (I imagine it geometrically as a plane in 3D space which seperates two point clouds, just as a line separates two point clouds in 2D space).

But I don't understand how one plane or one line could separate three different point clouds in 3D space or in 2D space, respectively – this is geometrically not possible, is it?

I tried to understand the corresponding section in the Wikipedia article, but already failed miserably at the sentence “Here, the input x and the output y are drawn from arbitrary sets”. Could somebody explain the multiclass perceptron to me and how it goes with the idea of the hyperplane, or maybe point me to a not-so-mathematical explanation?

score 9 · Answer 1 · edited Feb 23 '14 at 15:51

Suppose we have data $(x_1, y_1), \dots, (x_k,y_k)$ where $x_i \in \mathbb{R}^n$ are input vectors and $y_i \in \{\text{red, blue, green} \}$ are the classifications.

We know how to build a classifier for binary outcomes, so we do this three times: group the outcomes together, $\{\text{red, blue or green} \}$,$\{\text{blue, red or green} \}$ and $\{\text{green, blue or red} \}$.

Each model takes the form of a function $f: \mathbb{R}^n \to \mathbb{R}$, call them $f_R, f_B, f_G$ respectively. This takes an input vector to the signed distance from the hyperplane associated to each model, where positive distance corresponds to a prediction of blue if $f_B$, red if $f_R$ and green if $f_G$. Basically the more positive $f_G(x)$ is, the more the model thinks that $x$ is green, and vice versa. We don't need the output to be a probability, we just need to be able to measure how confident the model is.

Given an input $x$, we classify it according to $\text{argmax}_{c} \ f_c(x)$, so if $f_G(x)$ is the largest amongst $\{f_G(x), f_B(x), f_R(x) \}$ we would predict green for $x$.

This strategy is called "one vs all", and you can read about it here.

score 3 · Answer 2 · answered Jul 25 '13 at 10:20

I can't make sense of that Wiki article at all. Here's an alternative stab at explaining it.

A perceptron with one logistic output node is a classification network for 2 classes. It outputs $p$, the probability of being in one of the classes, with the probability of being in the other simply $1 - p$.

A perceptron with two output nodes is a classification network for 3 classes. The two nodes each output the probability of being in a class $p_i$, and the probability of being in the third class is $1 - \sum_{i=(1,2)} p_i$.

And so on; a perceptron with $m$ output nodes is a classifier for $m + 1$ classes. Indeed, if there is no hidden layer, such a perceptron is basically the same as a multinomial logistic regression model, just as a simple perceptron is the same as a logistic regression.

Are you sure that the output is an actual probability? Anyway, I don't know how multinomial logistic regression works, so I'll have to look into that. But isn't there an (algorithmic) way to explain how a perceptron with two or more output nodes is constructed? Are they chained together somehow? — grssnbchr, Jul 29 '13 at 09:03

How can a multiclass perceptron work?

2 Answers2