Mathematical formulation of correspondence analysis?

Question

you can think that Correspondence Analysis is a categorical data version of PCA. But the main usage of Correspondence Analysis is different from that of PCA, and it is more like clustering or Factor Analysis. With Correspondence Analysis, we can analyze and visualize the relationships among your observed data, and see which parts of the data are associated with another part of the data.

Wikipedia also says

It is conceptually similar to principal component analysis, but applies to categorical rather than continuous data. In a similar manner to principal component analysis, it provides a means of displaying or summarising a set of data in two-dimensional graphical form.

But I don't understand its details process, because I don't know what kind of problem Correspondence Analysis is trying to solve? Is there a clear mathematical formulation of the problem?

The kinship of simple CA with PCA is considered quite thoroughly in https://stats.stackexchange.com/q/141754/3277. See also there a link to answer by @chl — ttnphns, Apr 15 '17 at 19:53

dimitriy · Answer 1 · 2013-06-12T04:29:32.137

Take a look at the Stata documentation for CA (even if you're not a Stata user):

Correspondence analysis offers a geometric representation of the rows and columns of a two-way frequency table that is helpful in understanding the similarities between the categories of variables and the association between the variables.

There are lots of examples, references chosen for clarity, and the math can be found on page 20.

score 1 · Answer 2 · answered Apr 15 '17 at 20:24

Consider an $I\times J$ contingency table $C$ with elements $C_{ij}$ and total number of observations $n=\sum_i\sum_j C_{ij}$.

Simple $K$-dimensional correspondence analysis models this as $$ \frac{C_{ij}}{n} = \alpha_i\beta_j(1 + \sum^{K}_k\mu_i\sigma_k\nu_j) $$ You can think of this as a particular geometrical decomposition of a table of proportions, as described in the Stata documentation linked to by @dimitriy-v-masterov. This equation is the final equation in that document, despite being probably the most useful one to have started with.

Personally, I prefer to think of CA as a least squares approximation to the $K$ dimensional log multiplicative 'association' model of $C$: \begin{align} C_{ij} \sim &~ Poisson(\mu_{ij})\\ \mu_{ij} = &~ \alpha_i^* + \beta_j^* + \sum^{K}_k\mu_i^*\sigma_k^*\nu_j^*. \end{align} This makes it a bit clearer that the goal of both models is to create an interpretable low dimensional model of the table's association structure - that is, the variation in counts that should not be expected under independence. Increasing $K$ changes the models' complexity from independence to saturation.

In both models $\alpha$ and $\beta$ ensure the margin counts are captured whereas the elements of the sum exist to model the association structure. In CA the elements in the sum are essentially the first $K$ singular vectors and values of an SVD of the residuals from an independence model of $C$.

Biplots plot variously scaled $\mu$ or $\mu^*$ and $\nu$ or $\nu^*$ in the same space. Confusingly, some people refer to such plots as CA rather than the first model.

Comparisons of either model to factor analysis are basically unhelpful. From a measurement perspective these models model proximity, or 'ideal point' item structure, rather than a dominance structure of factor analysis.

Mathematical formulation of correspondence analysis?

2 Answers2