Problem
We have an array $\underline{X}$ of order $$items \times items \times people$$
(that is, $J \times J \times K$) where the cells are Pearson's correlation coefficients, each across some observations of some item pair.
(See this related answer on why this correlation matrix is the only comparable data we have; we can't go back to rawer data.)
These correlation coefficients can be interpreted as similarity, where high values are very similar, low values are dissimilar and 0
is unrelated.
We now want to reduce the dimensionality, in all three ranks, i.e., the rank of $people$ and $items$, so that we get lower people dimensions, each of which has some lower item dimensions.
Being the Principal Components Analysis (PCA) fanboy that I am (mostly because I understand it somewhat, and, more appropriately, because we really just want dimensionality reduction), I turned to its n-mode generalisations, Candecomp/Parafac (CP), and Tucker (aka. multilinear SVD).
Which of the two (if any) is appropriate for our data?
We tried both of them (see below), but found them both a bit lacking / uncertain, so we're between that rock and hard place they talk about.
Attempt 1: Candecomp/Parafac
We're aware that our data isn't $I \times J \times K$, as would be the case for most most multi-modal arrays, but only $J \times J \times K$, so we actually have three-way, two-mode data.
I had naively assumed, by analogy to PCA, that this wouldn't be a problem, since you always just pass a correlation matrix to a PCA instead of a data matrix, as in prcomp(x) == prcomp(cor(x))
, except, of course that the latter can't give you scores.
Turns out, we're lucky here, and Giordani and Kiers 2014: 17ff actually describe a similar use case in their vignette for ThreeWay
:
[...] The data array is such that the A- and B-mode entities (the kinship terms) coincide. It corresponds to a three-way proximity array (three-way, two-mode data), namely a collection of proximity data matrices, one for each group of students. This particular array can still be analyzed by means of CP. In this case, the CP model is usually referred to as the Indscal model to highlight that the model represents a three-way extension of the classical (two-way) multidimensional scaling. In their original paper, Carroll and Chang (1970) introduce such a model as a tool for summarizing a three-way array of scalar products through underlying components. Thus, the Indscal model can be understood as the CP model for arrays with symmetric slabs. When the CP model is fitted to proximity arrays, the symmetry of the slabs implies that two component matrices (in our case the matrices A and B) are the same. [...]
So, here we go.
Apologies for the not-quite-reprex:
- $\underline{X}$ aka
cora
is too large fordput
-action, but can be found here. ThreeWay::CP
can only be run interactively AFAIK, so I can only paste the console output.
Call:
npca_res <- CP(data = cora,
laba = dimnames(cora)[[1]],
labb = dimnames(cora)[[2]],
labc = dimnames(cora)[[3]])
Messy console interaction:
WELCOME to the interactive CANDECOMP/PARAFAC analysis program
Warning: If you insert an object of mode CHARACTER when not requested, an error occurs and the program stops!
To see ANOVA results, specify 1:
1:
Read 0 items
How do you want to center your array?
0 = none (default)
1 = across A-mode
2 = across B-mode
3 = across C-mode
12 = across A-mode and across B-mode
13 = across A-mode and across C-mode
23 = across B-mode and across C-mode
1: 0
Read 1 item
Data have not been centered
How do you want to normalize your array?
0 = none (default)
1 = within A-mode
2 = within B-mode
3 = within C-mode
1: 0
Read 1 item
Data have not been normalized
Note: The preprocessed data are now available in Xprep.
Xprep can be used for analyses outside the Candecomp/Parafac program
To see PCA of MEAN, specify '1':
1: 0
Read 1 item
You are now about to do a Candecomp/Parafac analysis.
You have to specify the dimensionality to use.
To search this, you are advised to run a series of (unconstrained)
Candecomp/Parafac analyses using different dimensionalities
The results can next be used to find useful dimensionalities by the scree test.
To reduce the computation time, note that the maximum allowed number of iterations is 5000,
the convergence criterion is 1e-6 and the rational start via eigendecompositions is considered.
If you want to do this for choosing your dimensionality, specify '1':
1: 0
Read 1 item
You can now do a CANDECOMP/PARAFAC ANALYSIS
How many components do you want to use?
1: 3
Read 1 item
Do you want to use constraints? If so, enter '1':
1:
Read 0 items
No constraints imposed
Specify convergence criterion (default=1e-6)
1:
Read 0 items
By default, only a rationally started analysis run will be carried out.
To decrease the chance of missing the optimal solution, you may use additional, randomly started runs.
If you want additional runs, specify how many (e.g., 4):
1: 4
Read 1 item
Specify the maximum number of iterations you allow (default=10000).
1: 50000
Read 1 item
Run no. 1
Candecomp/Parafac function value at Start is 14265.9792386173
Candecomp/Parafac function value is 11248.0847401712 after 15 iterations
Fit percentage is 20.8684707651605 %
Procedure used 0.14 seconds
Run no. 2
Candecomp/Parafac function value at Start is 14215.4293443188
Candecomp/Parafac function value is 11284.7457771764 after 26 iterations
Fit percentage is 20.6105562856232 %
Procedure used 0.23 seconds
Run no. 3
Candecomp/Parafac function value at Start is 14219.5279522587
f= 11229.5642472746 after 50 iters; diff.= 18.7124782152951
Candecomp/Parafac function value is 11208.8164256207 after 60 iterations
Fit percentage is 21.1447277326922 %
Procedure used 0.56 seconds
Run no. 4
Candecomp/Parafac function value at Start is 14219.946679173
Candecomp/Parafac function value is 11312.4618534379 after 16 iterations
Fit percentage is 20.4155706014269 %
Procedure used 0.11 seconds
Run no. 5
Candecomp/Parafac function value at Start is 14216.8408202143
Candecomp/Parafac function value is 11298.2532050982 after 21 iterations
Fit percentage is 20.5155300254045 %
Procedure used 0.14 seconds
Fit (%) values from all runs:
Start n.1 Start n.2 Start n.3 Start n.4 Start n.5
20.87 20.61 21.14 20.42 20.52
Candecomp/Parafac analysis with 3 components, gave a fit of 21.14 %
Simple check on degeneracy: inspect matrix of triple congruences
Comp.1 Comp.2 Comp.3
Comp.1 1.0000 0.0125 0.0028
Comp.2 0.0125 1.0000 0.0004
Comp.3 0.0028 0.0004 1.0000
It is sometimes useful to SCALE solution, e.g., scale two matrices so that
they have unit sum of squares compensating this scale in remaining matrix.
If you want to scale components, specify '1':
1: 1
Read 1 item
What modes do you want to scale?
1 = B- and C-modes (scaling compensated in A)
2 = A- and C-modes (scaling compensated in B)
3 = A- and B-modes (scaling compensated in C)
1: 3
Read 1 item
Solution for A, B and C in detail.
A
Comp.1 Comp.2 Comp.3
but-how -0.17 -0.12 -0.15
encyclopedia 0.17 -0.05 0.22
alien 0.17 -0.25 -0.21
language-of-bees -0.17 0.12 0.19
bad-hen 0.18 -0.26 -0.20
correspondence 0.16 0.13 -0.17
bamboozable -0.17 -0.17 0.14
inventions 0.16 0.27 0.19
gray-hair 0.17 -0.06 0.20
i-we -0.17 0.15 -0.12
the-same -0.17 0.20 -0.14
the-better -0.17 0.17 0.23
cockatoo -0.17 0.05 0.23
caravan 0.16 0.05 -0.08
comma -0.17 0.24 -0.13
headturner 0.17 -0.11 0.21
crocodile -0.16 -0.29 0.04
countries 0.17 -0.06 0.27
level 0.16 -0.26 -0.11
morning -0.17 0.29 -0.14
idiom 0.17 -0.29 -0.18
lullaby 0.17 -0.04 0.18
riddle 0.17 -0.07 0.23
eating-grandpa -0.17 0.16 -0.14
easter-bunny -0.17 -0.14 0.15
alphabet-of-swearing 0.17 0.17 -0.09
guilt 0.17 -0.10 -0.07
trouble -0.18 0.08 -0.11
eating-animals 0.17 0.05 -0.19
random-poetry -0.16 0.12 -0.08
comparatively -0.16 -0.15 -0.09
one-letter 0.17 -0.07 0.21
resistance -0.18 0.21 -0.17
conjugation 0.17 0.00 0.22
censored -0.17 -0.22 -0.12
B
Comp.1 Comp.2 Comp.3
but-how 0.17 -0.12 -0.15
encyclopedia -0.17 -0.05 0.22
alien -0.17 -0.25 -0.21
language-of-bees 0.17 0.12 0.19
bad-hen -0.18 -0.26 -0.20
correspondence -0.16 0.13 -0.17
bamboozable 0.17 -0.17 0.14
inventions -0.16 0.27 0.19
gray-hair -0.17 -0.06 0.20
i-we 0.17 0.15 -0.12
the-same 0.17 0.20 -0.14
the-better 0.17 0.17 0.23
cockatoo 0.17 0.05 0.23
caravan -0.16 0.05 -0.08
comma 0.17 0.24 -0.13
headturner -0.17 -0.11 0.21
crocodile 0.16 -0.29 0.04
countries -0.17 -0.06 0.27
level -0.16 -0.26 -0.11
morning 0.17 0.29 -0.14
idiom -0.17 -0.29 -0.18
lullaby -0.17 -0.04 0.18
riddle -0.17 -0.07 0.23
eating-grandpa 0.17 0.16 -0.14
easter-bunny 0.17 -0.14 0.15
alphabet-of-swearing -0.17 0.17 -0.09
guilt -0.17 -0.10 -0.07
trouble 0.18 0.08 -0.11
eating-animals -0.17 0.05 -0.19
random-poetry 0.16 0.12 -0.08
comparatively 0.16 -0.15 -0.09
one-letter -0.17 -0.07 0.21
resistance 0.18 0.21 -0.17
conjugation -0.17 0.00 0.22
censored 0.17 -0.22 -0.12
C
Comp.1 Comp.2 Comp.3
Willy -0.26 2.39 1.16
Karen -0.31 3.17 1.09
Kristina -0.41 9.25 0.53
Stefan -0.92 1.29 1.85
Niklas 0.02 1.70 3.01
Fiona -1.03 2.34 1.98
Finley -1.14 3.67 1.82
Zoe -0.54 3.12 2.04
Annika -1.27 0.53 2.27
Ian -1.18 2.06 1.12
Frank -0.54 2.29 0.99
William -1.23 2.12 0.76
Jana -0.92 3.42 2.24
Babsi -0.07 4.01 2.14
Collin -0.80 1.27 2.54
Patrizia -3.57 3.90 1.18
Valentine -0.33 1.19 0.96
Jennifer -34.99 -0.19 -0.01
Julius -2.46 2.65 1.75
Thomas -0.66 2.33 1.22
Kim -1.20 3.80 1.59
Tamara -1.16 3.89 2.20
Caroline -1.18 4.82 5.20
Dennis -1.43 3.71 1.78
Shae -0.43 1.50 1.04
Knut -0.27 2.31 6.99
Angelina -0.71 2.78 2.93
Victor -0.96 4.09 0.92
Fred -1.80 3.80 1.33
Nora -0.58 5.53 2.25
Carmen -1.06 2.89 3.04
Tashina -0.97 1.13 1.82
Tessa -0.69 9.30 0.65
Adrian -0.62 3.56 1.68
Constanze 0.66 8.07 2.88
Justin -0.32 5.63 1.89
Wanda -0.56 3.69 1.21
Irene -0.85 8.40 2.07
Anastasia -0.06 2.33 22.67
Sabrina -0.57 3.49 2.03
Jasmine -1.15 6.08 6.53
Dana -2.61 3.84 2.33
Amanda 0.04 0.84 0.18
Jalena -2.01 5.25 1.90
Marcel -0.28 4.14 1.30
Leandro -0.72 4.02 3.28
Justus -0.58 2.54 3.09
Hagen -1.57 6.63 1.47
Ralf -1.30 4.06 3.31
Shaana -1.07 0.48 2.18
Nhome -1.17 2.09 3.81
Azra -2.51 3.66 1.25
You can now manually PERMUTE and REFLECT columns of solution
If you want to reflect/permute columns, specify '1':
1:
Read 0 items
If you want to carry out a STABILITY CHECK on current or different solution, specify '1':
1:
Read 0 items
If you want to carry out a BOOTSTRAP procedure for
computing confidence intervals for the current solution, specify '1':
1:
Read 0 items
If you want to carry out a FITPARTITIONING on current solution, specify '1':
1:
Read 0 items
Press 'return' to conclude the analysis
1:
Read 0 items
The component matrices are normalized such that A and B have unit sums of squares
To see component matrices and fit and other results, digits: $Xprep, $A, $B, $C, $fit, $fitA, $fitB, $fitC, ...
The fit isn't great, but that might just be our crappy / preliminary /small dataset:
npca_res$fit
#> Fit (%)
#21.14473
Happily, A
-mode matrices ($Items$) and B
-mode matrices ($Items$), are same-ish (computational artefact, I'm guessing, plus random reflection?)
all(abs(round(x = npca_res$A, digits = 1)) == abs(round(x = npca_res$B, digits = 1)))
#>[1] TRUE
Worries and problems with Candecomp/Parafac
- Giordani and Kiers (ibid.) recommend this to an array of proximity matrices, not correlation matrices. Can we even do this? Is this ok?
- CP, by definition, as I understand, always extracts the same number of components on each mode, but we have no reasonable expectation that that would/should be so, empirically. In fact, it seems to us there should be more $item$ components than $people$ components.
- Any thoughts on what would be appropriate constraint, scaling, centering and transformation options? (we're thinking no on the latter three, b/c correlation coefficients already are centered).
Attempt 2: Tucker
So, to allow us to extract different (numbers of) components for each mode, we turn to Tucker, in flagrant disregard of actually having only an $J \times J \times K$ array, not $I \times J \times K$. Weirdly, it works.
Same apologies for non-reprex apply.
Call:
tucker_res <- T3(data = cora,
laba = dimnames(cora)[[1]],
labb = dimnames(cora)[[2]],
labc = dimnames(cora)[[3]])
Messy Terminal Interaction is here because too long for SO.
As advised by Giordani and Kiers (2014: 8) we then do some post-processing to make the core matrix an identity matrix (I think?):
tucker_res$A <- tucker_res$A %*% tucker_res$core
tucker_res$core <- solve(tucker_res$core) %*% tucker_res$core
This ominously fails with:
#> Error in solve.default(tucker_res$core) : 'a' (3 x 6) must be square
$Items$ and $Items$ mode matrices are the same:
all(round(tucker_res$A, digits = 1) == round(tucker_res$B, digits = 1))
#> [1] TRUE
but the core ain't an identity matrix, making this weird/hard to interpret:
tucker_res$core
#> B1xC1 B2xC1 B3xC1 B1xC2 B2xC2 B3xC2
#> A1 -0.2351972 -1.59675721 6.47329544 36.2976853 -0.6195908 2.0033419
#> A2 -1.5995439 25.46542681 -0.05905521 -0.6205366 -0.4781704 -1.1461151
#> A3 6.4725537 -0.03896677 24.04316413 2.0032077 -1.1465854 -0.3047064
Worries and problems with Tucker
- Plugging in an $items \times items \times people$ matrix, can we just do this?!? Are the results interpretable or is this just garbage-in-garbage-out?
- How do we interpret a non-identity core matrix?
- Is this, overall, appropriate?