0

I have a multivariate compositional dataset with data referring to a district. The dataset comes mainly from census sources and I would like to apply statistical clustering on this data.

Many columns of the dataset are expressed as percentages, so I will have for each row for example: the percentage of kids with respect to the living population, the percentage of young people, adults and the elderly. There are also other columns that refer to the same district, but that refer to the same entirety: the percentage of people looking for jobs, that of occupied people and that of retired people.

District Age: Kids Age: Youngs Age: Adults Age: Elderly Employment: NEET Employment: Employed Employment: Seeking Employment: Homeowner Employment: Student Empolyment: Retired
District A 10% 15% 55% 20% 5% 60% 5% 10% 10% 10%
District B 5% 5% 60% 30% 2% 65% 7% 10% 3% 13%
District C 2% 13% 50% 35% 1% 49% 5% 5% 5% 35%

From what I understood by reading current literature in compositional data (mainly Statistical Analysis of Compositional Data by Aitchinson and Compositional Data Analysis in Practice by Greenacre and this thread on Cross Validated), though ALR and CLR have an easier interpretation, ILR enables you to directly use multivariate statistical tools on them. But is it a legitimate processing of the data to use the same transformation for the entire row, even if I have different variables that refer to the same population?

I feel that there is something wrong with this application of the ILR, but I have found that Lloyd in the paper Analysing population characteristics using geographically weighted principal components analysis: A case study of Northern Ireland in 2001 does this, regardless of the different domains of the data. Are really ILR-transformed data belonging to different domains by any means comparable? Can someone provide me a mathematical justification for this?

Any help would be appreciated.

shoyip
  • 1

0 Answers0