Using compositional data analysis to represent chemical compounds

Question

I've recently got some insights about compositional data analysis, wondering whether it could be suitable for the framework I'm currently in.

Recently, I've been very interested in trying to find some useful representations of chemical compounds that could be suitable for running Machine Learning algorithms.

This has lead me considering the field of compositional data analysis, but the only perplexity that I have, is that on the definition of the Simplex given on CoDa resources, all the coordinates must be strictly greater than 0, while in my case, a usual representation given for a certain compound is something like (E.g.):

NaCl : (... ,0 , 0, 0.5, 0, 0, ..., 0.5, ...)

Basically, given an enumeration of periodic table, I consider the fractional abundance of each element in the compound (in this case 1/2 of Na and 1/2 of Cl). Having a such sparse representation makes me wonder if CoDa can still be a suitable choice in my case, considering that I could not apply the log transformations in a straightforward manner. What I would like to achieve, is a dimensionality-reducted representation of my compounds, which still retains a large portion of information.

Many thanks,

James

Re the log transformation: see https://stats.stackexchange.com/questions/259208. But it looks like treating these as compositional data ("CoDa") is irrelevant: you can apply any dimension-reduction technique you like directly to these vectors without worrying about the fact their entries are non-negative and sum to unity. This opens up more possibilities, too: for instance, you could represent each compound as the vector of *counts* of its atomic constituents. That would distinguish, say, propyne (C3H4) from cyclohexadiene (C6H8). — whuber, Jun 22 '21 at 12:58
thanks very much for your reply..can you be more specific about why would it be irrilevant to treat these objects as compositional data? Actually I was reading the following paper: https://hal.archives-ouvertes.fr/hal-01945508/file/7902-representation-learning-of-compositional-data.pdf trying to understand if any of the tools presented (almost dimens.reduct. variants for CoDa were useful for my problem) — James Arten, Jun 22 '21 at 13:47
Treating them as compositional data merely narrows the procedures you might contemplate, yet there is no aspect of this problem that rules out any dimension-reducing procedure. Indeed, the whole idea is that your data are subject to one or more constraints: that is what reducing the dimensions means! Thus, knowing *a priori* that one constraint (sum to unity) does hold gives you no practical information. — whuber, Jun 22 '21 at 15:35

Using compositional data analysis to represent chemical compounds

0 Answers0