How to determine number of profile in a dataset derivated from repeated measures?

Question

I'm currently working on datasets which have been derivated from repeated measures over time (blood concentrations). Actually the descriptors of these datasets are descripting the shape of the curves (concentrations over time) for each individual. The purpose is to distinguish if there are only one type of curve shape or two signing equivalence or not between two products. The hypothesis is :

if there is only one type of shape the products are equivalent
if there is more than one type of shape, the products are inequivalent

Indeed, I had to generate synthetic data (which follow the tendency of the real datas). I have generated datasets in which there are equivalence between products and some where there are no equivalence.

In first intention I tried to work on PCA and Kmeans clustering, searching for the optimal number of cluster to know if there products are equivalent. The fact is that this technic doesn't work at all.

Now I plan to work with some classifiers like Random Forests, SVMs or ANN, trained on synthetic datas, in order to estimate the number of shape type of a real dataset. The problem is the labelling of the datas. Indeed, in this case I can't use labeled from individuals beacause the question is not to classify individuals but datasets in their integrity.

I've heard about SOM and Growing Neural Gas but I don't really know how to use them in that way. I also tried to extract features from the dataset like means, sd but it doesn't seem relevant.

Does someone has already worked on this type of problems ?

BTW, if you include more information, more opinion can be given, for example, the route of administration, metabolic products if any, routes of elimination, relationship of elimination to plasma concentration, pharmacodynamics and so forth. — Carl, May 18 '19 at 14:30
@Carl Actually I'm working on essential amino acids absorption after oral consumption. I've been working only on NCA so far beacause of the lack of information about constants Ke / Ka per acid. The purpose here is to assess the equivalence between protein products during a clinical trial. The fact is that classical indicators such as AUC (incremental technic) is biaised considering the variation of the baseline (amino acid concentration after like 10h fasting) in a same subject on different days. I've tried to work on curve derived shape indicators .... — S.Gradit, May 20 '19 at 14:56
... and I've observed at least 2 groups of subjects. The idea is now to predict, considering a dataset of derived paramters or raw PK datas, the number of subject profiles to eventually correct the experiment plan — S.Gradit, May 20 '19 at 14:59
Ke, Ka are constants from the [convolution of two exponential density functions](https://ejnmmiphys.springeropen.com/articles/10.1186/s40658-016-0166-z) (Eq. 1). There is no shape coefficient for such a model, so even if you had Ke, Ka information you would not be looking at shape. — Carl, May 20 '19 at 15:26
You're right, but Ka and Ke would be usefull if I have a reliable model considering a compartmental analysis, thing that I haven't found yet (at the body scale) — S.Gradit, May 21 '19 at 08:16
@Carl: Given the density function $f(x) = \frac{\lambda_1 \lambda_2}{\lambda_2 -\lambda_1} [\exp(-\lambda_1 x) - \exp(-\lambda_2 x)], \lambda_1 \neq \lambda_2, x\geq0$, changing *either* of the parameters $\lambda_1$ or $\lambda_2$ while holding the other constant will change its shape. Increasing or decreasing both rate parameters in proportion, however, will merely rescale it; suggesting the convenient parametrization $f(x) = \frac{\lambda_1 \theta}{\theta -1} \cdot [\exp(-\lambda_1 x) - \exp(-\theta \lambda_1 x)]$ in which $\theta=\frac{\lambda_2}{\lambda_1}$ is a shape parameter ... — Scortchi - Reinstate Monica, May 21 '19 at 09:35
... & $\lambda_1$ a rate parameter. (The same goes for the SET models discussed in your answer.) — Scortchi - Reinstate Monica, May 21 '19 at 09:37
@Scortchi Perhaps this will help. An explicit shape parameter is one that allows for fitting of $f'(t)$. An implicit shape parameter of the type you propose is inefficient for fitting $f'(t)$, and in addition is not generally robust with respect real number solutions to exactly determined systems of equations for which any of $n$-tuple complex field solutions occur in practice for biexponentials approximately 2% of the time. No one would suggest that polynomial fitting is shape efficient, and yet, $\Sigma_{i=0}^n\ln(a_it^i)$ or the logarithm of a collection of polynomial shapes will.... — Carl, May 21 '19 at 15:32
con't... outperform sums of exponential term fitting by approximately one order of magnitude in terms of rrms error. — Carl, May 21 '19 at 15:34
@Carl: Happy to bow to your expertise when you say these models are sub-par for certain applications; but I can't see any meaningful general distinction between "implicit" & "explicit" shape parameters - at any rate baldly stating that the models don't have shape parameters seems an inadequate account of the matter. — Scortchi - Reinstate Monica, May 21 '19 at 16:27
Why restrict considerations to compartmental methods? The problem is not even vaguely compartmental. Consider that convolution modelling of the Bateman type produced the model you are considering and I would generalize the convolution as being much more open ended than compartmental modelling. For example see [this](https://ejnmmiphys.springeropen.com/articles/10.1186/s40658-016-0166-z). — Carl, May 21 '19 at 16:31
@Scortchi There may be nothing else out there about the difference between implicit and explicit shape parameters. I was groping for concepts to explain observations. The key concept is not just "shape parameter" but "shape parameter that allows for $f'(t)$ fitting". Consider this, for example. We cannot believe half-life information if we do not explicitly allow for derivative fitting. One now occasionally sees PK spline fitting rather than SET fitting for precisely that reason. Personally, I prefer a more direct approach; the chaining of physiologic processes using convolution. — Carl, May 21 '19 at 16:45

Carl · Answer 1 · 2019-05-19T01:52:39.427

Several considerations here pertain to the background upon which the question is based rather than the text of the question itself. As the say in Maine, "You can't get there from here." In simplest terms, one must first have shape information in order to test for shape, and there isn't any shape information to test, as follows.

1) Concentrations are not random variables and the best method of fitting concentrations is to take a density function and scale that to fit the data. Although we typically call density functions pdf, there is no 'p' for probability in one for matching concentration. pdf can be $f(x)$ or $f(t)$ and can be used to model random variables, but are not themselves random variables. This can cause confusion, e.g., see What is a good name for a density function that does not relate to probability?.

2) Remembering this, one can use pdf notation anyway and then $C(t)= \kappa\, \text{pdf}(t)$ becomes our fit equation, where $\kappa=\text{AUC},$ the area under the concentration curve from $t=0,\infty$.

3) The pdf used either have shape parameters in them or there will not be derivative fitting or shape fitting, and all the shapes fit will be indistinguishable.

4) The usual functions used for fitting concentrations are sums of exponential terms (SET). As pdf these are mixture distribution models (not of random variables, but of concentrations). Mixture models are always some $f(t)$ or $f(x)$ and this can cause confusion, as mixture models do not have to be models of random variables, e.g., see this answer, and for modeling concentration the density functions do not model probability. SET models have no shape parameters, as the exponential distribution from which these models arise have no shape parameters.

5) Much better derivative fitting is achieved using gamma distributions, or gamma distribution convolutions, where the latter has no less than two shape parameters and allows for very high precision concentration and shape fitting for data up to about four hours for at least some drugs.

6) Once one has a model that actually follows the shape of the concentration curve, one should be able to just inspect the shape parameter(s) to see if they are statistically different for the two cases you are considering.

This is a topic that I am actively researching, and there are multiple conclusions that I cannot reveal at this time. However, I would also suggest reading this to develop a broader background on how these problems should be treated.

Edit More information on fit procedures, shape determination, etc. Consider, for example, that fitting with minimizing proportional error (or indeed any error minimization) is not robust for biexponential, E2, (and higher SET) models. Approximately 2% of those models converge to any one of $n$-tuple solutions in the complex plane. Also, shape information from Tikhonov regularization that minimizes AUC error over the whole ill-posed curve for gamma distributions to GFR markers allows one to extract more exact clearance (CL) and volume of distribution (V$_{\text{d}}$) information, such that unlike for SET functions, there is no correlation (i.e., no contamination) between fluid disturbance and CL. However, that is insufficient to separate the shape information from the total duration of sample collection, such that any comparisons of shapes would have to be undertaken under the same sample-time collection regimen.

Finally, a full model of concentration that may obviate the necessity of using regularization, that is, a full bore well posed model, must contend with the actual shape of the right hand tail function, as well as eliminate all instant mixing assumptions. Indeed, there is probably more evidence to support power function right hand tails than exponential ones. I am working on such a model, and it is not trivial. For one thing, it necessitates variable volume modelling as well as half-life as a function of time. Most people have difficulty understanding that a half-life for a density function other than a memoryless exponential is not a constant, but 1) is a function of time and 2) is negative when concentration is increasing. Newton must be turning over in his grave that people do not generally understand the slope at a point in time of a logarithm of a nonexponential density function.

Maybe some other approach would be of interest to you, depending on what exactly it is you are doing. For example, take a look at this article.

Hello, Thanks for these considerations and information. I must admit that I'm not really familiar with some concepts like gamma distribution convolutions, but I'm going to learn about that. I used to work with classical pharmacokinetics context (I mean with C(0) = 0 and not with variable baseline like now), using compartmental and non compartmental analysis. I've forgotten to mention that the idea is to allow the end user to check, at mid-time of the bioequivalence study, to know the porbability of non equivalence considering that there are multiple shape profiles,with half of the data. — S.Gradit, May 17 '19 at 10:00
(+1) Have you anything to add on the errors (the stochastic part of the model) & how you fit it? — Scortchi - Reinstate Monica, May 17 '19 at 17:03
@Scortchi Yes, but it is contingent on 1) the purpose of fitting, 2) the degree of ill-posedness of the match between data and model for that purpose of fitting, 3) the robustness of the data and model match for that purpose of fitting, 4) the statistical properties of the model with respect to a candidate fitting procedure 5) the residual structure from fitting a candidate fit procedure, and finally these prior points, including error calculations should be used for model selection in a context that includes proper physiological assumption, proper initial conditions, etc. — Carl, May 17 '19 at 18:37
@Scortchi con't.... Such that any answer depends on so many factors that a proper answer would be of monograph length. I can be more specific if you posit the question more concretely. To do otherwise would be misleading. For example, I could describe what people do, but that would not be the same thing as what people should be doing. — Carl, May 17 '19 at 18:47
You lost me at "concentrations are not random variables." Either this is a trivial ontological statement or it's a denial of the utility of probability models in this setting, neither of which is a constructive approach. It leaves me doubting that you are using any of the subsequent terminology in standard ways. — whuber, May 17 '19 at 19:43
@whuber con't.... Perhaps read https://stats.stackexchange.com/q/404498/99274 like I asked you to before, and then comment. — Carl, May 17 '19 at 20:48
@whuber OK, concentrations can have errors, which errors, and not the concentrations themselves, are random variables. Concentrations are almost always modeled using density functions, although this is, in fact, largely unrecognized. Consider this, concentration is a density, and in place of a probability, the definite integral from some time is a to b is a mass per clearance, not a probability. For example, D = CL AUC, where AUC is the total area under the concentration curve, CL is clearance and D is total dose of a single dosing. — Carl, May 17 '19 at 20:54
Those issues--which concern *units of measurement* of concentrations--do not appear to be germane to any aspect of the question in this thread. — whuber, May 17 '19 at 21:19
@whuber Agreed. And in this case, the appearance is deceiving. The underlying problem is that the models being used by the OP do not lend themselves to the measurements being attempted. OP asked for advice from anyone, so I gave what I considered helpful advice, which is, "do not do that that way". — Carl, May 17 '19 at 21:37
@whuber I also agree that my answer is not what the OP expected, it does not answer the question, and as posited, the question cannot be answered, at all. So what would you have me do? — Carl, May 17 '19 at 21:46
Post comments to ask the OP to clarify the points you think are important. — whuber, May 17 '19 at 21:48
@whuber Why? I did not need clarification of the question to answer that it cannot be done. Shape is from derivative fitting. The models being used have no shape parameters, and it is useless to try to classify models based on shape using models that have no shape fitting. End. i may have saved the OP years of futile effort by saying that, and that is help. Now if you want me to answer the question under the assumption that the system of equations is entirely modified, then I would need clarification. One point failure is enough to suggest going back to the drawing board, is it not? — Carl, May 17 '19 at 22:14
If you know you didn't answer the question, that means you believe the question should be reformulated. If you truly want to help in this forum, then first find out what help is really needed and address that issue directly. Anything else risks confusing many readers. At a minimum, when posting a reply that does not answer a question as stated, begin with a clear restatement of your interpretation of the question so that everyone will understand what you are trying to respond to. — whuber, May 17 '19 at 22:29
@whuber OK, put in short preamble. I was in the original speaking to the OP and forgot about the other readers who likely do not have the OP's background knowledge, apologies. — Carl, May 18 '19 at 00:09

How to determine number of profile in a dataset derivated from repeated measures?

1 Answers1