2

Suppose I have data $x$ and $y$, where $x$ is a count and $y$ is continuous. I would like to predict $x$ from $y$.

Specifically, for my research question, $X$ can be viewed as being measured without error (it is fixed by design), whereas $Y$ is random.

Below is my data

$x$:  [1] 1 2 3 4 5 6 7 8 9 10

$y$:  [1] 1.0000 1.8002 2.4383 2.9353 3.3641 3.6847 3.9578 4.1610 4.3139 4.4667

The above data come from a simulation that I have developed to generate asymptotic "accumulation" curves. Basically, my simulation randomly samples without replacement from a pool of distinct character labels and computes the mean across all individuals (represented by the $x$ data). For the above data there are 5 character labels. I want to see if I can recover all 5 distinct labels. Based on the above data, only 4.4667 labels have been recovered on average.

What I am looking for is a regression technique that I can apply to the kind of data that I have supplied.

Specifically, I would use the proposed regression method to answer a question such as "What is the value of $x$ for a corresponding $y$-value of $y$ = 5?" That is, in the context of my data, what $x$ is needed to observe exactly $y$ = 5 labels?

I am unaware of existing appropriate alternatives that could work in this setting besides inverse regression.

compbiostats
  • 1,017
  • 9
  • 18
  • 2
    What are "nonlinear data"? Why do you say they "don't fit any parametric model"? Is $n$ known/fixed? Could each $x$ be considered the sum of $n$ independent dichotomous observations? Explaining what $x$ & $y$ are would probably help make your question a lot clearer. – Scortchi - Reinstate Monica Oct 04 '17 at 15:52
  • @Scortchi I have added some information to my question post that hopefully should make things clearer. – compbiostats Oct 04 '17 at 16:07
  • 1
    You really need to tell us what x and y *represents* – kjetil b halvorsen Oct 04 '17 at 16:36
  • @kjetilbhalvorsen Please see my revised post with sample data and plot. – compbiostats Oct 04 '17 at 17:14
  • 1
    With regard to the latest edit: is $x$ fixed *by design*? What's the sampling scheme? If you don't explain anything at all about what the data represent and how they were obtained how can a suitable regression model be suggested? – Scortchi - Reinstate Monica Oct 04 '17 at 20:49
  • The question, as it stands, is difficult to grasp. It does not help that $y$ seems to be the covariable and $x$ the response. Usually, these symbols are used the other way round. – Michael M Oct 04 '17 at 20:57
  • @MichaelM $x$ is not the response; rather $x$ is the predictor. $y$ is the response variable. Given a value of $y$ how can (should) I go about predicting $x$? – compbiostats Oct 04 '17 at 21:04
  • @Michael Pay no attention to the names of the variables: *read the question!* Jarrett, one approach is called [inverse regression](https://stats.stackexchange.com/a/206682/919). – whuber Oct 04 '17 at 21:09
  • @ whuber I have considered inverse regression - though it is quite controversial in the statistics community. Do other approaches exist that you know of? I don't think that swapping $x$ and $y$ is feasible here since $x$ is fixed by design, even though my data produce a monotonic curve. – compbiostats Oct 04 '17 at 21:14
  • What is the controversy? You're right about keeping the current roles of $x$ and $y$, so inverting the regression of $y$ against $x$ looks like a reasonable thing to consider. – whuber Oct 04 '17 at 21:19
  • 1
    @whuber The "controversy" I speak of refers to predicting $x$ from $y$ using "classical" vs. inverse regression estimators. The literature by Brandon Greenwell is an excellent source here. – compbiostats Oct 04 '17 at 21:31
  • 1
    Thank you. Interestingly, the top hit in a Google search turned up Greenwell's [`investr` package for `R`](https://journal.r-project.org/archive/2014/RJ-2014-009/RJ-2014-009.pdf), which offers many different ways to carry out inverse regression! He refers to this as "classical and well-known," so it's hard to see where the controversy might be. – whuber Oct 04 '17 at 21:34
  • @ whuber Yes, I am using investr indirectly. I am fitting Generalized Additive Models (GAMs) using Simon Wood's handy 'mgcv' R package. GAMs are not included in the investr package, so I have to develop my own approach in the interim. – compbiostats Oct 04 '17 at 22:21
  • 1
    @whuber I found this excellent (fairly recent) source: A Comparison of Classical and Inverse Estimators in the Calibration Problem by Kannan et al. (2007) that you may find interesting and useful for future discussion... – compbiostats Oct 05 '17 at 04:37

0 Answers0