1

Let's say I have a dependent variable y (a rating of the pleasantness of shopping at a particular store from 0 to 100), and 10 independent variables X1 .. X10. X1 is the percent capacity of the store (between 0 and 100), and X2 ... X10 are other attributes of the store.

I know a priori that when all other variables are controlled for (X2 .. X10), the relationship between y and X1 must be shaped like a bell curve. Too few customers in the store, and the experience is unpleasant. But too many customers and the store is unpleasant. I also know that the tails of this curve are constrained by zero - when the store is empty, or the store is at maximum capacity, the pleasantness rating is 0. I don't know the peak magnitude of pleasantness rating though (height of the curve). For example, a suitable model might be

$$\mathbb{E}[y|X_1=x,X_2=x_2,\ldots,X_{10}=x_{10}]=f(x;x_2,\ldots,x_{10}) \propto x^a(1-x)^b$$

with Gaussian errors. How can I fit a regression $y = f(X_1, \ldots, X_{10}) + \epsilon$ such that the relationship constraint between y and X1 is forcibly maintaned?

DeltaIV
  • 15,894
  • 4
  • 62
  • 104
user1566200
  • 837
  • 1
  • 9
  • 18
  • "Bell curve" is not a well-defined mathematical constraint. What do you mean? That $\mathbb{E}[y|X_1=x,X_2=x_2,\ldots,X_{10}=x_{10}]=f(x;x_2,\ldots,x_{10}) \propto x^a(1-x)^b$? That $f$ is nonnegative with a single maximum in $I=[0,1]$, 0 outside $I$ and continuous everywhere? Something else? Please explain – DeltaIV Nov 03 '17 at 13:13
  • @DeltaIV Indeed, you are spot on with your description. – user1566200 Nov 03 '17 at 13:29
  • Ok. Can you 1) add that into the question body and 2) add some sample data, a brief description of the problem, etc.? There are ways to do what you want, but the choice of the approach will depend also on the problem you're trying to solve. – DeltaIV Nov 03 '17 at 15:23
  • @DeltaIV 1) Not entirely sure how to add your mathematical notation formatted properly to the post. 2) I don't really have any sample data to provide, but I can make some up. – user1566200 Nov 03 '17 at 15:27
  • Ok, I'll do it for you, then you let me know if it works. The issue if you don't have data is that I have no idea which error distribution to assume. Is `y` always positive? Is it a ratio of positive real numbers? A ratio of integers? Should I assume Gaussian errors? – DeltaIV Nov 03 '17 at 15:31
  • 1
    I edited the post- `y` will always be positive, as will X1. Assume as a Gaussian distribution of errors. – user1566200 Nov 03 '17 at 15:33
  • 1
    For responses that are constrained, and which approach the limiting values, a Gaussian distribution is a poor model--it just cannot be right and usually leads to poor fitting. More consideration and study should be devoted to understanding how the variations ought to be modeled and what conditional distribution might be a good choice. – whuber Nov 03 '17 at 18:27

1 Answers1

1

You can accommodate a very general shape of the relation of a predictor to outcome (including such non-monotonic relations as you expect for X1) by modeling X1 as a restricted cubic spline. In R this is implemented by the rcs() function in the rms package. Instead of forcing your preconception of the shape upon the data, this allows the data to show you the actual shape of the relation. This might be the best way to proceed unless you are sure that you know the exact mathematical form of your "bell curve" relation between occupancy and "pleasantness."

EdM
  • 57,766
  • 7
  • 66
  • 187