Best model for regression of data

Question

I have a dataset where:

X        Y
123      123
141      151
424      525
12       12
90       90
24       25

Well, the pattern is clear to me. If X contains 4s, replace all of them with 5. If it doesn't, then do nothing. Now I know the pattern because I generated the dataset myself. I want the model to understand this pattern on its own. I have tried splitting X into individual digits but can't seem to figure out the best model to give accurate predictions. Any help is appreciated.

Ordinary least squares will work just fine, because the response is a linear function of X and its digit classes. — whuber, Jun 30 '18 at 18:12
can you please explain how this is a linear function? And do I need to split the number into individual digits before inserting into the model? — Anwesh Mohapatra, Jul 01 '18 at 05:53
Let the three digits of $X,$ from least to most significant, be $X_0,X_1,X_2.$ For $i=0,1,2$ and $j=0,1,\ldots,9$ let $I_{ij}$ be the indicator $$I_{ij}=\mathcal{I}(X_i=j).$$ Because $$Y=(1)X+(1)I_{04}+(10)I_{14}+(100)I_{24},$$ you can fit it *perfectly* (with no error) with the model $$Y=\beta X+\sum\beta_{ij}I_{ij}.$$ It is linear in the parameters $(\beta,\beta_{ij}):$ this is the usual sense of a "linear model." (See https://stats.stackexchange.com/a/148713/919 for a general discussion of what that means.) Ordinary least squares regression will easily find this fit: try it! — whuber, Jul 01 '18 at 15:02
I'm sorry for troubling you so much but all these notations are going above my head. What's an indicator? Can you explain in simpler terms. So sry for making you work so much. — Anwesh Mohapatra, Jul 01 '18 at 20:22
We have a search engine. See https://stats.stackexchange.com/search?q=indicator+variable. — whuber, Jul 01 '18 at 20:40
used my brain a little bit and understood this. Thank you so much! — Anwesh Mohapatra, Jul 01 '18 at 21:43

score 0 · Answer 1 · answered Jun 25 '18 at 11:12

In my opinion, Your problem/dataset seems more like a text-processing problem rather than a statistical analysis problem, but still linear regression with some particular dataset can give fairly satisfying results. That means in this case, the regression results depends on the data provided for training such model.

For example, I've generated a dataset of 10,000 samples ranging in [0, 100]. Fitting a straight line will give more accurate predictions for the points which does not contains 4, because from the below figure, you can see that the straight line(which is result of linear regression) is imposed on such data points but due to the data points which contains 4, the line is slightly shifted above from the position where it should be. This will be more clear in next example.

Now consider the dataset of 1000 examples of points ranging in [0, 4999].

Due to 4's in thousands place, we can see more variance in output. This gives even worse results on the data points which do not contains 4. (You can notice this by observing the resulting line in above graph).

So regression would not be the best choice for such problems.

It depends on what you mean by "regression." Most people have a broader sense of this concept than fitting a straight line. In particular, OLS is quite capable of modeling these data using "dummy variables." — whuber, Jun 25 '18 at 12:38
Whuber is right, I know that most likely linear regression won't work. That's why I'm asking for a model that can achieve this. Furthurmore I want the computer to detect the pattern on its own. Text processing means I'll have to write the code myself to convert 4s to 5s. — Anwesh Mohapatra, Jun 25 '18 at 14:40

Best model for regression of data

1 Answers1