1

I got the following columns:

Level 1        240    12     7
Level 2        98     5      5
Level 3        46     4      6
Level 4        21     0      1

I try to prove that there is a correlation between the "Level" and the number of people (represented by each column) I used the following forumla (Table is the table..) :

RS2 = r2_score(Table.iloc[1:5,0], Table.iloc[1:5,1], Table.iloc[1:5,2])

print(RS2 )

The result is negative (-4....) which is wrong.

I am undecided, but maybe I should calculate the columns together as one group, finding mean and then the estimated values? if not, what should I do?

** note my data is small, concatenating the columns will probably give a large confidence interval..

  • 1
    I've taken the liberty of editing your title to be more informative. Take a look and see if that captures your main question, and feel free to improve on it by adding more specificity about your particular problem. – Sycorax Dec 09 '21 at 17:27
  • It does not make sense for there to be a negative correlation involving a nominal variable. Is there any order to the levels, or are they like "dog", "cat", "horse", "kangaroo"? // If you do want to find a negative correlation for variables where such a notion makes sense, $R^2$ is not a tool that will help you. That would be a correlation coefficient, which has a relationship to $R^2$ under certain circumstances. // Why do you want to show a negative correlation? – Dave Dec 09 '21 at 17:33
  • I edit my question. I want to show that as long as the level increases - > the number of people declining – TheUndecided Dec 09 '21 at 17:37
  • So there is an order to the levels? Do you know the difference between each of the levels? Is it constant in the sense that $L4 - L2 = L3 - L1$, etc? – Dave Dec 09 '21 at 17:39
  • The relations expressed here through "rating" from 1 -> 4 , if certain people "pass" L1 they go to L2 and so on. My theory: L1 is easiest stage, there are more people, as long as you keep increasing the levels, it's becoming harder so then less people capable to be on an advanced level. – TheUndecided Dec 09 '21 at 17:51
  • If the columns represent cohorts, then your hypothesis is disproved, as the last column features an increase from $5$ to $6$. Also, if you have cohorts, then the analysis should include that cohort variable, which makes the problem more complicated and would require a full regression and analysis of the regression coefficients. – Dave Dec 09 '21 at 17:54
  • Yes I saw it, but can I claim that in "most of the time" in general it does happen? – TheUndecided Dec 09 '21 at 17:54
  • If I am correct that you have three cohorts, a graph that color-codes the cohorts will tell you a lot. This is a great example of how including another variable (cohort) in the regression decreases variance. (I'd also use an interaction term.) – Dave Dec 09 '21 at 20:51

1 Answers1

2

$R^2$ has nothing to do with the sign of a correlation. While there are ways of getting $R^2<0$ in a regression model (an indication of a poor fit), the notation comes from the fact that $R^2 = r^2$, where $r$ is the sample correlation between two variables, when you fit a regression model $\hat y_i = \hat\beta_0 + \hat\beta_1x_i$ with the extremely common method of least squares.

However, it looks like you have one variable with the levels in it and one variable with the numerical observations>

$$ X = (1,2,3,4,1,2,3,4,1,2,3,4)\\ Y = (240, 98, 46, 21, 12, 5, 4, 0, 7, 5, 6, 1) $$

Since your levels appear to be ordinal---that is, ordered but with unclear differences between them---Spearman's rank correlation is appropriate here. In R, the line is cor(x, y, method = "spearman").

This gives me a result of about $- 0.5$. However, the plot is not so convincing. When I do a test of the Spearman correlation being nonzero via cor.test(x, y, method = "spearman"), I get a p-value that tends to be considered inconclusive, $p = 0.099$, along with a warning that the exact p-value cannot be computed, due to tied values. I am not sure how serious this is, but, combined with the graph, I am skeptical increasing the level decreases the $Y$ variable.

Dave
  • 28,473
  • 4
  • 52
  • 104
  • Thanks, Probably due to low number of population (n)? – TheUndecided Dec 09 '21 at 18:01
  • "While there are ways of getting $R^2<0$"… Would you care to share what these ways are for a correlation between exactly two variables? – Alexis Dec 09 '21 at 18:21
  • @Alexis `set.seed(2021); N – Dave Dec 09 '21 at 18:33
  • Well, that's three variables, not 2. The correlation between $y$ and $preds$ is $-1$, and the $R^2$ is therefore 1: `summary(lm(y~preds))` (look at `R-squared`) and `cor(y,preds)`, so no: I do not think I am (yet) persuaded. (Alternately: `summary(lm(preds~y))` gives the same thing: $R^2 = 1$.) – Alexis Dec 09 '21 at 20:00
  • @Alexis Those predictions are generated by $\hat y_i = 5 - 10x_i$, a model that gives $R^2<0$. – Dave Dec 09 '21 at 20:05
  • Not on my version of R, nor, I think if you care to run the commands I just gave you, on yours. Not sure in what sense your `r2` variable is an $R^{2}$? (and specifically for what two variables?) – Alexis Dec 09 '21 at 20:08
  • Okay, I see my mistake, but the point remains that those predictions are generated by some model, one that has $R^2<0$. – Dave Dec 09 '21 at 20:09
  • `summary(lm(y~x))` gives `R-squared: 0.891`, which is not surprising given that `cor(y,x)` gives 0.9439313. $0.9439313^2 = 0.8910063$. Again: No, you have not provided an example of a negative $R^2$ between two variables. Similarly, `cor(preds,x)` gives `-0.9439313`, and `summary(lm(preds~x))` gives an $R^2 = 0.891$, since $-0.9439313^2 = 0.8910063$ also. – Alexis Dec 09 '21 at 20:11
  • An OLS regression with an intercept gives $R^2\ge 0$, so if you want such a regression model with $R^2<0$, you're not going to find one (except maybe some numerical technicalities). However, some model generates the `preds` I gave, and that model has $R^2<0$. If you want new `R` code, perhaps try this: `set.seed(2021); N – Dave Dec 09 '21 at 20:17
  • $$R^2 = 1-\dfrac{SSResiduals}{SSTotal}$$ In my model that gives `preds`, the $SSResiduals$ exceeds the $SSTotal$, resulting in $R^2<0$, as my code shows. // @Alexis Consider what it means for the predictions and residuals to have correlation $<0$: it means that low valyes of $y$ tends to be predicted as high values while high values of $y$ tend to be predicted as low values, an indication of a terrible model that is outperformed by always guessing the mean. – Dave Dec 09 '21 at 20:29
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/132201/discussion-between-dave-and-alexis). – Dave Dec 09 '21 at 20:39
  • @Alexis but also for the community: https://stats.stackexchange.com/questions/12900/when-is-r-squared-negative AND https://stats.stackexchange.com/questions/134167/is-there-any-difference-between-r2-and-r2?noredirect=1&lq=1 – Dave Dec 09 '21 at 20:43
  • Dave I totally overlooked the "not OLS". I stand persuaded. – Alexis Dec 09 '21 at 20:54