Adjusting weight for best linear regression

Question

I have test scores for theoretical ($T_i$) and practical ($P_j$) exams of students ($s$), and the aim is to calculate the dependence of theoretical and practical success.

I calculated the sums for each student

$$T_s = \sum T_i.W_i$$ $$P_s = \sum P_j.W_j$$

where $W_i$ and $W_j$ are the weights of corresponding exams.

For the starting point, let us give a weight of $1$ to all exams, and plot $P_s$ vs $T_s$ and calculate the linear regression.

Now we can randomly change the weights of exams and re-calculate the regression to get the best correlation.

The question is: what is the best way (e.g., a numerical algorithm with least computational complexity) to find the best correlation, instead of randomly testing millions of possible combinations of weights.

Imagine we intuitionally (or empirically) know the best correlation between $P$ and $T$ obtains if we use these weights

$$W_1 = 1.0\\ W_2 = 0.5\\ W_3 = 0.6\\ W_4 = 0.2\\ W_5 = 1.0\\ W_6 = 0.8\\ $$

where $1-3$ are the theoretical exams, and $4-6$ are the practical exams.

How can we obtain these weights statistically?

Optimizing $R^2$ could backfire on you, because [it might not measure what you think it does.](https://stats.stackexchange.com/questions/13314) Could you explain what you are hoping to accomplish with the solution? How would you use and interpret the results? — whuber, Nov 10 '21 at 18:43
@whuber I want to find the exams which contribute to the correlation the most. For example, consider these exams: $T(math)$, $T(psychology)$, $P(physics)$, $P(engineering)$. Most probably, we need to increase the weight of `math` in the theoretical exams to get a better correlation. — Googlebot, Nov 10 '21 at 19:02
Correlation between what and what? It will be difficult to address your question without knowing what your $X$ and $Y$ are in $cor(X, Y)$. — Dave, Nov 10 '21 at 19:09
@Dave correlation between $T$ and $P$ where $T$ is the sum of theoretical exam scores, and $P$ the sum of practical exam scores for each person. The aim is to introduce weights in the summation. — Googlebot, Nov 10 '21 at 19:55
One really confusing thing about your question is the explicit use of the *same* weights for the theoretical and practical exams. Is this intended? — whuber, Nov 10 '21 at 21:10
@whuber there is no *same* weight. I used $W=1$ for all exams, as the arbitrary starting point. Each exam (theoretical or practical) can have its own weight to get the best correlation between $P$ and $T$. — Googlebot, Nov 11 '21 at 11:01
Unfortunately, in the formula you wrote, your notation *explicitly* makes the weights the same. Please edit your post to clarify what you're looking for. — whuber, Nov 11 '21 at 13:18
@SextusEmpiricus the weights should be adjusted for each and every exam rather than a group of theoretical or experimental. I updated the question with an example. — Googlebot, Nov 11 '21 at 13:33
@SextusEmpiricus I agree that it should be resolved numerically, but I don't follow $(P_{ij} - T_{ij})$. Exam $i$ is either theoretical or practical. If we have $P_i$ then we do not have $T_i$. — Googlebot, Nov 11 '21 at 13:41
@SextusEmpiricus I changed the notations to make it clearer. Are we on the same page now? :) — Googlebot, Nov 11 '21 at 13:54
@Googlebot that makes it not clear. Now you have twice a subscript $_j$ in the equation for $T_j$. I would guess that you mean $$\begin{array}{rclcrcl} P_j &=& \sum_{i=1}^n P_{ij} V_i & \quad \text{and} \quad & T_j &=& \sum_{i=1}^n T_{ij} W_i \end{array}$$ or $$\begin{array}{rclcrcl} P_j &=& \sum_{i=1}^n P_{ij} W_i & \quad \text{and} \quad & T_j &=& \sum_{i=n+1}^{2n} T_{ij} W_i \end{array}$$ — Sextus Empiricus, Nov 11 '21 at 13:56
@SextusEmpiricus the second one is correct. Exams $1-n$ are practical (belong to $P$) and exams $n+1$-$2n$ are theoretical and belong to $T$. $W$ can be different for each and every exam ($i$). — Googlebot, Nov 11 '21 at 14:04

Sextus Empiricus · Accepted Answer · 2021-11-12T11:47:43.103

With some slightly different notation you can make the problem easier to interpret. It seems like you are looking to maximize the correlation between

$$\begin{array}{rclcrcl} P_j &=& \sum_{i=1}^n P_{ij} V_i & \quad \text{and} \quad & T_j &=& \sum_{i=1}^m T_{ij} W_i \end{array}$$

The solution to maximize the correlation is the same as when the columns of $P_{ij}$ and $T_{ij}$ are translated such that they have zero mean, so let's assume that the columns have zero means.

The problem is similar to the question Fitting of a sum of vectors $Y \alpha = X \beta + \epsilon$ The difference is in the dimension of $\alpha$ and $\beta$, or your $V_i$ and $W_i$. In the other question, these dimensions are the same, but the principle remains the same.

The answer to that question argues that $W_i$ is equal to the eigenvector with the largest eigenvalue of the following matrix

$$\begin{array}{rcl} M &=& (T^tT)^{-1}(QT)^t(QT) \\ \end{array}$$ with $Q$ the projection matrix $$\begin{array}{rcl} Q &=& P(P^tP)^{-1}P^t \end{array}$$

The correlation of the solution $\hat{P}_j,\hat{T}_j$ will be related to the eigenvalue $\lambda$

$$\text{cor}(\hat{P}_j,\hat{T}_j) = \sqrt{\lambda}$$

Code example

Below we demonstrate the algorithm with some code. A comparison is made with a numerical optimization

### create exams let's have
### 3 theoretical exams
### 6 practical exams
### 20 students
### the data are correlated normal distributed variables
set.seed(1)
ns = 20
nt = 3
np = 6
rho = 0.5
data = t(mvrnorm(nt+np,mu = rep(0,ns), Sigma = diag(rep(1-rho,ns))+rho ))
P = data[,1:np]
T = data[,(np+1):(np+nt)]

vector_solution = function(P,T) {
  ### subtract the means (without loss of generality)
  Pm = P - rep(1,length(P[,1])) %*% t(colMeans(P))
  Tm = T - rep(1,length(T[,1])) %*% t(colMeans(T))
  
  ### compute matrices
  Q = Pm %*% solve(t(Pm) %*% Pm) %*% t(Pm)  ### projection matrix
  M =  solve(t(Tm) %*% Tm) %*% t(Q %*% Tm) %*%  (Q %*% Tm)
  
  ### solve with eigenvectors
  eig = eigen(M)
  eVec = eig$vectors
  eVal = eig$values
  
  ### the first eigenvalue is the highest and the solution
  W = eVec[,1]
  corr = eVal[1]^0.5
  Tsol = T %*% W
  V = as.numeric(lm(Tsol ~ 0 + Pm)$coef)  ### find V by OLS 
  
  return(list(W = W, V = V, correlation = corr))
}

### optimization of correlation
f<-function(par,P,T){
  n  = length(P[1,]) 
  m = length(T[1,])
  v = par[1:n]
  w = par[(n+1):(n+m)]
  
  ### correlation
  rho = cor((P %*% v),(T  %*% w))
  
  ### extra term in the optimization 
  ### to ensure that a single solution is possible
  ### (multiples of the vectors have the same correlation)
  d = abs(sum(v^2)-1)^2
  return(-rho+d)
}


numerical_solution = function(P,T) {
  n  = length(P[1,]) 
  m = length(T[1,])
  mod <- optim(par = rep(1, n+,), 
               fn = f,
               method = "BFGS",control = list(maxit=1*10^5,trace=0,reltol=10^-16),
               P = P, T = T) 
  v = mod$par[1:n]
  w = mod$par[(n+1):(n+m)]
  rho = cor((P %*% v),(T  %*% w))
  return(list(W = w, V = v, correlation = rho))
}

Output

The output below shows that the method with the eigenvectors gives the same result as a numerical optimization. The difference is in the scale of the solution for $W_i$ and $V_i$. You can multiply these with an arbitrary constant and you get the same result. The correlation $\text{cor}(aT,bP)$ is the same as the correlation $\text{cor}(T,P)$.

> numerical_solution(P,T)
$W
[1] -3.621469  5.383181 -2.667191

$V
[1]  0.2700293 -0.2930244  0.4962644  0.3950038  0.2474477
[6]  0.6145594

$correlation
          [,1]
[1,] 0.7103213

> vector_solution(P,T)
$W
[1]  0.5162597 -0.7674021  0.3802235

$V
[1] -0.09827352  0.17602598 -0.29498806 -0.41464957
[5] -0.23167924 -0.43388866

$correlation
[1] 0.7103213

> vector_solution(P,T)$W/numerical_solution(P,T)$W
[1] -0.1425553 -0.1425555 -0.1425558
> vector_solution(P,T)$V/numerical_solution(P,T)$V
[1] -0.7723890 -0.7723903 -0.7723889 -0.7723876 -0.7723897
[6] -0.7723882

That's a brilliant answer. The R code is quite helpful in implementing the mathematical foundation. Just a quick question. Here, we assume the weights of the practical exams $V_i = 1$, and then calculate the weights for the theoretical exams $W_i$, right? How do we define the range of allowed weights? As I run the code, the weight can be negative, which is not very useful in the real world. — Googlebot, Nov 11 '21 at 17:16
@Googlebot you get negative weights in the example because in this example the data is generated as i.i.d. normal distributions. For 'real' data you should have less of these problems. (In the code I used `mvrnorm` to make a more realistic example but in the end I did not use it) — Sextus Empiricus, Nov 11 '21 at 17:38
I have some updates to make. It'll be easier to use the matrix $$\begin{array}{rcl} M &=& A^{-1} A_{\parallel} \\ \end{array}$$ with $$\begin{array}{rcl} A_{\parallel} &=& (QT)^{t}(QT) \\ A &=& T^tT \\ Q &=& P(P^tP)^{-1}P^t \end{array}$$ I will make use of more realistic data in that next update. — Sextus Empiricus, Nov 11 '21 at 17:44
The negative weights are not important. I used various inputs, and the code works perfectly. I just wonder if it is possible to have a simultaneous solution for both $W$ and $V$. Though I think we cannot have a solution for $V$ as $P$ should be greater than $T$. — Googlebot, Nov 11 '21 at 18:32
@Googlebot I made some adjustments. The dimensions of P and T are not important anymore and P does not need to be greater than T anymore (it was an artifact from my computation where I had the inverse of a matrix that did not always work). I also changed the code such that it shows how a combined solution with W and V is computed. — Sextus Empiricus, Nov 12 '21 at 12:07
your revised answer is brilliant. However, I struggle to match the data. I use some input data (not random) and calculate the weights by your code. Then, when I calculate the linear regression with the given weights, the correlation is not the same as that predicted by the code. I will post some sample data to show the difference (there's not enough space here). — Googlebot, Nov 13 '21 at 12:30
@Googlebot you need to compute the linear regression with the data centered around the mean (or add an intercept to the regression). In the code above it is done with this line `V = as.numeric(lm(Tsol ~ 0 + Pm)$coef)` where this `Pm` is the matrix `P` with the column means subtracted from the columns. Alternatively you could use an intercept `V = as.numeric(lm(Tsol ~ 1 + P)$coef)` — Sextus Empiricus, Nov 13 '21 at 14:10

Adjusting weight for best linear regression

1 Answers1

Code example

Linked

Related