How to derive the maximum a posteriori for random features?

Question

My question is at the end of the post. I tried to give as much information as I can to clarify my understanding and to point out as precisely as possible where I am stuck.

Independent variables or features may be fixed or random

I have recently read that when performing a regression, the independent variables or features may be fixed or random [Independent variable = Random variable?], like for instance for the linear regression where "the values $x_{ij}$ may be viewed as either observed values of random variables $Xj$ or as fixed values chosen prior to observing the dependent variable" [https://en.wikipedia.org/wiki/Linear_regression].

Regularization and prior distribution

Besides, I know that the regularization in machine learning objective function corresponds to a prior knowledge on the parameter. For instance, the L2 regularization assumes that the parameters follow a centered Normal distribution and the L1 regularization assumes that the parameters follow a Laplace distribution. For me, this is clear when the parameters are fixed but I have trouble to understand this when they are random variables.

1. Fixed independent variables case

When the independent variables are fixed, we model the dependent variable with a chosen distribution. For instance, for the linear regression, we use the following model: $$Y\sim\mathcal{N}(X\beta,\sigma^{2}I)$$

Frequentist approach

In the frequentist approach, $\hat{\beta}$ can then be obtained be maximizing the likelihood: $$\hat{\beta}=\underset{\beta}{\mathrm{argmax}}(f_Y(y))$$

Bayesian approach

When considering a prior Normal distribution on $\beta$ for instance, the previous model changes into: $$Y|\beta\sim\mathcal{N}(X\beta,\sigma^{2}I)$$ $$\beta\sim\mathcal{N}(0,\sigma^{2}_{\beta}I_{\beta})$$ In this case, $\hat{\beta}$ can then be obtained be maximizing the maximum a posteriori: $$\hat{\beta}=\underset{\beta}{\mathrm{argmax}}(f_{\beta|Y}(\beta|y))$$ Thanks to Bayes' theorem, $f_{\beta|Y}(\beta|y)=\frac{f_{Y|\beta}(y|\beta)f_{\beta}(\beta)}{f_{Y}(y)}$, we can then obtain $\hat{\beta}$ be maximizing the following quantity (since the denominator does not depend on $\beta$), which corresponds to a machine learning objective function including a regularization term: $$\begin{align} \hat{\beta}&=\underset{\beta}{\mathrm{argmax}}(f_{Y|\beta}(y|\beta)f_{\beta}(\beta))\\ &=\underset{\beta}{\mathrm{argmin}}(-log(f_{Y|\beta}(y|\beta)) -log(f_{\beta}(\beta))) \end{align}$$

2. Random independent variables case

When the independent variables are random variables, the previous anesthesia model changes to: $$Y|X,\beta\sim\mathcal{N}(X\beta,\sigma^{2}I)$$ $$\beta\sim\mathcal{N}(0,\sigma^{2}_{\beta}I_{\beta})$$ Like previously, $\hat{\beta}$ can then be obtained be maximizing the maximum a posteriori: $$\hat{\beta}=\underset{\beta}{\mathrm{argmax}}(f_{\beta|Y,X}(\beta|y,x))$$ I think that the desired result should be the following to get the same objective function as previously, including the regularization term (unless I am wrong): $$\begin{align} \hat{\beta}&=\underset{\beta}{\mathrm{argmax}}(f_{\beta|Y,X}(\beta|y,x))\\ &=\underset{\beta}{\mathrm{argmax}}(f_{Y|\beta,X}(y|\beta,x)f_{\beta}(\beta))\\ &=\underset{\beta}{\mathrm{argmin}}(-log(f_{Y|\beta,X}(y|\beta,x))-log(f_{\beta}(\beta))) \end{align}$$ If so, I cannot find out why. In particular, I don't know why the first and second lines are equal. Thanks to Bayes' theorem, I know that: $$\begin{align} f_{\beta|Y,X}(\beta|y,x)&=\frac{f_{\beta,Y,X}(\beta,y,x)}{f_{Y,X}(y,x)}\\&=\frac{f_{Y|\beta,X}(y|\beta,x)f_{\beta,X}(\beta,x)}{f_{Y,X}(y,x)}\\&=\frac{f_{Y|\beta,X}(y|\beta,x)f_{X|\beta}(x|\beta)f_{\beta}(\beta)}{f_{Y,X}(y,x)} \end{align}$$ But I am stuck there. I could find the desired result if $X$ and $\beta$ are independent (I don't know whether this is true, an assumption of the model ...).

Could someone help me understand the maximum a posteriori/regularization when the independent variables / features are random variables? Can the previous formula be simplified? If so, why?

Could you change the title to something that better reflects the actual question? — kjetil b halvorsen, Feb 12 '22 at 16:57
The expression for the posterior of $\beta$ given $X,Y$ you wrote seems to be correct - why do you think it should equal $p(y|\beta,X) p(\beta)$ ? this has no meaning. Also note that your notation is redundant, there is not need to write the conditional dependencies both as a subscript and as the function argument, It is conventional to simply write e.g. $p(y|\beta,X)$ — J. Delaney, Feb 12 '22 at 17:35
Regarding the notation, I stick to the courses I followed in my academic background. — kapytaine, Feb 13 '22 at 13:17
Regarding the result, I assume this is the desired result by comparing to machine learning objective function. Indeed, when using regularization, the objective function includes two terms: the traditional cost to minimize $-log(f_{Y|\beta,X}(y|\beta,x))$ and the regularization term $-log(f_{\beta}(\beta))$. I will edit the post to clarify the reason of my expectation. — kapytaine, Feb 13 '22 at 13:27