Imagine an underdetermined linear system, composed of N (continuous) labels and N samples, each have P features (with N < P):
$$\hat{\textbf{Y}}_{N \times 1} = \textbf{X}_{N \times P} \textbf{W}_{P\times 1} $$
and we are interested in finding the best weight matrix $W$ for this regression problem.
Since the system is underdetermined, a linear regression model has infinitely many feasible solutions and it is customary to pick the solution which also minimizes the solution $ ||\textbf{W}||^2$. This indeed makes sure that most of non-important features posses negligible weights.
Another approach to this problem is a ridge regression model. Again, there's a hyperparameter by which one can control the norm of the solution.
However when it comes to prediction, apparently, the ridge performs better in practice (the last comment here, as well as a personal experience of mine on training a system, confirm this statement). I'm interested to know why it's the case.