In a course I am following the professor stated that conducting 10,000 tests would be normal in a high-dimensional setting, making a direct mention to regression analysis.
What is meant by that phrase?
How a hypothesis is defined in this context? Do we consider each different combination of parameters -- e.g. coefficients in a logistic regression-- entertained as defining separate hypotheses that are tested?
However, are we truly testing hypotheses in the process of fitting parameters to the data? In practice we simply solve an optimization problem, normally by minimizing a loss function of some sort through a process such as gradient descent or some other method. We don't do any statistical testing of any kind in the process of fitting the parameters -- if I understand correctly the algorithms (please correct me).
To provide more context, the lecture is about Multiple Testing of Hypotheses in the frame of Statistical Inference and the exact segment from the lecture is the following:
So, first we're going to talk about controlling the false positive rate. And so, if P values are correctly calculated, you can actually just use the P values that you've calculated directly and call all P values less than sum threshold Alpha, where Alpha's between zero and one. To be significant, that will actually control the false positive rate at level alpha on average. In other words, the expected rate of false positives is less than alpha. So here's the problem with that. Suppose that you perform say 10,000 hypothesis tests, this seems a little bit extreme, a large number of tests maybe. For people that are doing just one or two regressions but in many high dimensional settings or signal processing settings there are, this is actually a reasonably small number of hypothesis tests that might be performed and if you call all P values less than 0.05 say significance and we say alpha equal to 0.05, then the expected number of false positives is just the total number of tests that you've performed. Times the false positive rate that you're controlling the, error rate at. And so you get 500 false positives. So if you perform this many hypothesis tests and you get 500 significant results, it's pretty likely that they're mostly going to be made up of false positive results. So a question that we immediately comes to mind is how do we control, a different error rate so that we avoid so many false positives.