2

I have a dataset which is made up of 62 features and 1 set of labels, all of which are percentiles. The signal to noise ratio is low. If I were to do a simple balanced classification, if I could achieve a 55% accuracy that would be great.

I've made a few attempts at training a model using XGBoost and a random forest algorithm. Neither have achieved any sort of accuracy that is statistically difference from randomness.

I admit I haven't really done much research on which machine learning techniques are necessary for low signalto noise data... I just used XGBoost because it wins all the competitions. Does anyone have any advice on where to start? Are there machine learning algorithms that are designed specifically for low signal data?

xxanissrxx
  • 21
  • 1
  • 1
    First of all, [accuracy is a poor metric](https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models), and if your labels are noisy as well, then it might be too "rough" to catch any meaningful differences between outcomes. – Tim Jun 19 '20 at 20:54
  • It's extremely difficult to select a ML algorithm a priori without actually looking at your data and making hypotheses about what might work and what might not work based on patterns you observe. If the data is not well-separable, it's going to be difficult regardless to achieve a high accuracy. – tchainzzz Jun 19 '20 at 20:56
  • Could you also clarify what do you mean by low signal to noise ration in here? What exactly is the nature of the noise? Why is it noisy? What is your data? – Tim Jun 19 '20 at 20:56
  • 1
    THe dataset is 62 socioeconomic status indicators for students at the beginning of a school year. There are ~100,000 students and the data spans 10 years, so roughly 1 million observations. The labels are the students grade average for the following year. My usage of low signal to noise is incorrect - i think a better way of saying it is that the predictive power of this model is not expected to be high. My goal is to identify students that will struggle in the next year. If I could predict whether a student will be above or below average at 55% accuracy that would be a huge win. – xxanissrxx Jun 19 '20 at 20:58

0 Answers0