0

I have a cross sectional data set at hand contains four predictors to predict one outcome, I employed bivariate analyses to check whether the relationship between the dependent and independent variables is linear or not. All the tests I employed (Linear, inverse, quadratic, compound, growth, exponential, logistic) indicate that the relationship is so weak and in some cases doesn't exist at all. The R squares I obtained for each independent variable are smaller than 5%.

I already have my data transformed to the natural logarithm form and I don't think that using other transformation forms would change the outcomes, in addition it would be very hard to interpret the outputs if I used other transformation forms.

So, in this case could machine learning techniques help? and which technique I should use? I have no prior experience with machine learning models but it seems that it's the only option I have.

Ameer
  • 25
  • 5
  • Hard to tell, you should try and see what comes of it. Random forests would be a first choice, very simple to use and work well. If you cannot get better results here either then you are proverbially screwed. – user2974951 Jan 22 '19 at 10:32

2 Answers2

1

If there is very little structure in your data, then no algorithm will magically find it - neither ML nor classical statistical ones. To detect weak signals, you need lots of data. Lots. Much more than 77 data points.

And even if you have lots of data, enough to detect a weak signal, it may be that your signal is so weak that it's useless. For instance, you may have a die that is slightly more likely to come up a 6, maybe in 17% of throws rather than 16.666...%, with the other results correspondingly slightly more unlikely. If you have seen lots of throws, you can detect this. But this small signal will still be unlikely to make a practical difference for your winning percentage at craps in Las Vegas.

Your best bet is likely to try to collect more information and find some actual strong drivers. You will need domain knowledge for this. This is related: How to know that your machine learning problem is hopeless?

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
0

To me, the most probable scenario is that you simply need more predictors. There seems to be variability in your outcome which doesn't seem to be explained/captured by the predictors.

That being said, if your goal is to only assess the correlations then this is a different question. for instance, $X_1$ can be significantly correlated with $Y$ but the $R^2$ can be very low but shouldn't really care a lot about this! That would imply that yes, they are correlated, it's just that only a tiny tiny fraction of the change of $Y$ can be explained by the change of $x_1$

Vasilis Vasileiou
  • 1,158
  • 1
  • 9
  • 16
  • I have also employed a bivariate and multiple regression model and beside that R squares were so small, the coefficients were not statistically significant. – Ameer Jan 22 '19 at 10:20
  • How big is your dataset? – Vasilis Vasileiou Jan 22 '19 at 10:24
  • My dataset contains 77 observations – Ameer Jan 22 '19 at 10:34
  • Hmm that's not that much.. There is a possibility that some of them are significant and they just didn't come up as significant just because of the small size. That can be false of course as well. If you could get some more observations, that would make things a bit more reliable – Vasilis Vasileiou Jan 22 '19 at 10:45
  • MY dependent variable has two tails, negative which when the dependent variable is overpriced and positive which is when the dependent variable is underpriced. I excluded the negative side because it doesn't fulfill my research objectives as it's only looks into the factors that contribute to underpricing, in addition, the two sides can not be encompassed by one model because they have different determinants – Ameer Jan 22 '19 at 10:56