Machine learning for causal inference

Question

I have a multiclass classification problem where the target variable is actually different categories of causes, and the dataset is observational. I know of causal inference, and I would like to learn more about it, but if I do I would need to justify it. So: is it justified to believe that a causal approach would yield more accurate classification results than classical machine learning (ML)?

EDIT FOR CLARIFICATION

Causal methods should not be used simply because the target variable is called "cause". However, causes are a special kind of target variable, because different causes might not be fully independent due to confounding or mediation. Do such structural considerations affect classification accuracy enough that methods that model them would be more accurate?

Casual inference is aimed at understanding casual relationships between observations. It sounds like you are only interested in classification accuracy. Those are different objectives. — J. Delaney, Feb 20 '22 at 10:56
ML can stand for *maximum likelihood* or for *machine learning*. Both are popular on Cross Validated. I have edited to disambiguate. — Richard Hardy, Feb 20 '22 at 17:55
I agree with @J.Delaney. You are not looking to estimating causal effects, rather you need to classify data instances in different categories. The fact that those categories are "causes" (do not know what you mean here) does not imply a causal analysis. — Plastic Man, Feb 20 '22 at 20:25
If the causal relationships were all independent, I would agree with you. But what makes me wonder, is that confounding or mediation might make things more complicated. I'm not sure if ML methods would be able to deal with those? — herman, Feb 21 '22 at 07:25

Tim · Answer 1 · 2022-02-21T10:43:35.043

Start with the The Two Cultures: statistics vs. machine learning? thread. Machine learning is about finding patterns or correlations in data. Causal inference, like statistics, is about inference. As others already noticed in the comments, those are different problems. When your aim is to study if smoking causes cancer, your aim is not classification accuracy, but confirming or rejecting the existence of the causal relationship. On another hand, if you want to accurately predict that someone will get cancer, you may throw many different variables that directly or indirectly relate to cancer to maximize the predictive performance. Sure, causal reasoning could and should inspire your decisions on which features to consider, but you wouldn't make the prediction based only on the fact that somebody smokes cigarettes because there is a causal relationship. Keep in mind that there could be multiple causal relationships (many things cause cancer), we may not be able to measure all of them, the causal relationship also doesn't mean that something is certain (not everyone who smokes gets cancer), they can be of varying strength, and the data would also be noisy, so having the features that are causally related does not give you classification performance guarantees.

Answering your question with a metaphor, using a causal model for classification is like if your task was to transport goods from A to B and you approached it with designing your own lorry. Sure, the custom lorry may be faster and more efficient for your problem, but is it worth it? Same with causal inference, if you used it, it would mean that you need to spend at least twice as much time on the problem since you would be solving two problems (a) causal inference, (b) classification. It is also non-trivial how would you inject the causal knowledge into the classification model, so this might be a third research problem to solve. If you are working on a high-stakes problem (e.g. medicine) and have a budget that allows you for doing the research, sure, it might be worth it. But in most cases, if your aim is only to do classification, it would be enough for you to have a machine learning model that would find by itself an approximate representation of the data that is good enough for classification.

Thanks. Indeed, Pearl likes the smoking example. What I'm wondering is whether structural considerations such as mediation or confounding would affect classifier performance. It is not the fact that we call them "causes" that affect the choice of the method, but the nature of the classes: they might not be fully independent. — herman, Feb 21 '22 at 07:32
@herman sure it may, but you have no guarantees that if you should just gather more data and stack more layers, it would do any better. — Tim, Feb 21 '22 at 08:33
Stack more layers? What does that have to do with mediation and confounding? — herman, Feb 21 '22 at 08:59
@herman it has nothing to do. Instead of answering in a comment I edited my answer. — Tim, Feb 21 '22 at 10:44

score 0 · Answer 2 · answered Feb 21 '22 at 08:26

As another answer and some comments pointed out, you should have in mind that the estimation of causal effects and prediction are not the same thing. The amount of consumption of chocolate in a country might help you to predict from which country the next Nobel laureate in medicine will come from, but it will not tell you that much about the causal "story" that produces nobel laureates in some countries (probably that story has something to do with economic growth).

But the knowledge of causal mechanisms may help to improve prediction:

Let's say you want to generalize your results from a trained model to new environments, $e$ . Causal relationships of the kind $X^{e} \rightarrow Y^{e}$, $X$ and $Y$ being some random variables, should go along with so called invariant conditional distributions $P(Y^{e} |X^{e})$, which are the same for all environments $e\in \mathcal{E}$ from some environment space $\mathcal{E}$. So having $X^{e}$ in your model when predicting $Y^{e}$ should improve the robustness of your predictions when dealing with different environments. An example from this book Ref1: let $X^{e}$ be the height of a mountain and $Y^{e}$ the temperature measured at a certain height. So you might deduce from some physical theory that there is something like a causal relation between height and temperature ("the higher, the colder"). This relationship should hold between different environments, such as the Swiss and the Austrian Alps. Hence your prediction model built on data from the Swiss Alps should also hold for the Austrian Alps (robust prediction). See this great article here: Ref2. Or this nice talk: Ref3.

In a semi-supervised learning setting the exploitation of causal mechanisms may also be fruitful (see anticausal learning in Ref4).

Note also:

You can also improve the estimation of causal effects using machine learning (Ref5).

Machine learning for causal inference

2 Answers2