Change of the data generating process (DGP) vs. occurrence of unseen observations

Question

Say, one trains a machine learning model to classify emails as spam or normal. Then, the adversary (or the collection of all adversaries) represents the data generating process (DGP) that generates the distribution of spam. When the adversary changes its strategy, for instance, starts using the topic of coronavirus in spam emails in 2020, does the DGP change, or are new spam variants just observations from the distribution of spam that have never occurred until 2020?

As I see, these views have different implications:

If the DGP changes, then one would accept that the classifier will perform worse because it has learned from a distribution before 2020 that is different from the current distribution of interest, and one has to retrain the model with new spam examples (and remove very old or irrelevant spam mails). In the other case, if new spam is assumed to be a new observation from the same distribution, then one expects that the classifier should detect this spam due to the generalization of the ML model.

How does one know or determine in general whether the DGP has changed or observations of a distribution have never occurred until now (and the DGP will not change as they occur)?

Change of the data generating process (DGP) vs. occurrence of unseen observations

0 Answers0