1

The problem: infer the nationality of a person from a limited number of features (name, email, ...). I do not have enough "ground truth" to use ML techniques, I'd like to try what for a computer scientist is a "heuristic model" that is a model based on empirical evidence and domain experience (if your email is something like foo@bar.fr and your name is Philippe, you should be French).

My problem is: how do I assess the validity of this model? The question seems (and probably is) naive but I do not have a strong background in statistics and I'm used to the standard approaches used in ML (n-fold validations, etc...). I tried reading books and online resources but I'm more confused than I was before.

Intuitively I could take a random subset of the samples, manually assess the ground truth for this subset and compare it with the outcomes of the model. If the subset is not too small and I have a high overlapping in the resulsts (minus unbalancing and so on) that should give me confidence that the model is OK. Does it make sense? Do you have a reference showing how this kind of analisys should be performed? Thanks.

1 Answers1

1

Your intuition is correct. In order to assess whether your model does what it is supposed to do (namely, correctly classify), you need to check whether it in fact does do what it is supposed to do (i.e., classify correctly). That is, you need some "ground truth" of classified samples, and then you can evaluate whether your model will classify these samples correctly.

So yes, go ahead and label some samples manually. The more, the better. Unbalanced classes are not a problem as long as you use appropriate evaluation measures, i.e., probabilistic predictions and proper scoring rules, and not accuracy. Also, use a holdout sample for evaluation, i.e., one that was not used in training your model (because otherwise you will overfit).

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • Thanks Stephan, good to know I was not that far off. A question, though: to better understand how to conduct this validation in a "scientific way" which resources do you suggest? From what you write "probabilistic predictions" and "scoring rules" should be some good starting keywords. Anything more specific? – user3687501 Feb 26 '21 at 09:28
  • Unfortunately, I don't know of a good introduction. For probabilistic predictions, you just need something that output class membership *probabilities*, not hard 0-1 classifications. The two links in my answer should be useful reading. Once you have these probabilistic predictions and the ground truth, you can assess how good the predictions were, and that is where proper scoring rules come in. [Our tag wiki](https://stats.stackexchange.com/tags/scoring-rules/info) has more information and pointers to literature. – Stephan Kolassa Feb 26 '21 at 10:59