1

We usually assume the i.i.d. assumption in machine learning problems, but in active learning, the labeled examples acquired by querying oracle are clearly not i.i.d. I want to know will it be better to reweight the labeled examples to make them mimic the target distribution and retrain a model with the reweighted data? Has anyone done that before or why not?

Qcer
  • 13
  • 3

1 Answers1

2

Indeed! In Passive Learning (PL)- i.e. normal supervised ML- the training $S_\text{Tr}$ and test $S_\text{Test}$ datasets are assumed to be i.i.d. sampled from the true distribution of data, denoted $S_\text{Tr}$, $S_\text{Test} \sim P$. In Active Learning (AL), the training dataset $\hat{S}_\text{Tr}$ is chosen by a query algorithm. In general, you can consider this data $\hat{S}_\text{Tr}$ as i.i.d. sampled from a distribution $Q$ where $Q$ doesn't necessarily equal $P$ (it equals $P$ when the query strategy is random sampling). Modelling AL as i.i.d. sampling from a distribution $Q$ is relatively new in AL literature, and I think it was first proposed in this paper.

I want to know will it be better to reweight the labeled examples to make them mimic the target distribution and retrain a model with the reweighted data? Has anyone done that before or why not?

That is not exactly how it is done, but your intuition is almost correct! The vast majority of query algorithms ask you to query samples that are hard to classify. This is referred to in the literature as querying informative sampling. However, this may result in $Q$ being very different from $P$. In such a case, you shouldn't expect your ML-model to perform well, especially since you are measuring performance on a test dataset that has been i.i.d. sampled from $P$. However, to make labeled data "mimic" $P$, you don't ask your query algorithm to choose points to label and then re-weight them while training (as your question suggests). You can do much better. You can try to make $Q$ similar to $P$ in the first place, while trying to query informative samples. Please refer to the above mentioned paper for more details on how to maintain such a balance.

Saleh
  • 623
  • 1
  • 4
  • 11