Biased classification because of data from different sites?

Question

Working in neuroscience, we often classify data from different sites. Usually I balance my data for sites - if I have for instance to classify the data for some illness vs. normal health condition, each of the sites the data is recorded at will contribute with an equal number of normal vs. ill data samples (subjects) to the final data set of the two-class classification problem.

Nevertheless, can the differences from having acquired the data at different sites still bias the classification performance despite the sets being balanced? If so, why?

there appears to be insufficient information: what are 'sites' and what is being classified (if 'an equal number of normal vs. ill' samples are collected from each site). The latter seems to misconstrue 'balance', e.g. if a certain 'site' has a greater illness/normal ratio, this gets misrepresented according to the explanation above (and that would be the largest, and ignored, site effect, no?); perhaps clarify it further. — katya, Nov 10 '15 at 20:30
A site = location/data collection center. So in my case, fMRI images are collected at different hospitals, which use different scanning parameters, sth that can lead to a difference in the data samples solely due to these factors. We classify for illness: So we have patients with a neural disorder, and healthy controls. Usually I have an equal number of healthy controls and subjects with a neural disorder, and additionally, each site contributes with an equal amount to both classes. Nevertheless I wonder if site can have a biasing effect on the classification? — Pugl, Nov 10 '15 at 21:34

Trisoloriansunscreen · Accepted Answer · 2015-11-10T22:26:13.343

3

Given that each site contributes an identical number of normal and ill samples and that each cross-validation test fold is also balanced in this fashion, I don't see a way how having multiple sites can cause an upward (optimistic) performance bias.

Somewhat trivially, the classification performance is expected to be worse than in the single site case, since using multiple sites adds class-irrelevant variability to your data. The extent of this negative effect depends on the nature and magnitude of the site-related variability and its similarity to the diagnostic signal.

One more consideration - for permutation tests against chance, I would permute labels only within each site, not across sites. If you will use unconstrained label permutation, you will include unbalanced cases which do not really belong to your null distribution.

edited Nov 10 '15 at 22:26

answered Nov 10 '15 at 22:21

Trisoloriansunscreen

1,669
12
25

But if we have for instance data from site 1 and site 2, and would assume that the two groups (healthy vs. neural disorder) are very similar in site 1, but the scanning process in site 2 makes the two groups very distinguishable, that would also imply that the disorder-group from site 2 will also be very well distinguishable from the healthy group of site 1 - this would be for instance a site effect which would lead to an optimistic bias, or not? – Pugl Nov 10 '15 at 23:05
1

For simplicity, let's assume the feature space is univariate, all of the samples are exactly on 0 except the site 2 positive samples that are all on 1, and you classifier is optimal. When you test a site 1 sample, it can't have more than chance accuracy. When you test a site 2 sample, it will have perfect accuracy. The combined accuracy is an average weighted by the sample size of each size. How can the bias creep in? – Trisoloriansunscreen Nov 11 '15 at 08:08

score 1 · Answer 2 · answered Nov 16 '15 at 19:19

1

You might be facing a problem of domain adaptation.

It is possible that the sites contains samples that represent different source. In this case a classifier learnt on one site (or mixture of all sites), might not perform well on some of the sites.

In your case it might be due to a different distribution of gender, age, etc. of the patients considered.

While you can try and cope with the problem by creating a classifier per site, typically the number of samples per site is quite small and this way you won't be able to utilize all your samples.

There are some methods to cope with domain adaptation and I need more details to know which one fits your problem best.

An easy method (as the title suggests) for the problem is described at Frustratingly Easy Domain Adaptation

answered Nov 16 '15 at 19:19

DaL

4,462
3
16
27

Many thanks. I will read the paper as soon as possible. What additional details do you need to know? The data in its entirety is balanced for age, IQ, and only contains males. – Pugl Nov 17 '15 at 06:12
First it is interesting to know if you should indeed cope with domain adaptation. Does the performance of your classifier vary significantly over the different sites? If it doesn't, you are lucky. In this case you can use all the samples you have as the dataset. If it is not so, see if there is a specific site that behaves differently. How many sites do you have? how many samples per site? how many features? Is there a domain based reason to suspect that the feature should behave in different ways? – DaL Nov 17 '15 at 06:31

Biased classification because of data from different sites?

2 Answers2