A method to separate classes while taking variable dependence into account

Question

I have posted a question related to this problem over a year ago and we still were not able to figure this out.

We have two groups, A and B that we want to train on to separate them. Both have numerous observations of "text" so for example:

group A:

AAABBBC*CAAAAAAAC
CCCBBBC*CAAAAAAAB
CBBBBBC*CAAAAAAAB

group B:

AAACCCC*CAAAAAAAA
CCCCCCC*CAAAAAAAA
CBBCCCC*CAAAAAAAA

Notably, our original datasets are much larger, with around 4,000 observations for A (with really specific patterns) and around 20,000 for group B. We want is a model that sees things like:

if there is a C at position 1 we see a B on the end in group A (2/3), and we do not see this in group B (0/3)
That we only find the motif AAABBB in group A
if we see AAABBB we also saw a C at the end in group A (1/3) but we did not see this in group B (0/3)

We tried LDA now (after converting this data to binary vectors), however, this would score each letter independently. To illustrate if group A would have two subgroups:

sub1: position1 = A + position10 = C
sub2: position2 = A + position15 = B

and both are not common in group B then a method like LDA would also score position1 = A (sub1) + position15 = B (sub2) extremely high even tho they are actually part of different dependencies within group A, so we are looking for an alternative taking care of such dependencies when differentiating groups.

We really hope someone here can help us out!

EDIT

To explain the dependencies better as asked by @carlo I made a grossly simplified example: For pink we see that:

A at pos 1 is often associated with A at pos 3
A at pos 1 can also be associated with B at pos 2 where it is often followed by A at 3 and less often by C at pos 3

For blue we see that:

A at pos 1 is often associated with A at pos 2 and then sometimes followed by C at 3.

Then when we run a classifier we want it to see that a sequence such as AXA would be probably group I but that a sequence such as ABA is even more likely to be group I, and that a sequence such as AAB would be group II.

care to explain better what do you mean by "dependence"? – carlo May 30 '20 at 09:39 — carlo, May 30 '20 at 09:39
@carlo please see my edit, hopefully it's more clear now! – KingBoomie May 31 '20 at 10:29 — KingBoomie, May 31 '20 at 10:29

A method to separate classes while taking variable dependence into account

0 Answers0