I have posted a question related to this problem over a year ago and we still were not able to figure this out.
We have two groups, A and B that we want to train on to separate them. Both have numerous observations of "text" so for example:
group A:
- AAABBBC*CAAAAAAAC
- CCCBBBC*CAAAAAAAB
- CBBBBBC*CAAAAAAAB
group B:
- AAACCCC*CAAAAAAAA
- CCCCCCC*CAAAAAAAA
- CBBCCCC*CAAAAAAAA
Notably, our original datasets are much larger, with around 4,000 observations for A (with really specific patterns) and around 20,000 for group B. We want is a model that sees things like:
- if there is a
C
at position 1 we see aB
on the end in group A (2/3), and we do not see this in group B (0/3) - That we only find the motif
AAABBB
in group A - if we see
AAABBB
we also saw aC
at the end in group A (1/3) but we did not see this in group B (0/3)
We tried LDA now (after converting this data to binary vectors), however, this would score each letter independently. To illustrate if group A would have two subgroups:
sub1
: position1 = A + position10 = Csub2
: position2 = A + position15 = B
and both are not common in group B then a method like LDA would also score position1 = A (sub1
) + position15 = B (sub2
) extremely high even tho they are actually part of different dependencies within group A, so we are looking for an alternative taking care of such dependencies when differentiating groups.
We really hope someone here can help us out!
EDIT
To explain the dependencies better as asked by @carlo I made a grossly simplified example:
For pink we see that:
- A at pos
1
is often associated with A at pos3
- A at pos
1
can also be associated with B at pos2
where it is often followed by A at3
and less often by C at pos3
For blue we see that:
- A at pos
1
is often associated with A at pos2
and then sometimes followed by C at3
.
Then when we run a classifier we want it to see that a sequence such as AXA would be probably group I but that a sequence such as ABA is even more likely to be group I, and that a sequence such as AAB would be group II.