Asymmetrical selective sampling for linear classification

Question

I've got a online classification problem where I predict a class label {+1, -1} for an object and then show it to a user to get a real label. My task is to minimize a number of -1 objects shown to a user.

Obviously, the algorithm will not converge if it learns only on +1 -> -1 missclassifications, so a certain ammount of -1 objects needs to be shown to users anyway. The problem is to minimize a number of -1 objects shown to a user by selecting only the ones most valuable for learning.

The above problem is being solved in a field of 'active learning', where a learner gets to decide whether to request a laber for an object from user or not. Searching for a solution that does not imply i.i.d., I've found this work, where the learner decides on whether to require a label for an object by its margin. This is intuitive: the lesser margin means that we're more likely to make an error, which makes an object more valuable for learning. The probability to require an object depends on a margin and some parameter b, the optimal value for which is given for a linearly separable case, which applies for my problem.

The algorithm is simple and goes as follows:

initialize weight vector w(t) with zeros
for every input vector x(t):
    calculate margin p(t) = w(t-1) * x(t)
    predict a class with Hebb's rule y(t) = sgn(p(t))
    toss a coin with P(heads) = b / (b + |p(t)|)
    if Heads:
        query the label y' of x
        if y' != y:
            update weights w(t+1) = w(t) + y'*x
        else:
            w(t+1) = w(t)

Basically, I need to use this rule for querying user only if object prediction is -1, and if it is +1, I require a label anyway. My question is: would my algorithm converge in that setting? The perceptron algorithm itself and the modified version from the above article are symmetrical in respect to class labels. In my setting there will surely be more errors on +1 objects, than on -1's. Does it hurt convergence? If it does, what modification to that algorithm can bring it back to 'symmetry' between positive and negative weight updates?

The modified algorithm querying +1s in all cases is as follows:

initialize weight vector w(t) with zeros
for every input vector x(t):
    calculate margin p(t) = w(t-1) * x(t)
    predict a class with Hebb's rule y(t) = sgn(p(t))
    toss a coin with P(heads) = b / (b + |p(t)|)

    if Heads or y(t) == +1:

        query the label y' of x
        if y' != y:
            update weights w(t+1) = w(t) + y'*x
        else:
            w(t+1) = w(t)

Is this algorithm going to converge? If not, what modifications can be made?

The first idea was to decrease a learning rate for +1's, but still, I don't know how to prove convergence. Do you have any ideas on how to design an algorithm so that it will surely find a linear separator in a finite number of iterations?

Thanks.

Update. If don't know how to answer the question, but have any thoughts on where you would look for a solution or any pointers to publications/books/etc, please, let me know!

I think I understand the problem but I am not clear about what you mean by the algorithm converging. Is it that the algorithm has a rule that depends on the past number of 1s and -1s chosen and that you want to show it becomes independent of the past as the number of cases gets large? — Michael R. Chernick, May 13 '12 at 13:20
I mean, there is a Novikoff theorem for the perceptron that shows that the algorithm will stop adjusting weights in a finite number of iterations, which means that the linear separator of two classes is found in finite time. I was wondering if that is true in my case. In general, I need to understand a way to make this algorithm work, and prove that it does its job finding linear separator. — martinthenext, May 13 '12 at 13:31
@MichaelChernick any suggestions? I'm really stuck with this — martinthenext, May 14 '12 at 14:03
To help you with this I would have to read that 25 page paper you linked us to and try to follow all the mathematics. I don't have a clear idea of how the algorithm works and this is not a specialty area for me. Unless you can give a concise explanation of the algorithm and the problem I think someone else would be better to help you. — Michael R. Chernick, May 14 '12 at 14:16
@MichaelChernick I've updated the question with algorithm's pseudocode. I hope it is now possible to answer my question without reading the article. — martinthenext, May 14 '12 at 14:38
I will try to take a look when I have time. One question to be clear: When you wrote " if y' != y:". Does that mean if y' not equal to y? — Michael R. Chernick, May 14 '12 at 17:04
@MichaelChernick yes, it means that object is misclassified. Thank you! — martinthenext, May 14 '12 at 18:16
I hope you get an expert in R to answer this for you. I could try to figure it out based on your description but it would be a struggle and if you need to modify it for convergence you might need an R expert to show you how to do it or point you to an existing algorithm in r that works. — Michael R. Chernick, May 15 '12 at 18:52

score 1 · Accepted Answer · answered May 30 '12 at 09:28

martinthenext!

I am not sure about the peculiarity of your problem, but if it's really crucial for you not to show -1 samples to a user, maybe you would better have a look at so-called "One-class classification", carefully described in a publicly available paper.

SVM method is presented there, but as I understand, it can be generalized for any metric algorithm. The paper even describes a method when you are given some number of -1 samples (section 2.2).

You can then combine this approach with online learning: just update the decision rule, with one new sample it will change very little (that comes from the properties of metric algorithms) and will converge when the number of samples increases.

Although I can't read the entire article right now, it seems it really has a solution to my problem! Too bad you came so late, you would have gotten a bounty for that answer, now I don't have enough reputation. Thank you very much! — martinthenext, Jun 01 '12 at 08:43

Asymmetrical selective sampling for linear classification

1 Answers1