0

I have two questions regarding KNN.

Q1. If training data have 7 classes should I consider higher K (for example I start with k=14?).

Q2. Since half of my training data is not labeled I will be using self training.

Self training Algorithm works as follows:

•Let L be the set of labeled data, U be the set of unlabeled data.

•Repeat

– Train a classifier h with training data L

 – Classify data in U with h

 – Find a subset U’ of U with the most confident scores.

 – L + U’ -> L

 – U – U’ -> U

What is considered confident scores here? For each training data, is it the percent of the chosen class's occurrence compared to to other classes in the k neighbor of that training data?

april
  • 3
  • 3

1 Answers1

1

Q1. If training data have 7 classes should I consider higher K (for example I start with k=14?).

The parameter $k$ in nearest neighbours is usually picked via $n $-fold cross-validation. See this question for further details.

Q2. Since half of my training data is not labeled I will be using self training, what is considered a confident scores here?

Although kNN predicts the label for a new datapoint $x$ by simple majority vote, as outlined here the majority vote also approximates the probability for each posterior label $l$ as given by $$ P(l \mid x) = \frac{\text{number of k-nearest neighbour labelled } l }{k}. $$ You can use this as your confidence score.

MachineEpsilon
  • 2,686
  • 1
  • 17
  • 29