3

For a machine learning class I am taking, on our first homework assignment we are given the following problem that has me stuck:

Consider the following simulated data set:

set.seed(123)
n<-100
X<-runif(n)
Y<-rbinom(n,1,exp(0.5+X)/(1+exp(0.5+X)))

a) Find the Bayes' classifier
b) Construct an empirical version of the Bayes' classifier using MLE (you can use 
the glm function)

I don't understand how to find the Bayes' classifier using R. I can find it algebraically, but how do you implement Bayes' classifiers in R? When I search around, the only sources I can find are on "Naive Bayes' classifiers", which don't appear to be the same thing.

This:

http://en.wikipedia.org/wiki/Bayes_classifier

Is the Bayes' classifier I want to find, but I can't find any sources on it for R.

Further, even if I did know how to find the Bayes' classifier, I don't understand what the difference would be between finding it and constructing an empirical version using MLE. The question doesn't even make sense to me. How do I use the glm function to use MLE to construct a classifier? I imagine it has something to do with fitting a logistic model, but I don't understand how to use the glm function in the way I am being asked to? I suspect I might just be getting caught up in the terminology/notation and confusing myself unnecessarily.

Anyone have any pointers for how to get started on this? I'm not asking anyone to code it for me, but it would be nice if someone could point me in the right direction.

Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161
Ryan Simmons
  • 1,563
  • 1
  • 13
  • 25
  • 2
    This question doesn't belong on StackOverflow since it's about statistics and machine learning rather than programming. I've flagged for it to be moved to a sister site. In any case, the Wikipedia article you link to gives the answer: `the Bayes classifier minimises the probability of misclassification`. At this point it's a *math and statistics* problem: you're not going to solve it by finding a function called `bayesClassifier` in R, but rather by working out mathematically what you need to do. – David Robinson Sep 01 '14 at 14:20
  • On the programming side I know that the [rstan](http://mc-stan.org/rstan.html) package implements bayesian statistics and might be of use. – Martin Markov Sep 01 '14 at 14:22
  • 1
    @DavidRobinson . The question explicitly asks me to find an empirical version of the Bayes classifier using the glm function in R. This is a programming question. Why would I be asked specifically to use the glm function in R if this wasn't a programming question? – Ryan Simmons Sep 01 '14 at 14:34
  • 3
    @RyanSimmons a) to do that you'll need to first solve the math issue described in (a). b) The fact that you need to use a function doesn't make something a programming question. This is asking "which arguments do you need to give to the `glm` function to accomplish this goal," which is very much a question for a statistician rather than a programmer. – David Robinson Sep 01 '14 at 14:43
  • 1
    The problem arises because you don't understand the statistical issues. Once you understand how to make a decision rule with a logistic regtession model it's quite easy, and is OT for SO. – DWin Sep 02 '14 at 03:24
  • This is how you can do logistic regression on Y: `model – Zhubarb Sep 02 '14 at 07:51
  • So, finding the Bayes classifier in this case amounts to finding the cutpoint on the X-scale where to one side you classify the Ys as 0 and the other side you classify the Ys as one? If I run `glm(Y ~ X, family="binomial")` I get that I should always classify the Ys as 1 (at least on the X range 0-1). – Rasmus Bååth Sep 02 '14 at 08:34
  • I have changed the title since it seems more about how to turn a logistic regression into a Bayes classifier, instead of a programming question in R. In this way the question can be useful to a larger public. @RyanSimmons Maybe I am mistaken and the task was to use MLE to find some expressions like p(x|y=1) and p(x|y=0)? – Sextus Empiricus Feb 20 '18 at 21:51

1 Answers1

1

a) Different similar question: For which values of X is it 'best' to predict a (unknown) Y to belong to class 1 and for which values of X is it 'best' to predict Y to belong to class 0? Where is the boundary between the two?

b) Like the previous question but pretending that you do not know the 'true' formula for the model $$Y \sim B\left(1,\frac{1}{1+e^{-0.5-X}}\right)$$ More specifically use maximum likelihood estimate (MLE). So do the fitting of a MLE yourself (you can do that with the glm function) rather than use some standard function for generating a classifier.

There must be several questions from which you could dig up examples how the glm function works https://stats.stackexchange.com/search?q=glm+logistic

Or go straight to the general information about the function (in R you can find information about functions by typing help('glm') or ?glm in a console, and sometimes if the function is not loaded you can use ??glm to dig trough occurrences of the term in a database of packages )

two graphical examples of Bayes classifier:

Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161