0

I know a logistic regression that tells you the probability of a specific data point corresponding to True, which means its a variation of a binary classification problem.

Let's say I have data points where each point is an n-dimensional vector. I only have data points where the binary dependent variable is True (zero data points for where it is false).

I would like to calculate the probability of an arbitrary n-dimensional vector corresponding to True. How would I go about doing this? Is this even possible?

My only thoughts on this would be to somehow find the distance between this arbitrary vector and some other vector that corresponds to True, then weigh it in a way that could be interpreted as a probability.

Would I even need to do this? Could I just do a logistic regression in Stata with the dataset I already have or would that not work?

Anthony
  • 141
  • 1
  • 4
  • 1
    It is easy to try. If all observed responses are 1 (more generally, non-zero and not missing) Stata declines to try to fit a logistic regression. Otherwise this hinges on what is an arbitrary data vector, and I have no idea what that means in Stata terms. – Nick Cox Dec 18 '19 at 16:41
  • Does this answer your question? [Binary semi-supervised classification with positive only and unlabeled data set](https://stackoverflow.com/questions/25700724/binary-semi-supervised-classification-with-positive-only-and-unlabeled-data-set) – user2160809 Dec 18 '19 at 16:42
  • That is sorta similar to the premise of my question but I don't really have unlabeled data. I just have positive data and want to say "If I gave this new data point vector these values, what is the probability that it would correspond to True?" – Anthony Dec 18 '19 at 16:46
  • 3
    This isn't on all fours with "I tossed my coin and got heads every time. What's the probability of that?" That is an answerable question if you tell us the number of tosses and postulate a fair coin. Your question is more like "My coin is likely biased and I get heads every time. What is the probability of that?", which is unanswerable beyond saying that the data imply probability 1 (knowing other variables is irrelevant). – Nick Cox Dec 18 '19 at 17:27
  • @NickCox my question is more like "I only have data where a guy applied for a loan and was approved. What is the probability that the next loan application is approved?" which seems like an answerable question – Anthony Dec 18 '19 at 17:41
  • You could try recasting the question in Bayesian terms, but this seems like an occult problem to me. – Nick Cox Dec 18 '19 at 18:04
  • @Anthony From a frequentist point of view, if all the guys you observed were approved of the loan, the best guess is that the next guy will be approved as well. Unless you take the Bayesian approach and have some prior belief about the population, then you can compare the prior belief about the population with the observed approval rate to draw some non-trivial inference. Otherwise,, the best guess from 100% approval rate will be always approved. – 9mat Dec 20 '19 at 16:12
  • 1
    Similar posts: https://stats.stackexchange.com/questions/149290/binary-logistic-regression-with-only-positive-training-examples-does-that-even, https://stats.stackexchange.com/questions/365582/how-to-choose-a-method-for-binary-classifier-based-on-only-positive-and-unlabell, https://stats.stackexchange.com/questions/73078/modeling-what-should-be-a-logistic-regression-but-has-no-negative-responses, – kjetil b halvorsen Dec 24 '21 at 18:02

1 Answers1

1

Logistic regression programs will not run on such data. There will be a division by 0 problem.

However, there has been work done on estimating the true probability when you only have data at one level. I have not looked at this literature in a while, but there was a paper by Agresti and Coull "Approximate is better than exact for interval estimation of binomial proportions" American Statistician vol 52, p 119-126. They give a relatively complicated formula and also a simple approximation of adding 2 "successes" and 2 "failures" and then proceeding as usual.

Another option is the "rule of 3" which says an approximate 95 \% CI for p = 1 is given by 1 - 3/n and 1.

However, this does not really do what logistic regression does. Rather, either of these solutions wind up measuring the frequency of different vectors.

Peter Flom
  • 94,055
  • 35
  • 143
  • 276