Statistics for online dating sites

Question

I'm curious how an online dating systems might use survey data to determine matches.

Suppose they have outcome data from past matches (e.g., 1 = happily married, 0 = no 2nd date).

Next, let's suppose they had 2 preference questions,

"How much do you enjoy outdoor activities? (1=strongly dislike, 5 = strongly like)"
"How optimistic are you about life? (1=strongly dislike, 5 = strongly like)"

Suppose also that for each preference question they have an indicator "How important is it that your spouse shares your preference? (1 = not important, 3 = very important)"

If they have those 4 questions for each pair and an outcome for whether the match was a success, what is a basic model that would use that information to predict future matches?

I thought a success match happen when the girl is pretty or the male is rich. Everything else is secondary. — user4951, Nov 16 '11 at 02:54
Check http://blog.okcupid.com/ - somewhere they talk about the underlying matching models. — Felix S, Nov 16 '11 at 10:08
Can You mention what kind of things you'd like more depth on? Michael's answer is a pretty solid overview. — Dan, Nov 21 '11 at 22:46
If you read the patent (patent 6,735,568 - https://www.google.com/#sclient=psy-ab&hl=en&safe=off&site=&source=hp&q=patent+6%2C735%2C568&pbx=1&oq=patent+6%2C735%2C568&aq=f&aqi=&aql=&gs_sm=e&gs_upl=1861l2351l0l2561l8l3l0l0l0l0l458l973l0.1.0.1.1l3l0&bav=on.2,or.r_gc.r_pw.r_cp.,cf.osb&fp=454e8cb381e664cb&biw=1440&bih=730 ) for EHarmony their system uses a combination of Principle Component Analysis, Factor Analysis, and uses a Neural Network. As others have mentioned methods like K-NN, CARTS, and GLM's would work well also. — Chris Simokat, Nov 22 '11 at 23:28
@ChrisSimokat -- WOW! Thanks so much for the amazing link. That's interesting though. I never thought you could "copyright" statistical methods and algorithms. — d_a_c321, Nov 24 '11 at 00:38
@dchandler you're welcome, I hope that helps. If you find that interesting, you might find the book "Math You Can't Use" ( http://www.amazon.com/Math-You-Cant-Use-Copyright/dp/0815749422 ) interesting as well. — Chris Simokat, Nov 25 '11 at 03:34

score 4 · Answer 1 · answered Nov 16 '11 at 09:26

I once spoke to someone who works for one of the online dating sites that uses statistical techniques (they'd probably rather I didn't say who). It was quite interesting - to begin with they used very simple things, such as nearest neighbours with euclidiean or L_1 (cityblock) distances between profile vectors, but there was a debate as to whether matching two people who were too similar was a good or bad thing. He then went on to say that now they have gathered a lot of data (who was interested in who, who dated who, who got married etc. etc.), they are using that to constantly retrain models. The work in an incremental-batch framework, where they update their models periodically using batches of data, and then recalculate the match probabilities on the database. Quite interesting stuff, but I'd hazard a guess that most dating websites use pretty simple heuristics.

Michael Bishop · Answer 2 · 2011-11-21T21:27:55.787

3

You asked for a simple model. Here's how I would start with R code:

 glm(match ~ outdoorDif*outdoorImport + optimistDif*optimistImport,
     family=binomial(link="logit"))

outdoorDif = the difference of the two people's answers about how much they enjoy outdoor activities. outdoorImport = the average of the two answers on the importance of a match regarding the answers on enjoyment of outdoor activities.

The * indicates that the preceding and following terms are interacted and also included separately.

You suggest that the match data is binary with the only two options being, "happily married" and "no second date," so that is what I assumed in choosing a logit model. This doesn't seem realistic. If you have more than two possible outcomes you'll need to switch to a multinomial or ordered logit or some such model.

If, as you suggest, some people have multiple attempted matches then that would probably be a very important thing to try to account for in the model. One way to do it might be to have separate variables indicating the # of previous attempted matches for each person, and then interact the two.

edited Nov 21 '11 at 21:27

answered Nov 21 '11 at 20:16

Michael Bishop

2,171
3
21
31

Thanks for the great answer.. I'm giving you the bounty! :) That seems like a good approach. Perhaps if you had N questions that fit into M like categories (eg., athletics questions) you might enrich the model using an average of the importance and differences within that category and add it as an additional term. It's not perfect, but that may be a simple way to capture the interaction of several correlated variables. Thanks again, I'd be happy to hear any other thoughts that didn't make your answer ;). – d_a_c321 Nov 24 '11 at 00:44
Should you not normalise the answers first? If everybody enjoyed the outdoors, then the outdoor answer should become less relevant, because it would be a poor predictor of compatibility. – Sklivvz Dec 05 '11 at 22:41
@Skliwz, I'm not sure how you would normalize a multiple choice (ordinal) answer. Also, remember that linear transformations of continuous predictor variables are sometimes desirable for reasons discussed here: http://stats.stackexchange.com/q/7112/3748 and here: http://stats.stackexchange.com/q/19216/3748 but they won't change the models predictions barring some unusual computational issues. If everyone enjoys the outdoors, the outdoors equally the outdoor answer is less relevant, but I don't think its really a problem for the model as I specified it. (Not that my model is perfect) – Michael Bishop Dec 06 '11 at 03:22

score 1 · Answer 3 · answered Nov 21 '11 at 21:28

One simple approach would be as follows.

For the two preference questions, take the absolute difference between the two respondent's responses, giving two variables, say z1 and z2, instead of four.

For the importance questions, I might create a score that combines the two responses. If the responses were, say, (1,1), I'd give a 1, a (1,2) or (2,1) gets a 2, a (1,3) or (3,1) gets a 3, a (2,3) or (3,2) gets a 4, and a (3,3) gets a 5. Let's call that the "importance score." An alternative would be just to use max(response), giving 3 categories instead of 5, but I think the 5 category version is better.

I'd now create ten variables, x1 - x10 (for concreteness), all with default values of zero. For those observations with an importance score for the first question = 1, x1 = z1. If the importance score for the second question also = 1, x2 = z2. For those observations with an importance score for the first question = 2, x3 = z1 and if the importance score for the second question = 2, x4 = z2, and so on. For each observation, exactly one of x1, x3, x5, x7, x9 != 0, and similarly for x2, x4, x6, x8, x10.

Having done all that, I'd run a logistic regression with the binary outcome as the target variable and x1 - x10 as the regressors.

More sophisticated versions of this might create more importance scores by allowing male and female respondent's importance to be treated differently, e.g, a (1,2) != a (2,1), where we've ordered the responses by sex.

One shortfall of this model is that you might have multiple observations of the same person, which would mean the "errors", loosely speaking, are not independent across observations. However, with a lot of people in the sample, I'd probably just ignore this, for a first pass, or construct a sample where there were no duplicates.

Another shortfall is that it is plausible that as importance increases, the effect of a given difference between preferences on p(fail) would also increase, which implies a relationship between the coefficients of (x1, x3, x5, x7, x9) and also between the coefficients of (x2, x4, x6, x8, x10). (Probably not a complete ordering, as it's not a priori clear to me how a (2,2) importance score relates to a (1,3) importance score.) However, we have not imposed that in the model. I'd probably ignore that at first, and see if I'm surprised by the results.

The advantage of this approach is it imposes no assumption about the functional form of the relationship between "importance" and the difference between preference responses. This contradicts the previous shortfall comment, but I think the lack of a functional form being imposed is likely more beneficial than the related failure to take into account the expected relationships between coefficients.

Statistics for online dating sites

3 Answers3