54

I know, this may sound like it is off-topic, but hear me out.

At Stack Overflow and here we get votes on posts, this is all stored in a tabular form.

E.g.:

post id     voter id    vote type     datetime
-------     --------    ---------     --------
10          1           2             2000-1-1 10:00:01 
11          3           3             2000-1-1 10:00:01 
10          5           2             2000-1-1 10:00:01 

... and so on. Vote type 2 is an upvote, vote type 3 is a downvote. You can query an anonymized version of this data at http://data.stackexchange.com

There is a perception that if a post reaches the score of -1 or lower it is more likely to be upvoted. This may be simply confirmation bias or it may be rooted in fact.

How would we analyze this data to confirm or deny this hypothesis? How would we measure the effect of this bias?

robin girard
  • 6,335
  • 6
  • 46
  • 60
Sam Saffron
  • 619
  • 4
  • 7
  • 2
    can we get an example of the query? Not everybody is well versed in writing SQL statements. Having sample data might encourage people to try to play with it. +1 for the question. – mpiktas Jun 01 '11 at 06:45
  • @Jeff votes are anonymized you can only get partial info from the data dump, it does include all transitions though here is a quick sample http://data.stackexchange.com/stackoverflow/q/101738/ full anonymized data is available in the public data dump – Sam Saffron Jun 01 '11 at 08:30
  • Why just upvotes? How the probability of up- or down-voting splits around each particular value would be interesting surely? – Bob Durrant Jun 01 '11 at 08:51
  • @Bob, sure agree they would – Sam Saffron Jun 01 '11 at 09:35
  • What'd be even more interested would be to see if this effect (assuming it exists) still occurred if http://meta.stackexchange.com/questions/747/show-total-votes-or-up-down-votes was extended to all users who could down vote. – naught101 Mar 28 '12 at 08:15
  • 2
    I've seen other kinds of sites obfuscate votes (i.e. add noise before displaying them) and sometimes even completely hide up- and down-votes for a short period, in order to avoid various forms of bandwagonning, pity votes and other 'social' elements of voting. – Glen_b Jun 14 '13 at 01:55
  • http://www.eecs.harvard.edu/cs286r/papers/DW08.pdf Could this (BTS Bayesian Truth Serum) be something to consider in the long run to improve the quality of votes? It might be too complicated though. –  Jun 01 '11 at 21:14

3 Answers3

35

You could use a multistate model or Markov chain (the msm package in R is one way to fit these). You could then look to see if the transition probability from -1 to 0 is greater than from 0 to 1, 1 to 2, etc. You can also look at the average time at -1 compared to the others to see if it is shorter.

Greg Snow
  • 46,563
  • 2
  • 90
  • 159
  • 3
    +1 great reference. There is an [article](http://www.jstatsoft.org/v38/i08/paper) in Journal of Statistical Software about msm package. The model seems ideally fitted for this kind of task. – mpiktas Jun 01 '11 at 06:51
  • 3
    The Markov chain model idea looks like a good one, but the average time at -1 won't give the whole story. It's possible (and plausible - think bad questions) that one is more likely to get downvoted at -1 than elsewhere too. – Bob Durrant Jun 01 '11 at 08:51
  • I guess what one may want to do first is cluster the vote-trajectories - those that get (almost) only up/downvoted (very popular/very bad questions), and those that are more contentious. Then you can do Markov chains on the three classes. – Jonas Jun 01 '11 at 13:52
13

Summary of my answer. I like the Markov chain modeling but it misses the "temporal" aspect. On the other end, focusing on the temporal aspect (e.g. average time at $-1$) misses the "transition" aspect. I would go into the following general modelling (which with suitable assumption can lead to [markov process][1]). Also there is a lot of "censored" statistic behind this problem (which is certainly a classical problem of Software reliability ? ). The last equation of my answer gives the maximum likelihood estimator of voting intensity (up with "+" and dow with "-") for a given state of vote. As we can see from the equation, it is an intermediate from the case when you only estimate transition probability and the case when you only measure time spent at a given state. Hope this help.

General Modelling (to restate the question and assumptions). Let $(VD_i)_{i\geq 1}$ and $(S_{i})_{i\geq 1}$ be random variables modelling respectively the voting dates and the associated vote sign (+1 for upvote, -1 for downvote). The voting process is simply

$$Y_{t}=Y^+_t-Y^-_t$$ where

$$Y^+_t=\sum_{i=0}^{\infty}1_{VD_i\leq t,S_i=1} \;\text{ and } \;Y^-_t=\sum_{i=0}^{\infty}1_{VD_i\leq t,S_i=-1}$$

The important quantity here is the intentity of $\epsilon$-jump $$\lambda^{\epsilon}_t=\lim_{dt\rightarrow 0} \frac{1}{dt} P(Y^{\epsilon}_{t+dt}-Y^{\epsilon}_t=1|\mathcal{F}_t) $$ where $\epsilon$ can be $-$ or $+$ and $\mathcal{F}_t$ is a good filtration, in the genera case, without other knowledge it would be: $$\mathcal{F}_t=\sigma \left (Y^+_t,Y^-_t,VD_1,\dots,VD_{Y^+_t+Y^-_t},S_{1},\dots,S_{Y^+_t+Y^-_t} \right )$$.

but along the lines of your question, I think you implicitly assume that $$ P \left ( Y^{\epsilon}_{t+dt}-Y^{\epsilon}_t=1 | \mathcal{F}_t \right )= P \left (Y^{\epsilon}_{t+dt}-Y^{\epsilon}_t=1| Y_t \right ) $$ This means that for $\epsilon=+,-$ there exists a deterministic sequence $(\mu^{\epsilon}_i)_{i\in \mathbb{Z}}$ such that $\lambda^{\epsilon}_t=\mu^{\epsilon}_{Y_t}$.

Within this formalism, you question can be restated as: "it likely that $ \mu^{+}_{-1} -\mu^{+}_{0}>0$ " (or at least is the difference larger than a given threshold).

Under this assumption, it is easy to show that $Y_t$ is an [homogeneous markov process][3] on $\mathbb{Z}$ with generator $Q$ given by

$$\forall i,j \in \mathbb{Z}\;\;\; Q_{i,i+1}=\mu^{+}_{i}\;\; Q_{i,i-1}=\mu^{-}_{i}\;\; Q_{ii}=1-(\mu^{+}_{i}+\mu^{-}_{i}) \;\; Q_{ij}=0 \text{ if } |i-j|>1$$

Answering the question (through proposing a maximum likelihood estimatior for the statistical problem) From this reformulation, solving the problem is done by estimating $(\mu^{+}_i)$ and building a test uppon its values. Let us fix and forget the $i$ index without loss of generality. Estimation of $\mu^+$ (and $\mu^-$) can be done uppon the observation of

$(T^{1},\eta^1),\dots,(T^{p},\eta^p)$ where $T^j$ are the lengths of the $j^{th}$ of the $p$ periods spent in state $i$ (i.e. successive times with $Y_t=i$) and $\eta^j$ is $+1$ if the question was upvoted, $-1$ if it was downvoted and $0$ if it was the last state of observation.

If you forget the case with the last state of observation, the mentionned couples are iid from a distribution that depends on $\mu_i^+$ and $\mu_i^-$: it is distributed as $(\min(Exp(\mu_i^+),Exp(\mu_i^-)),\eta)$ (where Exp is a random var from an exponential distribution and $\eta$ is + or -1 depending on who realizes the max). Then, you can use the following simple lemma (the proof is straightforward):

Lemma If $X_+\leadsto Exp(\mu_+)$ and $X_{-} \leadsto Exp(\mu_{-})$ then, $T=\min(X_+,X_-)\leadsto Exp(\mu_++\mu_-)$ and $P(X_+1<X_-)=\frac{\mu_+}{\mu_++\mu_-}$.

This implies that the density $f(t,\epsilon)$ of $(T,\eta)$ is given by: $$ f(t,\epsilon)=g_{\mu_++\mu_-}\left ( \frac{1(\epsilon=+1)*\mu_++1(\epsilon=-1)*\mu_-}{\mu_++\mu_-}\right )$$ where $g_a$ for $a>0$ is the density function of an exponential random variable with parameter $a$. From this expression, it is easy to derive the maximum likelihood estimator of $\mu_+$ and $\mu_-$:

$$(\hat{\mu}_+,\hat{\mu}_-)=argmin \ln (\mu_-+\mu_+)\left ( (\mu_-+\mu_+)\sum_{i=1}^p T^i+p\right )- p_-\ln\left (\mu_-\right ) -p_+ \ln \left (\mu_+\right )$$ where $p_-=|{i:\delta_i=-1}|$ and $p_+=|{i:\delta_i=+1}|$.

Comments for more advanced approaches

If you want to take into acount cases when $i$ is the last observed state (certainly smarter because when you go through $-1$, it is often your last score...) you have to modify a little bit the reasonning. The corresponding censoring is relatively classical...

Possible other approache may include the possibility of

  • Having an intensity that decreases in with time
  • Having an intensity that decreases with the time spent since the last vote (I prefer this one. In this case there are classical way of modelling how the density decreases...
  • You may want to assume that $\mu_i^+$ is a smooth function of $i$
  • .... you can propose other ideas !
robin girard
  • 6,335
  • 6
  • 46
  • 60
13

Conduct an experiment. Randomly downvote half of the new posts at a particular time every day.

charles.y.zheng
  • 7,346
  • 2
  • 28
  • 32
  • 5
    Cool, we should observe a significant increase in "critic" badges and probably a decrease in motivation for new users :-) Better to start with high-rep users, in this case (at risk of biasing the experiment!) – chl Jun 01 '11 at 20:09
  • 15
    Actually we could do better than this ... using AB testing we could pick to display half of the -1 voted question on the site as 0 and half as -1 ... and see if either of the groups is more likely to be upvoted! Ingenious. – Sam Saffron Jun 02 '11 at 07:41
  • 4
    The experiment idea controls the quality of the posts, but (1) those being downgraded should agree in advance to participate in the experiment, and (2) after a brief time, the downgrades should be removed. – zbicyclist Jun 04 '11 at 20:00
  • 2
    +1 (and +1 to all comments here, too): a controlled *reversible* experiment, communicated in advance to all users who might be affected and conducted with their approval, is one of the strongest ways to obtain this information. – whuber Jun 21 '12 at 20:21