74

PSY's music video "Gangnam style" is popular, after a little more than 2 months it has about 540 million viewers. I learned this from my preteen children at dinner last week and soon the discussion went in the direction of if it was possible to do some kind of prediction of how many viewers there will be in 10-12 days and when(/if) the song will pass 800 million viewers or 1 billion viewers.

Here is the picture from number of viewers since it was posted: PSY OGS

Here are the picture from number of viewers of the No1 "Justin Biever-Baby"and No2 "Eminem - Love the way you lie" music videos that both have been around for a much longer time Justin Eminem

My first attempt to reason about the model was that is should be a S-curve but this doesn't seem to fit the the No1 and No2 songs and also doesn't fit that there are no limit on how many views that the music video can have, only a slower growth.

So my question is: what kind of model should I use to predict number of viewers of the music video?

mpiktas
  • 33,140
  • 5
  • 82
  • 138
FredrikD
  • 843
  • 7
  • 15
  • 22
    +1 for managing to steer the dinner table conversation from Gangnam to statistics. We need people like you! – Stephan Kolassa Oct 27 '12 at 20:22
  • 4
    What I can add to the discussion that I hope will be useful to gui11aume or others who are writing equations to try to model this, is that in the KONY example, geographic clustering was a significant aspect of the viral spreading. The fact that PSY is a Korean and then Asian phenomenon first, is an important part of the story. Not sure exactly how that would be modeled, but it might be a clue. –  Oct 27 '12 at 23:49
  • Data regarding views, comments, likes and dislikes of the video during November 2012, can be found at https://docs.google.com/spreadsheet/ccc?key=0AstJzCCxOXH1dFlhX3F2Z3dBc0xQS01ZeUpHVUt4VkE – FredrikD Nov 05 '12 at 15:02

6 Answers6

38

Aha, excellent question!!

I would also have naively proposed an S-shaped logisitic curve, but this is obviously a poor fit. As far as I know, the constant increase is an approximation because YouTube counts the unique views (one per IP address), so there cannot be more views than computers.

We could use an epidemiological model where people have different susceptibility. To make it simple, we could divide it in the high risk group (say the children) and the low risk group (say the adults). Let's call $x(t)$ the proportion of "infected" children and $y(t)$ the proportion of "infected" adults at time $t$. I will call $X$ the (unknown) number of individuals in the high risk group and $Y$ the (also unknown) number of individuals in the low risk group.

$$\dot{x}(t) = r_1(x(t)+y(t))(X-x(t))$$ $$\dot{y}(t) = r_2(x(t)+y(t))(Y-y(t)),$$

where $r_1 > r_2$. I don't know how to solve that system (maybe @EpiGrad would), but looking at your graphs, we could make a couple of simplifying assumptions. Because the growth does not saturate, we can assume that $Y$ is very large and $y$ is small, or

$$\dot{x}(t) = r_1x(t)(X-x(t))$$ $$\dot{y}(t) = r_2x(t),$$

which predicts linear growth once the high risk group is completely infected. Note that with this model there is no reason to assume $r_1 > r_2$, quite the contrary because the large term $Y-y(t)$ is now subsumed in $r_2$.

This system solves to

$$x(t) = X \frac{C_1e^{Xr_1t}}{1 + C_1e^{Xr_1t}}$$ $$y(t) = r_2 \int x(t)dt + C_2 = \frac{r_2}{r_1} \log(1+C_1e^{Xr_1t})+C_2,$$

where $C_1$ and $C_2$ are integration constants. The total "infected" population is then $x(t) + y(t)$, which has 3 parameters and 2 integration constants (initial conditions). I don't know how easy it would be to fit...

Update: playing around with the parameters, I could not reproduce the shape of the top curve with this model, the transition from $0$ to $600,000,000$ is always sharper than above. Continuing with the same idea, we could again assume that there are two kinds of Internet users: the "sharers" $x(t)$ and the "loners" $y(t)$. The sharers infect each other, the loners bump into the video by chance. The model is

$$\dot{x}(t) = r_1x(t)(X-x(t))$$ $$\dot{y}(t) = r_2,$$

and solves to

$$x(t) = X \frac{C_1e^{Xr_1t}}{1 + C_1e^{Xr_1t}}$$ $$y(t) = r_2 t+C_2.$$

We could assume that $x(0) = 1$, i.e. that there is only patient 0 at $t=0$, which yields $C_1 = \frac{1}{X-1} \approx \frac{1}{X}$ because $X$ is a large number. $C_2 = y(0)$ so we can assume that $C_2 = 0$. Now only the 3 parameters $X$, $r_1$ and $r_2$ determine the dynamics.

Even with this model, it seems that the inflection is very sharp, it is not a good fit so the model must be wrong. That makes the problem very interesting actually. As an example, the figure below was built with $X = 600,000,000$, $r_1 = 3.667 \cdot 10^{-10}$ and $r_2 = 1,000,000$.

growth model of Gangnam style

Update: From the comments I gathered that Youtube counts views (in its secret way) and not unique IPs, which makes a big difference. Back to the drawing board.

To keep it simple, let's assume that the viewers are "infected" by the video. They come back to watch it regularly, until they clear the infection. One of the simplest models is the SIR (Susceptible-Infected-Resistant) which is the following:

$$\dot{S}(t) = -\alpha S(t)I(t)$$ $$\dot{I}(t) = \alpha S(t)I(t) - \beta I(t)$$ $$\dot{R}(t) = \beta I(t)$$

where $\alpha$ is the rate of infection and $\beta$ is the rate of clearance. The total view count $x(t)$ is such that $\dot{x}(t) = kI(t)$, where $k$ is the average views per day per infected individual.

In this model, the view count starts increasing abruptly some time after the onset of the infection, which is not the case in the original data, perhaps because videos also spread in a non viral (or meme) way. I am no expert in estimating the parameters of the SIR model. Just playing with different values, here is what I came up with (in R).

S0 = 1e7; a = 5e-8; b = 0.01 ; k = 1.2
views = 0; S = S0; I = 1;
# Exrapolate 1 year after the onset.
for (i in 1:365) {
   dS = -a*I*S;
   dI = a*I*S - b*I;
   S = S+dS;
   I = I+dI;
   views[i+1] = views[i] + k*I 
}
par(mfrow=c(2,1))
plot(views[1:95], type='l', lwd=2, ylim=c(0,6e8))
plot(views, type='n', lwd=2)
lines(views[1:95], type='l', lwd=2)
lines(96:365, views[96:365], type='l', lty=2)

Extrapolation of the views of the Gangnam style Youtube video

The model is obviously not perfect, and could be complemented in many sound ways. This very rough sketch predicts a billion views somewhere around March 2013, let's see...

gui11aume
  • 13,383
  • 2
  • 44
  • 89
  • 5
    (+1) As a first approach. Note that youtube's policiy for counting views is not well understood given that they have not made their algorithm public. They only say: "A view is counted whenever someone watches a video on YouTube. We do not get more specific than this to avoid attempts at artificially inflating view counts" [(see)](http://www.youtube.com/t/faq). –  Oct 27 '12 at 13:47
  • @Procrastinator Thanks for the tip. That makes it very hard to model then... – gui11aume Oct 27 '12 at 14:06
  • @gui11aume, I like the model of the high and low risk groups, but it seems like the model increases too sharply at the end. Inspired by your model, perhaps the "contagion" phase ends/phases out and then the number of views is proportional to general viewing – FredrikD Oct 27 '12 at 19:02
  • Thanks! Yes, I noticed. I think @Procrastinator raised an important issue. Both models assume that users can view the video (be infected) only once, but this is probably not correct. Your model is interesting, how would you write it? – gui11aume Oct 27 '12 at 19:09
  • FYI Youtube counts a video as watched if at least 90% is played. It doesn't go by IP address either because sometimes whole companies or schools are behind a proxy server, and Youtube would only see that proxy's IP address. There are also service which sell Youtube views, to artificially inflate views, by making robots and botnets watch the videos. – Chloe Oct 28 '12 at 02:01
  • Using the input from @Procrastinator and others makes a better model. It probably lacks a non viral component. The model can be completed, but parameter estimation will become more and more difficult. – gui11aume Oct 28 '12 at 16:18
  • @gui11aume, agree on the complexity issue. Given the context of the question, your answer is accepted. Also, since the team behind "OGS" seems to be doing their best to increase S and k (other marketing activities) and increase the chance of mutations (dance instruction videos, own "mutations"), so it seems like there are things external to what we can observe and model that impacts real number of viewers. – FredrikD Oct 29 '12 at 08:05
  • 3
    @FredrikD thanks. You can still remove the 'accept' in March 2013 if I got it wrong :D – gui11aume Oct 29 '12 at 08:08
  • @gui11aume, I checked the value today, it was almost 630M. The model predicts around 620M which is good. However, if you look at the comment stream around the video, you see that there is a kind of "viewing" frenzy. Some of the "infected" are actively aiming for a billion, i.e. they increase the k-factor in your model. – FredrikD Nov 03 '12 at 07:59
  • @FredrikD yes, so far the model seems to hold up more or less. It is surprising given that parameter estimation was arbitrary. Do you know whether we can get a day-by-day view count in numeric format somewhere? – gui11aume Nov 03 '12 at 13:52
  • @gui11aume, found this on Stackexchange, http://stackoverflow.com/a/8199756/1494569 – FredrikD Nov 03 '12 at 18:38
  • Here is gui11aume's model rewritten in Mathematica, see http://mathematica.stackexchange.com/a/14047/1635 The difference equation itself is rewritten in this answer which might make it easier to estimate the parameters – FredrikD Nov 03 '12 at 20:37
  • 2
    SIR model parameter estimation, see http://rsfs.royalsocietypublishing.org/content/2/2/156.full – FredrikD Nov 05 '12 at 15:11
  • 1
    It seems I am going to lose this one! They may hit the million even before 2013... – gui11aume Dec 19 '12 at 18:49
  • 2
    http://www.engadget.com/2012/12/21/gangnam-style-one-billion-views/ So the world didn't end but 1 Billion views was hit today. – DanTheMan Dec 21 '12 at 20:36
5

Probably the most common model for forecasting new product adoption is the Bass diffusion model, which - similar to @gui11aume's answer - models interactions between current and potential users. New product adoption is a pretty hot topic in forecasting, searching for this term should yield tons of info (which I unfortunately don't have the time to expand on here...).

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • yes, that is also a candidate model. However, it seems like it assumes that you only can be a user once. Here, you view the video a number of times if you are "infected". – FredrikD Oct 29 '12 at 10:01
  • 1
    @FredrikD: point taken. (Though I personally didn't manage to sit even through a single "use" of this "product"...) There should be generalizations of Bass to deal with this. (Shameless plug:) Next year's _International Symposium of Forecasting_ is in Seoul, so anyone should consider presenting his/her favorite Gangnam forecasting model there! ;-) – Stephan Kolassa Oct 29 '12 at 10:04
4

I would look at the Gompertz growth curve.

The Gompertz curve is a 3-parameter (a,b,c) double-exponential formula with time, T, as an independent variable.

R code:

gompertz_growth <- function(a=a,b=b,c=c, t) { a*exp(b*exp(c*t)) }

Gompertz growth formula is known to be good at describing many life-cycle phenomena where at first growth is accelerating, then tapers off resulting in a asymmetric sigmoid curve whose derivative is steeper on the left than on the right of the peak. For example, the total number of articles on Wikipedia which is also viral in nature, has been following a Gompertz growth curve (with certain a,b,c parameters) for many years with great accuracy.

Chart of the Gompertz curves: total size and its growth rate derivative

Edit: If the Gompertz curve isn't enough to approximate the shape you're looking for, you may want to add parameters d & θ as described in The Exponentaited Generalized Weibull Gompertz Distribution. Note that this paper uses x instead of t for the independent time parameter. Interestingly, Wikipedia also modified their best approximation by adding a single 4th parameter d, to account for a prediction divergence from the actual value after 2012. The modified 4-param Gompertz curve formula is:

gompertz_2 <- function(a=A,b=B,c=C,d=D, t) {a * exp(b * exp(c*t) + d*t)}

The Gompertz function is named after Benjamin Gompertz (1779-1865), a Gauss contemporary (just 2 years Gauss' junior), the first mathematician to describe it.

arielf
  • 1,128
  • 11
  • 15
  • Good point! However, what challenges the model is that it doesn't seem to be a limit (see the No1 and No2 ). That is, the factor a in the model is also increasing over time. – FredrikD Nov 04 '12 at 09:43
  • I would challenge the "There doesn't seem to be a limit." Can Gangnam style reach 1B? 10B? 100B? views? eventually the growth rate gets to near zero and the curve plateaus. This is hard to see when you're at the high growth phase, like we are now with Gangnam, but just wait a few years and you'll Gompertz win :) The trick is of course, to figure out the right (a,b,c) parameters for this specific case. – arielf Nov 05 '12 at 02:21
  • 2
    Here is a reference for estimating the parameters of the Gompertz model, see http://www.weibull.com/RelGrowthWeb/Parameter_Estimation_for_the_Gompertz_Models_Using_Least_Squares_in_Nonlinear_Regression.htm – FredrikD Nov 05 '12 at 15:18
3

I think you need to separate phenomena like Gangnam Style, which owes much of it's views to being a meme/viral thing, from Justin Bieber and Eminem, who are big artists in their own right and who also would spread widely in a traditional setting - JB or Eminem would sell a lot of singles too, I'm not sure that PSY would.

abaumann
  • 1,910
  • 14
  • 12
  • good point. After reading & listening to interviews of PSY and the team behind "OGS" (Oppa Gangnam Style), it is clear that they are well aware of which button to press to create a viral thing. Through some image analysis of the views picture above, it seems like the no of views are linear up to about 90 days after launch, then PSY appears on the Korean Grand Prix and the number of views per time unit increases. – FredrikD Oct 28 '12 at 13:08
  • - and how does these two classes differ from "classics" - songs that were presumeably well-known when they were first uploaded on YouTube (I'm thinking David Bowie)? – abaumann Oct 28 '12 at 14:08
2

OK guys, we need some stylised facts about the diffusion of youtube videos, which turn out to suggest patterns rather different from the usual product diffusion literature. Good place to start is Meeyoung Cha, Haewoon Kwak, Pablo Rodriguez, Yong-Yeol Ahn, and Sue Moon, 2007, I Tube, You Tube, Everybody Tubes: Analyzing the World’s Largest User Generated Content Video System, Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, ISBN: 978-1-59593-908-1.

and

X Cheng, C Dale, J Liu, 2008, Statistics and social network of youtube videos, in proceedings of International Workshop on Quality of Service (IWQoS), Enschede, the Netherlands, June.

MichaelChirico
  • 1,270
  • 1
  • 9
  • 20
ProfRoy47
  • 21
  • 1
  • 5
    Welcome to the site, @ProfRoy47. Would you mind elaborating on this post somewhat? It's not clear that this is actually an answer to the OP's question yet / that it quite stands on it's own. OTOH, it wouldn't fit as a comment, & I think it has the makings of a helpful contribution to this thread. Our [FAQ](http://stats.stackexchange.com/faq) has some discussion re providing answers on CV, which may be helpful to you. – gung - Reinstate Monica Dec 19 '12 at 17:58
1

The model is obviously not perfect, and could be complemented in many sound ways. This very rough sketch predicts a billion views somewhere around March 2013, let's see...

Looking at the slowdown in views over the past week, the Mar-13 date looks like a decent bet. The majority of the new views appear to be already infected users that return multiple times per day.

With regards to complementing your model, one method that researchers use to track a virus' spread is to monitor its genome mutations - when and where it mutated can show researchers how fast a virus is transmitted and spread (see tracking West Nile Virus in USA).

In a practical sense, videos like Gangnam Style and Party Rock Anthem (by the group LMFAO) are more likely to 'mutate' into parodies, flash mobs, wedding dances, remixes and other video responses than say, Justin Bieber's Baby or Eminem's songs.

Researchers could analyse the number of video responses (and parodies in particular) as a proxy for mutations. Measuring the frequency and popularity of these mutations early in the life of the video could be useful is modelling its lifetime YouTube views.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
lucasng
  • 11
  • 1
  • Welcome to the site, @lucasng. CV is intended for serious, factual answers to substantive questions (you may want to read our [faq](http://stats.stackexchange.com/faq)), & I think the OP has asked w/ this in mind. Your answer is on the borderline here; I think it should stay based on its ideas about mutations etc, but note that opinions about the merits of the videos isn't really germane. – gung - Reinstate Monica Oct 29 '12 at 03:01
  • I think the idea is good. @gung True that it is not an answer to the OP, but the second answer also isn't. – gui11aume Oct 29 '12 at 08:07
  • @gung: (A Google search suggests that) lucasng was not stating an opinion in the part you redacted but rather citing the name of the group that performs the song! – cardinal Nov 18 '12 at 03:48
  • 1
    @cardinal, thanks for the heads up. Lucasng, sorry about the confusion; I have put the group name back. – gung - Reinstate Monica Nov 18 '12 at 13:06