4

There is a pretty cool graph I would like to recreate just for illustration purposes. There are no vital inferences that are hanging in the balance, so some smudging of the numbers is perfectly fine. I basically just want to capture the general features of the data and present them in a very similar way. Here is the reference: enter image description here

Question: Can someone provide some pseudo-code or python code for creating a graph that is pretty similar to the one above? It seems the mean is around 10^8, but the spread is very tricky (for me at least). The other tricky part is reproducing that large concentration of data points that lie under the diagonal line. Note that the spread is not symmetric about the diagonal line.

Further Clarifications

  • Observations: 500 (probably a smaller data set than the original, I don't need 1 billion dots)
  • Scale: log
  • Optional Components: diagonal line, labels and cluster ellipses are all optional, you may omit if desired
Arash Howaida
  • 741
  • 7
  • 19
  • 3
    Is there statistical content in this question? If the question is about how to draw a diagram in Python, then it is off topic here. If there is some special property of the distribution that you want to recreate, please describe what this distributional property is. – Bernhard Jun 05 '18 at 13:21
  • If you just want to create a graph, other tools might be preferable -- maybe an image manipulation program? If you however want to create/simulate data that has a similar scatter plot, it is a bit more work. Please notice that the scale is double-logarithmic. Maybe you can get away with creating such a plot with linear scales and replace the labels afterwards. – cherub Jun 05 '18 at 13:23
  • 1
    To those who have voted to close this question as not being about statistics, all I can reply is "really??" A solution requires identifying appropriate multivariate descriptive statistics from a plot and then simulating a dataset from that description. It's hard to imagine an exercise that could be any more fundamentally statistical, IMHO. I do agree there is some vagueness in the formulation, because we aren't told *which* characteristics to reproduce--but we are pointed to several of them, enough to formulate reasonable answers. – whuber Jun 05 '18 at 20:00
  • 1
    A [similar question](https://stats.stackexchange.com/questions/114610/what-is-the-relationship-between-y-and-x-in-this-plot) with a much more statistical bent, and several high-quality answers, is already available. – AkselA Jun 05 '18 at 20:58

3 Answers3

7

Use a tool like WebPlotDigitalizer, which extracts points from images based on color and other variables. The tool allows you to easily define the axes range (even allowing for log scale), in order to give coordinates to each point. With a 5-minute attempt I extracted many points (1305, to be precise) and built the following plot using plotly (which the above tool can export to in one click!):

enter image description here

Data is here in csv format. Just import your data into R and then plot (e.g. using ggplot2).

There are tons of online tutorials on how to create scatterplots in R from imported data

PS: again, I just spent 5 minutes doing this. For proper replication you need to be more careful, like eliminating points in the middle, proper axes scale, etc.

luchonacho
  • 2,568
  • 3
  • 21
  • 38
  • Very cool! Is that csv in log form already? I'm getting a different scatter plot than yours, I'm hoping it's just a data transformation thing. – Arash Howaida Jun 05 '18 at 17:07
  • After trying to plot in Python and d3 I got the same weird result from the ubuntu data you linked. It is 1305 observations long, like you said, but the scatter is very different than the one in your answer. It has a lot of values in the lower left corner and doesn't really resemble a scatter plot. Could you confirm the csv data are still the ones you used? – Arash Howaida Jun 05 '18 at 17:51
  • 2
    This answer misses the interesting point of the question, which is to produce a *similar* dataset, rather than the same one. See my comment to the question for why I think that distinction is so important. BTW, we already have a [thread about scraping graphics for data.](https://stats.stackexchange.com/questions/14437/software-needed-to-scrape-data-from-graph) If this question turns out to be just recreating a plot, then it should be closed as a duplicate. – whuber Jun 05 '18 at 20:03
  • @whuber The OP says: "There is a pretty cool graph I would like to recreate just for illustration purposes. There are no vital inferences that are hanging in the balance, so some smudging of the numbers is perfectly fine. I basically just want to capture the general features of the data and present them in a very similar way." Isn't that what I am addressing? – luchonacho Jun 06 '18 at 13:02
  • You have essentially photocopied the image. If your interpretation of the question is correct, then the question doesn't belong on this site. – whuber Jun 06 '18 at 13:47
  • @whuber I have essentially achieved the result wanted providing an alternative approach (which very likely was not thought by the OP) to what the OP was asking. There are plenty of similar posts here saying "hey, instead of doing A, try B". – luchonacho Jun 06 '18 at 13:52
  • @ArashHowaida Have you log-scaled the axes and selected the range of the axes as in the above graphs? For the data, just copy into an excel and do text to columns. – luchonacho Jun 06 '18 at 13:53
  • Scraping the plot does not appear to respond to the explicit question, which is "I basically just want to capture the general features of the data and present them in a very similar way." Your answer does not analyze the "general features" and it presents them exactly as given. – whuber Jun 06 '18 at 13:54
  • @whuber That's something the OP has to judge, imo. – luchonacho Jun 06 '18 at 13:55
  • Certainly--but we get to judge it too. If the question turns out to be focused on scraping the data or the mechanics of reproducing the plot, then it will not be on-topic here and will be closed. – whuber Jun 06 '18 at 14:04
5

I can provide you with some R Code to result in a similar scenario which shifts the data on the upper end of the scale by the constant log(2) on the log-scale:

x<-rnorm(1000,mean=7.5,sd=1); # 1000 random values for x

y<-x+rnorm(1000,sd=0.25)  # adding random noise to x, playing with the standard deviation changes how clearly you see the 'shift' between the groups
 #by creation these data are along the diagonal line

y[x<8.5]<-y[x<8.5]-log(2)   # shifts the data on the lower end of the scale by the constant log(2) on the log-scale to create two types of users

plot(10^x,10^y,log = "xy");  # plots on the log-log-scale

abline(a=0, b=1)  #adds the diagonal line

enter image description here

Alex2006
  • 527
  • 3
  • 12
2

This does not feel like an answer, but I cannot use graphics in comments. This plot is not very refined, yet. Is it about what you asked for or is something important missing? Still not shure, what is the problem here.enter image description here

x <- rnorm(500,8)
y <- x - runif(500)+rnorm(500,0,.3)

x <- x^10
y <- y^10
plot(x, y, 
     log="xy", xlim = c(1e7, 5e10),
     xlab="Daily outbound traffic [bytes]",
     ylab="Daily inbound traffic [bytes]")

This is R but as you were ready to accept Pseudocode...

Another flavour of R, different look, still the same question, if this includes the essence of what you were looking for

enter image description here

library(ggplot2)
d <- data.frame(x=x, y=y)
p <- ggplot(d, aes(x=x, y=y)) + geom_point() +
     scale_x_continuous(trans = 'log10') +
     scale_y_continuous(trans = 'log10') +
     geom_smooth(method="lm") +
     xlab("Daily outbound traffic [bytes]") +
     ylab("Daily inbound traffic [bytes]")
print(p)

Yes, there is lot's of room for plot improvement, but as long as that is not statistical but a question of programming language, this is off topic.

Bernhard
  • 7,419
  • 14
  • 36
  • Thank you. The main thing is how to create that high concentration of data points below the diagonal line. Note that it is not symmetric about the diagonal line, there is a big cluster below it, but not above it. I'll update my post for clarity. – Arash Howaida Jun 05 '18 at 13:49
  • That was my question when I asked for "special property of the distribution that you want to recreate". However, after seeing the answer by @luchonacho that question is not important anymore - take his data and any plotting software of your choice. – Bernhard Jun 05 '18 at 14:35