2

I have a very large and sparse dataset of spam twitter accounts and it requires me to scale the x axis in order to be able to visualise the distribution (histogram, kde etc) and cdf of the various variables (tweets_count, number of followers/following etc).

    > describe(spammers_class1$tweets_count)
  var       n   mean      sd median trimmed mad min    max  range  skew kurtosis   se
1   1 1076817 443.47 3729.05     35   57.29  43   0 669873 669873 53.23  5974.73 3.59

In this dataset, the value 0 has a huge importance (actually 0 should have the highest density). However, with a logarithmic scale these values are ignored. I thought of changing the value to 0.1 for example, but it will not make sense that there are spam accounts that have 10^-1 followers.

P.S I mainly use Python for the analysis, but I could use Matlab or R if there are easy work arounds in these languages.

Alhayer
  • 95
  • 1
  • 10
  • @amaatouq: I would have a look at [IHS](http://stats.stackexchange.com/a/1630/603) transformation (instead of the log). – user603 May 04 '13 at 19:06
  • Andre, For example, if I wanted to plot a histogram or a cdf of the dataset to show that 50% of spammers actually have between 0 - 10 tweets and 20% have between 11 - 100 and less than that between 101 - 1000 and this would go up to 10^5 as my max value is 669873. Dividing by a 100 or 1000 wouldn't let me convey this observation – Alhayer May 04 '13 at 19:06
  • @user603: Thank you very much for the reference, but it seems to me that these transformations are applied to the values of x and not used to scale the measurement that displays the value of x using intervals corresponding to orders of magnitude. I am not sure of what I am saying, so please do correct me if I am wrong. What I am trying to convey will require me to use the actual values of x – Alhayer May 04 '13 at 19:20
  • @amaatouq: you can apply this transformation to the axis (e.g. not to the data points themselves). If this is what you want (transform the axis, not the data) I can write a simple R example to do that. – user603 May 04 '13 at 19:24
  • @user603: I'd appreciate if you do this for me. – Alhayer May 05 '13 at 06:43

1 Answers1

5

If you are just trying to visualize the distribution (and not using it in modeling) you can add 1 to all values, take logs, then write the axis to reflect this, e.g. in R, for density plot (same idea would work for other plots)

x <- c(rep(0,100), rep(1,30), rep(2,20), rep(3,10), rep(100, 2))
xt <- log10(x+1)
plot(density(xt), xaxt = 'n')
axis(1, at = c(0, 1, 2), labels = c(0, 10, 100))
Peter Flom
  • 94,055
  • 35
  • 143
  • 276