4

I have a dataset 3,270 rows long of news stories DOWNLOAD HERE from Mega.

For each story I have calculated a basic sentiment score for the title, description and body texts (which is just a simple positive words - negative words).

BUT this isn't taking account of the number of words in the text being analysed so consequently the body scores vary a LOT more than say the title. This makes plotting a real pain.

What I want is for the title, description and body a score that sits on the scale -1 (very negative) to +1 (very positive) where 0 is a neutral sentiment.

I suspect some sort of weighted average is the right approach but i) is this the right thing and ii) how does one do that in R for the three scores needed?

Can anyone help please?

BarneyC
  • 153
  • 6

2 Answers2

5

You would like a function $f(x)$ which maps a range $[min,max]$ to $[a,b]$. As taken from this question on SO,

$f(min) = a$

$f(max) = b$

Following the intuition in that question, we end up here for arbitrary $a$ and $b$, given that $a \ne b$ and $b > a$:

$f(x) = \frac{(b-a)(x-min)}{max-min} + a$

Moving this into R, we can construct a function for this, given an input vector r

scale <- function(vector = NULL, lower_bound = NULL, upper_bound = NULL){
  if(is.null(vector)){
    stop("Please provide input")
  } else if(is.null(lower_bound)){
    stop("Please provide lower bound")
  } else if(is.null(upper_bound)){
    stop("Please provide upper bound")
  }

  min <- min(vector)
  max <- max(vector)
  a <- lower_bound
  b <- upper_bound

  new <- lapply(vector, function(x) ((b-a)*(x-min)/(max-min)) + a)
  return(unlist(new))
}

Testing:

r <- c(1,2,3,4,5,6,7,8,9,10)
scale(r, -1, 1)

[1] -1.0000000 -0.7777778 -0.5555556 -0.3333333 -0.1111111
 [6]  0.1111111  0.3333333  0.5555556  0.7777778  1.0000000

scale(r, 10, 100)

[1]  10  20  30  40  50  60  70  80  90 100

System timing:

For input vector length 10

ptm <- proc.time()
scale(r, -1, 1)
proc.time() - ptm

 user  system elapsed 
  0.000   0.001   0.002 

For input vector length 10000

r <- runif(10000)

ptm <- proc.time()
scale(r, -1, 1)
proc.time() - ptm

 user  system elapsed 
  0.174   0.023   0.169 

And for input vector length 1,000,000

r <- runif(1000000)

ptm <- proc.time()
scale(r, -1, 1)
proc.time() - ptm

   user  system elapsed 
  3.824   0.063   3.862 

So it slows down a little when you get larger, but is still fairly speedy and accurate.

EDIT: Find below a similar function which first separates data into positive and negative portions in order to keep 0 neutral.

scale2 <- function(vector = NULL, lower_bound = NULL, upper_bound = NULL){
  if(is.null(vector)){
    stop("Please provide input")
  } else if(is.null(lower_bound)){
    stop("Please provide lower bound")
  } else if(is.null(upper_bound)){
    stop("Please provide upper bound")
  }

  min <- min(vector)
  max <- max(vector)
  a <- lower_bound
  b <- upper_bound

  positive <- vector[vector >= 0]
  negative <- vector[vector <= 0]

  p <- lapply(positive, function(x) ((b-0)*(x-min(positive))/(max(positive)-min(positive))) + 0)
  n <- lapply(negative, function(x) ((0-a)*(x-min(negative))/(max(negative)-min(negative))) + a)

  #delete duplicates
  p[p == 0] <- NULL

  return(unlist(list(n,p)))
}

Example:

t <- c(-100, -20, -5, 0, 0, 0, 10, 42, 904)
scale2(t, upper_bound = 1, lower_bound = -1)
[1] -1.00000000 -0.20000000 -0.05000000  0.00000000  0.00000000  0.00000000  0.01106195
[8]  0.04646018  1.00000000
Chris C
  • 2,545
  • 16
  • 34
  • Finally gotten around to dissecting this properly. The solution kind of works BUT is scaling the entire vector in range -1:1 including shifting 0 values up or down when a zero is still a zero, which makes me think the input vector first needs to be split into positive and negative value sets each of which gets scaled then recombined with the zero values. – BarneyC Jun 08 '15 at 11:20
  • @BarneyC, Please see the edit for the new function doing what you would like. I hope this will work for you. – Chris C Jun 08 '15 at 20:06
  • That certainly does scale within the two ranges rather nicely. – BarneyC Jun 09 '15 at 11:59
  • I need to cite from a book or a journal article for this scaling formula. Would anyone point out where I can find this type of scaling system? – Faisal Mustafa May 30 '21 at 07:44
1

If I have read your question correctly, there is a simple way to rescale your variables into the interval $[-1,1]$. Let $M$ denote the mid-range i.e. the average between the lowest and the largest observations, and let $R$ denote the range which is defined as the difference between the largest and the smallest observation. Then, for every $X_i$ define the new (rescaled) variable

$$Y_i=\frac{X_i-M}{R/2}$$

and it can be readily verified that $Y$ will be constrained in the interval $[-1,1]$. Here is an extremely simple example in R.

x<-rnorm(1000,50,1)
hist(x) #check the location
y<-(x-((min(x)+max(x))/2))/(max(x)-min(x))/2
hist(y)
#All between -1 and 1 now!

You can wrap it up into a function and apply it at will!

JohnK
  • 18,298
  • 10
  • 60
  • 103