I am looking at analyzing the effects of Covid pandemic on online communications. I am hypothesizing that user comments on discussion forums are significantly more negative in tone during the Covid pandemic period (i.e., Mar 01 2020 -Apr 15, 2020) than before (Jan 2019- Feb 2020). I am new to time series analysis and don't know how to approach this question.
I have text data for the period from January 2018 to April 15, 2020. Each record in my data has the following fields:
User_ID
;Comment_NegativeTone
(number of words in the comment that belong to negative tone dictionary);Comment_Timestamp
;Comment_InResponseTo_UserID
(null in case it is a new comment);DiscussionForum_ID
;DiscussionForum_NumberofRegisteredMembers
;DiscussionForum_AverageCommentsPerDay
;DiscussionForum_AveragePageViewsPerDay
How do I test the hypothesis while controlling for the fact that comments are nested in users, parent comments, and discussion forums? How do I control for discussion forum level variables (i.e., number of registered members, average comments per day, and average page views per day)?
I was thinking of the following mixed effect, negative binomial model in R
:
glmmTMB(Comment_NegativeTone ~
CovidTimePeriod +
(1|User_ID) +
(1|Comment_InResponseTo_UserID) +
(1|DiscussionForum_ID) +
DiscussionForum_NumberofRegisteredMembers+ DiscussionForum_AverageCommentsPerDay+
DiscussionForum_AveragePageViewsPerDay + offset(log(wordCountInComment)),
data, family=nbinom2)
where CovidTimePeriod
=1 for Comment_Timestamp
between Mar-Apr, 2020 and 0 for earlier.
I am not sure if this model is the right way to do such an analysis of timeframe differences. How do I address potential concerns that in the non-Covid period too there might have been an increase/decrease in negative tone and thus any observed increase in the CovidTimePeriod is not meaningful? Should I be using alternative models to test such a time-based hypothesis (e.g., latent change score model)?
Thank you!