Update: if you have a true regression to the mean effect, because both it and treatment effects co-occur over time and have the same directionality for people needing treatment, the regression to the mean is confounded with treatment, and so you will not be able to estimate the "true" treatment effect.
This is an interesting set of data, and I think you can do some analyses with it, however you will not be able to treat the method used to generate the data as an experiment. I think you have what is outlined on Wikipedia as a natural experiment and, while useful, these types of studies have some issues not found in controlled experiments. In particular, natural experiments suffer from a lack of control over independent variables, so cause-and-effect relationships may be impossible to identify, although it is still possible to draw conclusions about correlations.
In your case, I would be worried about confounding variables. This is a list of possible factors that could influence the results:
- Possibly your largest confound is that you don't know what else is going on in users' lives away from the website. On the basis of what they write on the website, one user may realise how bad their situation is, they may draw on resources around them (family, friends) for support and therefore the help is not limited to that received on the website. Another user, perhaps due to their life issues, may be alienated from family and friends and the website is all the support they have. We may expect that the time-to-positive-outcome will be different for these two users, but we can't distinguish between them.
- I'm assuming that the website users accessed the website when they wanted to (which is great for them) but means that the results you have for their problems may not be reflective of the number and severity of their life issues, because I assume they didn't access the site regularly (unlike face-to-face counselling appointments which tend to be scheduled regularly).
- The level of detail in their writing will be reflective of their written style, and is not likely to be equivalent to what they would express in a face-to-face counselling session. There are also no non-verbal cues which a face-to-face counsellor would also use to help assess the state of their client. Were the changes over time more pronounced in users who wrote less and had less tags applied to their content?
- If there were a number of lower-score and high-score tags in the same post (e.g. someone is having problems with study and they're in a happy relationship), how was the proxy affected by this, for example was a simple average score take across all tag scores for each post? This could be affecting your scores if there is a particular very negative issue that the person is facing, but much of what else they mention is positive. In a face-to-face setting, the counsellor can focus on the negative and find out, e.g. find out why the person is so depressed even though much of their life seems to be going well, but in the website situation you only have what they write. So it is possible that the way users have written their posts means that taking an overall proxy may not work too well.
- If the website is for users with life problems, I'm not sure why you wish to include users who scored as being very (happy? successful?) in their first post. These people do not seem to be the target audience for the website and I'm not sure of why you would want to include them in the same group as people who had issues. For example, the happy(?) people do not seem to need treatment, so there is no reason I can think of why the website intervention would be suitable for them. I'm not sure if users were assigned to the website as a treatment after, for example, seeing a counsellor. If that was the case, I would wonder why people who were upset enough to see a counsellor would then do a very positive post on a website designed to help them improve their mental state. Assuming there was this pre-counselling stage, maybe all they needed was that one counselling appointment. Regardless, I think this is quite a different group to the ones that gave initial posts that showed life issues, and for the moment I would omit them as they seem to be a "sampling error". Normally when assessing treatment effects, we only select people who appear to need treatment (e.g. we don't include happy contented people in trials of antidepressants).
- There may be some social desirability bias in the user posts.
- Have you undertaken any inter-rater reliability testing with the tags? If not, could some of the issues with scoring be related to bias with some tags? In particular, there could be some quality issues when the counsellor has just started and is learning how to tag posts, just like there are quality issues when any of us learn something new. Also, did some counsellors tend to place more tags, and did some tend to place few tags? Your analysis requires tag consistency across all the posts.
These are just suggestions based on your post, and I could well have misunderstood some of your study, or made some incorrect assumptions. I think that the factors you mention at the end of your post - user's language choices, details of website interaction, timing of counsellor response - are all very important.
Best wishes with your study.