Neural Network Predicting Subreddit Karma and Comments for Mental Health Topics

Introduction:

Problem Description:

This project analyses whether the attributes of the subreddit posts’ content are predictive of social support (karma points and number of comments) posts receive.

Context of the Problem:

Mental health problems tend to go under-reported and under-addressed, which places a high social-economic cost on the society. Research shows that social support is valuable for improving quality of life for people with mental health illnesses.

This project examines what content attributes of anonymous social media posts on reddit platform elicit higher levels of social support in the form of karma points and comments.

Limitation About other Approaches:

We have examined two most relevant papers on the topic, [1] and [2]. Neither Schrading, N. et al. [1], nor De Choudhury, M. & De, S. [2] use subreddit indicator variables (i.e., indicators for schizophrenia, depression, anxiety, etc.) in their analysis. It is likely that posts are treated differently, depending on a mental illness indicated (as per Mann, C. E. & Himelein, M. J. [3], “stigmatization of schizophrenia was significantly higher than stigmatization of depression”). Also, De Choudhury, M. & De, S. [2] used a resource intensive manual labelling approach to arrive at keywords.

Solution:

In this project, the analysis includes subreddit indicators in the neural network model predicting social supports for reddit posts. The figure below shows statistics for subreddit indicators for a sample dataset. It can be seen that the mean for the target variables is very different between subreddits.

Additional inputs include counts of frequent bigrams and emotion labelling of keywords. Emotion labelling was done through an NLP approach, using an already existing emotions lexicon.

Background:

ReferenceExplanationDataset/InputWeakness
Schrading, N. et al. [1]They trained and compared multiple classifiers on content of reddit posts to determine the top semantic and linguistic features in detecting abusive relationships. Subreddit posts with comments that focus on domestic abuse, plus subreddit posts with comments unrelated to domestic abuse as a control set.Future studies could be implemented on datasets from multiple websites to compare online abuse patterns across forums.
De Choudhury, M. & De, S. [2]They trained a negative binomial regression model on content of reddit posts (i.e., length, use of 1st pronoun, relationship words, emoticons, positive and negative words, etc.) to predict social support variables (karma points and number of responses).Posts, comments and associated metadatafrom several mental health subreddits, including alcoholism, anxiety, bipolarreddit, depression,mentalhealth, MMFB (Make Me Feel Better), socialanxiety, SuicideWatch.– Out of the top 15 discussed predicting variables used in the regression model, the highest coefficient have the intercept and the use of the 1st pronoun. – There is no discussion about correlations between predicting variables (for example, the study uses such variables as negative emotion, positive emotion and number of emoticons, which could be correlated).

Methodology

Schrading, N. et al. [1] reported that out of the post features they analyzed, ngrams were the most predicting ones when detecting abusive relationships in reddit posts. De Choudhury M. & De, S. [2] tried to predict social support variables for mental health related reddit posts using post length, emoticons, unigrams, variables built based on presence of emotionally charged unigrams, etc.

In this project, to predict social support variables (scores and number of comments) for mental health related reddit posts, the model was built using the neural networks approach and with emotionally charged unigrams as indicators of 10 different emotions, emotions count, post length, part of speech frequencies (counts of verbs, pronouns, adverbs and adjectives), count of first pronouns, number of question marks, post length, count of frequent bigrams, and subreddit indicators as predictive variables.

Below is the list of the input used in the models for predicting the score and number of comments:

‘anger’, ‘anticipation’, ‘disgust’, ‘fear’, ‘joy’, ‘negative’, ‘positive’, ‘sadness’, ‘surprise’, ‘trust’,’len_post’, ‘len_post_orig’, ‘first_pronoun_count’, ‘freq_bigram_count’, ‘q_count’, ‘verb_count’, ‘pronoun_count’,’adverb_count’, ‘adjective_count’, Subreddit(display_name=’BipolarReddit’), Subreddit(display_name=’Anxiety’), Subreddit(display_name=’depression’), Subreddit(display_name=’schizophrenia’), Subreddit(display_name=’bipolar’), Subreddit(display_name=’mentalhealth’), Subreddit(display_name=’depression_help’), Subreddit(display_name=’BPD’), Subreddit(display_name=’socialanxiety’), Subreddit(display_name=’mentalillness’)

Emotion lexicon

A public lexicon dataset was used to determine counts of specific emotion words. The NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive).

https://nrc.canada.ca/en/research-development/products-services/technical-advisory-services/sentiment-emotion-lexicons

Below are examples of posts with most frequent bigrams highlighted. Frequent bigrams ‘feel like’, ‘feels like’ are consistent with the finding by De Choudhury M. & De, S. [2] of frequent unigrams related to emotional expression.

N-grams

For this project we identified most popular bigrams and trigrams. The counts of most frequent bigrams and trigrams were used while testing various models, and the most useful data turned out to be counts of most frequent 16 bigrams, which were used as one of the inputs to the model.

Below is the list of the most popular bigrams used and a few examples of their usage in raw texts.


Implementation

Data Collection

Obtained data via a public API from 10 mental health subreddits: “depression”, “anxiety”, “bipolarreddit”, “mentalhealth”, “socialanxiety”, “depression_help”, “bipolar”, “BPD”, “schizophrenia”, and “mentalillness”.

  • First, checking 10 hot posts for each subreddit indicator
  • Collecting data

top_posts dimensions: (9949, 9)

hot_posts dimensions: (9890, 9)

new_posts dimensions: (9896, 9)


Preparing the Data

reddit data scraping is limited to a maximum of 1000 records per subreddit per each of 3 post categories (“hot”, “top” and “new” posts). To maximize the dataset size, we collected posts of all 3 categories and removed duplicate records that have categories overlapping. As mentioned by De Choudhury M. & De, S. [2], reddit posts reach most of their commentary within the first 3 days from being posted. Thus, we removed posts that were “younger” than 3 days old at the data collection time.

  • Removing stop words and punctuation
  • Created ngrams (bigrams, trigrams and fourgrams)
  • Applying smoothing for trigrams and removing extra words referring to posts, unrelated to this analysis (i.e., moderator’s posts)
  • Creating emotions dataframe, count POS (part of speech) tags, and topic/subreddit dummies

Reddit score prediction model – results based on first layer weights:
In a multi-layer neural network it is hard to interpret raw internal weights, but it looks like mental health-specific variables (such as indicators for fear or surprise, or subreddit indicators) are more important than generic (such as verb count or the length of the post, which looks to be least useful). In particular most subreddit indicators (“depression_help”, “depression”, “schizophrenia”, etc.), which were not used in other papers, are in top 10 for total weights.


Conclusion and Future Direction

In conclusion, neural network results showed that the model inputs do have some predictive power for social response variables ‘number of comments’ and ‘score’, as the sums of weights for input variables were found to be greater than zero. Also during model testing, starting with fewer input variables, adding the rest of the input variables reduced the absolute mean errors.

One of the future improvements for this analysis could be incorporating a variable that indicates whether the post is from a throwaway account or an existing long-term reddit account, as De Choudhury, M. & De, S. [2] mention that reddit’s throwaway accounts allow individuals to express themselves more honestly and to ‘discuss uninhibited feelings’.

Also, while content and length of post titles and how users action on posts (click, read, and reply) might have an impact on post’s score, neither of the research papers cited, nor this analysis used title analysis as a part of the model. As such adding title attributes and post actioning statistics variables to the model could be a potential area for improvement.


References:

[1]: Schrading, N., Alm, C. O., Ptucha, R., & Homan, C. M. An Analysis of Domestic Abuse Discourse on Reddit, The 2015 Conference of Empirical Methods in Natural Language Processing, Lisbon, Portugal, September 2015, pages 2577-2583.

[2]: De Choudhury, M. & De, S. Mental Health Discourse on reddit: Self-Disclosure, Social Support, and Anonymity. Eights International AAAI Conference on Weblogs and Social Media, North America, May 2014, pages 71-80. Available at: https://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/view/8075/8107.

[3] Mann, C. E. & Himelein, M. J. Factors Associated with Stigmatization of Persons with Mental Illness. Psychiatric Services, Vol. 55, No. 2., February 2004, pages 185-197. Available at: https://ps.psychiatryonline.org/doi/pdf/10.1176/appi.ps.55.2.185.