Reddit Depression Regimens – Topic Modeling

Text data (top posts and top comments for those posts) was downloaded from the subreddit depression regimens (https://www.reddit.com/r/depressionregimens/). Data was grouped by post id, in total there were 101 such ids, therefore 101 text documents. After collecting the data, the following data cleaning steps were performed:

  • any emails were removed from text
  • urls were removed (http and www)
  • common contractions were expanded (‘ain’t >> ‘is not’; ‘bday’ >> ‘birthday’; ‘don’t’ >> ‘do not’; etc.)
  • new line characters were removed
  • single quotes were removed

After the data cleaning steps were complete, sentences were tokenized into words, and punctuation was removed. English stop words were removed from documents. Python’s gensim.models.phrases.Phraser() was used in order to detect common phrases (bigrams). Lemmatization was then performed and part of speech tagging (POS). Only lemmatized words with certain POS tags were kept, including nouns, adjectives, verbs, adverbs, and proper nouns. Proper nouns were kept in case medication or supplement names get tagged as such. We are interested in how reddit users describe their experiences with certain psychotropic medications and supplements, and therefore the chosen POS tags are the ones that are relevant for descriptions.

An example of an original post is presented below. As a human, we can decipher that the post below is about ketamine and that the user had a positive experience with this treatment. We would be interested in the verbs, adverbs, adjectives, and nouns, that the author used to describe their experience.


I wanted to post this to give hope to those who need a little extra. I know how excruciating both having and battling treatment options for depression and anxiety can be. I’ve seen what I thought was rock bottom. I’ve been to my actual rock bottom, but I am free now.

One year ago, I was sitting in the recliner at my local ketamine clinic receiving my first infusion. The day before I had outlined my plan for suicide and had all my affairs in order, but a friend I had confided in about my depression had a “feeling” I was in a very dangerous place inside my head. I don’t know how she knew what I was planning, but thank goodness she had the foresight and the strength to push me to try one. more. thing. I had heard (and shared) quite a few podcasts from people who had been through treatment, administered the treatments, and even those who had been doing the research behind it all. had been chatting about ketamine as a potential treatment since nothing else had ever worked. She researched clinics, called them all to pick the best one, and made the appointment on an urgent basis getting me in that day.

She took me to the doctor, and after a while, I told him about my plan. I told him that I would give this a try, but this was my last try. After 25 years of my brain being a lab rat for every pill imaginable, years and years of therapy and everything else you can imagine, I was just so tired. He was sympathetic, caring and sat with me for quite a while. Then he started that first IV.

I won’t bore you with all the details of treatment (feel free to ask), but I can say that after the first treatment – one year ago today – I didn’t want to die anymore. I haven’t wanted to since. From time to time, the depression will creep in a little too much for comfort, but I have a lot of self-care tricks to help me get through it. And if It comes down to it, I go in for a booster treatment.
In the past year, I have had 11 infusions. The last 3 were to help me get off the last, and most difficult antidepressant that I ever took. Now, I’m on a very low dose of Lexapro, which I honestly doubt I even need. But I’m stable. I actually know what happiness feels like. And most importantly, I’m alive.
Thanks for reading.

After we perform the steps described above – data cleaning, removing stop words, lemmatization, and keeping only terms with specific POS tags, extract common bigrams, the post above results in the following:

want post give hope need little extra know excruciating battle treatment option depression anxiety see think rock bottom actual rock bottom free year ago sit recliner local ketamine clinic receive first infusion day outline plan suicide affair order friend confide depression feel dangerous place head know know plan thank goodness foresight strength push try thing hear share quite podcast people treatment administered treatment even research chat ketamine potential treatment else ever work research clinic call pick good make appointment urgent basis get day take doctor tell plan tell would give try last try year brain lab rat pill imaginable year year therapy else imagine tired sympathetic caring sit quite start first bear detail treatment feel free ask say first treatment year ago today want die anymore want time time depression creep little much comfort lot self_care trick help come go booster treatment year infusion last help last difficult antidepressant ever take low_dose lexapro honestly doubt even need stable actually know happiness feel importantly alive thank read

In regards to topic modeling, we are interested in the general topics that are discussed in this particular subreddit, Latent Dirichlet Allocation (LDA) can be used specifically for this sort of task. LDA is an unsupervised method for finding topics in text data. Our text is composed of documents, in this case each document is a combination of a post and top comments for a specific post id. LDA assumes that each document is composed of different topics and each topic is composed of different words. Therefore documents can contain overlapping topics and topics can contain overlapping words, but the probabilities for those topics and words will be different.

Since the problem is unsupervised, there are no labels, just text, and we don’t know how many topics there are in our subreddit. There is no exact formula to determine the optimal number of topics for the LDA model. One common way, that we will implement here, is to loop through different number of topics and calculate coherence scores. Then we should choose the model with the highest coherence score. In this specific case, I implemented models for 2, 4, 8, …, 14 topics, and plotted the corresponding coherence scores. As we can see from the chart, the highest value occurs when the number of topics is four, also we see peaks at 10 and 12 topics.

Now let’s see what the topics are.

Number of topics = 4

Topic 1 wordsWord probTopic 2 wordsWord probTopic 3 wordsWord probTopic 4 wordsWord prob
feel0.037day0.042depression0.017depression0.019
thing0.026work0.024effect0.014people0.017
depression0.022feel0.020antidepressant0.013year0.016
make0.021time0.020ssris0.012give0.016
life0.019sleep0.018doctor0.012treatment0.015
time0.013good0.014anxiety0.010month0.013
bad0.013thing0.012side_effect0.010start0.013
good0.012start0.010mg0.010find0.012
lot0.010bed0.009drug0.009hope0.012
depressed0.010exercise0.009psychiatrist0.009ketamine0.011

If we choose 10 topics:

Topic 1 wordsWord probTopic 2 wordsWord probTopic 3 wordsWord probTopic 4 wordsWord probTopic 5 wordsWord prob
feel0.039people0.044day0.037ssris0.027treatment0.035
year0.026depression0.037thing0.035antidepressant0.024ketamine0.028
thing0.022doctor0.028feel0.033effect0.024year0.022
symptom0.020psychiatrist0.020make0.024drug0.022work0.021
brain0.019make0.020find0.017side_effect0.020drug0.017
start0.018bad0.016good0.016depression0.019hope0.015
time0.017therapy0.016exercise0.016serotonin0.016hear0.012
make0.015therapist0.015eat0.013prescribe0.014lithium0.011
issue0.015find0.014walk0.013treat0.013people0.010
lot0.014problem0.013lot0.013ssri0.012infusion0.009
Topic 6 wordsWord probTopic 7 wordsWord probTopic 8 wordsWord probTopic 9 wordsWord probTopic 10 wordsWord prob
work0.053time0.033sleep0.053experience0.039life0.062
anxiety0.030make0.028day0.037day0.030feel0.030
mg0.025depression0.015time0.030feel0.029depression0.029
bad0.020long0.015bed0.024depression0.024thing0.020
high0.020call0.014start0.024mind0.020find0.019
vitamin0.018depressed0.014feel0.023give0.017good0.017
diet0.015feeling0.013morning0.020month0.017live0.017
supplement0.014people0.013wake0.018good0.015bad0.014
post0.012read0.013night0.014week0.013change0.014
literally0.011focus0.013hour0.013back0.012year0.013

I think that even with this small sample size – 101 top posts and corresponding top comments, LDA results provide us with a good understanding of what users discuss in the depressionregimens subreddit. There are discussions about life, feeling depressed, for how long the depression has been going on (mentions of week/month/year), mentions of how the day goes (Topic 7), mentions of specific treatments (Topic 4), supplements (Topic 5), SSRIs and side effects (Topic 3), exercise (Topic 2).

It’s possible to then apply the chosen model to each document in order to obtain the topics distribution by document. For example, we can choose the model with 10 topics, obtain topics distribution by document, and determine the topic with the maximum probability for each document. Then we can select sample documents that have the highest probability for a given topic. If we choose topic 2, which contains the following word distribution:

(‘people’, 0.04), (‘depression’, 0.038), (‘doctor’, 0.028), (‘psychiatrist’, 0.020),
(‘make’, 0.020), (‘bad’, 0.016), (‘therapy’, 0.016), (‘therapist’, 0.015), (‘find’, 0.014),
(‘problem’, 0.013)

We can find documents that have the maximum probability for topic 2:


“This might be an unconventional treatment considering that many of us post about their experience with various drugs.
I myself struggled with mental health in the past. I can say my mental health issues in the past were to 90% biological (hormonal problems). Once I treated the causes, over time the upwards spiral in my personal wellbeing (and life in general) started again.
In early twenties, my life was starting to go down the gutter. My life started to fall apart in every domain. I was severely depressed. I found out that some of my hormones were very low. I started hormone replacement. Whereas before my life was a nightmare, it has been great ever since. I could even get off the SSRIs I was on.
I wrote an article about my journey. How Hormones Destroyed and Saved My Life.
My dream is to live in a world where no one is held back from living an at least decent life the way I was. Even though not my fault, it is my life. And thus my responsibility. Without accepting and acting on that I just don´t know where I would be today. For sure I wouldn´t be writing this. Hope you find value in it… “

(Can read full text at https://www.reddit.com/r/depressionregimens/comments/lef32x )

The topics distribution for this document is as follows:
[1: 0.074, 2: 0.338, 3: 0.032, 4: 0.069, 5: 0.083, 6: 0.084, 7: 0.052, 8: 0.054, 9: 0.073, 10: 0.153]

Lemmatized text:

[‘may’, ‘unconventional’, ‘treatment’, ‘consider’, ‘many’, ‘us’, ‘post’, ‘experience’, ‘various’, ‘drug’, ‘struggle’, [‘may’, ‘unconventional’, ‘treatment’, ‘consider’, ‘many’, ‘us’, ‘post’, ‘experience’, ‘various’, ‘drug’, ‘struggle’, ‘mental_health’, ‘say’, ‘mental_health’, ‘issue’, ‘biological’, ‘hormonal’, ‘problem‘, ‘treat’, ’cause’, ‘time’, ‘upwards’, ‘spiral’, ‘personal’, ‘wellbeing’, ‘life’, ‘general’, ‘start’, ‘early’, ‘twenty’, ‘life’, ‘start’, ‘go’, ‘gutter’, ‘life’, ‘start’, ‘fall’, ‘apart’, ‘domain’, ‘severely_depresse’, ‘find‘, ‘hormone’, ‘low’, ‘start’, ‘hormone’, ‘replacement’, ‘life’, ‘nightmare’, ‘great’, ‘ever’, ‘since’, ‘could’, ‘even’, ‘ssris’, ‘write’, ‘article’, ‘journey’, ‘hormone’, ‘destroy’, ‘save’, ‘life’, ‘dream’, ‘live’, ‘world’, ‘hold’, ‘back’, ‘live’, ‘least’, ‘decent’, ‘life’, ‘way’, ‘even’, ‘fault’, ‘life’, ‘thus’, ‘responsibility’, ‘accept’, ‘act’, ‘know’, ‘would’, ‘today’, ‘sure’, ‘write’, ‘hope’, ‘find‘, ‘value’, ‘opinion’, ‘replace’, ‘hormone’, ‘deficient’, ‘far’, ‘natural’, ‘also’, ‘effective’, ‘artificial’, ‘med’, ‘however’, ‘believe’, ‘hormone’, ‘deficiency’, ‘may’, ‘much’, ‘common’, ‘assume’, ‘people‘, ‘never’, ‘get’, ‘hormone’, ‘check’, ‘often’, ‘even’, ‘life’, ‘want’, ‘give’, ‘head’, ‘other’, ‘pull’, ‘trigger’, ‘medication’, ‘claim’, ‘medication’, ‘work’, ‘hormone’, ‘check’, ‘opportunity’, ‘cost’, ‘high’, ‘similar’, ‘experience’, ‘hormone’, ‘hormone’, ‘dangerous’, ‘play’, ‘make‘, ‘sure’, ‘talk’, ‘doctor‘, ‘monitor’, ‘doctor‘, ‘lock’, ‘post’, ‘people‘, ‘would’, ‘see’, ‘unlocked’, ‘pm’, ‘otherwise’, ‘leave’, ‘lock’, ‘play’, ‘hormone’, ‘medical’, ‘supervision’, ‘highly’, ‘detrimental’, ‘health’, ‘thyroid’, ‘hormone’, ‘deficient’, ‘know’, ‘other’, ‘start’, ‘take’, ‘mcg’, ‘thyroxine’, ‘treat’, ‘hypothyroidism’, ‘run’, ‘family’, ‘fog’, ‘seem’, ‘lift’, ‘agree’, ‘hormone’, ‘underrated’, ‘come’, ‘depression‘, ‘thank’, ‘share’, ‘hormone’, ‘specifically’, ‘testosterone’, ‘direct’, ‘correlation’, ‘dopamine’, ‘high’, ‘test’, ‘high’, ‘dopamine’, ‘vice’, ‘versa’, ‘generally’, ‘testerone’, ‘wellbutrin’, ‘increase’, ‘libido’, ‘endocrine’, ‘system’, ‘research’, ‘seem’, ‘lag’, ‘research’, ‘treatment’, ‘know’, ‘million’, ‘could’, ‘suffer’, ‘needlessly’, ‘ignore’, ‘op’, ‘entire’, ‘post’, ‘structure’, ‘sway’, ‘people‘, ‘way’, ‘link’, ‘closing’, ‘paragraph’, ‘also’, ‘spamme’, ‘numerous’, ‘time’, ‘different’, ‘thing’, ’cause’, ‘depression‘, ‘know’, ‘enough’, ‘dark’, ‘age’, ‘exclusive’, ‘seratonin’, ‘hormone’, ‘receptor’, ‘regulation’, ‘drug’, ‘abuse’, ‘dopamine’, ‘ach’, ‘brain’, ‘damage’, ‘gaba’, ‘glutamate’, ‘imbalance’, ‘relate’, ‘several’, ‘brain’, ‘region’, ‘receptor’, ‘site’, ‘together’, ‘hormone’, ‘conversion’, ‘chain’, ‘adhd’, ‘bp’, ‘level’, ‘bdnf’, ‘several’, ‘type’, ‘disease’, ‘additionally’, ‘low’, ‘end’, ‘hormone’, ‘scale’, ‘total’, ‘free’, ‘may’, ‘feel’, ‘symptom’, ‘other’, ‘would’, ‘conversely’, ‘man’, ‘may’, ‘almost’, ‘nil’, ‘estrogen’, ‘high’, ‘estrogen’, ‘side_effect’, ‘decent’, ‘doctor‘, ‘full’, ‘blood’, ‘panel’, ‘hormone’, ‘panel’, ‘include’, ‘ask’, ‘depend’, ‘free’, ‘go’, ‘private’, ‘cost’, ‘uk’, ‘take’, ‘important’, ‘relative’, ‘commit’, ‘find‘, ‘thyroid’, ‘level’, ‘way’, ‘back’, ‘thyroid’, ‘problem‘, ‘handle’, ‘psych’, ‘med’, ‘need’, ‘depression‘, ‘probably’, ‘lifelong’, ‘become’, ‘unmanageable’, ‘thyroid’, ‘cancer’, ‘luckily’, ‘old’, ‘easy’, ‘catch’, ‘get’, ‘point’, ‘hormone’, ‘low’, ‘find‘, ‘hormone’, ‘check’, ‘yearly’, ‘perfectly’, ‘normal’, ‘even’, ‘high’, ‘yet’, ‘still’, ‘depressed’, ‘hormone’, ‘may’, ‘help’, ‘people‘, ‘many’, ‘still’, ‘depress’, ‘physiological’, ‘duck’, ‘row’, ‘infuriate’, ‘many’, ‘doctor‘, ‘refuse’, ‘prescribe’, ‘hrt’, ‘guess’, ‘taboo’, ‘medical’, ‘school’, ‘pull’, ‘tooth’, ‘find‘, ‘decent’, ‘doctor‘, ‘even’, ‘consider’, ‘apparently’, ‘fear’, ‘cancer’, ‘induce’, ‘hormone’, ‘frankly’, ‘rather’, ‘live’, ‘good’, ‘life’, ‘even’, ‘mean’, ‘get’, ‘cancer’, ‘live’, ‘cancer’, ‘free’, ‘life’, ‘mentally’, ‘miserable’, ‘post’, ‘multiple’, ‘account’, ‘whole’, ‘time’, ‘person’, ‘post’, ‘often’, ‘article’, ‘different’, ‘account’, ‘sometimes’, ‘claim’, ‘last’, ‘year’, ‘biology’, ‘student’, ‘other’, ‘last’, ‘year’, ‘medicine’, ‘student’, ‘post’, ‘lame’, ‘excuse’, ‘lure’, ‘costumer’, ‘hormetheu’, ‘thank’, ‘share’, ‘disregard’, ‘irrational’, ‘post’, ‘intelligent’, ‘enough’, ‘determine’, ‘right’, ‘see’, ‘sort’, ‘ground’, ‘swell’, ‘business’, ‘activity’, ‘even’, ‘touch’, ‘consultation’, ‘hormone’, ‘way’, ‘business’, ‘s’, ‘even’, ‘well’, ‘talk’, ‘get’, ‘free’, ‘professsional’, ‘guidance’, ‘think’, ‘people‘, ‘stick’, ‘depression‘, ’cause’, ‘people‘, ‘pursue’, ‘treatment’, ‘may’, ‘save’, ‘life’, ‘know’, ‘firsthand’, ‘appropriate’, ‘way’, ‘respond’, ‘tell’, ‘support’, ‘other’, ‘say’, ‘mother’, ‘first’, ‘tell’, ‘hit’, ‘would’, ‘sit’, ‘kitchen’, ‘table’, ‘cry’, ‘uncontrollably’, ‘start’, ‘hrt’, ‘right’, ‘take’, ‘nurse’, ‘year’, ‘tortuous’, ‘severe’, ‘depression‘, ‘ask’, ‘do’, ‘hormone’, ‘panel’, ‘flabbergast’, ‘go’, ‘lowt’, ‘men’, ‘health’, ‘center’, ‘addition’, ‘find‘, ‘severely’, ‘low’, ‘receive’, ‘great’, ‘man’, ‘health’, ‘care’, ‘know’, ‘funny’, ‘deduce’, ‘man’, ‘mid’, ‘life’, ‘crisis’, ‘hormone’, ‘imbalance’, ‘likely’, ‘low’, ‘get’, ‘ball’, ‘bust’, ‘buy’, ‘corvette’, ‘woman’, ‘get’, ‘sympathy’, ‘go’, ‘change’, ‘enough’, ‘question’, ‘come’, ‘first’, ‘opinion’, ‘testosterone’, ‘brain’, ‘get’, ‘testosterone’, ‘shot’, ‘help’, ‘put’, ‘dent’, ‘depression‘, ‘make‘, ‘feel’, ‘well’, ‘still’, ‘leave’, ‘pretty’, ‘severe’, ‘depression‘, ‘admittedly’, ‘hormone’, ‘vitamin’, ‘could’, ‘do’, ‘aggressively’, ‘recently’, ‘do’, ‘put’, ‘brain’, ‘glide’, ‘path’, ‘depression‘, ‘amazing’, ‘think’, ‘fix’, ‘fix’, ‘brain’, ‘still’, ‘aggressively’, ‘pursue’, ‘low’, ‘hear’, ‘cortisol’, ‘kill’, ‘testosterone’]



Neural Network Predicting Subreddit Karma and Comments for Mental Health Topics

Introduction:

Problem Description:

This project analyses whether the attributes of the subreddit posts’ content are predictive of social support (karma points and number of comments) posts receive.

Context of the Problem:

Mental health problems tend to go under-reported and under-addressed, which places a high social-economic cost on the society. Research shows that social support is valuable for improving quality of life for people with mental health illnesses.

This project examines what content attributes of anonymous social media posts on reddit platform elicit higher levels of social support in the form of karma points and comments.

Limitation About other Approaches:

We have examined two most relevant papers on the topic, [1] and [2]. Neither Schrading, N. et al. [1], nor De Choudhury, M. & De, S. [2] use subreddit indicator variables (i.e., indicators for schizophrenia, depression, anxiety, etc.) in their analysis. It is likely that posts are treated differently, depending on a mental illness indicated (as per Mann, C. E. & Himelein, M. J. [3], “stigmatization of schizophrenia was significantly higher than stigmatization of depression”). Also, De Choudhury, M. & De, S. [2] used a resource intensive manual labelling approach to arrive at keywords.

Solution:

In this project, the analysis includes subreddit indicators in the neural network model predicting social supports for reddit posts. The figure below shows statistics for subreddit indicators for a sample dataset. It can be seen that the mean for the target variables is very different between subreddits.

Additional inputs include counts of frequent bigrams and emotion labelling of keywords. Emotion labelling was done through an NLP approach, using an already existing emotions lexicon.

Background:

ReferenceExplanationDataset/InputWeakness
Schrading, N. et al. [1]They trained and compared multiple classifiers on content of reddit posts to determine the top semantic and linguistic features in detecting abusive relationships. Subreddit posts with comments that focus on domestic abuse, plus subreddit posts with comments unrelated to domestic abuse as a control set.Future studies could be implemented on datasets from multiple websites to compare online abuse patterns across forums.
De Choudhury, M. & De, S. [2]They trained a negative binomial regression model on content of reddit posts (i.e., length, use of 1st pronoun, relationship words, emoticons, positive and negative words, etc.) to predict social support variables (karma points and number of responses).Posts, comments and associated metadatafrom several mental health subreddits, including alcoholism, anxiety, bipolarreddit, depression,mentalhealth, MMFB (Make Me Feel Better), socialanxiety, SuicideWatch.– Out of the top 15 discussed predicting variables used in the regression model, the highest coefficient have the intercept and the use of the 1st pronoun. – There is no discussion about correlations between predicting variables (for example, the study uses such variables as negative emotion, positive emotion and number of emoticons, which could be correlated).

Methodology

Schrading, N. et al. [1] reported that out of the post features they analyzed, ngrams were the most predicting ones when detecting abusive relationships in reddit posts. De Choudhury M. & De, S. [2] tried to predict social support variables for mental health related reddit posts using post length, emoticons, unigrams, variables built based on presence of emotionally charged unigrams, etc.

In this project, to predict social support variables (scores and number of comments) for mental health related reddit posts, the model was built using the neural networks approach and with emotionally charged unigrams as indicators of 10 different emotions, emotions count, post length, part of speech frequencies (counts of verbs, pronouns, adverbs and adjectives), count of first pronouns, number of question marks, post length, count of frequent bigrams, and subreddit indicators as predictive variables.

Below is the list of the input used in the models for predicting the score and number of comments:

‘anger’, ‘anticipation’, ‘disgust’, ‘fear’, ‘joy’, ‘negative’, ‘positive’, ‘sadness’, ‘surprise’, ‘trust’,’len_post’, ‘len_post_orig’, ‘first_pronoun_count’, ‘freq_bigram_count’, ‘q_count’, ‘verb_count’, ‘pronoun_count’,’adverb_count’, ‘adjective_count’, Subreddit(display_name=’BipolarReddit’), Subreddit(display_name=’Anxiety’), Subreddit(display_name=’depression’), Subreddit(display_name=’schizophrenia’), Subreddit(display_name=’bipolar’), Subreddit(display_name=’mentalhealth’), Subreddit(display_name=’depression_help’), Subreddit(display_name=’BPD’), Subreddit(display_name=’socialanxiety’), Subreddit(display_name=’mentalillness’)

Emotion lexicon

A public lexicon dataset was used to determine counts of specific emotion words. The NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive).

https://nrc.canada.ca/en/research-development/products-services/technical-advisory-services/sentiment-emotion-lexicons

Below are examples of posts with most frequent bigrams highlighted. Frequent bigrams ‘feel like’, ‘feels like’ are consistent with the finding by De Choudhury M. & De, S. [2] of frequent unigrams related to emotional expression.

N-grams

For this project we identified most popular bigrams and trigrams. The counts of most frequent bigrams and trigrams were used while testing various models, and the most useful data turned out to be counts of most frequent 16 bigrams, which were used as one of the inputs to the model.

Below is the list of the most popular bigrams used and a few examples of their usage in raw texts.


Implementation

Data Collection

Obtained data via a public API from 10 mental health subreddits: “depression”, “anxiety”, “bipolarreddit”, “mentalhealth”, “socialanxiety”, “depression_help”, “bipolar”, “BPD”, “schizophrenia”, and “mentalillness”.

  • First, checking 10 hot posts for each subreddit indicator
  • Collecting data

top_posts dimensions: (9949, 9)

hot_posts dimensions: (9890, 9)

new_posts dimensions: (9896, 9)


Preparing the Data

reddit data scraping is limited to a maximum of 1000 records per subreddit per each of 3 post categories (“hot”, “top” and “new” posts). To maximize the dataset size, we collected posts of all 3 categories and removed duplicate records that have categories overlapping. As mentioned by De Choudhury M. & De, S. [2], reddit posts reach most of their commentary within the first 3 days from being posted. Thus, we removed posts that were “younger” than 3 days old at the data collection time.

  • Removing stop words and punctuation
  • Created ngrams (bigrams, trigrams and fourgrams)
  • Applying smoothing for trigrams and removing extra words referring to posts, unrelated to this analysis (i.e., moderator’s posts)
  • Creating emotions dataframe, count POS (part of speech) tags, and topic/subreddit dummies

Reddit score prediction model – results based on first layer weights:
In a multi-layer neural network it is hard to interpret raw internal weights, but it looks like mental health-specific variables (such as indicators for fear or surprise, or subreddit indicators) are more important than generic (such as verb count or the length of the post, which looks to be least useful). In particular most subreddit indicators (“depression_help”, “depression”, “schizophrenia”, etc.), which were not used in other papers, are in top 10 for total weights.


Conclusion and Future Direction

In conclusion, neural network results showed that the model inputs do have some predictive power for social response variables ‘number of comments’ and ‘score’, as the sums of weights for input variables were found to be greater than zero. Also during model testing, starting with fewer input variables, adding the rest of the input variables reduced the absolute mean errors.

One of the future improvements for this analysis could be incorporating a variable that indicates whether the post is from a throwaway account or an existing long-term reddit account, as De Choudhury, M. & De, S. [2] mention that reddit’s throwaway accounts allow individuals to express themselves more honestly and to ‘discuss uninhibited feelings’.

Also, while content and length of post titles and how users action on posts (click, read, and reply) might have an impact on post’s score, neither of the research papers cited, nor this analysis used title analysis as a part of the model. As such adding title attributes and post actioning statistics variables to the model could be a potential area for improvement.


References:

[1]: Schrading, N., Alm, C. O., Ptucha, R., & Homan, C. M. An Analysis of Domestic Abuse Discourse on Reddit, The 2015 Conference of Empirical Methods in Natural Language Processing, Lisbon, Portugal, September 2015, pages 2577-2583.

[2]: De Choudhury, M. & De, S. Mental Health Discourse on reddit: Self-Disclosure, Social Support, and Anonymity. Eights International AAAI Conference on Weblogs and Social Media, North America, May 2014, pages 71-80. Available at: https://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/view/8075/8107.

[3] Mann, C. E. & Himelein, M. J. Factors Associated with Stigmatization of Persons with Mental Illness. Psychiatric Services, Vol. 55, No. 2., February 2004, pages 185-197. Available at: https://ps.psychiatryonline.org/doi/pdf/10.1176/appi.ps.55.2.185.