I decided to do some natural language processing with RSS feeds. RSS (Really Simple Syndication) feeds are a way for websites to distribute their content in a standardized format, which can be consumed and displayed by various devices, applications, or services. RSS feeds are primarily used by news websites, blogs, and other online publishers to syndicate their content to readers in an easily digestible format. For example, we can obtain a url for content on borderline personality disorder, and we can easily use this url in a feedparser in python, in order to obtain summaries of multiple articles. The url that I chose for the RSS feeds is below:
https://www.sciencedaily.com/rss/mind_brain/borderline_personality_disorder.xml
The feedparser is a popular python library designed to parse RSS and Atom feeds, which are both XML-based formats for distributing website content updates. With feedparser, you can easily consume content from websites and process the structured information they provide. By using the feedparser, we can read the feed.json file from the url and obtain titles and summaries of all the articles contained there. Examples from the .json file:
{‘title’: “Adolescents’ personalities and coping habits affect social behaviors”,
‘summary’: “A new study by a human development expert describes how adolescents’ developing personalities and coping habits affect their behaviors toward others.”},
{‘title’: ‘Teenage mind: First time evidence links over interpretation of social situations to personality disorder’,
‘summary’: ‘Researchers have became interested in the way people think, how they organize thoughts, execute a decision, then determine whether a decision is good or bad.’},
{‘title’: ‘Quitting smoking enhances personality change’,
‘summary’: ‘Researchers have found evidence that shows those who quit smoking show improvements in their overall personality.’},
{‘title’: ‘Personality plays role in body weight: Impulsivity strongest predictor of obesity’,
‘summary’: ‘People with personality traits of high neuroticism and low conscientiousness are likely to go through cycles of gaining and losing weight throughout their lives, according to an examination of 50 years of data. Impulsivity was the strongest predictor of who would be overweight, the researchers found.’}
The summaries were aggregated into a single string and then the text was preprocessed. Preprocessing included converting the text to lowercase, tokenizing the content, and removing English stopwords. The preprocessed content was then divided into individual sentences using a period (‘.’) as the delimiter. I then set up a TfidfVectorizer to focus on trigrams (three-word sequences). This means that I am considering sequences of three words to be meaningful units for our analysis. The max_df argument was set at 0.85. Trigrams that appeared in more than 85% of the sentences were ignored, as they might be too common to be informative.
For each trigram, I computed its average tf-idf score across all sentences. The TF-IDF (Term Frequency-Inverse Document Frequency) score represents the importance of a term within a document relative to its frequency across multiple documents. Trigrams were then ranked based on their average TF-IDF scores in descending order. Based on a predefined percentage (in this case, 1%), I selected the top trigrams to be excluded from our word cloud. The idea is to focus on terms that are significant but not too common. For the word cloud generation, I first created a dictionary that maps trigrams to their respective average TF-IDF scores. I specifically focused on trigrams outside the top 1% identified earlier. From this list, I selected 15 trigrams with the highest scores to be displayed prominently in our word cloud.