data science – Notes of a Neuropsychiatry Amateur

Percentile Confidence Interval Calculator

Percentile-Confidence-Interval-Calculation.ipynb

This Python script calculates the 95% confidence interval for a specified percentile (e.g., the 70th percentile) of a dataset. The confidence interval provides a range in which we expect the true percentile value to lie with 95% confidence.

The calculation makes use of the binomial distribution properties, making an assumption that our data can be modeled by a binomial distribution. This assumption may not always be accurate, especially for continuous data, but it provides an approximation for our purposes.

Assumptions

1. Binary Outcome: The fundamental assumption behind the binomial distribution is that there is a binary outcome, often termed as ‘success’ and ‘failure’. In the context of percentiles, you can think of ‘success’ as the instances below the percentile and ‘failure’ as the instances above.

2. Fixed Number of Trials: For the binomial distribution, there is a fixed number n of trials. In our case, n represents the total number of data points in our sample.

3. Independence: Each trial (or data point) is independent of others. This means the outcome of one trial does not affect the outcome of another.

4. Constant Probability of Success: The probability of success, q, is the same for each trial. Here, q represents the percentile value. For example, for the 70th percentile, q=0.7.

Why the Binomial Distribution?

The rationale behind using the binomial distribution for percentile confidence intervals is its direct applicability to cases where you’re looking at the proportion of observations below a certain threshold (i.e., a percentile).

When you’re asking about the 70th percentile, you’re essentially inquiring: “What’s the value below which 70% of my data falls?” This can be likened to asking about the number of successes in n trials, where a success is an observation below the desired threshold.

However, it’s important to note that this method provides an approximation. The binomial distribution is discrete and inherently based on counting successes in a set number of trials, while percentiles often come from continuous distributions and may not perfectly adhere to the assumptions above.

import numpy as np
from scipy.stats import binom
import seaborn as sns

Get some data

# Load the Iris dataset
iris = sns.load_dataset("iris")
# Use the 'sepal_length' feature
data = iris['sepal_length'].values

print(data[:50])

[5.1 4.9 4.7 4.6 5.  5.4 4.6 5.  4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.  5.  5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.
 5.5 4.9 4.4 5.1 5.  4.5 4.4 5.  5.1 4.8 5.1 4.6 5.3 5. ]

Calculate the 70th percentile

# Calculate the 70th percentile
percentile_70 = np.percentile(data, 70)
print("Min: %f, Max: %f, 70th percentile: %f" % (min(data), max(data), percentile_70))

Min: 4.300000, Max: 7.900000, 70th percentile: 6.300000

Convert the data to “success” (above the 70th percentile) and “failure”

successes = np.sum(data > percentile_70)
failures = len(data) - successes

# Now, `successes` is analogous to `q * n` in the binomial scenario.
# So, we can set:
n = len(data)
q = successes / n

print("n: %d, q: %f" % (n, q))

n: 150, q: 0.280000

Calculate the 95% confidence interval

The code calculates potential upper (u) and lower (l) bounds for a confidence interval using the binomial distribution’s percent-point function (ppf).

np.ceil(binom.ppf(1 – alpha / 2, n, q)) determines the approximate upper bound for the confidence interval and np.ceil(binom.ppf(alpha / 2, n, q)) for the lower bound.

+ np.arange(-2, 3) extends these bounds by adding an array of [-2, -1, 0, 1, 2], generating a set of potential boundaries around the original estimate.

u gives a sequence of indices in the dataset that demarcate the upper bound of the confidence interval. It starts from the calculated index for the 97.5th percentile and provides two more indices above and two below it.

l gives a sequence of indices in the dataset that demarcate the lower bound of the confidence interval. It starts from the calculated index for the 2.5th percentile and provides two more indices above and two below it.

alpha = 0.05
u = np.ceil(binom.ppf(1 - alpha / 2, n, q)) + np.arange(-2, 3)
u[u > n] = np.inf

l = np.ceil(binom.ppf(alpha / 2, n, q)) + np.arange(-2, 3)
l[l < 0] = -np.inf

print("u: " + ", ".join(map(str, u)))
print("l: " + ", ".join(map(str, l)))

u: 51.0, 52.0, 53.0, 54.0, 55.0
l: 29.0, 30.0, 31.0, 32.0, 33.0

sorted_data = np.sort(data)

# Extract values corresponding to the indices
# Correct way to interpret the u and l values
u_values = sorted_data[n - u.astype(int)]
l_values = sorted_data[l.astype(int) - 1]

print("Upper values:", u_values)
print("Lower values:", l_values)

Upper values: [6.3 6.2 6.2 6.2 6.2]
Lower values: [5.  5.  5.  5.  5.1]

Probability coverage

The code calculates the probability coverage of different combinations of potential confidence intervals formed by the lower bounds (l) and upper bounds (u). Coverage is a matrix of probabilities. The goal is to find the smallest confidence interval that guarantees coverage of at least 1−α.

coverage = np.zeros((len(l), len(u)))
for i, a in enumerate(l):
    for j, b in enumerate(u):
        coverage[i, j] = binom.cdf(b - 1, n, q) - binom.cdf(a - 1, n, q)

if np.max(coverage) < 1 - alpha:
    i = np.where(coverage == np.max(coverage))
else:
    i = np.where(coverage == np.min(coverage[coverage >= 1 - alpha]))

print("Coverage Matrix:")
print(coverage)

print("\nOptimal Indices (i_l, i_u):")
print(i)

Coverage Matrix:
[[0.93135214 0.95028522 0.96430299 0.97438285 0.98142424]
 [0.92730647 0.94623955 0.96025732 0.97033718 0.97737857]
 [0.92096076 0.93989385 0.95391161 0.96399148 0.97103286]
 [0.91140808 0.93034117 0.94435894 0.9544388  0.96148018]
 [0.89759319 0.91652627 0.93054404 0.9406239  0.94766529]]

Optimal Indices (i_l, i_u):
(array([0], dtype=int64), array([1], dtype=int64))

i_l = i[0][0]
i_u = i[1][0]
print("Chosen row of coverage matrix: %d, chosen column of coverage matrix: %d" % (i_l, i_u))

u_final = min(n, u[i_u])
u_final = max(0, int(u_final)-1)
        
l_final = min(n, l[i_l])
l_final = max(0, int(l_final)-1)

# Actual value corresponding to u_final and l_final
upper_value_threshold = n - u_final
lower_value_threshold = l_final

upper_value = sorted_data[upper_value_threshold]
lower_value = sorted_data[lower_value_threshold]

print("Lower bound value:", lower_value)
print("Upper bound value:", upper_value)

Chosen row of coverage matrix: 0, chosen column of coverage matrix: 1
Lower bound value: 5.0
Upper bound value: 6.3

import matplotlib.pyplot as plt

# Plotting the histogram
plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, color='skyblue', edgecolor='black', alpha=0.7, label='Data')

# Adding vertical lines for lower_value and upper_value
plt.axvline(lower_value, color='red', linestyle='--', label='Lower bound')
plt.axvline(upper_value, color='green', linestyle='--', label='Upper bound')

# Adding vertical line for the 70th percentile
plt.axvline(percentile_70, color='purple', linestyle='-.', label='70th Percentile')

# Adding title and labels
plt.title('Histogram of Data with Confidence Bounds and 70th Percentile')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.legend()

plt.show()

Bootstrap method

A commonly used alternative method to calculate confidence intervals for percentiles (also known as quantiles) is the Bootstrap method.

The Bootstrap method involves resampling the dataset multiple times with replacement and then computing the desired statistic (in this case, the 70th percentile) for each of these resampled datasets. This gives a distribution of the 70th percentiles from which we can compute the confidence interval.

lower: This represents the value below which the bottom 2.5% of your jotted down 70th percentiles fall. In other words, it’s like saying, “In 2.5% of our bootstrap ‘experiments,’ the 70th percentile was below this value.”

upper: This is the value below which the bottom 97.5% of your jotted down 70th percentiles fall. Put another way, “In 97.5% of our bootstrap ‘experiments,’ the 70th percentile was below this value.”

import numpy as np

def bootstrap_percentile_CI(data, percentile=70, alpha=0.05, B=10000):
    """Calculate the bootstrap confidence interval for a given percentile."""
    n = len(data)
    resampled_percentiles = []

    for _ in range(B):
        resample = np.random.choice(data, n, replace=True)
        resampled_percentiles.append(np.percentile(resample, percentile))

    lower = np.percentile(resampled_percentiles, 100 * alpha/2)
    upper = np.percentile(resampled_percentiles, 100 * (1-alpha/2))
    
    return lower, upper

# Calculate the bootstrap 70th percentile confidence interval
lower_bootstrap, upper_bootstrap = bootstrap_percentile_CI(data)
print("Bootstrap 70th percentile CI: (%.2f, %.2f)" % (lower_bootstrap, upper_bootstrap))

Bootstrap 70th percentile CI: (6.10, 6.43)

# Plotting
plt.hist(data, bins=30, color='lightblue', edgecolor='black', alpha=0.7)
plt.axvline(x=np.percentile(data, 70), color='green', linestyle='--', label="True 70th Percentile")
plt.axvline(x=lower_bootstrap, color='red', linestyle='--', label="Lower Bound of CI")
plt.axvline(x=upper_bootstrap, color='blue', linestyle='--', label="Upper Bound of CI")
plt.legend()
plt.title('Histogram of Sepal Length with Bootstrap CI for 70th Percentile')
plt.xlabel('Sepal Length')
plt.ylabel('Frequency')
plt.show()

Discussion

The bootstrap method makes minimal assumptions about the distribution of the data, making it versatile for a wide variety of datasets. This flexibility allows the bootstrap to handle complex or unknown data distributions, whereas the binomial method assumes data follows a binomial distribution and is mainly suited for binary outcomes. While the binomial approach is computationally simpler and quicker, it might not always provide an accurate representation, especially if the underlying assumptions aren’t met. In contrast, the bootstrap can be more computationally intensive due to resampling but offers the advantage of being more adaptable and often provides a more accurate estimate for datasets that don’t strictly adhere to a binomial distribution.

Fructose Malabsorption – Applying the Luhn algorithm for text summarization

The Luhn algorithm is a text summarization technique that uses statistical properties of the text to identify and extract the most important sentences from a document. The algorithm was developed by H.P. Luhn in the 1950s, and is still widely used in various forms today.

The Luhn algorithm works by first analyzing the frequency of each word in the document, and then assigning a score to each sentence based on the frequency of the words it contains. Sentences that contain words that are more frequent in the document as a whole are considered to be more important, and are assigned higher scores. The algorithm then selects the top-scoring sentences and concatenates them together to form the summary. The length of the summary is usually determined in advance by the user, and the algorithm selects the most important sentences that fit within that length limit.

It works by identifying the most salient or important sentences in a document based on the frequency of important words and their distribution within each sentence. First, the algorithm removes stopwords, which are common words such as “the”, “and”, and “a” that do not carry much meaning. Additionally, one could apply stemming, which reduces words to their base or root form. For example, “likes” and “liked” are reduced to “like”. Then, the algorithm looks for important words in each sentence. These are typically nouns, verbs, and adjectives that carry the most meaning. The specific method for identifying important words may vary depending on the implementation of the algorithm, but in general, they are selected based on their frequency and relevance to the topic of the text.

The algorithm counts the number of important words in each sentence and divides it by the span, or the distance between the first and last occurrence of an important word. This gives a measure of how densely the important words are distributed within the sentence. Finally, the algorithm ranks the sentences based on their scores, with the highest scoring sentences considered the most important and selected for the summary.

Here are the step-by-step instructions for the Luhn algorithm:

Preprocess the text: Remove any stop words, punctuation, and other non-textual elements from the document, and convert all the remaining words to lowercase.
Calculate the word frequency: Count the number of occurrences of each word in the document, and store this information in a frequency table.
For each sentence, calculate the score by:
a. Identifying the significant words (excluding stop words) that occur in the sentence.
b. Ordering the significant words by their position in the sentence.
c. Determining the distance between adjacent significant words (the “span”).
d. Calculating a score for the sentence as the sum of the square of the number of significant words divided by the span for each adjacent pair of significant words.
Select the top-scoring sentences: Sort the sentences in the document by their score, and select the top-scoring sentences up to a maximum length L. The length L is typically chosen by the user in advance, and represents the maximum number of words or sentences that the summary can contain.
Generate the summary: Concatenate the selected sentences together to form the summary.

Below I summarize the topic of fructose malabsorption by generating a summary using the Luhn algorithm. To create the summary, I selected several articles from sources like Wikipedia and PubMed. The important words were selected based on their total frequency in all of the text. I chose the top 25 words to focus on, and then used the algorithm to identify the most important sentences based on the frequency and distribution of these words. The summary was generated using the top 15 sentences.

Symptoms and signs of Fructose malabsorption may cause gastrointestinal symptoms such as abdominal pain, bloating, flatulence or diarrhea. Although often assumed to be an acceptable alternative to wheat, spelt flour is not suitable for people with fructose malabsorption, just as it is not appropriate for those with wheat allergies or celiac disease. However, fructose malabsorbers do not need to avoid gluten, as those with celiac disease must. Many fructose malabsorbers can eat breads made from rye and corn flour. This can cause some surprises and pitfalls for fructose malabsorbers. Foods (such as bread) marked “gluten-free” are usually suitable for fructose malabsorbers, though they need to be careful of gluten-free foods that contain dried fruit or high fructose corn syrup or fructose itself in sugar form. Food-labeling Producers of processed food in most or all countries, including the US, are not currently required by law to mark foods containing “fructose in excess of glucose”.

Stone fruit: apricot, nectarine, peach, plum (caution – these fruits contain sorbitol);Berry fruit: blackberry, boysenberry, cranberry, raspberry, strawberry, loganberry; Citrus fruit: kumquat, grapefruit, lemon, lime, mandarin, orange, tangelo; Other fruits: ripe banana, jackfruit, passion fruit, pineapple, rhubarb, tamarillo.

The fructose and glucose contents of foods listed on the Australian food standards would appear to indicate that most of the listed foods have higher fructose levels.

Glucose enhances absorption of fructose, so fructose from foods with fructose-to-glucose ratio <1, like white potatoes, are readily absorbed, whereas foods with fructose-to-glucose ratio >1, like apples and pears, are often problematic regardless of the total amount of fructose in the food.

Foods with high fructose-to-glucose ratio.

Dietary supplements of xylose isomerase may improve some symptoms of fructose malabsorption, although there is currently only a single scientific study available.

This can lead to issues such as bloating, heartburn and constipation.

It is called a hydrogen breath test and is the method currently used for a clinical diagnosis.

People with fructose malabsorption absorb less than 25 g per sitting.

Fructose malabsorption should be considered in patients with symptoms of depression and disturbances of tryptophan metabolism. This prospective cross-sectional study aimed to investigate the potential association between primary-adult lactose malabsorption, fructose malabsorption, tryptophan (TRP) metabolism and the presence of depressive signs.

High intestinal fructose concentration seems to interfere with L-tryptophan metabolism, and it may reduce availability of tryptophan for the biosynthesis of serotonin (5-hydroxytryptamine).

After obtaining the top sentences above, I used them as input into ChatGPT to generate another summary. The summary ended up with some mistakes, such as “People with this condition should avoid foods like spelt flour that contain fructose.“, as spelt does not contain fructose, it contains fructans. But it was still a fun exercise.

Fructose malabsorption can cause gastrointestinal symptoms such as abdominal pain, bloating, flatulence, and diarrhea. People with this condition should avoid foods like spelt flour that contain fructose. However, they do not need to avoid gluten, and many can eat breads made from rye and corn flour. Foods marked “gluten-free” are generally suitable for fructose malabsorbers, but they need to be careful of gluten-free foods that contain dried fruit or high fructose corn syrup. Certain fruits such as apricots, nectarines, peaches, and plums can cause issues due to their sorbitol content. The fructose-to-glucose ratio of foods can also play a role in absorption, with foods with a higher ratio often being problematic. A hydrogen breath test is used for clinical diagnosis, and dietary supplements of xylose isomerase may help with symptoms. Fructose malabsorption may also be associated with depressive signs due to the interference with L-tryptophan metabolism.

Summarizing articles on PMDD treatments using TextRank

In this blog post, I want to share with you what I learned about treating PMDD using articles summarization through TextRank. TextRank is not really a summarization algorithm, it is used for extracting top sentences, but I decided to use it anyways and see the results. I started by using the googlesearch library in python to search for “PMDD treatments – calcium, hormones, SSRIs, scientific evidence”. The search resulted in a list of URLs to various articles on PMDD treatments. However, not all of them were useful for my purposes, as some were blocked due to access restrictions. I used BeautifulSoup to extract the text from the remaining articles.

In order to exclude irrelevant paragraphs, I used the library called Justext. This library is designed for removing boilerplate content and other non-relevant text from HTML pages. Justext uses a heuristics to determine which parts of the page are boilerplate and which are not, and then filters out the former. Justext tries to identify these sections by analyzing the length of the text, the density of links, and the presence of certain HTML tags.

Some examples of the kinds of content that Justext can remove include navigation menus, copyright statements, disclaimers, and other non-content-related text. It does not work perfectly, as I still ended up with sentences such as the following in the resulting articles: “This content is owned by the AAFP. A person viewing it online may make one printout of the material and may use that printout only for his or her personal, non-commercial reference.”

Next, I used existing code that implements the TextRank algorithm that I found online. I slightly improved it so that instead of bag of words method the algorithm would use sentence embeddings. Let’s go step by step through the algorithm. I defined a class called TextRank4Sentences. Here is a description of each line in the __init__ method of this class:

self.damping = 0.85: This sets the damping coefficient used in the TextRank algorithm to 0.85. In this case, it determines the probability of the algorithm to transition from one sentence to another.

self.min_diff = 1e-5: This sets the convergence threshold. The algorithm will stop iterating when the difference between the PageRank scores of two consecutive iterations is less than this value.

self.steps = 100: This sets the number of iterations to run the algorithm before stopping.

self.text_str = None: This initializes a variable to store the input text.

self.sentences = None: This initializes a variable to store the individual sentences of the input text.

self.pr_vector = None: This initializes a variable to store the TextRank scores for each sentence in the input text.

from nltk import sent_tokenize, word_tokenize
from nltk.cluster.util import cosine_distance
from sklearn.metrics.pairwise import cosine_similarity

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

MULTIPLE_WHITESPACE_PATTERN = re.compile(r"\s+", re.UNICODE)

class TextRank4Sentences():
    def __init__(self):
        self.damping = 0.85  # damping coefficient, usually is .85
        self.min_diff = 1e-5  # convergence threshold
        self.steps = 100  # iteration steps
        self.text_str = None
        self.sentences = None
        self.pr_vector = None

The next step is defining a private method _sentence_similarity() which takes in two sentences and returns their cosine similarity using a pre-trained model. The method encodes each sentence into a vector using the pre-trained model and then calculates the cosine similarity between the two vectors using another function core_cosine_similarity().

core_cosine_similarity() is a separate function that measures the cosine similarity between two vectors. It takes in two vectors as inputs and returns a similarity score between 0 and 1. The function uses the cosine_similarity() function from the sklearn library to calculate the similarity score. The cosine similarity is a measure of the similarity between two non-zero vectors of an inner product space. It is calculated as the cosine of the angle between the two vectors.

Mathematically, given two vectors u and v, the cosine similarity is defined as:

cosine_similarity(u, v) = (u . v) / (||u|| ||v||)

where u . v is the dot product of u and v, and ||u|| and ||v|| are the magnitudes of u and v respectively.

def core_cosine_similarity(vector1, vector2):
    """
    measure cosine similarity between two vectors
    :param vector1:
    :param vector2:
    :return: 0 < cosine similarity value < 1
    """
    sim_score = cosine_similarity(vector1, vector2)
    return sim_score

class TextRank4Sentences():
    def __init__(self):
        ...

    def _sentence_similarity(self, sent1, sent2):
        first_sent_embedding = model.encode([sent1])
        second_sent_embedding = model.encode([sent2])
        
        return core_cosine_similarity(first_sent_embedding, second_sent_embedding)

In the next function, the similarity matrix is built for the given sentences. The function _build_similarity_matrix takes a list of sentences as input and creates an empty similarity matrix sm with dimensions len(sentences) x len(sentences). Then, for each sentence in the list, the function computes its similarity with all other sentences in the list using the _sentence_similarity function. After calculating the similarity scores for all sentence pairs, the function get_symmetric_matrix is used to make the similarity matrix symmetric.

The function get_symmetric_matrix adds the transpose of the matrix to itself, and then subtracts the diagonal elements of the original matrix. In other words, for each element (i, j) of the input matrix, the corresponding element (j, i) is added to it to make it symmetric. However, the diagonal elements (i, i) of the original matrix are not added twice, so they need to be subtracted once from the sum of the corresponding elements in the upper and lower triangles. The resulting matrix has the same values in the upper and lower triangles, and is symmetric along its main diagonal. The similarity matrix is made symmetric in order to ensure that the similarity score between two sentences in the matrix is the same regardless of their order, and it also simplifies the computation.

def get_symmetric_matrix(matrix):
    """
    Get Symmetric matrix
    :param matrix:
    :return: matrix
    """
    return matrix + matrix.T - np.diag(matrix.diagonal())

class TextRank4Sentences():
    def __init__(self):
        ...

    def _sentence_similarity(self, sent1, sent2):
        ...
    
    def _build_similarity_matrix(self, sentences, stopwords=None):
        # create an empty similarity matrix
        sm = np.zeros([len(sentences), len(sentences)])
    
        for idx, sentence in enumerate(sentences):
            print("Current location: %d" % idx)
            sm[idx] = self._sentence_similarity(sentence, sentences)
    
        # Get Symmeric matrix
        sm = get_symmetric_matrix(sm)
    
        # Normalize matrix by column
        norm = np.sum(sm, axis=0)
        sm_norm = np.divide(sm, norm, where=norm != 0)  # this is ignore the 0 element in norm
    
        return sm_norm

In the next function, the ranking algorithm PageRank is implemented to calculate the importance of each sentence in the document. The similarity matrix created in the previous step is used as the basis for the PageRank algorithm. The function takes the similarity matrix as input and initializes the pagerank vector with a value of 1 for each sentence.

In each iteration, the pagerank vector is updated based on the similarity matrix and damping coefficient. The damping coefficient represents the probability of continuing to another sentence at random, rather than following a link from the current sentence. The algorithm continues to iterate until either the maximum number of steps is reached or the difference between the current and previous pagerank vector is less than a threshold value. Finally, the function returns the pagerank vector, which represents the importance score for each sentence.

class TextRank4Sentences():
    def __init__(self):
        ...

    def _sentence_similarity(self, sent1, sent2):
        ...
    
    def _build_similarity_matrix(self, sentences, stopwords=None):
        ...

    def _run_page_rank(self, similarity_matrix):

        pr_vector = np.array([1] * len(similarity_matrix))

        # Iteration
        previous_pr = 0
        for epoch in range(self.steps):
            pr_vector = (1 - self.damping) + self.damping * np.matmul(similarity_matrix, pr_vector)
            if abs(previous_pr - sum(pr_vector)) < self.min_diff:
                break
            else:
                previous_pr = sum(pr_vector)

        return pr_vector

The _get_sentence function takes an index as input and returns the corresponding sentence from the list of sentences. If the index is out of range, it returns an empty string. This function is used later in the class to get the highest ranked sentences.

class TextRank4Sentences():
    def __init__(self):
        ...

    def _sentence_similarity(self, sent1, sent2):
        ...
    
    def _build_similarity_matrix(self, sentences, stopwords=None):
        ...

    def _run_page_rank(self, similarity_matrix):
        ...

    def _get_sentence(self, index):

        try:
            return self.sentences[index]
        except IndexError:
            return ""

The code then defines a method called get_top_sentences which returns a summary of the most important sentences in a document. The method takes two optional arguments: number (default=5) specifies the maximum number of sentences to include in the summary, and similarity_threshold (default=0.5) specifies the minimum similarity score between two sentences that should be considered “too similar” to include in the summary.

The method first initializes an empty list called top_sentences to hold the selected sentences. It then checks if a pr_vector attribute has been computed for the document. If the pr_vector exists, it sorts the indices of the sentences in descending order based on their PageRank scores and saves them in the sorted_pr variable.

It then iterates through the sentences in sorted_pr, starting from the one with the highest PageRank score. For each sentence, it removes any extra whitespace, replaces newlines with spaces, and checks if it is too similar to any of the sentences already selected for the summary. If it is not too similar, it adds the sentence to top_sentences. Once the selected sentences are finalized, the method concatenates them into a single string separated by spaces, and returns the summary.

class TextRank4Sentences():
    def __init__(self):
        ...

    def _sentence_similarity(self, sent1, sent2):
        ...
    
    def _build_similarity_matrix(self, sentences, stopwords=None):
        ...

    def _run_page_rank(self, similarity_matrix):
        ...

    def _get_sentence(self, index):
        ...
   
    def get_top_sentences(self, number=5, similarity_threshold=0.5):
        top_sentences = []
    
        if self.pr_vector is not None:
            sorted_pr = np.argsort(self.pr_vector)
            sorted_pr = list(sorted_pr)
            sorted_pr.reverse()
    
            index = 0
            while len(top_sentences) < number and index < len(sorted_pr):
                sent = self.sentences[sorted_pr[index]]
                sent = normalize_whitespace(sent)
                sent = sent.replace('\n', ' ')
    
                # Check if the sentence is too similar to any of the sentences already in top_sentences
                is_similar = False
                for s in top_sentences:
                    sim = self._sentence_similarity(sent, s)
                    if sim > similarity_threshold:
                        is_similar = True
                        break
    
                if not is_similar:
                    top_sentences.append(sent)
    
                index += 1
        
        summary = ' '.join(top_sentences)
        return summary

The _remove_duplicates method takes a list of sentences as input and returns a list of unique sentences, by removing any duplicates in the input list.

class TextRank4Sentences():
    def __init__(self):
        ...

    def _sentence_similarity(self, sent1, sent2):
        ...
    
    def _build_similarity_matrix(self, sentences, stopwords=None):
        ...

    def _run_page_rank(self, similarity_matrix):
        ...

    def _get_sentence(self, index):
        ...
   
    def get_top_sentences(self, number=5, similarity_threshold=0.5):
        ...
    
    def _remove_duplicates(self, sentences):
        seen = set()
        unique_sentences = []
        for sentence in sentences:
            if sentence not in seen:
                seen.add(sentence)
                unique_sentences.append(sentence)
        return unique_sentences

The analyze method takes a string text and a list of stop words stop_words as input. It first creates a unique list of words from the input text by using the set() method and then joins these words into a single string self.full_text.

It then uses the sent_tokenize() method from the nltk library to tokenize the text into sentences and removes duplicate sentences using the _remove_duplicates() method. It also removes sentences that have a word count less than or equal to the fifth percentile of all sentence lengths.

After that, the method calculates a similarity matrix using the _build_similarity_matrix() method, passing in the preprocessed list of sentences and the stop_words list.

Finally, it runs the PageRank algorithm on the similarity matrix using the _run_page_rank() method to obtain a ranking of the sentences based on their importance in the text. This ranking is stored in self.pr_vector.

class TextRank4Sentences():
    ...

    def analyze(self, text, stop_words=None):
        self.text_unique = list(set(text))
        self.full_text = ' '.join(self.text_unique)
        #self.full_text = self.full_text.replace('\n', ' ')
        
        self.sentences = sent_tokenize(self.full_text)
        
        # for i in range(len(self.sentences)):
        #     self.sentences[i] = re.sub(r'[^\w\s$]', '', self.sentences[i])
    
        self.sentences = self._remove_duplicates(self.sentences)
        
        sent_lengths = [len(sent.split()) for sent in self.sentences]
        fifth_percentile = np.percentile(sent_lengths, 10)
        self.sentences = [sentence for sentence in self.sentences if len(sentence.split()) > fifth_percentile]

        print("Min length: %d, Total number of sentences: %d" % (fifth_percentile, len(self.sentences)) )

        similarity_matrix = self._build_similarity_matrix(self.sentences, stop_words)

        self.pr_vector = self._run_page_rank(similarity_matrix)

In order to find articles, I used the googlesearch library. The code below performs a Google search using the Google Search API provided by the library. It searches for the query “PMDD treatments – calcium, hormones, SSRIs, scientific evidence” and retrieves the top 7 search results.

# summarize articles
import requests
from bs4 import BeautifulSoup
from googlesearch import search
import justext
query = "PMDD treatments - calcium, hormones, SSRIs, scientific evidence"

# perform the google search and retrieve the top 5 search results
top_results = []
for url in search(query, num_results=7):
    top_results.append(url)

In the next part, the code extracts the article text for each of the top search results collected in the previous step. For each URL in the top_results list, the code sends an HTTP GET request to the URL using the requests library. It then uses the justext library to extract the main content of the webpage by removing any boilerplate text (i.e., non-content text).

article_texts = []

# extract the article text for each of the top search results
for url in top_results:
    response = requests.get(url)
    paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
    text = ''
    for paragraph in paragraphs:
        if not paragraph.is_boilerplate:
            text += paragraph.text + '\n'

    if "Your access to PubMed Central has been blocked" not in text:
        article_texts.append(text.strip())
        print(text)
    print('-' * 50)
    
print("Total articles collected: %d" % len(article_texts))

In the final step, the extracted article texts are passed to an instance of the TextRank4Sentences class, which is used to perform text summarization. The output of get_top_sentences() is a list of the top-ranked sentences in the input text, which are considered to be the most important and representative sentences for summarizing the content of the text. This list is stored in the variable summary_text.

# summarize
tr4sh = TextRank4Sentences()
tr4sh.analyze(article_texts)
summary_text = tr4sh.get_top_sentences(15)

Results:
(I did not list irrelevant sentences that appeared in the final results, such as “You will then receive an email that contains a secure link for resetting your password…“)

Total articles collected: 6

There have been at least 15 randomized controlled trials of the use of selective serotonin-reuptake inhibitors (SSRIs) for the treatment of severe premenstrual syndrome (PMS), also called premenstrual dysphoric disorder (PMDD).

It is possible that the irritability/anger/mood swings subtype of PMDD is differentially responsive to treatments that lead to a quick change in ALLO availability or function, for example, symptom-onset SSRI or dutasteride.
* My note: ALLO is allopregnanolone
* My note: Dutasteride is a synthetic 4-azasteroid compound that is a selective inhibitor of both the type 1 and type 2 isoforms of steroid 5 alpha-reductase

From 2 to 10 percent of women of reproductive age have severe distress and dysfunction caused by premenstrual dysphoric disorder, a severe form of premenstrual syndrome.

The rapid efficacy of selective serotonin reuptake inhibitors (SSRIs) in PMDD may be due in part to their ability to increase ALLO levels in the brain and enhance GABAA receptor function with a resulting decrease in anxiety.

Clomipramine, a serotoninergic tricyclic antidepressant that affects the noradrenergic system, in a dosage of 25 to 75 mg per day used during the full cycle or intermittently during the luteal phase, significantly reduced the total symptom complex of PMDD.

Relapse was more likely if a woman stopped sertraline after only 4 months versus 1 year, if she had more severe symptoms prior to treatment and if she had not achieved full symptom remission with sertraline prior to discontinuation.

Women with negative views of themselves and the future caused or exacerbated by PMDD may benefit from cognitive-behavioral therapy. This kind of therapy can enhance self-esteem and interpersonal effectiveness, as well as reduce other symptoms.

Educating patients and their families about the disorder can promote understanding of it and reduce conflict, stress, and symptoms.

Anovulation can also be achieved with the administration of estrogen (transdermal patch, gel, or implant).

In a recent meta-analysis of 15 randomized, placebo-controlled studies of the efficacy of SSRIs in PMDD, it was concluded that SSRIs are an effective and safe first-line therapy and that there is no significant difference in symptom reduction between continuous and intermittent dosing.

Preliminary confirmation of alleviation of PMDD with suppression of ovulation with a GnRH agonist should be obtained prior to hysterectomy.

Sexual side effects, such as reduced libido and inability to reach orgasm, can be troubling and persistent, however, even when dosing is intermittent. * My note: I think this sentence refers to the side-effects of SSRIs

Calculating Confidence Interval for a Percentile

Calculating the confidence interval for a percentile is a crucial step in understanding the variability and the uncertainty around the estimated value. In many real-world applications, the distribution of the data is unknown and this makes it difficult to determine the confidence intervals. In such scenarios, using a binomial distribution can be a viable alternative to estimate the confidence intervals for a percentile.

For instance, let’s consider a variable with 300 data points and we want to calculate the 70th and 90th percentiles and the corresponding confidence intervals for the variable. To do this, we can use a binomial distribution approach.

First, we need to choose an alpha level, which is a probability that determines the size of the confidence interval. A common choice for alpha is 0.05, which corresponds to a 95% confidence interval.

Next, we use the cumulative distribution function (CDF) of the binomial distribution to estimate the lower and upper bounds of the confidence interval. The CDF of the binomial distribution gives the probability of getting k or fewer successes in n independent Bernoulli trials, where the probability of success in each trial is p.

To calculate the 70th percentile and its confidence interval, we use the following steps:

Set n = 300, which is the number of data points.
Set p = 0.7, which corresponds to the 70th percentile.
Calculate the binomial quantile using the CDF, which is the smallest k such that P(X <= k) >= p, where X is a binomial random variable with parameters n and p.
Use the CDF to determine the lower and upper bounds of the confidence interval.

Below is the python code for calculating the confidence interval for the 70th percentile.

alpha – alpha is a parameter representing the significance level or confidence level for the calculation of the confidence interval. It is the probability that the confidence interval contains the true value of the parameter being estimated. The value of alpha is typically set to 0.05 or 0.01, meaning that there is a 95% or 99% chance, respectively, that the confidence interval contains the true value. In the code, alpha=0.05 is the default value for alpha, but it can be changed to a different value if desired.

n – number of observations

q – percentile value

from scipy.stats import binom
import numpy as np

alpha = 0.05
n = 300
q = 0.7

Below is the code for calculating the upper and lower bounds for the confidence interval. The u value is calculated as the ceiling of the binomial distribution’s quantile function (ppf) evaluated at 1 – alpha / 2 (1 – 0.05 / 2 = 0.975), and the value is shifted by adding an array of numbers from -2 to 2. Any values of u that are greater than n are set to infinity.

u = np.ceil(binom.ppf(1 - alpha / 2, n, q)) + np.arange(-2, 3)
u[u > n] = np.inf

l = np.ceil(binom.ppf(alpha / 2, n, q)) + np.arange(-2, 3)
l[l < 0] = -np.inf

# From the calculation of bounds, np.ceil(binom.ppf(1 - alpha / 2, n, q)) and np.ceil(binom.ppf(alpha / 2, n, q)), we obtain that
# the upper bound value is 225 and the lower bound value is 194. This means that given a sample of size 300, a binomial distribution, and # probability of success p=0.7, we are 95% certain that the number of successes will be between 194 and 225.

Next we calculate coverage of the percentiles that the bounds cover. The coverage represents a matrix of values that correspond to the probability of coverage of the confidence interval for each combination of lower and upper bounds of the interval.

The coverage calculation uses the binom.cdf function to calculate the cumulative distribution function (CDF) for the binomial distribution, which is then used to determine the coverage probability of each combination of u and l. Once the coverage matrix is calculated, the code finds the index i corresponding to the combination of u and l that gives the closest coverage probability to 1-alpha.

coverage = np.zeros((len(l), len(u)))

for i, a in enumerate(l):
    for j, b in enumerate(u):
        coverage[i, j] = binom.cdf(b - 1, n, q) - binom.cdf(a - 1, n, q)

Next we select the upper and lower bounds of the confidence interval based on the coverage of the interval. The code first checks if the maximum coverage is less than 1 minus the significance level alpha. If it is, the code selects the pair of bounds with the maximum coverage probability. Otherwise, the code selects the pair of bounds with the smallest coverage probability that is still greater than or equal to 1 minus alpha.

if np.max(coverage) < 1 - alpha:
    i = np.where(coverage == np.max(coverage))
else:
    i = np.where(coverage == np.min(coverage[coverage >= 1 - alpha]))

i_u = i[0][0]
i_l = i[1][0]

u_final = min(n, u[i_u])
u_final = max(0, int(u_final)-1)
        
l_final = min(n, l[i_l])
l_final = max(0, int(l_final)-1)

The resulting l and u are 192 and 223, respectively. Therefore if you have a sample of 300 and you want to calculate the confidence interval for a variable X, you would sort the values in ascending order, and then you would take the values of X that correspond to the 192nd and 223rd observations.

NLP – Word Embeddings – ELMo

ELMo (Embeddings from Language Models) is a deep learning approach for representing words as vectors (also called word embeddings). It was developed by researchers at Allen Institute for Artificial Intelligence and introduced in a paper published in 2018.

ELMo represents words as contextualized embeddings, meaning that the embedding for a word can change based on the context in which it is used. For example, the word “bank” could have different embeddings depending on whether it is used to refer to a financial institution or the edge of a river.

ELMo has been shown to improve the performance of a variety of natural language processing tasks, including language translation, question answering, and text classification. It has become a popular approach for representing words in NLP models, and the trained ELMo embeddings are freely available for researchers to use.

How does ELMo differ from Word2Vec or GloVe?

ELMo (Embeddings from Language Models) is a deep learning approach for representing words as vectors (also called word embeddings). It differs from other word embedding approaches, such as Word2Vec and GloVe, in several key ways:

Contextualized embeddings: ELMo represents words as contextualized embeddings, meaning that the embedding for a word can change based on the context in which it is used. In contrast, Word2Vec and GloVe represent words as static embeddings, which do not take into account the context in which the word is used.
Deep learning approach: ELMo uses a deep learning model, specifically a bidirectional language model, to generate word embeddings. Word2Vec and GloVe, on the other hand, use more traditional machine learning approaches based on a neural network (Word2Vec) and matrix factorization (GloVe).

To generate context-dependent embeddings, ELMo uses a bi-directional Long Short-Term Memory (LSTM) network trained on a specific task (such as language modeling or machine translation). The LSTM processes the input sentence in both directions (left to right and right to left) and generates an embedding for each word based on its context in the sentence.

Overall, ELMo is a newer approach for representing words as vectors that has been shown to improve the performance of a variety of natural language processing tasks. It has become a popular choice for representing words in NLP models.

What is the model for training ELMo word embeddings?

The model used to train ELMo word embeddings is a bidirectional language model, which is a type of neural network that is trained to predict the next word in a sentence given the context of the words that come before and after it. To train the ELMo model, researchers at Allen Institute for Artificial Intelligence used a large dataset of text, such as news articles, books, and websites. The model was trained to predict the next word in a sentence given the context of the words that come before and after it. During training, the model learns to represent words as vectors (also called word embeddings) that capture the meaning of the word in the context of the sentence.

Explain in details the bidirectional language model

A bidirectional language model is a type of neural network that is trained to predict the next word in a sentence given the context of the words that come before and after it. It is called a “bidirectional” model because it takes into account the context of words on both sides of the word being predicted.

To understand how a bidirectional language model works, it is helpful to first understand how a unidirectional language model works. A unidirectional language model is a type of neural network that is trained to predict the next word in a sentence given the context of the words that come before it.

A unidirectional language model can be represented by the following equation:

P(w[t] | w[1], w[2], …, w[t-1]) = f(w[t-1], w[t-2], …, w[1])

This equation says that the probability of a word w[t] at time t (where time is the position of the word in the sentence) is determined by a function f of the words that come before it (w[t-1], w[t-2], …, w[1]). The function f is learned by the model during training.

A bidirectional language model extends this equation by also taking into account the context of the words that come after the word being predicted:

P(w[t] | w[1], w[2], …, w[t-1], w[t+1], w[t+2], …, w[n]) = f(w[t-1], w[t-2], …, w[1], w[t+1], w[t+2], …, w[n])

This equation says that the probability of a word w[t] at time t is determined by a function f of the words that come before it and the words that come after it. The function f is learned by the model during training.

In practice, a bidirectional language model is implemented as a neural network with two layers: a forward layer that processes the input words from left to right (w[1], w[2], …, w[t-1]), and a backward layer that processes the input words from right to left (w[n], w[n-1], …, w[t+1]). The output of these two layers is then combined and used to predict the next word in the sentence (w[t]). The forward and backward layers are typically implemented as recurrent neural networks (RNNs) or long short-term memory (LSTM) networks, which are neural networks that are designed to process sequences of data.

During training, the bidirectional language model is fed a sequence of words and is trained to predict the next word in the sequence. The model uses the output of the forward and backward layers to generate a prediction, and this prediction is compared to the actual next word in the sequence. The model’s weights are then updated to minimize the difference between the prediction and the actual word, and this process is repeated for each word in the training dataset. After training, the bidirectional language model can be used to generate word embeddings by extracting the output of the forward and backward layers for each word in the input sequence.

ELMo model training algorithm

Initialize the word vectors:

The word vectors are usually initialized randomly using a Gaussian distribution.
Alternatively, you can use pre-trained word vectors such as Word2Vec or GloVe.

Process the input sequence:

Input the sequence of words w[1], w[2], ..., w[t-1] into the forward layer and the backward layer.
The forward layer processes the words from left to right, and the backward layer processes the words from right to left.
Each layer has its own set of weights and biases, which are updated during training.

Compute the output:

The output of the forward layer and the backward layer are combined to form the final output o[t].
The final output is used to predict the next word w[t].

Compute the loss:

The loss is computed as the difference between the predicted word w[t] and the true word w[t].
The loss function is usually the cross-entropy loss, which measures the difference between the predicted probability distribution and the true probability distribution.

Update the weights and biases:

The weights and biases of the forward and backward layers are updated using gradient descent and backpropagation.

Repeat steps 2-5 for all words in the input sequence.

ELMo generates contextualized word embeddings by combining the hidden states of a bi-directional language model (BLM) in a specific way.

The BLM consists of two layers: a forward layer that processes the input words from left to right, and a backward layer that processes the input words from right to left. The hidden state of the BLM at each position t is a vector h[t] that represents the context of the word at that position.

To generate the contextualized embedding for a word, ELMo concatenates the hidden states from the forward and backward layers and applies a weighted summation. The hidden states are combined using a task-specific weighting of all biLM layers. The weighting is controlled by a set of learned weights γ_task and a bias term s_task. The ELMo embeddings for a word at position k are computed as a weighted sum of the hidden states from all L layers of the biLM:

ELMo_task_k = E(R_k; Θtask) = γ_task_L * h_LM_k,L + γ_task{L-1} * h_LM_k,{L-1} + … + γ_task_0 * h_LM_k,0 + s_task

Here, h_LM_k,j represents the hidden state at position k and layer j of the biLM, and γ_task_j and s_task are the task-specific weights and bias term, respectively. The task-specific weights and bias term are learned during training, and are used to combine the hidden states in a way that is optimal for the downstream task.

Using ELMo for NLP tasks

ELMo can be used to improve the performance of supervised NLP tasks by providing context-dependent word embeddings that capture not only the meaning of the individual words, but also their context in the sentence.

To use a pre-trained bi-directional language model (biLM) for a supervised NLP task, the first step is to run the biLM and record the layer representations for each word in the input sequence. These layer representations capture the context-dependent information about the words in the sentence, and can be used to augment the context-independent token representation of each word.

In most supervised NLP models, the lowest layers are shared across different tasks, and the task-specific information is encoded in the higher layers. This allows ELMo to be added to the model in a consistent and unified manner, by simply concatenating the ELMo embeddings with the context-independent token representation of each word.

The model then combines the ELMo embeddings with the context-independent token representation to form a context-sensitive representation h_k, typically using either bidirectional RNNs, CNNs, or feed-forward networks. The context-sensitive representation h_k is then used as input to the higher layers of the model, which are task-specific and encode the information needed to perform the target NLP task. It can be helpful to add a moderate amount of dropout to ELMo and to regularize the ELMo weights by adding a regularization term to the loss function. This can help to prevent overfitting and improve the generalization ability of the model.

NLP – Word Embeddings – FastText

What is the FastText method for word embeddings?

FastText is a library for efficient learning of word representations and sentence classification. It was developed by Facebook AI Research (FAIR).

FastText represents each word in a document as a bag of character n-grams. For example, the word “apple” would be represented as the following character n-grams: “a”, “ap”, “app”, “appl”, “apple”, “p”, “pp”, “ppl”, “pple”, “p”, “pl”, “ple”, “l”, “le”. This representation has two advantages:

It can handle spelling mistakes and out-of-vocabulary words. For example, the model would still be able to understand the word “apple” even if it was misspelled as “appel” or “aple”.
It can handle words in different languages with the same script (e.g., English and French) without the need for a separate model for each language.

FastText uses a shallow neural network to learn the word representations from this character n-gram representation. It is trained using the skip-gram model with negative sampling, similar to word2vec.

FastText can also be used for sentence classification by averaging the word vectors for the words in the sentence and training a linear classifier on top of the averaged vector. It is particularly useful for languages with a large number of rare words, or in cases where using a word’s subwords (also known as substrings or character n-grams) as features can be helpful.

How are word embeddings trained in FastText?

Word embeddings in FastText can be trained using either the skip-gram model or the continuous bag-of-words (CBOW) model.

In the skip-gram model, the goal is to predict the context words given a target word. For example, given the input sequence “I have a dog”, the goal would be to predict “have” and “a” given the target word “I”, and to predict “I” given the target word “have”. The skip-gram model learns to predict the context words by minimizing the negative log likelihood of the context words given the target word.

In the CBOW model, the goal is to predict the target word given the context words. For example, given the input sequence “I have a dog”, the goal would be to predict “I” given the context words “have” and “a”, and to predict “have” given the context words “I” and “a”. The CBOW model learns to predict the target word by minimizing the negative log likelihood of the target word given the context words.

Both the skip-gram and CBOW models are trained using stochastic gradient descent (SGD) and backpropagation to update the model’s parameters. The model is trained by minimizing the negative log likelihood of the words in the training data, given the model’s parameters.

Explain how FastText represents each word in a document as a bag of character n-grams

To represent a word as a bag of character n-grams, FastText breaks the word down into overlapping substrings (also known as character n-grams). For example, the word “apple” could be represented as the following character 3-grams (trigrams): [“app”, “ppl”, “ple”]. The number of characters in each substring is specified by the user and is typically set to between 3 and 6 characters.

For example, consider the following sentence:

“I have a dog”

If we set the number of characters in each substring to 3, FastText would represent each word in the sentence as follows:

“I”: [“I”] “have”: [“hav”, “ave”] “a”: [“a”] “dog”: [“dog”]

The use of character n-grams allows FastText to learn good vector representations for rare words, as it can use the vector representations of the character n-grams that make up the rare word to compute its own vector representation. This is particularly useful for handling out-of-vocabulary words that may not have a pre-trained vector representation available.

How are vector representations for each word computed from n-gram vectors?

In FastText, the vector representation for each word is computed as the sum of the vector representations of the character n-grams (subwords) that make up the word. For example, consider the following sentence:

“I have a dog”

If we set the number of characters in each substring to 3, FastText would represent each word in the sentence as a bag of character 3-grams (trigrams) as follows:

“I”: [“I”] “have”: [“hav”, “ave”] “a”: [“a”] “dog”: [“dog”]

FastText would then learn a vector representation for each character n-gram and use these vector representations to compute the vector representation for each word. For example, the vector representation for the word “have” would be computed as the sum of the vector representations for the character n-grams [“hav”, “ave”].

Since there can be huge number of unique n-grams, how does FastText deal with the memory requirement?

One of the ways that FastText deals with the large number of unique character n-grams is by using hashing to map the character n-grams to a fixed-size hash table rather than storing them in a dictionary. This allows FastText to store the character n-grams in a compact form, which can save memory.

What is hashing? How are character sequences hashed to integer values?

Hashing is the process of converting a given input (called the ‘key’) into a fixed-size integer value (called the ‘hash value’ or ‘hash code’). The key is typically some sort of string or sequence of characters, but it can also be a number or other data type.

There are many different ways to hash a character sequence, but most algorithms work by taking the input key, performing some mathematical operations on it, and then returning the hash value as an integer. The specific mathematical operations used will depend on the specific hashing algorithm being used.

One simple example of a hashing algorithm is the ‘modulo’ method, which works as follows:

Take the input key and convert it into a numerical value, for example by assigning each character in the key a numerical value based on its ASCII code.
Divide this numerical value by the size of the hash table (the data structure in which the hashed keys will be stored).
The remainder of this division is the hash value for the key.

This method is simple and fast, but it is not very robust and can lead to a high number of collisions (when two different keys produce the same hash value). More sophisticated algorithms are typically used in practice to improve the performance and reliability of hash tables.

How is the Skip-gram with negative sampling applied in FastText?

Skip-gram with negative sampling (SGNS) algorithm is used to learn high-quality word embeddings (i.e., dense, low-dimensional representations of words that capture the meaning and context of the words). The Skip-gram with negative sampling algorithm works by training a predictive model to predict the context words (i.e., the words that appear near a target word in a given text) given the target word. During training, the model is given a sequence of word pairs (a target word and a context word) and tries to predict the context words given the target words.

To train the model, the SGNS algorithm uses a technique called negative sampling, which involves sampling a small number of negative examples (random words that are not the true context words) and using them to train the model along with the positive examples (the true context words). This helps the model to learn the relationship between the target and context words more efficiently by focusing on the most informative examples.

The SGNS algorithm steps are as following:

The embedding for a target word (also called the ‘center word’) is calculated by taking the sum of the embeddings for the word itself and the character n-grams that make up the word.
The context words are represented by their word embeddings, without adding the character n-grams.
Negative samples are selected randomly from the vocabulary during training, with the probability of selecting a word being proportional to the square root of its unigram frequency (i.e., the number of times it appears in the text).
The dot product of the embedding for the center word and the embedding for the context word is calculated. We then need to normalize the similarity scores over all of the context words in the vocabulary, so that the probabilities sum to 1 and form a valid probability distribution.
Compute the cross-entropy loss between the predicted and true context words. Use an optimization algorithm such as stochastic gradient descent (SGD) to update the embedding vectors in order to minimize this loss. This involves bringing the actual context words closer to the center word (i.e., the target word) and increasing the distance between the center word and the negative samples.

The cross-entropy loss function can be expressed as:
L = – ∑i(y_i log(p(w_i|c)) + (1 – y_i)log(1 – p(w_i|c)))
where:
L is the cross-entropy loss.
y_i is a binary variable indicating whether context word i is a positive example (y_i = 1) or a negative example (y_i = 0).
p(w_i|c) is the probability of context word i given the target word c and its embedding.
∑i indicates that the sum is taken over all context words i in the vocabulary.

FastText and hierarchical softmax

FastText can use a technique called hierarchical softmax to reduce the computation time during training. Hierarchical softmax works by organizing the vocabulary into a binary tree, with the word at the root of the tree and its descendant words arranged in a hierarchy according to their probability of occurrence.

During training, the model uses the hierarchical structure of the tree to compute the loss and update the model weights more efficiently. This is done by traversing the tree from the root to the appropriate leaf node for each word, rather than computing the loss and updating the weights for every word in the vocabulary separately.

The standard softmax function has a computational complexity of O(Kd), where K is the number of classes (i.e., the size of the vocabulary) and d is the number of dimensions in the hidden layer of the model. This complexity arises from the need to normalize the probabilities over all potential classes in order to obtain a valid probability distribution. The hierarchical softmax reduces the computational complexity to O(d*log(K)). Huffman coding can be used to construct a binary tree structure for the softmax function, where the lowest frequency classes are placed deeper into the tree and the highest frequency classes are placed near the root of the tree.

In the hierarchical softmax function, a probability is calculated for each path through the Huffman coding tree, based on the product of the output vector v_n_i of each inner node n and the output value of the hidden layer of the model, h. The sigmoid function is then applied to this product to obtain a probability between 0 and 1.

The idea of this method is to represent the output classes (i.e., the words in the vocabulary) as the leaves on the tree and to use a random walk through the tree to assign probabilities to the classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as the product of the probabilities along the path from the root to the leaf node corresponding to the class.

This allows the hierarchical softmax function to compute the probability of each class more efficiently, since it only needs to consider the path through the tree rather than the entire vocabulary. This can significantly reduce the computational complexity of the model, particularly for large vocabularies, making it practical to train word embeddings on very large datasets.

Hierarchical softmax and conditional probabilities

To compute the probability of each context word given the center word and its embedding using the hierarchical softmax function, we first organize the vocabulary into a binary tree, with the words at the nodes of the tree and their descendant words arranged in a hierarchy according to their probability of occurrence.

We then compute the probability of each context word by traversing the tree from the root to the appropriate leaf node for the word. For each inner node n in the tree, we compute the probability of traversing the left or right branch of the tree as follows:

p(left|n) = sigmoid(v_n_i · h) p(right|n) = 1 – p(left|n)

where:

v_n_i is the vector representation of inner node n
h is the output value of the hidden layer of the model

The probability of a context word w is then computed as the product of the probabilities of the branches along the path from the root to the leaf node corresponding to w.

NLP – Word Embeddings – GloVe

What are word embeddings?

Word embeddings are a type of representation for text data, which allows words with similar meaning to have a similar representation in a neural network model. Word embeddings are trained such that words that are used in similar contexts will have similar vectors in the embedding space. This is useful because it allows the model to generalize better and makes it easier to learn from smaller amounts of data. Word embeddings can be trained using a variety of techniques, such as word2vec and GloVe, and are commonly used as input to deep learning models for natural language processing tasks.

So are they represented as arrays of numbers?

Yes, word embeddings are typically represented as arrays of numbers. The length of the array will depend on the size of the embedding space, which is a parameter that is chosen when the word embeddings are created. For example, if the size of the embedding space is 50, each word will be represented as a vector of length 50, with each element of the vector representing a dimension in the embedding space.

In a neural network model, these word embedding vectors are typically fed into the input layer of the model, and the rest of the layers in the model are then trained to perform some task, such as language translation or sentiment analysis. The model learns to combine the various dimensions of the word embedding vectors in order to make predictions or decisions based on the input data.

How are word embeddings determined?

There are a few different techniques for determining word embeddings, but the most common method is to use a neural network to learn the embeddings from a large dataset of text. The basic idea is to train a neural network to predict a word given the words that come before and after it in a sentence, using the output of the network as the embedding for the input word. The network is trained on a large dataset of text, and the weights of the network are used to determine the embeddings for each word.

There are a few different variations on this basic approach, such as using a different objective function or incorporating additional information into the input to the network. The specific details of how word embeddings are determined will depend on the specific method being used.

What are the specific methods for generating word embeddings?

Word embeddings are a type of representation for natural language processing tasks in which words are represented as numerical vectors in a high-dimensional space. There are several algorithms for generating word embeddings, including:

Word2Vec: This algorithm uses a neural network to learn the vector representations of words. It can be trained using two different techniques: continuous bag-of-words (CBOW) and skip-gram.
GloVe (Global Vectors): This algorithm learns word embeddings by factorizing a matrix of word co-occurrence statistics.
FastText: This is an extension of Word2Vec that learns word embeddings for subwords (character n-grams) in addition to full words. This allows the model to better handle rare and out-of-vocabulary words.
ELMo (Embeddings from Language Models): This algorithm generates word embeddings by training a deep bi-directional language model on a large dataset. The word embeddings are then derived from the hidden state of the language model.
BERT (Bidirectional Encoder Representations from Transformers): This algorithm is a transformer-based language model that generates contextual word embeddings. It has achieved state-of-the-art results on a wide range of natural language processing tasks.

What is the word2vec CBOW model?

The continuous bag-of-words (CBOW) model is one of the two main techniques used to train the Word2Vec algorithm. It predicts a target word based on the context words, which are the words surrounding the target word in a text.

The CBOW model takes a window of context words as input and predicts the target word in the center of the window. The input to the model is a one-hot vector representation of the context words, and the output is a probability distribution over the words in the vocabulary. The model is trained to maximize the probability of predicting the correct target word given the context words.

During training, the model adjusts the weights of the input-to-output connections in order to minimize the prediction error. Once training is complete, the model can be used to generate word embeddings for the words in the vocabulary. These word embeddings capture the semantic relationships between words and can be used for various natural language processing tasks.

What is the word2vec skip-gram model?

The skip-gram model is the other main technique used to train the Word2Vec algorithm. It is the inverse of the continuous bag-of-words (CBOW) model, which predicts a target word based on the context words. In the skip-gram model, the target word is used to predict the context words.

Like the CBOW model, the skip-gram model takes a window of context words as input and predicts the target word in the center of the window. The input to the model is a one-hot vector representation of the target word, and the output is a probability distribution over the words in the vocabulary. The model is trained to maximize the probability of predicting the correct context words given the target word.

What are the steps for the GloVe algorithm?

GloVe learns word embeddings by factorizing a matrix of word co-occurrence statistics, which can be calculated from a large corpus of text.

The main steps of the GloVe algorithm are as follows:

Calculate the word co-occurrence matrix: Given a large corpus of text, the first step is to calculate the co-occurrence matrix, which is a symmetric matrix X where each element X_ij represents the number of times word i appears in the context of word j. The context of a word can be defined as a window of words around the word, or it can be the entire document.
Initialize the word vectors: The next step is to initialize the word vectors, which are the columns of the matrix W. The word vectors are initialized with random values.
Calculate the pointwise mutual information (PMI) matrix: The PMI matrix is calculated as follows:

PMI_ij = log(X_ij / (X_i * X_j))

where X_i is the sum of all the elements in the ith row of the co-occurrence matrix, and X_j is the sum of all the elements in the jth column of the co-occurrence matrix. The PMI matrix is a measure of the association between words and reflects the strength of the relationship between them.

Factorize the PMI matrix: The PMI matrix is then factorized using singular value decomposition (SVD) or another matrix factorization technique to obtain the word vectors. The word vectors are the columns of the matrix W.
Normalize the word vectors: Finally, the word vectors are normalized to have unit length.

Once the GloVe algorithm has been trained, the word vectors can be used to represent words in a high-dimensional space. The word vectors capture the semantic relationships between words and can be used for various natural language processing tasks.

How is the matrix factorization performed in GloVe? What is the goal?

The goal of matrix factorization in GloVe is to find two matrices, called the word matrix and the context matrix, such that the dot product of these matrices approximates the co-occurrence matrix. The word matrix contains the word vectors for each word in the vocabulary, and the context matrix contains the context vectors for each word in the vocabulary.

To find these matrices, GloVe minimizes the difference between the dot product of the word and context matrices and the co-occurrence matrix using a least-squares optimization method. This results in word vectors that capture the relationships between words in the corpus.

In GloVe, the objective function that is minimized during matrix factorization is the least-squares error between the dot product of the word and context matrices and the co-occurrence matrix. More specifically, the objective function is given by:

How is the objective function minimized?

In each iteration of SGD, a mini-batch of co-occurrence pairs (i, j) is selected from the co-occurrence matrix, and the gradients of the objective function with respect to the parameters are computed for each pair. The parameters are then updated using these gradients and a learning rate, which determines the step size of the updates.

This process is repeated until the objective function has converged to a minimum or a preset number of iterations has been reached. The process of selecting mini-batches and updating the parameters is often referred to as an epoch. SGD is an efficient method for minimizing the objective function in GloVe because it does not require computing the Hessian matrix, which is the matrix of second-order partial derivatives of the objective function.

When should GloVe be used instead of Word2Vec?

GloVe (Global Vectors) and Word2Vec are two widely used methods for learning word vectors from a large corpus of text. Both methods learn vector representations of words that capture the semantics of the words and the relationships between them, and they can be used in various natural language processing tasks, such as language modeling, information retrieval, and machine translation.

GloVe and Word2Vec differ in the way they learn word vectors. GloVe learns word vectors by factorizing a co-occurrence matrix, which is a matrix that contains information about how often words co-occur in a given corpus. Word2Vec, on the other hand, learns word vectors using a shallow neural network with a single hidden layer.

One advantage of GloVe is that it is computationally efficient, as it does not require training a neural network. This makes it well suited for use with large corpora. However, Word2Vec has been shown to perform better on some tasks, such as syntactic analogies and named entity recognition.

How is the co-occurrence matrix reduced to lower dimensions in GloVe?

In GloVe (Global Vectors), the co-occurrence matrix is not directly reduced to lower dimensions. Instead, the co-occurrence matrix is used to learn word vectors, which are then reduced to lower dimensions using dimensionality reduction techniques, such as principal component analysis (PCA) or t-distributed stochastic neighbor embedding (t-SNE).

To learn word vectors from the co-occurrence matrix in GloVe, the matrix is factorized into two matrices, called the word matrix and the context matrix, using a least-squares optimization method. The word matrix contains the word vectors for each word in the vocabulary, and the context matrix contains the context vectors for each word in the vocabulary.

After the word vectors have been learned, they can be reduced to lower dimensions using dimensionality reduction techniques. For example, PCA can be used to project the word vectors onto a lower-dimensional space, while t-SNE can be used to embed the word vectors in a two-dimensional space for visualization.

It is worth noting that reducing the dimensionality of the word vectors may result in some loss of information, as some of the relationships between words may be lost in the lower-dimensional space. Therefore, it is important to consider the trade-off between the dimensionality of the word vectors and their representational power.

Interpreting GloVe from the Ratio of Co-occurrence Probabilities

GloVe uses the ratio of co-occurrence probabilities to learn the word vectors and context vectors. Specifically, it minimizes the difference between the dot product of the word and context vectors and the log of the ratio of co-occurrence probabilities. This allows GloVe to learn word vectors that capture the meanings and relationships between words in the language.

Chatting with GPT – Neural Networks Cont’d

The information provided about neural networks has been quite useful. I would like to learn more about loss functions. Please explain loss functions in the context of a neural networks – what are some examples of loss functions? How are they used?

Loss functions are a key component of neural networks, as they help to evaluate how well a model is performing. They do this by calculating the difference between the model’s predicted output and the true target output, and then generating a scalar value that represents the model’s overall error.

There are many different loss functions that can be used in neural networks, depending on the specific task at hand. Some common examples include:

Mean Squared Error (MSE) loss: This is a common loss function for regression tasks, where the goal is to predict a continuous numerical value. MSE loss is calculated by taking the average of the squared differences between the predicted output and the true output.
MSE Loss = (1/n) * ∑(ŷ – y)^2
A lower MSE loss indicates that the model is performing better and making more accurate predictions.
Cross-Entropy Loss (CEL): This loss function is commonly used for classification tasks, where the goal is to predict a class label. CEL is calculated by taking the negative log of the predicted probability of the true class.
CEL = – ∑(y * log(ŷ))
A lower CEL indicates that the model is performing better and making more accurate predictions.
Hinge Loss: This loss function is often used for support vector machines (SVMs) and is suitable for binary classification tasks. Hinge loss is calculated by taking the maximum of 0 and the difference between the predicted score for the true class and the predicted score for the false class, plus a margin.
Hinge Loss = max(0, 1 – y * (w * x + b))
where:
y is the true class label (1 if the example belongs to the true class, -1 otherwise)
w and b are the model’s weights and bias, respectively
x is the input example

The value 1 in the equation represents the margin. The margin is a hyperparameter that can be adjusted to control the influence of the loss function. A larger margin results in a more “relaxed” loss function, while a smaller margin results in a more “strict” loss function.
Kullback-Leibler Divergence (KLD) Loss: This loss function is often used in generative models, where the goal is to model the distribution of the data. KLD loss is calculated by taking the difference between the predicted distribution and the true distribution.
KLD Loss = ∑(p(x) * log(p(x)/q(x)))
where q(x) is the predicted distribution and p(x) is the true distribution

In general, loss functions are used to guide the training process in a neural network. During training, the model’s weights and biases are adjusted in an attempt to minimize the loss function. This process is known as backpropagation, and it helps the model to learn patterns in the data and make more accurate predictions.

So once I choose a loss function for my neural network, how will it be used to estimate the weights?

Once you have chosen a loss function for your neural network, it will be used to guide the training process and estimate the weights of the model.

During training, the neural network is presented with a series of input examples, along with the corresponding true output (also known as the target output). The model processes the input and generates a predicted output. The loss function is then used to calculate the difference between the predicted output and the true output, which is known as the error.

The error is then used to update the model’s weights and biases in an attempt to minimize the loss function. This process is known as backpropagation, and it helps the model to learn patterns in the data and make more accurate predictions.

The weights and biases are updated using an optimization algorithm, such as stochastic gradient descent (SGD). During each training iteration, the optimization algorithm adjusts the weights and biases based on the error calculated by the loss function. This process is repeated until the model reaches a satisfactory level of accuracy, or until a specified number of training iterations has been reached.

Provide the steps of how the loss function and errors are used in backpropagation, include equations.

Backpropagation is a process used to train neural networks by adjusting the model’s weights and biases in an attempt to minimize the loss function. The following are the steps involved in backpropagation:

Initialize the model’s weights and biases: Before training can begin, the model’s weights and biases must be initialized. This can be done using a variety of methods, such as setting all weights and biases to 0, or using random values drawn from a normal distribution.
Feed forward: During each training iteration, the model is presented with a series of input examples, along with the corresponding true output (also known as the target output). The model processes the input and generates a predicted output, ŷ, using the following equation:

ŷ = f(w * x + b)

where:

f is the activation function
w and b are the model’s weights and bias, respectively
x is the input example

Calculate the error: The loss function is then used to calculate the difference between the predicted output, ŷ, and the true output, y, which is known as the error. This can be expressed as follows:

Error = Loss(ŷ, y)

where Loss is the chosen loss function.

Backpropagate the error: The error is then used to update the model’s weights and biases in an attempt to minimize the loss function. This is done using an optimization algorithm, such as stochastic gradient descent (SGD).
Adjust the weights and biases: During each training iteration, the optimization algorithm adjusts the weights and biases based on the error calculated by the loss function. This can be expressed as follows:

w’ = w – α * ∂Loss/∂w b’
b’ = b – α * ∂Loss/∂b

where:

w’ and b’ are the updated weights and biases, respectively
α is the learning rate
∂Loss/∂w and ∂Loss/∂b are the partial derivatives of the loss function with respect to the weights and biases, respectively

Repeat until convergence: This process is repeated until the model reaches a satisfactory level of accuracy, or until a specified number of training iterations has been reached.

How do I choose the data on which the model should be trained? How do I split training vs. test data? And what is the difference between training, test, and validation data sets?

When training a neural network, it is important to choose a representative and diverse dataset to ensure that the model is able to learn patterns in the data that generalize well to new, unseen examples.

Once you have chosen a dataset, it is common practice to split it into three sets: training, validation, and test.

Training data is used to fit the model and learn the patterns in the data. The model is presented with a series of input examples, along with the corresponding true output (also known as the target output). The model processes the input and generates a predicted output, and the loss function is used to calculate the difference between the predicted output and the true output. This error is then used to update the model’s weights and biases in an attempt to minimize the loss function. This process is repeated until the model reaches a satisfactory level of accuracy, or until a specified number of training iterations has been reached.
Validation data is used to tune the model’s hyperparameters, such as the learning rate and the regularization coefficients. The model is presented with a series of input examples from the validation set, and the loss function is used to calculate the error between the predicted output and the true output. The hyperparameters are then adjusted in an attempt to minimize the validation loss.
Test data is used to evaluate the model’s generalization performance. The model is presented with a series of input examples from the test set, and the loss function is used to calculate the error between the predicted output and the true output. This error is used to measure the model’s accuracy on new, unseen data.

It is important to keep the training, validation, and test sets separate, as mixing them could result in a model that performs well on the training and validation sets, but poorly on the test set. This could lead to overfitting, which is when a model is overly optimized for the training and validation data, but fails to generalize well to new, unseen examples.

There are a few different strategies for splitting a dataset into training, validation, and test sets. One common approach is to use a 70/15/15 split, where 70% of the data is used for training, 15% is used for validation, and 15% is used for testing. Another approach is to use a 80/10/10 split, where 80% of the data is used for training, 10% is used for validation, and 10% is used for testing. The specific split will depend on the size and diversity of the dataset, as well as the specific requirements of the task at hand.

Can I use a neural network for both regression and classification problem types?

Yes, neural networks can be used for both regression and classification problems.

In a regression problem, the output of the model is a continuous value, such as a price or a probability. The goal is to predict the output value based on the input features. For example, you might use a neural network to predict the price of a house based on its size, location, and other features.

In a classification problem, the output of the model is a categorical value, such as a class label. The goal is to predict the class label of an input sample based on its features. For example, you might use a neural network to predict whether an email is spam or not based on the contents of the email.

Neural networks are very flexible and can be used to solve a wide range of problems. They can be used for both regression and classification, as well as other types of problems such as clustering, generation, and reinforcement learning.

What is meant by regularization in regards to a neural network model?

Regularization is a technique that is used to prevent overfitting in a neural network model. Overfitting occurs when a model is too complex and has too many parameters, which leads to poor generalization to new, unseen data.

There are several ways to regularize a neural network model, including:

L1 regularization: This involves adding a penalty term to the loss function that is proportional to the absolute value of the weights. This has the effect of pushing the weights towards 0, which can help reduce overfitting.
L2 regularization: This involves adding a penalty term to the loss function that is proportional to the square of the weights. This has the effect of pushing the weights towards 0, which can help reduce overfitting.
Dropout: This involves randomly dropping out units (neurons) from the network during training, which can help prevent overfitting by reducing the number of parameters in the model.
Early stopping: This involves stopping the training process early if the model starts to overfit, based on the performance on a validation set.

Regularization is an important technique for improving the generalization performance of a neural network model. It helps the model to generalize better to new, unseen data, rather than just memorizing the training data.

Off-topic, using NLP to summarize Russian propaganda articles on the invasion of Ukraine

I have a pretty simple article summarizer in python that I use for my blog. It’s not machine learning, it’s pretty basic – collecting N articles based on a search query, cleaning up the text, tokenizing, removing stop words. Then word frequencies are calculated and each sentence is scored based on those frequencies. Top N sentences are selected for the summary.

I usually use it to summarize articles on some specific supplements, but today for fun I decided to try a different search query. My query was “путин россия украина спецоперация”. I think that even though this summarizer is pretty simple, the summary turned out to be pretty meaningful. It gives a pretty good overview of what sort of info you would get if you were reading Russian news. I selected the top 15 sentences for the summary.

Что касается слов президента о спецоперации на Донбассе, то они показывают, что наши военнослужащие снова спасают мир от коричневой чумы нацизма, и все поставленные задачи, вне всякого сомнения, будут выполнены. Верховный главнокомандующий еще раз подтвердил, спецоперация на Украине будет доведена до своего логического завершения по тем задачам, которые он озвучил 24 февраля, в день начала СВО», — сказал полковник. И президент в своей речи указал на союзников из Европы, Азии, Африки и Латинской Америки, которые не прогибаются перед так называемым гегемоном, выбирают суверенный путь развития и хотят коллективно решать вопросы безопасности и сформировать многополярный мир.

Украинские издания утверждают, что СНБО одобрил санкции против патриарха Кирилла, а также Сергея Кириенко, Евгения Пригожина, Романа Абрамовича, Олега Дерипаски, Михаила Фридмана, Виктора Медведчука, а также Виктора Януковича. Наш долг перед памятью миллионов жертв Второй мировой войны — жестко реагировать на попытки фальсификации истории, противодействовать распространению любых форм неонацизма, русофобии и расизма», — призвал президент. Все указывает на то, что временной, человеческий, материальный, дипломатический ресурс «спецоперации» близится к исчерпанию, и Путин делает решительный шаг, чтобы закончить как можно скорее, зафиксировав прибыли и убытки. “Наша задача, наша миссия, солдат, ополченцев Донбасса – эту войну прекратить, защитить людей и, конечно, защитить саму Россию”, – подчеркнул Путин.

Выступление Путина демонстрирует готовность Вооруженных сил России довести специальную военную операцию на Украине до победного конца, пояснил URA.RU военный эксперт, ветеран разведки, полковник Анатолий Матвийчук. Война, развязанная Западом и киевской хунтой, будет закончена», — заключил Красов. Российское высокоточное оружие опережает иностранные аналоги на годы и десятилетия, при этом значительно превосходит их по тактико-техническим характеристикам, а оборонная промышленность лежит в основе суверенитета России, отметил президент.

Говоря о шагах со стороны Запада, ранее председатель комиссии Госдумы по расследованию фактов вмешательства иностранных государств во внутренние дела России Василий Пискарев заявлял о том, что зарубежные НКО насаждают радикальные идеологии в российском обществе, сообщает РАПСИ. «Сперва коллективный Запад изощрялся доказать, что якобы „разорвал в клочья“ российскую экономику, теперь там думают, что изолировали нашу страну от всего остального мира. Вооруженные силы России надежно защищают свою страну и несут свободу другим народам, заявил Путин, открывая работу форума «Армия-2022» и Армейских международных игр — 2022. «У России в запасе есть несколько военных вариантов, и для США и их союзников по НАТО эти сценарии — серьезный повод для беспокойства», — пишет журнал. Несмотря на то, что Запад был воодушевлен последними «успехами» ВСУ, генерал США Марк Милли призвал опасаться непредсказуемости России.

Yerba Mate (Ilex Paraguariensis) articles summary using NLP

The following summary was created using a google search for specific phrases and then performing natural language processing steps for sentence scoring. Yerba mate is an evergreen tree/shrub that grows in subtropical regions of South America. The leaves of the plant are used to make tea. Yerba mate tea contains caffeine and theobromine, which are known to affect the mood. I was interested in summarizing the existing articles in regards to research on this plant in psychiatry.

The first search phrase used was “yerba mate psychiatry depression research evidence“, and the number of collected articles for this phrase was 18. The text from all articles was combined, and relative word frequencies were calculated (after removing stop-words). These relative frequencies were then used to score each sentence. Sentence length distribution was checked, and the 90th percentile of 30 words was chosen to select sentences below the maximum length. Below are the 10 highest scoring sentences that summarize the text from the 18 articles.

We can infer from the summary that studies have been performed using the yerba mate extract on rats and tasks for chosen as proxies for the rats’ depression and anxiety levels. There are no mentions of human studies in the summary. Also the chosen sentences indicate that based on these studies, yerba mate has potential antidepressant activity, and it may improve memory as well. The results of the anxiety study were not mentioned and it’s not clear whether there were any side effects from yerba mate. These results are in line with descriptions of personal experiences of reddit users that I have reviewed, as many report better mood and improved focus after drinking yerba mate tea. Some users do report increased anxiety correlated with yerba mate consumption.

View abstract. J Agric.Food Chem. Vitamin C Levels Cerebral vitamin C (ascorbic acid (AA)) levels were determined as described by Jacques-Silva et al. Conclusion: In conclusion, the present study showed that Ilex paraguariensis presents an important effect on reducing immobility time on forced swimming test which could suggest an antidepressant-like effect of this extract. Despite previous some studies show the antidepressant-like activity of flavonoids [31, 32] which are present in the extract of I. paraguariensis, any study has evaluated the possible antidepressant-like activity of it. The presence of nine antioxidants compounds was investigated, namely, gallic acid, chlorogenic acid, caffeic acid, catechin, quercetin, rutin, kaempferol, caffeine, and theobromine. Abstract In this study, we investigated the possible antidepressant-like effect of I. paraguariensis in rats. Another study showed that an infusion of I. paraguariensis can improve the memory of rats treated with haloperidol and this effect was related to an indirect modulation of oxidative stress . In addition to flavonoids as quercetin and rutin and phenolic compounds as chlorogenic and caffeic acids, yerba mate is also rich in caffeine and saponins . After four weeks, behavioral analysis of locomotor activity and anxiety was evaluated in animals receiving water (n = 11) or I. paraguariensis (n = 9). In the same way, we evaluated if the presence of stimulants compounds like caffeine and theobromine in the extract of I. paraguariensis could cause anxiety. In the present study, we evaluated the possible antidepressant-like effect of I. paraguariensis by using forced swimming test (FST) in rats. Forced Swimming Test This experiment was performed using the FST according to the method previously published by Porsolt et al. In this context, Yerba mate (Ilex paraguariensis) is a beverage commonly consumed in South America especially in Argentina, Brazil, Uruguay, and Paraguay. I. paraguariensis reduced the immobility time on forced swimming test without significant changes in locomotor activity in the open field test.

I also tried several other search phrases, such as “yerba mate mood anxiety evidence” and “yerba mate side effects evidence“. In total of 17 articles were collected for the first query and 19 articles for the second query. The summaries are presented below. There was nothing in the summary directly discussing mood or anxiety, but there are mentions of neuroprotective effects and antioxidant effects. We can also learn that a cup of yerba mate tea has similar caffeine content as a cup of coffee, and that drinking yerba mate is not recommended while pregnant or breastfeeding. As in the previous summary, no human trials were mentioned, so it seems that all the summarized studies were performed on rats. The side effects query summary mentions the risk of transferring the caffeine from the tea to the fetus when pregnant, as well as a link to cancer for those who drink both alcohol and yerba mate. It also mentions and anxiety is a side effect of the tea.

Query 1:
View abstract. J Agric.Food Chem. On the other hand, studies conducted on an animal model showed chemopreventive effects of both pure mate saponin fraction and Yerba Mate tea in chemically induced colitis in rats. Yerba Mate Nutrition Facts The following nutrition information is provided by the USDA for one cup (12g) of a branded yerba mate beverage (Mate Revolution) that lists just organic yerba mate as an ingredient. Researchers found that steeping yerba mate (such as in yerba mate tea) may increase the level of absorption. Yerba mate beverages are not recommended for children and women who are pregnant or breastfeeding. Chlorogenic acid and theobromine tested individually also had neuroprotective effects, but slightly weaker than Yerba Mate extract as a whole, but stronger than known neuroprotective compounds, such as caffeine [ 83 ]. The caffeine content in a cup (about 150 mL) of Yerba Mate tea is comparable to that in a cup of coffee and is about 80 mg [ 1 , 11 , 20 ]. In aqueous and alcoholic extracts from green and roasted Yerba Mate, the presence of chlorogenic acid (caffeoylquinic acid), caffeic acid, quinic acid, dicaffeoylquinic acid, and feruloylquinic acid was confirmed. After consumption of Yerba Mate tea, antioxidant compounds are absorbed and appear in the circulating plasma where they exert antioxidant effects [ 55 ]. According to the cited studies, Yerba Mate tea consumption attenuates oxidative stress in patients with type 2 diabetes, which may prevent its complications.

Query 2:
View abstract. J Agric.Food Chem. Because yerba mate has a high concentration of caffeine, drinking mate tea while pregnant can increase the risk of transferring caffeine to the fetus. J Ethnopharmacol. South Med J 1988;81:1092-4.. View abstract. J Am Coll Nutr 2000;19:591-600.. View abstract. Am J Med 2005;118:998-1003.. View abstract. J Psychosom Res 2003;54:191-8.. View abstract. Yerba mate consumed by those who drink alcohol is linked to a higher risk of developing cancer. Anxiety and nervousness are a side effect of excessive yerba mate tea consumption.