coding – Notes of a Neuropsychiatry Amateur

Percentile Confidence Interval Calculator

Percentile-Confidence-Interval-Calculation.ipynb

This Python script calculates the 95% confidence interval for a specified percentile (e.g., the 70th percentile) of a dataset. The confidence interval provides a range in which we expect the true percentile value to lie with 95% confidence.

The calculation makes use of the binomial distribution properties, making an assumption that our data can be modeled by a binomial distribution. This assumption may not always be accurate, especially for continuous data, but it provides an approximation for our purposes.

Assumptions

1. Binary Outcome: The fundamental assumption behind the binomial distribution is that there is a binary outcome, often termed as ‘success’ and ‘failure’. In the context of percentiles, you can think of ‘success’ as the instances below the percentile and ‘failure’ as the instances above.

2. Fixed Number of Trials: For the binomial distribution, there is a fixed number n of trials. In our case, n represents the total number of data points in our sample.

3. Independence: Each trial (or data point) is independent of others. This means the outcome of one trial does not affect the outcome of another.

4. Constant Probability of Success: The probability of success, q, is the same for each trial. Here, q represents the percentile value. For example, for the 70th percentile, q=0.7.

Why the Binomial Distribution?

The rationale behind using the binomial distribution for percentile confidence intervals is its direct applicability to cases where you’re looking at the proportion of observations below a certain threshold (i.e., a percentile).

When you’re asking about the 70th percentile, you’re essentially inquiring: “What’s the value below which 70% of my data falls?” This can be likened to asking about the number of successes in n trials, where a success is an observation below the desired threshold.

However, it’s important to note that this method provides an approximation. The binomial distribution is discrete and inherently based on counting successes in a set number of trials, while percentiles often come from continuous distributions and may not perfectly adhere to the assumptions above.

import numpy as np
from scipy.stats import binom
import seaborn as sns

Get some data

# Load the Iris dataset
iris = sns.load_dataset("iris")
# Use the 'sepal_length' feature
data = iris['sepal_length'].values

print(data[:50])

[5.1 4.9 4.7 4.6 5.  5.4 4.6 5.  4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.  5.  5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.
 5.5 4.9 4.4 5.1 5.  4.5 4.4 5.  5.1 4.8 5.1 4.6 5.3 5. ]

Calculate the 70th percentile

# Calculate the 70th percentile
percentile_70 = np.percentile(data, 70)
print("Min: %f, Max: %f, 70th percentile: %f" % (min(data), max(data), percentile_70))

Min: 4.300000, Max: 7.900000, 70th percentile: 6.300000

Convert the data to “success” (above the 70th percentile) and “failure”

successes = np.sum(data > percentile_70)
failures = len(data) - successes

# Now, `successes` is analogous to `q * n` in the binomial scenario.
# So, we can set:
n = len(data)
q = successes / n

print("n: %d, q: %f" % (n, q))

n: 150, q: 0.280000

Calculate the 95% confidence interval

The code calculates potential upper (u) and lower (l) bounds for a confidence interval using the binomial distribution’s percent-point function (ppf).

np.ceil(binom.ppf(1 – alpha / 2, n, q)) determines the approximate upper bound for the confidence interval and np.ceil(binom.ppf(alpha / 2, n, q)) for the lower bound.

+ np.arange(-2, 3) extends these bounds by adding an array of [-2, -1, 0, 1, 2], generating a set of potential boundaries around the original estimate.

u gives a sequence of indices in the dataset that demarcate the upper bound of the confidence interval. It starts from the calculated index for the 97.5th percentile and provides two more indices above and two below it.

l gives a sequence of indices in the dataset that demarcate the lower bound of the confidence interval. It starts from the calculated index for the 2.5th percentile and provides two more indices above and two below it.

alpha = 0.05
u = np.ceil(binom.ppf(1 - alpha / 2, n, q)) + np.arange(-2, 3)
u[u > n] = np.inf

l = np.ceil(binom.ppf(alpha / 2, n, q)) + np.arange(-2, 3)
l[l < 0] = -np.inf

print("u: " + ", ".join(map(str, u)))
print("l: " + ", ".join(map(str, l)))

u: 51.0, 52.0, 53.0, 54.0, 55.0
l: 29.0, 30.0, 31.0, 32.0, 33.0

sorted_data = np.sort(data)

# Extract values corresponding to the indices
# Correct way to interpret the u and l values
u_values = sorted_data[n - u.astype(int)]
l_values = sorted_data[l.astype(int) - 1]

print("Upper values:", u_values)
print("Lower values:", l_values)

Upper values: [6.3 6.2 6.2 6.2 6.2]
Lower values: [5.  5.  5.  5.  5.1]

Probability coverage

The code calculates the probability coverage of different combinations of potential confidence intervals formed by the lower bounds (l) and upper bounds (u). Coverage is a matrix of probabilities. The goal is to find the smallest confidence interval that guarantees coverage of at least 1−α.

coverage = np.zeros((len(l), len(u)))
for i, a in enumerate(l):
    for j, b in enumerate(u):
        coverage[i, j] = binom.cdf(b - 1, n, q) - binom.cdf(a - 1, n, q)

if np.max(coverage) < 1 - alpha:
    i = np.where(coverage == np.max(coverage))
else:
    i = np.where(coverage == np.min(coverage[coverage >= 1 - alpha]))

print("Coverage Matrix:")
print(coverage)

print("\nOptimal Indices (i_l, i_u):")
print(i)

Coverage Matrix:
[[0.93135214 0.95028522 0.96430299 0.97438285 0.98142424]
 [0.92730647 0.94623955 0.96025732 0.97033718 0.97737857]
 [0.92096076 0.93989385 0.95391161 0.96399148 0.97103286]
 [0.91140808 0.93034117 0.94435894 0.9544388  0.96148018]
 [0.89759319 0.91652627 0.93054404 0.9406239  0.94766529]]

Optimal Indices (i_l, i_u):
(array([0], dtype=int64), array([1], dtype=int64))

i_l = i[0][0]
i_u = i[1][0]
print("Chosen row of coverage matrix: %d, chosen column of coverage matrix: %d" % (i_l, i_u))

u_final = min(n, u[i_u])
u_final = max(0, int(u_final)-1)
        
l_final = min(n, l[i_l])
l_final = max(0, int(l_final)-1)

# Actual value corresponding to u_final and l_final
upper_value_threshold = n - u_final
lower_value_threshold = l_final

upper_value = sorted_data[upper_value_threshold]
lower_value = sorted_data[lower_value_threshold]

print("Lower bound value:", lower_value)
print("Upper bound value:", upper_value)

Chosen row of coverage matrix: 0, chosen column of coverage matrix: 1
Lower bound value: 5.0
Upper bound value: 6.3

import matplotlib.pyplot as plt

# Plotting the histogram
plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, color='skyblue', edgecolor='black', alpha=0.7, label='Data')

# Adding vertical lines for lower_value and upper_value
plt.axvline(lower_value, color='red', linestyle='--', label='Lower bound')
plt.axvline(upper_value, color='green', linestyle='--', label='Upper bound')

# Adding vertical line for the 70th percentile
plt.axvline(percentile_70, color='purple', linestyle='-.', label='70th Percentile')

# Adding title and labels
plt.title('Histogram of Data with Confidence Bounds and 70th Percentile')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.legend()

plt.show()

Bootstrap method

A commonly used alternative method to calculate confidence intervals for percentiles (also known as quantiles) is the Bootstrap method.

The Bootstrap method involves resampling the dataset multiple times with replacement and then computing the desired statistic (in this case, the 70th percentile) for each of these resampled datasets. This gives a distribution of the 70th percentiles from which we can compute the confidence interval.

lower: This represents the value below which the bottom 2.5% of your jotted down 70th percentiles fall. In other words, it’s like saying, “In 2.5% of our bootstrap ‘experiments,’ the 70th percentile was below this value.”

upper: This is the value below which the bottom 97.5% of your jotted down 70th percentiles fall. Put another way, “In 97.5% of our bootstrap ‘experiments,’ the 70th percentile was below this value.”

import numpy as np

def bootstrap_percentile_CI(data, percentile=70, alpha=0.05, B=10000):
    """Calculate the bootstrap confidence interval for a given percentile."""
    n = len(data)
    resampled_percentiles = []

    for _ in range(B):
        resample = np.random.choice(data, n, replace=True)
        resampled_percentiles.append(np.percentile(resample, percentile))

    lower = np.percentile(resampled_percentiles, 100 * alpha/2)
    upper = np.percentile(resampled_percentiles, 100 * (1-alpha/2))
    
    return lower, upper

# Calculate the bootstrap 70th percentile confidence interval
lower_bootstrap, upper_bootstrap = bootstrap_percentile_CI(data)
print("Bootstrap 70th percentile CI: (%.2f, %.2f)" % (lower_bootstrap, upper_bootstrap))

Bootstrap 70th percentile CI: (6.10, 6.43)

# Plotting
plt.hist(data, bins=30, color='lightblue', edgecolor='black', alpha=0.7)
plt.axvline(x=np.percentile(data, 70), color='green', linestyle='--', label="True 70th Percentile")
plt.axvline(x=lower_bootstrap, color='red', linestyle='--', label="Lower Bound of CI")
plt.axvline(x=upper_bootstrap, color='blue', linestyle='--', label="Upper Bound of CI")
plt.legend()
plt.title('Histogram of Sepal Length with Bootstrap CI for 70th Percentile')
plt.xlabel('Sepal Length')
plt.ylabel('Frequency')
plt.show()

Discussion

The bootstrap method makes minimal assumptions about the distribution of the data, making it versatile for a wide variety of datasets. This flexibility allows the bootstrap to handle complex or unknown data distributions, whereas the binomial method assumes data follows a binomial distribution and is mainly suited for binary outcomes. While the binomial approach is computationally simpler and quicker, it might not always provide an accurate representation, especially if the underlying assumptions aren’t met. In contrast, the bootstrap can be more computationally intensive due to resampling but offers the advantage of being more adaptable and often provides a more accurate estimate for datasets that don’t strictly adhere to a binomial distribution.

Fructose Malabsorption – Applying the Luhn algorithm for text summarization

The Luhn algorithm is a text summarization technique that uses statistical properties of the text to identify and extract the most important sentences from a document. The algorithm was developed by H.P. Luhn in the 1950s, and is still widely used in various forms today.

The Luhn algorithm works by first analyzing the frequency of each word in the document, and then assigning a score to each sentence based on the frequency of the words it contains. Sentences that contain words that are more frequent in the document as a whole are considered to be more important, and are assigned higher scores. The algorithm then selects the top-scoring sentences and concatenates them together to form the summary. The length of the summary is usually determined in advance by the user, and the algorithm selects the most important sentences that fit within that length limit.

It works by identifying the most salient or important sentences in a document based on the frequency of important words and their distribution within each sentence. First, the algorithm removes stopwords, which are common words such as “the”, “and”, and “a” that do not carry much meaning. Additionally, one could apply stemming, which reduces words to their base or root form. For example, “likes” and “liked” are reduced to “like”. Then, the algorithm looks for important words in each sentence. These are typically nouns, verbs, and adjectives that carry the most meaning. The specific method for identifying important words may vary depending on the implementation of the algorithm, but in general, they are selected based on their frequency and relevance to the topic of the text.

The algorithm counts the number of important words in each sentence and divides it by the span, or the distance between the first and last occurrence of an important word. This gives a measure of how densely the important words are distributed within the sentence. Finally, the algorithm ranks the sentences based on their scores, with the highest scoring sentences considered the most important and selected for the summary.

Here are the step-by-step instructions for the Luhn algorithm:

Preprocess the text: Remove any stop words, punctuation, and other non-textual elements from the document, and convert all the remaining words to lowercase.
Calculate the word frequency: Count the number of occurrences of each word in the document, and store this information in a frequency table.
For each sentence, calculate the score by:
a. Identifying the significant words (excluding stop words) that occur in the sentence.
b. Ordering the significant words by their position in the sentence.
c. Determining the distance between adjacent significant words (the “span”).
d. Calculating a score for the sentence as the sum of the square of the number of significant words divided by the span for each adjacent pair of significant words.
Select the top-scoring sentences: Sort the sentences in the document by their score, and select the top-scoring sentences up to a maximum length L. The length L is typically chosen by the user in advance, and represents the maximum number of words or sentences that the summary can contain.
Generate the summary: Concatenate the selected sentences together to form the summary.

Below I summarize the topic of fructose malabsorption by generating a summary using the Luhn algorithm. To create the summary, I selected several articles from sources like Wikipedia and PubMed. The important words were selected based on their total frequency in all of the text. I chose the top 25 words to focus on, and then used the algorithm to identify the most important sentences based on the frequency and distribution of these words. The summary was generated using the top 15 sentences.

Symptoms and signs of Fructose malabsorption may cause gastrointestinal symptoms such as abdominal pain, bloating, flatulence or diarrhea. Although often assumed to be an acceptable alternative to wheat, spelt flour is not suitable for people with fructose malabsorption, just as it is not appropriate for those with wheat allergies or celiac disease. However, fructose malabsorbers do not need to avoid gluten, as those with celiac disease must. Many fructose malabsorbers can eat breads made from rye and corn flour. This can cause some surprises and pitfalls for fructose malabsorbers. Foods (such as bread) marked “gluten-free” are usually suitable for fructose malabsorbers, though they need to be careful of gluten-free foods that contain dried fruit or high fructose corn syrup or fructose itself in sugar form. Food-labeling Producers of processed food in most or all countries, including the US, are not currently required by law to mark foods containing “fructose in excess of glucose”.

Stone fruit: apricot, nectarine, peach, plum (caution – these fruits contain sorbitol);Berry fruit: blackberry, boysenberry, cranberry, raspberry, strawberry, loganberry; Citrus fruit: kumquat, grapefruit, lemon, lime, mandarin, orange, tangelo; Other fruits: ripe banana, jackfruit, passion fruit, pineapple, rhubarb, tamarillo.

The fructose and glucose contents of foods listed on the Australian food standards would appear to indicate that most of the listed foods have higher fructose levels.

Glucose enhances absorption of fructose, so fructose from foods with fructose-to-glucose ratio <1, like white potatoes, are readily absorbed, whereas foods with fructose-to-glucose ratio >1, like apples and pears, are often problematic regardless of the total amount of fructose in the food.

Foods with high fructose-to-glucose ratio.

Dietary supplements of xylose isomerase may improve some symptoms of fructose malabsorption, although there is currently only a single scientific study available.

This can lead to issues such as bloating, heartburn and constipation.

It is called a hydrogen breath test and is the method currently used for a clinical diagnosis.

People with fructose malabsorption absorb less than 25 g per sitting.

Fructose malabsorption should be considered in patients with symptoms of depression and disturbances of tryptophan metabolism. This prospective cross-sectional study aimed to investigate the potential association between primary-adult lactose malabsorption, fructose malabsorption, tryptophan (TRP) metabolism and the presence of depressive signs.

High intestinal fructose concentration seems to interfere with L-tryptophan metabolism, and it may reduce availability of tryptophan for the biosynthesis of serotonin (5-hydroxytryptamine).

After obtaining the top sentences above, I used them as input into ChatGPT to generate another summary. The summary ended up with some mistakes, such as “People with this condition should avoid foods like spelt flour that contain fructose.“, as spelt does not contain fructose, it contains fructans. But it was still a fun exercise.

Fructose malabsorption can cause gastrointestinal symptoms such as abdominal pain, bloating, flatulence, and diarrhea. People with this condition should avoid foods like spelt flour that contain fructose. However, they do not need to avoid gluten, and many can eat breads made from rye and corn flour. Foods marked “gluten-free” are generally suitable for fructose malabsorbers, but they need to be careful of gluten-free foods that contain dried fruit or high fructose corn syrup. Certain fruits such as apricots, nectarines, peaches, and plums can cause issues due to their sorbitol content. The fructose-to-glucose ratio of foods can also play a role in absorption, with foods with a higher ratio often being problematic. A hydrogen breath test is used for clinical diagnosis, and dietary supplements of xylose isomerase may help with symptoms. Fructose malabsorption may also be associated with depressive signs due to the interference with L-tryptophan metabolism.

Summarizing articles on PMDD treatments using TextRank

In this blog post, I want to share with you what I learned about treating PMDD using articles summarization through TextRank. TextRank is not really a summarization algorithm, it is used for extracting top sentences, but I decided to use it anyways and see the results. I started by using the googlesearch library in python to search for “PMDD treatments – calcium, hormones, SSRIs, scientific evidence”. The search resulted in a list of URLs to various articles on PMDD treatments. However, not all of them were useful for my purposes, as some were blocked due to access restrictions. I used BeautifulSoup to extract the text from the remaining articles.

In order to exclude irrelevant paragraphs, I used the library called Justext. This library is designed for removing boilerplate content and other non-relevant text from HTML pages. Justext uses a heuristics to determine which parts of the page are boilerplate and which are not, and then filters out the former. Justext tries to identify these sections by analyzing the length of the text, the density of links, and the presence of certain HTML tags.

Some examples of the kinds of content that Justext can remove include navigation menus, copyright statements, disclaimers, and other non-content-related text. It does not work perfectly, as I still ended up with sentences such as the following in the resulting articles: “This content is owned by the AAFP. A person viewing it online may make one printout of the material and may use that printout only for his or her personal, non-commercial reference.”

Next, I used existing code that implements the TextRank algorithm that I found online. I slightly improved it so that instead of bag of words method the algorithm would use sentence embeddings. Let’s go step by step through the algorithm. I defined a class called TextRank4Sentences. Here is a description of each line in the __init__ method of this class:

self.damping = 0.85: This sets the damping coefficient used in the TextRank algorithm to 0.85. In this case, it determines the probability of the algorithm to transition from one sentence to another.

self.min_diff = 1e-5: This sets the convergence threshold. The algorithm will stop iterating when the difference between the PageRank scores of two consecutive iterations is less than this value.

self.steps = 100: This sets the number of iterations to run the algorithm before stopping.

self.text_str = None: This initializes a variable to store the input text.

self.sentences = None: This initializes a variable to store the individual sentences of the input text.

self.pr_vector = None: This initializes a variable to store the TextRank scores for each sentence in the input text.

from nltk import sent_tokenize, word_tokenize
from nltk.cluster.util import cosine_distance
from sklearn.metrics.pairwise import cosine_similarity

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

MULTIPLE_WHITESPACE_PATTERN = re.compile(r"\s+", re.UNICODE)

class TextRank4Sentences():
    def __init__(self):
        self.damping = 0.85  # damping coefficient, usually is .85
        self.min_diff = 1e-5  # convergence threshold
        self.steps = 100  # iteration steps
        self.text_str = None
        self.sentences = None
        self.pr_vector = None

The next step is defining a private method _sentence_similarity() which takes in two sentences and returns their cosine similarity using a pre-trained model. The method encodes each sentence into a vector using the pre-trained model and then calculates the cosine similarity between the two vectors using another function core_cosine_similarity().

core_cosine_similarity() is a separate function that measures the cosine similarity between two vectors. It takes in two vectors as inputs and returns a similarity score between 0 and 1. The function uses the cosine_similarity() function from the sklearn library to calculate the similarity score. The cosine similarity is a measure of the similarity between two non-zero vectors of an inner product space. It is calculated as the cosine of the angle between the two vectors.

Mathematically, given two vectors u and v, the cosine similarity is defined as:

cosine_similarity(u, v) = (u . v) / (||u|| ||v||)

where u . v is the dot product of u and v, and ||u|| and ||v|| are the magnitudes of u and v respectively.

def core_cosine_similarity(vector1, vector2):
    """
    measure cosine similarity between two vectors
    :param vector1:
    :param vector2:
    :return: 0 < cosine similarity value < 1
    """
    sim_score = cosine_similarity(vector1, vector2)
    return sim_score

class TextRank4Sentences():
    def __init__(self):
        ...

    def _sentence_similarity(self, sent1, sent2):
        first_sent_embedding = model.encode([sent1])
        second_sent_embedding = model.encode([sent2])
        
        return core_cosine_similarity(first_sent_embedding, second_sent_embedding)

In the next function, the similarity matrix is built for the given sentences. The function _build_similarity_matrix takes a list of sentences as input and creates an empty similarity matrix sm with dimensions len(sentences) x len(sentences). Then, for each sentence in the list, the function computes its similarity with all other sentences in the list using the _sentence_similarity function. After calculating the similarity scores for all sentence pairs, the function get_symmetric_matrix is used to make the similarity matrix symmetric.

The function get_symmetric_matrix adds the transpose of the matrix to itself, and then subtracts the diagonal elements of the original matrix. In other words, for each element (i, j) of the input matrix, the corresponding element (j, i) is added to it to make it symmetric. However, the diagonal elements (i, i) of the original matrix are not added twice, so they need to be subtracted once from the sum of the corresponding elements in the upper and lower triangles. The resulting matrix has the same values in the upper and lower triangles, and is symmetric along its main diagonal. The similarity matrix is made symmetric in order to ensure that the similarity score between two sentences in the matrix is the same regardless of their order, and it also simplifies the computation.

def get_symmetric_matrix(matrix):
    """
    Get Symmetric matrix
    :param matrix:
    :return: matrix
    """
    return matrix + matrix.T - np.diag(matrix.diagonal())

class TextRank4Sentences():
    def __init__(self):
        ...

    def _sentence_similarity(self, sent1, sent2):
        ...
    
    def _build_similarity_matrix(self, sentences, stopwords=None):
        # create an empty similarity matrix
        sm = np.zeros([len(sentences), len(sentences)])
    
        for idx, sentence in enumerate(sentences):
            print("Current location: %d" % idx)
            sm[idx] = self._sentence_similarity(sentence, sentences)
    
        # Get Symmeric matrix
        sm = get_symmetric_matrix(sm)
    
        # Normalize matrix by column
        norm = np.sum(sm, axis=0)
        sm_norm = np.divide(sm, norm, where=norm != 0)  # this is ignore the 0 element in norm
    
        return sm_norm

In the next function, the ranking algorithm PageRank is implemented to calculate the importance of each sentence in the document. The similarity matrix created in the previous step is used as the basis for the PageRank algorithm. The function takes the similarity matrix as input and initializes the pagerank vector with a value of 1 for each sentence.

In each iteration, the pagerank vector is updated based on the similarity matrix and damping coefficient. The damping coefficient represents the probability of continuing to another sentence at random, rather than following a link from the current sentence. The algorithm continues to iterate until either the maximum number of steps is reached or the difference between the current and previous pagerank vector is less than a threshold value. Finally, the function returns the pagerank vector, which represents the importance score for each sentence.

class TextRank4Sentences():
    def __init__(self):
        ...

    def _sentence_similarity(self, sent1, sent2):
        ...
    
    def _build_similarity_matrix(self, sentences, stopwords=None):
        ...

    def _run_page_rank(self, similarity_matrix):

        pr_vector = np.array([1] * len(similarity_matrix))

        # Iteration
        previous_pr = 0
        for epoch in range(self.steps):
            pr_vector = (1 - self.damping) + self.damping * np.matmul(similarity_matrix, pr_vector)
            if abs(previous_pr - sum(pr_vector)) < self.min_diff:
                break
            else:
                previous_pr = sum(pr_vector)

        return pr_vector

The _get_sentence function takes an index as input and returns the corresponding sentence from the list of sentences. If the index is out of range, it returns an empty string. This function is used later in the class to get the highest ranked sentences.

class TextRank4Sentences():
    def __init__(self):
        ...

    def _sentence_similarity(self, sent1, sent2):
        ...
    
    def _build_similarity_matrix(self, sentences, stopwords=None):
        ...

    def _run_page_rank(self, similarity_matrix):
        ...

    def _get_sentence(self, index):

        try:
            return self.sentences[index]
        except IndexError:
            return ""

The code then defines a method called get_top_sentences which returns a summary of the most important sentences in a document. The method takes two optional arguments: number (default=5) specifies the maximum number of sentences to include in the summary, and similarity_threshold (default=0.5) specifies the minimum similarity score between two sentences that should be considered “too similar” to include in the summary.

The method first initializes an empty list called top_sentences to hold the selected sentences. It then checks if a pr_vector attribute has been computed for the document. If the pr_vector exists, it sorts the indices of the sentences in descending order based on their PageRank scores and saves them in the sorted_pr variable.

It then iterates through the sentences in sorted_pr, starting from the one with the highest PageRank score. For each sentence, it removes any extra whitespace, replaces newlines with spaces, and checks if it is too similar to any of the sentences already selected for the summary. If it is not too similar, it adds the sentence to top_sentences. Once the selected sentences are finalized, the method concatenates them into a single string separated by spaces, and returns the summary.

class TextRank4Sentences():
    def __init__(self):
        ...

    def _sentence_similarity(self, sent1, sent2):
        ...
    
    def _build_similarity_matrix(self, sentences, stopwords=None):
        ...

    def _run_page_rank(self, similarity_matrix):
        ...

    def _get_sentence(self, index):
        ...
   
    def get_top_sentences(self, number=5, similarity_threshold=0.5):
        top_sentences = []
    
        if self.pr_vector is not None:
            sorted_pr = np.argsort(self.pr_vector)
            sorted_pr = list(sorted_pr)
            sorted_pr.reverse()
    
            index = 0
            while len(top_sentences) < number and index < len(sorted_pr):
                sent = self.sentences[sorted_pr[index]]
                sent = normalize_whitespace(sent)
                sent = sent.replace('\n', ' ')
    
                # Check if the sentence is too similar to any of the sentences already in top_sentences
                is_similar = False
                for s in top_sentences:
                    sim = self._sentence_similarity(sent, s)
                    if sim > similarity_threshold:
                        is_similar = True
                        break
    
                if not is_similar:
                    top_sentences.append(sent)
    
                index += 1
        
        summary = ' '.join(top_sentences)
        return summary

The _remove_duplicates method takes a list of sentences as input and returns a list of unique sentences, by removing any duplicates in the input list.

class TextRank4Sentences():
    def __init__(self):
        ...

    def _sentence_similarity(self, sent1, sent2):
        ...
    
    def _build_similarity_matrix(self, sentences, stopwords=None):
        ...

    def _run_page_rank(self, similarity_matrix):
        ...

    def _get_sentence(self, index):
        ...
   
    def get_top_sentences(self, number=5, similarity_threshold=0.5):
        ...
    
    def _remove_duplicates(self, sentences):
        seen = set()
        unique_sentences = []
        for sentence in sentences:
            if sentence not in seen:
                seen.add(sentence)
                unique_sentences.append(sentence)
        return unique_sentences

The analyze method takes a string text and a list of stop words stop_words as input. It first creates a unique list of words from the input text by using the set() method and then joins these words into a single string self.full_text.

It then uses the sent_tokenize() method from the nltk library to tokenize the text into sentences and removes duplicate sentences using the _remove_duplicates() method. It also removes sentences that have a word count less than or equal to the fifth percentile of all sentence lengths.

After that, the method calculates a similarity matrix using the _build_similarity_matrix() method, passing in the preprocessed list of sentences and the stop_words list.

Finally, it runs the PageRank algorithm on the similarity matrix using the _run_page_rank() method to obtain a ranking of the sentences based on their importance in the text. This ranking is stored in self.pr_vector.

class TextRank4Sentences():
    ...

    def analyze(self, text, stop_words=None):
        self.text_unique = list(set(text))
        self.full_text = ' '.join(self.text_unique)
        #self.full_text = self.full_text.replace('\n', ' ')
        
        self.sentences = sent_tokenize(self.full_text)
        
        # for i in range(len(self.sentences)):
        #     self.sentences[i] = re.sub(r'[^\w\s$]', '', self.sentences[i])
    
        self.sentences = self._remove_duplicates(self.sentences)
        
        sent_lengths = [len(sent.split()) for sent in self.sentences]
        fifth_percentile = np.percentile(sent_lengths, 10)
        self.sentences = [sentence for sentence in self.sentences if len(sentence.split()) > fifth_percentile]

        print("Min length: %d, Total number of sentences: %d" % (fifth_percentile, len(self.sentences)) )

        similarity_matrix = self._build_similarity_matrix(self.sentences, stop_words)

        self.pr_vector = self._run_page_rank(similarity_matrix)

In order to find articles, I used the googlesearch library. The code below performs a Google search using the Google Search API provided by the library. It searches for the query “PMDD treatments – calcium, hormones, SSRIs, scientific evidence” and retrieves the top 7 search results.

# summarize articles
import requests
from bs4 import BeautifulSoup
from googlesearch import search
import justext
query = "PMDD treatments - calcium, hormones, SSRIs, scientific evidence"

# perform the google search and retrieve the top 5 search results
top_results = []
for url in search(query, num_results=7):
    top_results.append(url)

In the next part, the code extracts the article text for each of the top search results collected in the previous step. For each URL in the top_results list, the code sends an HTTP GET request to the URL using the requests library. It then uses the justext library to extract the main content of the webpage by removing any boilerplate text (i.e., non-content text).

article_texts = []

# extract the article text for each of the top search results
for url in top_results:
    response = requests.get(url)
    paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
    text = ''
    for paragraph in paragraphs:
        if not paragraph.is_boilerplate:
            text += paragraph.text + '\n'

    if "Your access to PubMed Central has been blocked" not in text:
        article_texts.append(text.strip())
        print(text)
    print('-' * 50)
    
print("Total articles collected: %d" % len(article_texts))

In the final step, the extracted article texts are passed to an instance of the TextRank4Sentences class, which is used to perform text summarization. The output of get_top_sentences() is a list of the top-ranked sentences in the input text, which are considered to be the most important and representative sentences for summarizing the content of the text. This list is stored in the variable summary_text.

# summarize
tr4sh = TextRank4Sentences()
tr4sh.analyze(article_texts)
summary_text = tr4sh.get_top_sentences(15)

Results:
(I did not list irrelevant sentences that appeared in the final results, such as “You will then receive an email that contains a secure link for resetting your password…“)

Total articles collected: 6

There have been at least 15 randomized controlled trials of the use of selective serotonin-reuptake inhibitors (SSRIs) for the treatment of severe premenstrual syndrome (PMS), also called premenstrual dysphoric disorder (PMDD).

It is possible that the irritability/anger/mood swings subtype of PMDD is differentially responsive to treatments that lead to a quick change in ALLO availability or function, for example, symptom-onset SSRI or dutasteride.
* My note: ALLO is allopregnanolone
* My note: Dutasteride is a synthetic 4-azasteroid compound that is a selective inhibitor of both the type 1 and type 2 isoforms of steroid 5 alpha-reductase

From 2 to 10 percent of women of reproductive age have severe distress and dysfunction caused by premenstrual dysphoric disorder, a severe form of premenstrual syndrome.

The rapid efficacy of selective serotonin reuptake inhibitors (SSRIs) in PMDD may be due in part to their ability to increase ALLO levels in the brain and enhance GABAA receptor function with a resulting decrease in anxiety.

Clomipramine, a serotoninergic tricyclic antidepressant that affects the noradrenergic system, in a dosage of 25 to 75 mg per day used during the full cycle or intermittently during the luteal phase, significantly reduced the total symptom complex of PMDD.

Relapse was more likely if a woman stopped sertraline after only 4 months versus 1 year, if she had more severe symptoms prior to treatment and if she had not achieved full symptom remission with sertraline prior to discontinuation.

Women with negative views of themselves and the future caused or exacerbated by PMDD may benefit from cognitive-behavioral therapy. This kind of therapy can enhance self-esteem and interpersonal effectiveness, as well as reduce other symptoms.

Educating patients and their families about the disorder can promote understanding of it and reduce conflict, stress, and symptoms.

Anovulation can also be achieved with the administration of estrogen (transdermal patch, gel, or implant).

In a recent meta-analysis of 15 randomized, placebo-controlled studies of the efficacy of SSRIs in PMDD, it was concluded that SSRIs are an effective and safe first-line therapy and that there is no significant difference in symptom reduction between continuous and intermittent dosing.

Preliminary confirmation of alleviation of PMDD with suppression of ovulation with a GnRH agonist should be obtained prior to hysterectomy.

Sexual side effects, such as reduced libido and inability to reach orgasm, can be troubling and persistent, however, even when dosing is intermittent. * My note: I think this sentence refers to the side-effects of SSRIs

Calculating Confidence Interval for a Percentile

Calculating the confidence interval for a percentile is a crucial step in understanding the variability and the uncertainty around the estimated value. In many real-world applications, the distribution of the data is unknown and this makes it difficult to determine the confidence intervals. In such scenarios, using a binomial distribution can be a viable alternative to estimate the confidence intervals for a percentile.

For instance, let’s consider a variable with 300 data points and we want to calculate the 70th and 90th percentiles and the corresponding confidence intervals for the variable. To do this, we can use a binomial distribution approach.

First, we need to choose an alpha level, which is a probability that determines the size of the confidence interval. A common choice for alpha is 0.05, which corresponds to a 95% confidence interval.

Next, we use the cumulative distribution function (CDF) of the binomial distribution to estimate the lower and upper bounds of the confidence interval. The CDF of the binomial distribution gives the probability of getting k or fewer successes in n independent Bernoulli trials, where the probability of success in each trial is p.

To calculate the 70th percentile and its confidence interval, we use the following steps:

Set n = 300, which is the number of data points.
Set p = 0.7, which corresponds to the 70th percentile.
Calculate the binomial quantile using the CDF, which is the smallest k such that P(X <= k) >= p, where X is a binomial random variable with parameters n and p.
Use the CDF to determine the lower and upper bounds of the confidence interval.

Below is the python code for calculating the confidence interval for the 70th percentile.

alpha – alpha is a parameter representing the significance level or confidence level for the calculation of the confidence interval. It is the probability that the confidence interval contains the true value of the parameter being estimated. The value of alpha is typically set to 0.05 or 0.01, meaning that there is a 95% or 99% chance, respectively, that the confidence interval contains the true value. In the code, alpha=0.05 is the default value for alpha, but it can be changed to a different value if desired.

n – number of observations

q – percentile value

from scipy.stats import binom
import numpy as np

alpha = 0.05
n = 300
q = 0.7

Below is the code for calculating the upper and lower bounds for the confidence interval. The u value is calculated as the ceiling of the binomial distribution’s quantile function (ppf) evaluated at 1 – alpha / 2 (1 – 0.05 / 2 = 0.975), and the value is shifted by adding an array of numbers from -2 to 2. Any values of u that are greater than n are set to infinity.

u = np.ceil(binom.ppf(1 - alpha / 2, n, q)) + np.arange(-2, 3)
u[u > n] = np.inf

l = np.ceil(binom.ppf(alpha / 2, n, q)) + np.arange(-2, 3)
l[l < 0] = -np.inf

# From the calculation of bounds, np.ceil(binom.ppf(1 - alpha / 2, n, q)) and np.ceil(binom.ppf(alpha / 2, n, q)), we obtain that
# the upper bound value is 225 and the lower bound value is 194. This means that given a sample of size 300, a binomial distribution, and # probability of success p=0.7, we are 95% certain that the number of successes will be between 194 and 225.

Next we calculate coverage of the percentiles that the bounds cover. The coverage represents a matrix of values that correspond to the probability of coverage of the confidence interval for each combination of lower and upper bounds of the interval.

The coverage calculation uses the binom.cdf function to calculate the cumulative distribution function (CDF) for the binomial distribution, which is then used to determine the coverage probability of each combination of u and l. Once the coverage matrix is calculated, the code finds the index i corresponding to the combination of u and l that gives the closest coverage probability to 1-alpha.

coverage = np.zeros((len(l), len(u)))

for i, a in enumerate(l):
    for j, b in enumerate(u):
        coverage[i, j] = binom.cdf(b - 1, n, q) - binom.cdf(a - 1, n, q)

Next we select the upper and lower bounds of the confidence interval based on the coverage of the interval. The code first checks if the maximum coverage is less than 1 minus the significance level alpha. If it is, the code selects the pair of bounds with the maximum coverage probability. Otherwise, the code selects the pair of bounds with the smallest coverage probability that is still greater than or equal to 1 minus alpha.

if np.max(coverage) < 1 - alpha:
    i = np.where(coverage == np.max(coverage))
else:
    i = np.where(coverage == np.min(coverage[coverage >= 1 - alpha]))

i_u = i[0][0]
i_l = i[1][0]

u_final = min(n, u[i_u])
u_final = max(0, int(u_final)-1)
        
l_final = min(n, l[i_l])
l_final = max(0, int(l_final)-1)

The resulting l and u are 192 and 223, respectively. Therefore if you have a sample of 300 and you want to calculate the confidence interval for a variable X, you would sort the values in ascending order, and then you would take the values of X that correspond to the 192nd and 223rd observations.

Chatting with GPT – Neural Networks

Chat GPT link

*NOTE: I have manually modified some of the equations produced by ChatGPT, such as adding a t-1 subscript

Hello, please tell me what is ChatGPT?

ChatGPT is a variant of the GPT (Generative Pre-training Transformer) language model that was specifically designed for use in chatbots and conversational AI systems. It is trained on a large dataset of human-human conversations and is able to generate human-like responses to user input. ChatGPT can be used to build chatbots that can hold natural language conversations with users, helping businesses and organizations automate customer service, sales, and other types of interactions.

Like other GPT models, ChatGPT is a transformer-based language model that uses attention mechanisms to process and generate text. It is trained using a combination of supervised learning, in which the model is given labeled examples of conversation, and unsupervised learning, in which the model is trained to predict the next word in a sequence based on the context of the words that come before it. This allows ChatGPT to learn the patterns and structure of human conversation and generate appropriate responses to user input.

I would like to learn more about machine learning, let’s start with the concept of neural networks. Please explain the concept of a neural network.

A neural network is a type of machine learning model that is inspired by the structure and function of the human brain. It is composed of a large number of interconnected “neurons,” which are simple processing units that receive input, transform it, and produce output.

In a neural network, the input data is fed through a series of hidden layers, each of which is made up of a large number of neurons. These hidden layers process the input data using weighted connections between the neurons. The weights of these connections are adjusted during the training process, allowing the neural network to “learn” from the input data.

The output of the neural network is produced by the output layer, which is also made up of a number of neurons. The output layer takes the processed data from the hidden layers and produces the final output of the neural network.

Neural networks can be used for a wide variety of machine learning tasks, including classification, regression, and clustering. They are particularly powerful for tasks that require the processing of large amounts of data or the detection of patterns and relationships in data that are not immediately apparent.

Overall, the concept of a neural network is a key element of machine learning and is used in many different applications and industries.

How does a neural network look in terms of mathematical equations? What is the architecture?

The architecture of a neural network refers to the way that the neurons are organized and connected within the network. A typical neural network consists of an input layer, one or more hidden layers, and an output layer. The input layer receives the raw input data, and the output layer produces the final output of the network. The hidden layers are located between the input and output layers and are responsible for processing the input data and producing intermediate output that is passed on to the output layer.

In terms of mathematical equations, the computation performed by a neural network can be represented by a series of matrix multiplications and element-wise non-linear functions (also called activation functions).

For example, consider a neural network with one hidden layer and an input layer with m inputs and an output layer with n outputs. Let’s say that the input data is represented by a matrix X with dimensions m x N (where N is the number of data points) and the weights between the input layer and the hidden layer are represented by a matrix W1 with dimensions m x p (where p is the number of neurons in the hidden layer). The output of the hidden layer is then given by:

h = f(XW1 + b1)

where f is the activation function and b1 is a bias term.

The weights between the hidden layer and the output layer are represented by a matrix W2 with dimensions p x n, and the final output of the neural network is given by:

y = f(hW2 + b2)

where b2 is a bias term.

This is just a simple example, but it illustrates the basic idea of how a neural network can be represented mathematically. In practice, neural networks can have many more layers and a much larger number of neurons, and the equations can become quite complex. However, the basic idea of using matrix multiplications and non-linear activation functions to transform the input data and produce output remains the same.

What exactly do you mean by activation function? What are some examples of such functions?

An activation function is a mathematical function that is applied to the output of a neuron in a neural network. It determines whether the neuron will “fire” and produce an output, or whether it will remain inactive. Activation functions are a key component of neural networks because they allow the network to learn non-linear relationships in the data.

There are many different activation functions that can be used in neural networks, each with its own properties and characteristics. Some common examples include:

Sigmoid: The sigmoid function is a smooth, S-shaped curve that maps any input value to a value between 0 and 1. It is often used in the output layer of a binary classification neural network because it can produce a probability-like output.

Tanh: The tanh (hyperbolic tangent) function is also a smooth, S-shaped curve, but it maps input values to a range of -1 to 1. It is similar to the sigmoid function, but it allows for negative output values as well.

ReLU (Rectified Linear Unit): The ReLU function is a piecewise linear function that maps negative input values to 0 and positive input values to the same value. It is a simple and efficient activation function that has been widely adopted in many neural network architectures.

Leaky ReLU: The leaky ReLU function is a variant of the ReLU function that allows a small negative slope for negative input values. This can help prevent the “dying ReLU” problem, in which some neurons in the network become inactive and stop producing output.

Softmax: The softmax function is a generalization of the sigmoid function that is often used in the output layer of a neural network for multi-class classification. It maps the input values to a probability distribution over the different classes, such that the sum of the probabilities is 1.

These are just a few examples of the many activation functions that are used in neural networks. The choice of activation function can have a significant impact on the performance of the network, and different activation functions may be better suited for different types of tasks and data.

So how would I choose an activation function then? Would each layer have a different activation function?

The choice of activation function can have a significant impact on the performance of a neural network, and it is an important design decision that can affect the network’s ability to learn from the data and generalize to new examples.

There is no one-size-fits-all activation function that is best for all situations, and it is often necessary to experiment with different activation functions to find the one that works best for a particular task. In general, it can be helpful to consider the properties of the activation function and how they align with the characteristics of the data and the goals of the task.

For example, if you are building a neural network for a binary classification task, you may want to use a sigmoid or a tanh activation function in the output layer, since these functions can produce a probability-like output between 0 and 1. On the other hand, if you are building a neural network for a multi-class classification task, you may want to use a softmax activation function in the output layer, since it can produce a probability distribution over the different classes.

As for the hidden layers of the network, ReLU and its variants (such as Leaky ReLU) are often a good choice, since they are simple and efficient and have been shown to work well in a wide range of tasks. However, other activation functions may also be suitable, depending on the characteristics of the data and the specific goals of the task.

In general, it is possible to use different activation functions in different layers of a neural network, and this can sometimes be helpful to improve the network’s performance. However, it is also important to be mindful of the complexity of the network and to avoid adding unnecessary complexity that could harm the network’s ability to learn effectively.

So once I have my data inputs, my data outputs, and I choose my activation functions, how do I choose the number of layers in my neural network?

The number of layers in a neural network, as well as the number of neurons in each layer, are important design choices that can impact the network’s ability to learn from the data and generalize to new examples. In general, the more layers and neurons a network has, the more capacity it has to learn from the data. However, too many layers and neurons can also make the network more prone to overfitting, which is when the network performs well on the training data but poorly on new, unseen examples.

There is no hard and fast rule for choosing the number of layers and neurons in a neural network, and it is often necessary to experiment with different configurations to find the one that works best for a particular task. However, there are a few guidelines that you can follow to help you make informed decisions:

Start with a small number of layers and neurons and gradually increase them until you see improvements in the network’s performance. This can help you avoid the problem of overfitting, since a smaller network is less likely to overfit the data.
Consider the complexity of the task and the amount of data you have available. For more complex tasks or larger datasets, you may need a deeper network (i.e., one with more layers) to learn effectively. On the other hand, for simpler tasks or smaller datasets, a shallower network may be sufficient.
Experiment with different architectures and compare the performance of the different models. This can help you identify the best-performing configuration for your task.

Ultimately, the choice of the number of layers and neurons in a neural network is a trade-off between model capacity and generalization. You want to choose a configuration that has enough capacity to learn from the data, but not so much that it overfits the data. Finding the right balance can require some experimentation and trial and error, but with the right approach, you can build a neural network that performs well on your task.

Once I will choose the number of layers, the number of neurons in each layers, the activation functions, how do I train the model then? How will the weights be calculated?

To train a neural network, you need to adjust the weights of the connections between the neurons in such a way that the output of the network is as close as possible to the desired output for a given set of inputs. The process of adjusting the weights to minimize the difference between the predicted output and the desired output is known as optimization.

There are several optimization algorithms that can be used to train a neural network, each with its own pros and cons. Some common optimization algorithms include:

Stochastic Gradient Descent (SGD): SGD is a simple and widely-used optimization algorithm that involves updating the weights of the network based on the gradient of the loss function with respect to the weights. The loss function is a measure of how well the network is doing at predicting the desired output given the input data, and the gradient is a vector that points in the direction of the steepest increase in the loss function. SGD involves repeatedly updating the weights in the opposite direction of the gradient, in an attempt to minimize the loss function and improve the performance of the network.
Adam (Adaptive Moment Estimation): Adam is a variant of SGD that uses an adaptive learning rate, which means that the step size for updating the weights is adjusted based on the past gradient information. Adam tends to be more efficient and require fewer hyperparameter tuning compared to SGD.
Mini-batch gradient descent: Mini-batch gradient descent is a variant of SGD that involves dividing the training data into small batches and updating the weights based on the average of the gradients of the loss function for each batch. This can be more efficient than SGD, especially for large datasets, since it allows the network to learn from the data more quickly.
RProp (Resilient Backpropagation): RProp is a variant of SGD that uses a heuristic approach to adjust the learning rate for each weight individually, based on the past gradient information. It is generally faster and more efficient than SGD, and it is often used in large-scale neural network training.

Please explain stochastic gradient descent further, include equations

Stochastic Gradient Descent (SGD) is a widely-used optimization algorithm for training neural networks and other machine learning models. It involves updating the weights of the model based on the gradient of the loss function with respect to the weights. The loss function is a measure of how well the model is doing at predicting the desired output given the input data, and the gradient is a vector that points in the direction of the steepest increase in the loss function.

SGD works by iteratively updating the weights of the model in the opposite direction of the gradient, in an attempt to minimize the loss function and improve the performance of the model. The update rule for SGD can be expressed as follows:

w = w – alpha * gradient

where w is the weight of the model, alpha is the learning rate (a hyperparameter that controls the step size for updating the weights), and gradient is the gradient of the loss function with respect to the weight.

The learning rate is an important hyperparameter in SGD, and it determines the step size for updating the weights. A larger learning rate can lead to faster convergence, but it can also make the optimization process more unstable. On the other hand, a smaller learning rate can lead to slower convergence, but it can also make the optimization process more stable.

SGD can be implemented in a number of ways, and it is often used in conjunction with other techniques, such as mini-batch gradient descent or momentum, to improve the efficiency and stability of the optimization process.

You also mentioned ADAM – Adaptive Moment Estimation, how does this algorithm differ from SGD? Please provide equations.

Adam (Adaptive Moment Estimation) is a variant of Stochastic Gradient Descent (SGD) that uses an adaptive learning rate, which means that the step size for updating the weights is adjusted based on the past gradient information. Adam is a popular optimization algorithm that is often used to train neural networks and other machine learning models.

One of the key differences between Adam and SGD is that Adam uses an exponentially weighted average of the past gradients to scale the learning rate, rather than a fixed learning rate as in SGD. This allows Adam to automatically adjust the learning rate based on the characteristics of the data and the optimization process, which can lead to faster convergence and better performance.

The update rule for Adam can be expressed as follows:

m = beta1 * m_t-1 + (1 – beta1) * gradient

v = beta2 * v_t-1 + (1 – beta2) * gradient^2

w = w_t-1 – alpha * m / (sqrt(v) + epsilon)

where w is the weight of the model, alpha is the learning rate (a hyperparameter that controls the step size for updating the weights), m and v are the exponentially weighted averages of the past gradients and squared gradients, respectively, beta1 and beta2 are hyperparameters that control the decay rate of the moving averages, and epsilon is a small value that is used to prevent division by zero.

Adam uses two moving averages, one for the gradients and one for the squared gradients, which allows it to incorporate both the mean and the variance of the past gradients into the learning rate. This can help to stabilize the optimization process and prevent oscillations or divergences.

You also listed RProp (Resilient Backpropagation) as an optimization algorithms, please explain RProp with equations.

Resilient backpropagation (RProp) is an optimization algorithm used in artificial neural networks for training supervised learning models. It is an extension of the backpropagation algorithm that aims to improve the convergence speed of the training process by adapting the learning rate for each weight in the network during training.

Like backpropagation, RProp uses the gradient of the loss function to update the weights of the network. The gradient is calculated using the chain rule and the weights are updated using the following equation:

weight(t+1) = weight(t) – learning_rate * gradient(t)

In backpropagation, the learning rate is a fixed hyperparameter that is set before training. In RProp, the learning rate is not fixed and is instead adapted for each weight based on the sign of the gradient. In RProp, the learning rate for each weight in the network is updated individually, based on the sign of the gradient of the loss function with respect to that weight. This means that the learning rate for each weight is updated individually based on the convergence of that weight, rather than being updated based on the convergence of all weights in the network. This can improve the convergence speed of the training process, as the learning rate can be adjusted according to the convergence of each weight.

The learning rate is updated using the following equation:

learning_rate(i)_t+1 = learning_rate(i)_t – step_size(i)_t-1 * sng (dE_t-1 / dw(i)_t-1 )

I have also often heard the AdaGrad optimizer mentioned, please explain that to me

In AdaGrad, the learning rate is not fixed and is instead adapted for each weight based on the history of the gradients of the loss function with respect to that weight. Specifically, the learning rate is adjusted based on the sum of the squares of the gradients of the loss function with respect to the weight.

The learning rate for each weight is updated using the following equation:

learning_rate(t+1) = learning_rate(t) / sqrt(sum(gradient(t)^2))

where sum(gradient(t)^2) is the sum of the squares of the gradients of the loss function with respect to the weight up to time t.

This means that the learning rate for each weight is inversely proportional to the sum of the squares of the gradients of the loss function with respect to that weight. This has the effect of decreasing the learning rate for weights that have had large gradients in the past, and increasing the learning rate for weights that have had small gradients in the past.

I have also heard of gradient boosting, explain that

Gradient boosting is a machine learning technique for building predictive models by creating an ensemble of weak learners and combining their predictions to create a stronger, more accurate model. The general idea of gradient boosting is to train a sequence of weak learners (e.g. decision trees) in a stage-wise manner, with each subsequent learner trying to correct the errors made by the previous learner. At each stage, the gradient boosting algorithm fits a new weak learner to the residual errors made by the previous learner. The residual errors are the difference between the true values and the predicted values of the target variable.

In gradient boosting, weights are assigned to the incorrectly classified samples in order to give more importance to the samples that are difficult to classify. This helps the weak learners in the ensemble to focus more on the samples that are most likely to improve the overall performance of the model.

The steps of the gradient boosting algorithm can be summarized in the following equations:

Initialize the ensemble with a single weak learner, such as a decision tree, and make predictions using this learner:

prediction(1) = weak_learner_prediction(1)

Calculate the residual errors between the true values and the predicted values of the target variable:

residual_error(t) = true_value(t) – prediction(t)

Train a new weak learner to predict the residual errors from the previous step:

weak_learner_prediction(t+1) = train(weak_learner, residual_error(t))

Update the predictions of the ensemble by adding the predictions of the new weak learner to the previous predictions:

prediction(t+1) = prediction(t) + weak_learner_prediction(t+1)

Repeat steps 2-4 until the desired number of weak learners has been trained.

What is AdaBoost and what are the steps of the algorithm? Include equations.

AdaBoost works by weighting the observations in the training set and adjusting the weight of each observation at each iteration of the algorithm based on the performance of the previous models. The idea behind AdaBoost is to train a sequence of weak learners in an iterative manner, with each learner trying to correct the mistakes of the previous learner. The final model is a weighted average of all the weak models.

Here are the steps of the AdaBoost algorithm:

Initialize the weights of the observations in the training set:
- At the beginning, all the observations in the training set are given the same weight.
Train a weak learner:
- Train a weak learner (e.g., a decision tree) on the training set using the current weights of the observations (My note: sample with replacement).
Calculate the error of the weak learner:
- Calculate the error (my note: total error) on the training set using the following equation:error = sum(weights[i] * I[prediction[i] != y[i]])
where I is the indicator function that returns 1 if the prediction is incorrect and 0 if the prediction is correct, weights is the array of weights for the observations, prediction is the array of predictions made by the weak learner, and y is the array of true labels for the observations.
Calculate the weight of the weak learner:
- Calculate the weight of the weak learner using the following equation :
  alpha = 0.5 * log((1 – total error) / total error)
Adjust the weights of the observations:
- Adjust the weights of the observations using the following equation:
  weights[i+1] = weights[i] * exp(-alpha * I[prediction[i] == y[i]])
Normalize the weights:
- Normalize the weights so that they sum up to 1.
Repeat steps 2-6:
- Repeat steps 2-6 a predetermined number of times or until the error of the model becomes small enough.
Return the final model:
- Return the final model as a weighted average of all the weak models.

My note:

Short note on exp(-alpha * I[prediction[i] == y[i]]) i.e. for misclassification

When the performance is relatively large the last stump did a good job in classifying the records now the new sample weight will be much larger than the old one. When the performance is relatively low the last stump did not do a good job in classifying the records now the new sample weight will only be little larger than the old one.

No misclassification: When the performance is relatively large the last stump did a good job in classifying the records now the new sample weight will be very small than the old one. When the performance is relatively small the last stump did not do a good job in classifying the records now the new sample weight will only be little smaller than the old one.

What is the difference between gradient boosting algorithm and adaptive boosting?

Gradient boosting and adaptive boosting are both ensemble learning algorithms that can be used to improve the performance of a weak learner (a model that performs only slightly better than random guessing) by combining the predictions of multiple weak learners.

The main difference between gradient boosting and adaptive boosting is the way in which the weak learners are trained and combined.

In gradient boosting, the weak learners are trained sequentially, with each learner trying to correct the mistakes of the previous learner. In adaptive boosting (also known as AdaBoost), the weak learners are also trained sequentially, but the training process is different. Each weak learner is trained on a different subset of the training data, and the weights of the training examples are adjusted after each learner is trained to give more weight to the examples that were misclassified. The final prediction is made by combining the predictions of all of the learners using a weighted sum, where the weight of each learner is determined by its accuracy on the training data.

Advent of Code Day 5 – my bonus question

I am doing the Advent of Code. So far I have solved all the questions for the four previous days and part one of the question for day five. I have also created my own question for fun, the question is below:

After many hours of walking, the Elves come to a forest glade. They are quite tired and hungry, one of the elves suddenly notices that the glade is full of mushrooms. The Elves are familiar with this mushrooms species – they are edible and quite tasty. The Elves pick all of the mushrooms and are almost ready to make mushroom soup, when they remember about one tricky problem – there is a poisonous mushroom species that looks very similar and often a poisonous mushroom will grow right among the edible mushrooms.

At this point the elves have determined the molecular structure of each mushrooms that they picked. The structure always consists of five segments and each segment consists of a number and a letter.

Example: 0.9H 0.08G 0.27L 0.57M 0.84P

Each letter molecule (A – Z) has a corresponding weight, from 0 to 25. The numbers also represent additional weight units. It is therefore possible to calculate the molecular weight of each mushroom. In the above example the weight would be 0.9 + 7 + 0.08 + 6 + 0.27 + 11 + 0.57 + 12 + 0.84 + 15 = 53.66

If the structure had a negative number, such as if it would be 0.9H -0.08G 0.27L 0.57M 0.84P, then the negative segment would need to be subtracted. The weight then would be 0.9 + 7 – 0.08 – 6 + 0.27 + 11 + 0.57 + 12 + 0.84 + 15 = 41.5

The Elves are aware that the value of each segment of a mushroom comes from a process generated by ~N(12.5, 4.5) and there is no correlation between the segments. (The value of the segment is number + letter, for example 0.9H is 7.9, while -0.08G is -6.08).

The mushroom that is poisonous is definitely tricky to find for the Elves because it looks exactly the same as the edible mushrooms. BUT! The molecular structure of this mushroom gives it away! It is very unlikely that such structure would be generated by the same process as for the edible mushrooms. Find the poisonous mushroom from the input list so that the Elves can start cooking their soup.

The list of mushrooms is in the link below:

Advent of Code Day 5 bonus question input