Calculating the confidence interval for a percentile is a crucial step in understanding the variability and the uncertainty around the estimated value. In many real-world applications, the distribution of the data is unknown and this makes it difficult to determine the confidence intervals. In such scenarios, using a binomial distribution can be a viable alternative to estimate the confidence intervals for a percentile.
For instance, let’s consider a variable with 300 data points and we want to calculate the 70th and 90th percentiles and the corresponding confidence intervals for the variable. To do this, we can use a binomial distribution approach.
First, we need to choose an alpha level, which is a probability that determines the size of the confidence interval. A common choice for alpha is 0.05, which corresponds to a 95% confidence interval.
Next, we use the cumulative distribution function (CDF) of the binomial distribution to estimate the lower and upper bounds of the confidence interval. The CDF of the binomial distribution gives the probability of getting k or fewer successes in n independent Bernoulli trials, where the probability of success in each trial is p.
To calculate the 70th percentile and its confidence interval, we use the following steps:
- Set n = 300, which is the number of data points.
- Set p = 0.7, which corresponds to the 70th percentile.
- Calculate the binomial quantile using the CDF, which is the smallest k such that P(X <= k) >= p, where X is a binomial random variable with parameters n and p.
- Use the CDF to determine the lower and upper bounds of the confidence interval.
Below is the python code for calculating the confidence interval for the 70th percentile.
alpha – alpha is a parameter representing the significance level or confidence level for the calculation of the confidence interval. It is the probability that the confidence interval contains the true value of the parameter being estimated. The value of alpha is typically set to 0.05 or 0.01, meaning that there is a 95% or 99% chance, respectively, that the confidence interval contains the true value. In the code, alpha=0.05
is the default value for alpha, but it can be changed to a different value if desired.
n – number of observations
q – percentile value
from scipy.stats import binom
import numpy as np
alpha = 0.05
n = 300
q = 0.7
Below is the code for calculating the upper and lower bounds for the confidence interval. The u value is calculated as the ceiling of the binomial distribution’s quantile function (ppf) evaluated at 1 – alpha / 2 (1 – 0.05 / 2 = 0.975), and the value is shifted by adding an array of numbers from -2 to 2. Any values of u that are greater than n are set to infinity.
u = np.ceil(binom.ppf(1 - alpha / 2, n, q)) + np.arange(-2, 3)
u[u > n] = np.inf
l = np.ceil(binom.ppf(alpha / 2, n, q)) + np.arange(-2, 3)
l[l < 0] = -np.inf
# From the calculation of bounds, np.ceil(binom.ppf(1 - alpha / 2, n, q)) and np.ceil(binom.ppf(alpha / 2, n, q)), we obtain that
# the upper bound value is 225 and the lower bound value is 194. This means that given a sample of size 300, a binomial distribution, and # probability of success p=0.7, we are 95% certain that the number of successes will be between 194 and 225.
Next we calculate coverage of the percentiles that the bounds cover. The coverage represents a matrix of values that correspond to the probability of coverage of the confidence interval for each combination of lower and upper bounds of the interval.
The coverage calculation uses the binom.cdf function to calculate the cumulative distribution function (CDF) for the binomial distribution, which is then used to determine the coverage probability of each combination of u and l. Once the coverage matrix is calculated, the code finds the index i corresponding to the combination of u and l that gives the closest coverage probability to 1-alpha.
coverage = np.zeros((len(l), len(u)))
for i, a in enumerate(l):
for j, b in enumerate(u):
coverage[i, j] = binom.cdf(b - 1, n, q) - binom.cdf(a - 1, n, q)
Next we select the upper and lower bounds of the confidence interval based on the coverage of the interval. The code first checks if the maximum coverage is less than 1 minus the significance level alpha. If it is, the code selects the pair of bounds with the maximum coverage probability. Otherwise, the code selects the pair of bounds with the smallest coverage probability that is still greater than or equal to 1 minus alpha.
if np.max(coverage) < 1 - alpha:
i = np.where(coverage == np.max(coverage))
else:
i = np.where(coverage == np.min(coverage[coverage >= 1 - alpha]))
i_u = i[0][0]
i_l = i[1][0]
u_final = min(n, u[i_u])
u_final = max(0, int(u_final)-1)
l_final = min(n, l[i_l])
l_final = max(0, int(l_final)-1)
The resulting l and u are 192 and 223, respectively. Therefore if you have a sample of 300 and you want to calculate the confidence interval for a variable X, you would sort the values in ascending order, and then you would take the values of X that correspond to the 192nd and 223rd observations.