When t-Tests Fail in A/B Testing | Kaloyan P. Parvanov

A/B testing failures in production often share common mathematical roots. An experiment on revenue can look transformative when a single “whale” user makes a large purchase, only for the result to lose significance as more data arrives.

The standard t-test framework assumes your data is independent and normally distributed. At a massive scale with tens of thousands of users, these assumptions often hold well enough due to the Central Limit Theorem (CLT). But for many companies, especially startups, that scale is an unaffordable luxury. They may not have access to enough users or cannot afford to run experiments long enough to gather the massive samples needed to blindly trust the t-test.

At these more moderate sample sizes, reality is messier. This is because business metrics are often plagued by issues like:

Skewed Distributions: Unlike height or weight, metrics like revenue are not symmetric; most users spend little, while a few spend a lot.
Extreme Outliers: The presence of “whale” users can dramatically pull the average, destabilizing the results.
Clustered Behavior: Users do not act independently. A single user can have multiple sessions, violating the i.i.d. (independent and identically distributed) assumption.

The solution is not to abandon t-tests entirely, but to know when their assumptions break down and to have robust alternatives ready.

Why Welch’s t-Test Falters

Most A/B testing platforms default to Welch’s t-test, an adaptation of the Student’s t-test that does not assume equal variances between groups. It relies on the test statistic converging to a normal distribution.

\[\frac{(\bar{X}_T - \bar{X}_C) - (\mu_T - \mu_C)}{\sqrt{\frac{s_T^2}{n_T} + \frac{s_C^2}{n_C}}} \xrightarrow{d} \mathcal{N}(0,1)\]

This convergence requires three conditions: independent observations, finite variance, and a sufficiently large sample size for the CLT to take effect. When these assumptions crumble, you should switch to methods that remain valid under skewness and heavy tails.

Click to see the t-test implementation

import numpy as np
import warnings
from scipy import stats

def ttest_ab_test(control_data, treatment_data, alpha=0.05):
    """Performs a Welch's t-test for A/B testing."""
    n_c, n_t = len(control_data), len(treatment_data)
    mean_c, mean_t = np.mean(control_data), np.mean(treatment_data)
    var_c, var_t = np.var(control_data, ddof=1), np.var(treatment_data, ddof=1)

    diff_mean = mean_t - mean_c
    pooled_se = np.sqrt(var_c / n_c + var_t / n_t)
    if pooled_se == 0:
        return {'significant': False, 'p_value': 1.0, 'difference': diff_mean, 'ci_lower': diff_mean, 'ci_upper': diff_mean}

    t_stat = diff_mean / pooled_se
    df = (var_c / n_c + var_t / n_t)**2 / ((var_c / n_c)**2 / (n_c - 1) + (var_t / n_t)**2 / (n_t - 1))
    p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
    margin = stats.t.ppf(1 - alpha / 2, df) * pooled_se

    return {
        'difference': diff_mean,
        'ci_lower': diff_mean - margin,
        'ci_upper': diff_mean + margin,
        'significant': p_value < alpha,
        'p_value': p_value
    }

Bootstrap Methods for Distribution-Free Inference

The bootstrap, introduced by Efron in 1979, addresses a core limitation of the t-test. Instead of assuming normality, the bootstrap estimates the sampling distribution empirically by resampling from the observed data. The core principle is simple: if the sample approximates the population, then resampling from the sample approximates sampling from the population.

Bias-Corrected and Accelerated (BCa) Bootstrap

The Bias-Corrected and Accelerated (BCa) bootstrap is an advanced method that adjusts the confidence interval to account for both bias and skewness in the bootstrap distribution. It calculates adjusted percentile points ($\alpha_1, \alpha_2$) to provide a more accurate interval.

The adjustments are based on two parameters: a bias-correction factor ($\hat{z}_0$) and an acceleration parameter ($\hat{a}$).

\[\hat{z}_0 = \Phi^{-1}\left( \frac{\#\{\hat{\theta}^*_b < \hat{\theta}\}}{B} \right)\]

Where:

$\hat{\theta}$ (theta-hat) is the statistic calculated from the original data (e.g., the observed difference in means).
$B$ is the total number of bootstrap samples generated.
$\hat{\theta}^*_b$ is the same statistic calculated for the b-th bootstrap sample. The fraction inside the formula simply represents the proportion of bootstrap statistics that were less than the original statistic.

The acceleration parameter $\hat{a}$ is calculated using jackknife resampling.

\[\hat{a} = \frac{\sum_{i=1}^n (\bar{\theta} - \theta_{(i)})^3}{6\left[\sum_{i=1}^n (\bar{\theta} - \theta_{(i)})^2\right]^{3/2}}\]

Where:

$\theta_{(i)}$ is the “jackknife” estimate, meaning the same statistic calculated on the original data but with the i-th observation removed.
$\bar{\theta}$ is the average of all the jackknife estimates, $\frac{1}{n}\sum_{i=1}^n \theta_{(i)}$.

These two parameters are then used to find the adjusted percentile levels for the confidence interval.

\[\alpha_1 = \Phi\left( \hat{z}_0 + \frac{\hat{z}_0 + z_{\alpha/2}}{1 - \hat{a}(\hat{z}_0 + z_{\alpha/2})} \right) \quad \text{and} \quad \alpha_2 = \Phi\left( \hat{z}_0 + \frac{\hat{z}_0 + z_{1-\alpha/2}}{1 - \hat{a}(\hat{z}_0 + z_{1-\alpha/2})} \right)\]

While the math is involved, the takeaway is simple. BCa provides a data-driven way to correct for imperfections that a standard bootstrap ignores.

Note that the jackknife step for the acceleration parameter can be computationally expensive (O(n²)). For production use on samples larger than a few thousand, consider using optimized libraries like scipy.bootstrap. For very large datasets, you can also use the fast approximation for $\hat{a}$ found in Efron & Tibshirani (1994).

Click to see the BCa bootstrap implementation

def bca_ab_test(control_data, treatment_data, alpha=0.05, n_bootstrap=2000):
    """Performs a BCa bootstrap for the difference in means."""
    rng = np.random.default_rng()
    n_c, n_t = len(control_data), len(treatment_data)
    observed_diff = np.mean(treatment_data) - np.mean(control_data)

    bootstrap_diffs = []
    for _ in range(n_bootstrap):
        boot_control = rng.choice(control_data, size=n_c, replace=True)
        boot_treatment = rng.choice(treatment_data, size=n_t, replace=True)
        bootstrap_diffs.append(np.mean(boot_treatment) - np.mean(boot_control))
    bootstrap_diffs = np.array(bootstrap_diffs)

    prop_less = np.mean(bootstrap_diffs < observed_diff)
    if prop_less == 0 or prop_less == 1:
        ci_lower = np.percentile(bootstrap_diffs, 100 * alpha/2)
        ci_upper = np.percentile(bootstrap_diffs, 100 * (1-alpha/2))
        return {'difference': observed_diff, 'ci_lower': ci_lower, 'ci_upper': ci_upper, 'significant': (ci_lower > 0) or (ci_upper < 0)}

    z_0 = stats.norm.ppf(prop_less)

    combined_data = np.concatenate([control_data, treatment_data])
    n_total = len(combined_data)
    jackknife_stats = []
    for i in range(n_total):
        jk_sample = np.delete(combined_data, i)
        if i < n_c:
            jk_control = jk_sample[:n_c-1]
            jk_treatment = jk_sample[n_c-1:]
        else:
            jk_control = jk_sample[:n_c]
            jk_treatment = jk_sample[n_c:]
        jackknife_stats.append(np.mean(jk_treatment) - np.mean(jk_control))
    jackknife_stats = np.array(jackknife_stats)
    jk_mean = np.mean(jackknife_stats)
    
    numerator = np.sum((jk_mean - jackknife_stats)**3)
    denominator = 6 * (np.sum((jk_mean - jackknife_stats)**2))**1.5
    if denominator == 0:
        ci_lower = np.percentile(bootstrap_diffs, 100 * alpha/2)
        ci_upper = np.percentile(bootstrap_diffs, 100 * (1-alpha/2))
        return {'difference': observed_diff, 'ci_lower': ci_lower, 'ci_upper': ci_upper, 'significant': (ci_lower > 0) or (ci_upper < 0)}

    a_hat = numerator / denominator

    z_alpha = stats.norm.ppf([alpha / 2, 1 - alpha / 2])
    z_term = z_0 + z_alpha
    alpha_adj_num = z_0 + z_term / (1 - a_hat * z_term)
    alpha_adj = stats.norm.cdf(alpha_adj_num)

    ci_lower = np.percentile(bootstrap_diffs, 100 * alpha_adj[0])
    ci_upper = np.percentile(bootstrap_diffs, 100 * alpha_adj[1])

    return {
        'difference': observed_diff,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'significant': (ci_lower > 0) or (ci_upper < 0)
    }

Concentration Inequalities for Heavy-Tailed Data

When data is extremely heavy-tailed or sample sizes are small, even the bootstrap can be unreliable. Concentration inequalities provide a powerful, non-asymptotic alternative. They give explicit, finite-sample bounds on how much a sample mean can deviate from its true mean, without making distributional assumptions.

A Deeper Look at Sub-Exponential Variables

Many metrics are sub-exponential, meaning their tails decay at least as fast as an exponential distribution. For a sum of such variables, Bernstein’s inequality provides a tight bound on the deviation of the sample mean $\bar{X}$ from the true mean $\mu$.

\[\mathbb{P}\left( \left|\bar{X} - \mu\right| \geq t \right) \leq 2\exp\left( -\frac{nt^2}{2v^2 + 2bt} \right)\]

The term $v^2$ acts like a variance parameter, while $b$ controls for the heavy tail.

The Practical Challenge and Solution

In practice, estimating the tail parameter $b$ from data is difficult and often not robust. However, for a wide range of problems, we can use a simpler and more robust empirical Bernstein bound.

This approach relies on the sample variance $s^2$ (assuming independent observations) to estimate $v^2$ and gracefully handles the tail term. For the difference between two means, this simplifies to the following confidence interval calculation.¹

\[CI = (\bar{X}_T - \bar{X}_C) \pm \sqrt{\left(\frac{s_C^2}{n_C} + \frac{s_T^2}{n_T}\right) 2 \log(2/\alpha)}\]

This formula provides a robust, conservative confidence interval with guaranteed coverage, making it ideal for high-stakes decisions or noisy data.

¹ Some variations of this bound use a slightly different constant, such as $\log(3/\alpha)$. The $\log(2/\alpha)$ form is a common and intuitive choice.

Click to see the concentration inequality implementation

def concentration_ab_test(control_data, treatment_data, alpha=0.05):
    """Constructs a CI for the difference in means via a concentration inequality."""
    n_c, n_t = len(control_data), len(treatment_data)
    mean_c, mean_t = np.mean(control_data), np.mean(treatment_data)
    var_c, var_t = np.var(control_data, ddof=1), np.var(treatment_data, ddof=1)
    
    diff_mean = mean_t - mean_c
    diff_var = var_c / n_c + var_t / n_t
    epsilon = np.sqrt(2 * diff_var * np.log(2 / alpha))
    
    return {
        'difference': diff_mean,
        'ci_lower': diff_mean - epsilon,
        'ci_upper': diff_mean + epsilon,
        'significant': (diff_mean - epsilon > 0) or (diff_mean + epsilon < 0)
    }

Practical Implementation

Here is a unified function that automatically selects a method based on data characteristics. The key diagnostic is the coefficient of variation (CV), which measures relative variability. Revenue metrics often have a CV greater than 2.

Click to see the unified robust A/B test implementation

def robust_ab_test(control_data, treatment_data, alpha=0.05, method='auto'):
    """Selects an A/B test based on the data's coefficient of variation."""
    n_c, n_t = len(control_data), len(treatment_data)
    mean_c, mean_t = np.mean(control_data), np.mean(treatment_data)
    std_c, std_t = np.std(control_data, ddof=1), np.std(treatment_data, ddof=1)
    
    cv_c = std_c / mean_c if mean_c > 0 else float('inf')
    cv_t = std_t / mean_t if mean_t > 0 else float('inf')
    max_cv = max(cv_c, cv_t)
    
    if method == 'auto':
        if max_cv < 1.5 and min(n_c, n_t) > 500:
            method = 'ttest'
        elif max_cv > 2.5:
            method = 'concentration'
        else:
            method = 'bootstrap'
    
    if method == 'ttest':
        result = ttest_ab_test(control_data, treatment_data, alpha)
    elif method == 'bootstrap':
        result = bca_ab_test(control_data, treatment_data, alpha, n_bootstrap=5000)
    elif method == 'concentration':
        result = concentration_ab_test(control_data, treatment_data, alpha)
    else:
        raise ValueError(f"Unknown method: {method}")
    
    result['method'] = method
    result['cv_control'] = cv_c
    result['cv_treatment'] = cv_t
    return result

Method Selection Guidelines

A method-selection cheat-sheet for experimenters (Table 1).

Data Characteristics	Recommended Method	Reason
CV < 1.5, n > 500, few outliers	`t-test`	Classical assumptions hold reasonably well.
CV > 2.5 or very heavy tails	`concentration`	Provides robust, guaranteed bounds for wild data.
Moderate skewness/outliers	`bootstrap`	Handles skewness well without distributional assumptions.
Small samples (n < 500)	`concentration`	Offers strong finite-sample guarantees.

Worked Example with Heavy-Tailed Revenue Data

Let’s simulate a revenue test where data follows a Pareto distribution, commonly used to model wealth distribution and user spending patterns where a few “power users” contribute disproportionately.

Click to see the simulation and analysis code

print("=== A/B Testing with Pareto Distribution ===")

rng = np.random.default_rng(seed=475)

control_revenue = rng.pareto(1.8, 150)
treatment_revenue = rng.pareto(1.8, 150)

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    result = robust_ab_test(control_revenue, treatment_revenue)

print(f"Method automatically selected: {result['method'].title()}")
print(f"Control CV: {result['cv_control']:.2f}, Treatment CV: {result['cv_treatment']:.2f}")
print(f"Observed Difference: ${result['difference']:.2f}")
print(f"95% CI: [${result['ci_lower']:.2f}, ${result['ci_upper']:.2f}]")
print(f"Significant: {result['significant']}")

print("\n=== Method Comparison ===")
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    ttest_result = ttest_ab_test(control_revenue, treatment_revenue)
    bootstrap_result = bca_ab_test(control_revenue, treatment_revenue, n_bootstrap=5000)
    concentration_result = concentration_ab_test(control_revenue, treatment_revenue)

print(f"T-test:        Diff=${ttest_result['difference']:.2f}, CI=[${ttest_result['ci_lower']:.2f}, ${ttest_result['ci_upper']:.2f}], Sig={ttest_result['significant']}")
print(f"Bootstrap:     Diff=${bootstrap_result['difference']:.2f}, CI=[${bootstrap_result['ci_lower']:.2f}, ${bootstrap_result['ci_upper']:.2f}], Sig={bootstrap_result['significant']}")
print(f"Concentration: Diff=${concentration_result['difference']:.2f}, CI=[${concentration_result['ci_lower']:.2f}, ${concentration_result['ci_upper']:.2f}], Sig={concentration_result['significant']}")

Example Output

=== A/B Testing with Pareto Distribution ===
Method automatically selected: Bootstrap
Control CV: 2.13, Treatment CV: 1.42
Observed Difference: $0.42
95% CI: [$-0.06, $0.79]
Significant: False

=== Method Comparison ===
T-test:        Diff=$0.42, CI=[$0.01, $0.83], Sig=True
Bootstrap:     Diff=$0.42, CI=[$-0.02, $0.80], Sig=False
Concentration: Diff=$0.42, CI=[$-0.14, $0.99], Sig=False

The robust_ab_test function correctly identifies moderate skewness (control CV = 2.13) and automatically selects the bootstrap method. This example demonstrates a crucial difference: the t-test incorrectly claims significance with CI = [$0.01, $0.83], while both robust methods (bootstrap and concentration) correctly identify that the difference is not statistically significant. This is exactly the type of false positive that can lead to misguided business decisions when relying solely on t-tests with skewed data.

Key Takeaways

T-tests Are Brittle. The assumptions of the t-test (like normally distributed data) are often violated by real-world business metrics like revenue, which are typically skewed and filled with outliers.
This Fragility Leads to Costly Errors. A “significant” p-value from a t-test can be a statistical mirage. As the article’s final example shows, random outliers can easily trick a t-test into declaring a false positive, which can lead to misguided business decisions.
Robust Methods Provide a More Honest Picture. Techniques like the bootstrap don’t assume your data is clean. They are designed to correctly identify the massive uncertainty caused by outliers, providing a crucial safeguard against being misled.
Your Choice of Statistical Tool Is a Business Decision. The difference between a t-test and a bootstrap on the same messy dataset can be the difference between a confident but wrong product launch and a correctly cautious one.

References

Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press.
DiCiccio, T. J., & Efron, B. (1996). Bootstrap Confidence Intervals. Statistical Science, 11(3), 189-228.
Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics, 7(1), 1-26.
Efron, B., & Tibshirani, R. J. (1994). An Introduction to the Bootstrap. Chapman & Hall/CRC.