TFT

A/B Test Significance Calculator – Check If Your Test Results Are Statistically Valid

Make confident marketing decisions with our A/B Test Significance Calculator. Enter your control and variant conversion rates along with sample sizes to determine statistical significance and confidence level — stop guessing and start testing smarter.

Variant A (Control)

Variant B (Test)

Results

Enter your A/B test data and click Calculate to see results

Understanding A/B Test Significance

Statistical significance tells you whether the difference between your variants is likely real or just due to random chance.

  • Confidence Level: Probability that the result is not due to chance
  • P-Value: Probability of seeing this result if there's no real difference
  • Z-Score: How many standard deviations the result is from the mean
  • 95% Confidence: Industry standard (p-value < 0.05)

Best Practices: Run tests until you have at least 100 conversions per variant and reach 95%+ confidence before declaring a winner.

How to Use This A/B Test Calculator

1

Enter visitors and conversions for variant A

Input the total number of visitors who saw your control version (A) and how many of them completed the desired action (conversions).

2

Enter visitors and conversions for variant B

Input the same metrics for your test version (B). Make sure both variants ran during the same time period for accurate comparison.

3

Get statistical significance results instantly

Click Calculate and see your confidence level, p-value, and whether the difference between variants is statistically significant.

Understanding Statistical Significance

What p-value means

The p-value tells you the probability that your results happened by pure chance. A p-value of 0.03 means there's only a 3% chance you'd see these results if there was actually no difference between your variants. Lower p-values give you more confidence that the difference is real.

Confidence levels explained

Confidence level is simply 1 minus your p-value, expressed as a percentage. The three most common thresholds are:

  • 90% confidence (p < 0.10): Some evidence of a difference, but you'd want more data before making big changes
  • 95% confidence (p < 0.05): The standard threshold. Most teams are comfortable making decisions at this level
  • 99% confidence (p < 0.01): Very strong evidence. Use this when the cost of being wrong is high

What "statistically significant" actually means

When a result is statistically significant, it means the observed difference between your variants is unlikely to be due to random chance alone. It doesn't guarantee the result will hold forever, and it doesn't tell you whether the difference is large enough to matter for your business.

Why sample size matters

Small samples are noisy. With only 50 visitors per variant, random fluctuations can easily create the appearance of a difference that isn't real. Larger samples reduce this noise and give you more reliable results. That's why most experts recommend waiting until you have at least 100 conversions per variant before drawing conclusions.

A/B Test Result Categories

CategoryP-Value RangeWhat It Means
Highly Significantp < 0.01Very confident in results. Less than 1% chance this is random noise.
Significantp < 0.05Confident in results. The standard threshold for declaring a winner.
Marginally Significantp < 0.10Some evidence of a difference, but needs more data before you can be sure.
Not Significantp >= 0.10No clear winner. The observed difference could easily be random chance.

Common A/B Testing Mistakes

Stopping tests too early (the peeking problem)

Checking results daily and stopping as soon as you hit significance dramatically increases false positives. Every time you peek, you inflate your error rate. Decide your sample size in advance and stick to it.

Testing too many variables at once

If you change the headline, button color, and image all at once, you won't know which change drove the result. Test one variable at a time, or use proper multivariate testing methods.

Ignoring practical significance vs statistical significance

With a large enough sample, even tiny differences become statistically significant. A 0.1% improvement might be "real" but not worth the engineering effort to implement. Always ask: does this difference matter for the business?

Not accounting for novelty effects

Users often click on new things just because they're new. A variant might perform well in the first few days simply because it's different. Run tests long enough for the novelty to wear off.

Running tests for insufficient time

User behavior varies by day of week. A test that runs only on weekdays might miss important weekend patterns. Run tests for at least one full week, preferably two or more, to capture the full cycle.

Sample Size Guidelines

The table below shows approximate sample sizes needed per variant to detect different effect sizes at 95% confidence with 80% statistical power.

Baseline Conversion RateMinimum Detectable EffectRequired Sample Size (per variant)
5%20% improvement (to 6%)~3,200 visitors
10%20% improvement (to 12%)~800 visitors
10%10% improvement (to 11%)~3,100 visitors
20%25% improvement (to 25%)~600 visitors
50%20% improvement (to 60%)~200 visitors
50%10% improvement (to 55%)~800 visitors

Note: These are approximate values. Actual required sample sizes depend on your specific context, desired power level, and statistical method used.

Frequently Asked Questions

What is a good p-value for A/B testing?

Most teams use p < 0.05 (95% confidence) as the standard threshold. This means there's less than a 5% chance the observed difference is due to random chance. For high-stakes decisions, consider using p < 0.01 (99% confidence). For lower-risk tests where you're okay with more uncertainty, p < 0.10 (90% confidence) might be acceptable.

How long should I run an A/B test?

Run your test until you reach your predetermined sample size, which should be calculated before starting. In practical terms, this usually means running for at least 1-2 full weeks to capture day-of-week effects. Don't stop early just because you hit significance — that's the peeking problem and it leads to false positives.

What if my results aren't significant?

Inconclusive results are common and valuable. They tell you that the change you tested doesn't have a meaningful impact. You can either: (1) run the test longer to see if significance emerges, (2) accept that there's no meaningful difference and pick either variant based on other factors, or (3) test a more dramatic change that's more likely to show an effect.

Can I check results while the test is running?

You can look, but don't act on what you see until you've reached your predetermined sample size. If you stop the test as soon as you hit significance, you'll get a lot of false positives. Some teams use "sequential testing" methods that allow for early stopping, but these require more advanced statistical approaches.

What's the difference between statistical and practical significance?

Statistical significance tells you whether a difference is likely real (not due to chance). Practical significance asks whether the difference is large enough to matter. With a huge sample, even a 0.01% improvement can be statistically significant — but it might not justify the cost of implementing the change. Always consider both.