A/B Test Calculator: Determine If Results Are Statistically Significant

You ran an A/B test. But is the result real?

You test a new checkout button. Version A (blue button) converts 100 out of 1,000 visitors (10%). Version B (red button) converts 115 out of 1,000 visitors (11.5%). You think red is better. But you only tested 2,000 people. That 1.5% difference could be random noise. This calculator tells you whether your test result is statistically significant (likely to repeat) or just chance. Testing without knowing significance is like flipping a coin and calling it skill. This calculator removes the guesswork.

What This Calculator Does

This tool performs a chi-squared statistical test on your A/B test results and tells you the probability that your result is real (not due to random chance). Feed in the number of people who saw each version, how many converted in each, and the calculator computes statistical significance. It shows p-value (probability the result is random), confidence level (how confident you can be the result is real), and whether the test reached the standard threshold of 95% confidence.

How to Use This Calculator

Step 1: Count visitors in version A (control). If your blue button page got 5,000 visits, enter 5000.

Step 2: Count conversions in version A. If 500 visitors clicked the button, enter 500.

Step 3: Count visitors in version B (variant). If your red button page got 5,000 visits, enter 5000.

Step 4: Count conversions in version B. If 575 visitors clicked the button, enter 575.

Step 5: Enter your desired confidence level. 95% is standard (5% chance you're wrong). 90% is faster but less certain. 99% is more rigorous but requires larger sample sizes.

Step 6: The calculator shows whether your test is significant. If p-value is below 0.05 (5%), your result is likely real and you can implement it. If p-value is above 0.05, your result is not significant-the difference might be random.

The Formula Behind the Math

Chi-Squared Test for A/B Testing


Chi² = Σ [(Observed - Expected)² / Expected]

For a simple two-variant test:


Chi² = (n1 × n2) / (n1 + n2) × [(p1 - p2)² / (p × (1 - p))]

Where:

•n1 = Sample size for variant A

•n2 = Sample size for variant B

•p1 = Conversion rate for variant A

•p2 = Conversion rate for variant B

•p = Overall conversion rate

Worked Example:

•Version A: 5,000 visitors, 500 conversions (10% conversion rate)

•Version B: 5,000 visitors, 575 conversions (11.5% conversion rate)


Overall conversion rate p = (500 + 575) / (5,000 + 5,000) = 10.75%

Chi² = (5,000 × 5,000) / (5,000 + 5,000) × [(0.10 - 0.115)² / (0.1075 × 0.8925)]
Chi² = 2,500 × [0.000225 / 0.0959] = 2,500 × 0.002346 = 5.87

Converting Chi² to P-Value:

Using a chi-squared distribution table with 1 degree of freedom, Chi² of 5.87 gives a p-value of approximately 0.015.

Interpreting p-value:

•p < 0.05: Result is statistically significant at 95% confidence. You can implement the change.

•p < 0.01: Result is highly significant at 99% confidence. Very confident in the change.

•p > 0.05: Result is not significant. The difference might be random. Keep testing.

Our calculator does all of this instantly-but now you understand exactly what it's computing.

E-commerce Company Testing Landing Pages

You test a new headline. Control (original): 10,000 visitors, 400 conversions (4%). Variant (new headline): 10,000 visitors, 450 conversions (4.5%). Difference is 0.5%, which seems small. But with 10,000 sample size per variant, the test is highly significant (p < 0.01). You can confidently launch the new headline knowing the improvement will repeat. You'll gain 50 additional customers per 10,000 visitors, which compounds to thousands of additional customers annually.

SaaS Company Testing Pricing

You test pricing tiers. Control: 5,000 free trial users, 200 convert to paid (4%). Variant: 5,000 free trial users with higher price tier, 180 convert to paid (3.6%). Difference is -0.4% (price increase caused lower conversion). But with 5,000 sample size, p-value is 0.31 (not significant). You can't confidently say higher price caused lower conversion. The result could be random. Test longer or try a different price point.

Email Campaign Testing Subject Lines

You test two subject lines. Control: 50,000 emails sent, 3,000 opens (6%). Variant: 50,000 emails sent, 3,200 opens (6.4%). Difference is 0.4%. With 50,000 sample size, p-value is 0.002 (highly significant). You can confidently say the new subject line is better and should use it for future campaigns. With email's high volume, even small differences become statistically significant.

Ad Campaign Testing Creative

You test two ad creatives. Control: 2,000 clicks, 40 conversions (2%). Variant: 2,000 clicks, 52 conversions (2.6%). Difference is 0.6%. With only 2,000 sample size, p-value is 0.13 (not significant). You can't confidently say variant is better. You need to run the test longer or start with smaller sample sizes but higher difference to achieve significance.

Tips and Things to Watch Out For

Sample size matters enormously. With 100 visitors per variant, a 2% difference needs to be 20% improvement (20 vs. 16 conversions) to be significant. With 10,000 visitors per variant, a 2% difference (200 vs. 204 conversions) is significant. Always calculate required sample size before starting a test.

Don't stop a test early. If your test shows significance after reaching 50% of planned sample size, don't declare victory yet. Significance can flip as you collect more data. Run the test to completion. Early stopping biases results toward false positives.

Multi-variant testing requires correction. If you test five variants simultaneously, you need a higher significance threshold (p < 0.01 instead of 0.05) to account for false positives. Use Bonferroni correction or simply test one variant at a time.

Statistical significance doesn't equal practical significance. A result might be statistically significant but practically tiny. A 0.1% conversion improvement on 1 million visitors is significant (p < 0.001) but adds only 1,000 customers. Was it worth the test? Only if you expected larger impact. Focus on effects that matter.

Novelty effects aren't permanent. Email subject line test shows new subject line wins. But in week two of using it, engagement drops back to baseline (subscribers got used to it). Account for novelty effects by testing for 2-4 weeks, not 3 days.

Seasonality affects tests. Monday email conversion is different from Friday conversion. Holiday season conversion is different from regular season. Test for a full business cycle to avoid seasonal bias. If you must test during one season, note it and repeat during other seasons.

Beware of selection bias. If you test red button on new users and blue button on returning users, you're not testing button color-you're testing user type. Control all variables except the one you're testing. Randomize visitors evenly between control and variant.

*This A/B test calculator uses the chi-squared statistical test to determine significance. Results assume proper randomization, no selection bias, and adequate sample size. Statistical significance is set at the 0.05 level (95% confidence) by default, which is industry standard but arbitrary. Different industries and use cases may warrant different significance thresholds.*

Frequently Asked Questions

What's the difference between p-value and confidence level?

p-value is the probability your result happened by random chance. Confidence level is 1 minus p-value. A p-value of 0.05 means 5% chance it's random, or 95% confidence it's real. Use p-value < 0.05 as your threshold for implementation.

How many people do I need to test to get significant results?

Depends on your baseline conversion rate and expected improvement. Testing from 5% to 6% conversion requires more samples than testing from 5% to 10%. Use a sample size calculator before starting (search "A/B test sample size calculator"). Typical ranges: 1,000-10,000 per variant for most tests.

Should I always test at 95% confidence or can I use 90%?

95% is standard because it balances risk and sample size. 90% requires fewer samples but has higher false positive risk (5% chance you're implementing a change that doesn't work). 99% requires huge samples but has lower false positive risk. Start with 95%; adjust based on cost of false positive.

My test shows variant is significantly worse. Should I kill it immediately?

Yes. If control is statistically significantly better than variant, switch back. No reason to continue testing. Negative results are just as valuable as positive ones-they tell you what doesn't work.

What if my p-value is exactly 0.05?

Borderline. Technically significant at the 95% threshold, but barely. If p is 0.05-0.10, consider it "possibly significant" and test longer if the practical impact is important. If p is 0.05 and the effect is tiny, it might not be worth implementing.

Can I peek at results mid-test?

Technically no-peeking at interim results biases the test. But many tools (Google Optimize, Optimizely) handle this mathematically. If you're peeking manually, stick to your planned sample size before making decisions.

What's the minimum sample size to test?

At least 100-200 conversions per variant (not 100-200 visitors; conversions). If your conversion rate is 2%, you need 5,000-10,000 visitors per variant. Lower-conversion-rate products need larger sample sizes.

Can I run multiple A/B tests simultaneously?

Yes, but you need to segment your audience so they don't overlap (user sees both tests). And if running many tests, use Bonferroni correction (divide significance threshold by number of tests). Otherwise, you increase false positive risk.

Related Calculators

Use the conversion rate calculator to understand your baseline conversion rates before running tests. Check the pricing calculator to model the revenue impact of different price tests. The email marketing ROI calculator helps you measure the business impact of email test winners.