Can I use Bayesian A/B testing to avoid sample size requirements?

Bayesian methods offer advantages for interim analysis and handle peeking more gracefully, but they do not eliminate the need for sufficient data. A Bayesian test on insufficient data will produce wide credible intervals that do not support confident decisions. The sample size requirement is a property of the data, not the statistical framework.

What confidence level should I use -- 90%, 95%, or 99%?

95% is the standard for most e-commerce tests. Use 90% if you are running a high-volume test program and are comfortable with a slightly higher false positive rate in exchange for faster iteration. Use 99% only for tests with very high deployment cost or irreversible consequences.

Does the sample size calculation change for revenue metrics vs. conversion rate?

Yes. Revenue per visitor has higher variance than binary conversion rate, which means revenue-based tests require larger samples to detect the same relative effect size. As a rule of thumb, multiply the conversion-rate sample size by 2-3x when optimizing for revenue metrics.

Should I run a power analysis for every test?

Yes. It takes five minutes and prevents weeks of wasted traffic. Pre-test power analysis is not optional -- it is the quality gate that separates evidence-based optimization from opinion-based guessing.

What if my testing tool says the result is significant before my calculated end date?

Let it run until the pre-calculated end date. The tool's real-time significance calculation does not account for the peeking problem. Early significance signals are unreliable. The only reliable result is the one measured at the pre-determined sample size.

A/B Testing Sample Size: How to Calculate It (And Why Most Get It Wrong)

Why does sample size matter in A/B testing?

Sample size determines the statistical power of your test -- its ability to detect a real difference when one exists. A test with insufficient sample size will either miss real improvements (false negative) or declare improvements that do not exist (false positive).

Imagine flipping a coin ten times and getting seven heads. You might conclude the coin is biased. But it is not -- you just did not flip it enough times. With 10,000 flips, the ratio would converge to 50/50. A/B testing works on exactly the same principle. With small samples, random variation dominates. With large samples, the real effect emerges.

In e-commerce, the stakes of getting this wrong are measured in revenue. If you stop a test after three days because variant B shows a 15% lift, you might deploy a change that was riding a weekend traffic spike or a coincidental influx of high-intent users. Two weeks later, the 'winning' variant is underperforming the original -- but you have already rolled it out to 100% of traffic.

Common Mistake

The most dangerous A/B test result is a false positive: a test that appears to win but actually has no effect (or a negative effect). False positives are caused almost exclusively by insufficient sample size or premature test termination. Both are preventable.

The cost of a false positive is not just the development time spent implementing the variant. It is the ongoing revenue impact of a change that degrades performance, compounded by the fact that you believe it is helping. You might not revisit that element for months, trusting a result that was never real.

~25%False positive rate at 3-day stopsBased on industry analysis of premature test termination

5%Target false positive rateStandard 95% confidence level

2 monthsSNOCKS search bar test durationRequired duration for valid result at their traffic levels

How do you calculate the required sample size before a test?

Sample size depends on three inputs: your baseline conversion rate, the minimum detectable effect (smallest improvement worth detecting), and the desired statistical confidence level. Use these to calculate the required visitors per variant before the test starts.

The calculation is straightforward once you understand the three inputs. Each one represents a decision you make before the test begins -- and each one has practical implications for how long the test needs to run.

Input 1: Baseline conversion rate

This is the current conversion rate of the page or flow you are testing. Pull it from your analytics for the same page, same traffic source, and same time period that the test will run. A 2% baseline requires more samples than a 10% baseline to detect the same relative improvement, because there is more noise relative to the signal.

Input 2: Minimum detectable effect (MDE)

MDE is the smallest improvement you care about detecting. If your baseline is 2% and you set MDE at 10% relative (a lift from 2.0% to 2.2%), you need a larger sample than if you set MDE at 20% relative (a lift from 2.0% to 2.4%). Smaller effects require more data to distinguish from noise.

Pro Tip

Choose MDE based on business impact, not statistical convenience. Ask: 'What is the smallest improvement that would justify the development cost of this change?' If a 5% relative lift would not justify implementation, set your MDE at 10% and accept that you will miss smaller effects. That is a deliberate, rational tradeoff -- not a failure.

Input 3: Confidence level and statistical power

Confidence level (typically 95%) sets your Type I error threshold (alpha). In practice, with alpha = 0.05, you tolerate a 5% chance of flagging a difference when none exists under the null hypothesis. Statistical power (typically 80%) is the probability of detecting a real effect at your chosen MDE. Together, they define your false-positive and false-negative risk.

Sample size requirements per variant (95% confidence, 80% power)

Baseline CR	MDE: 5% relative	MDE: 10% relative	MDE: 20% relative
1%	~630,000	~160,000	~40,000
2%	~310,000	~78,000	~20,000
3%	~200,000	~51,000	~13,000
5%	~120,000	~31,000	~8,000
10%	~57,000	~15,000	~4,000

These numbers are per variant. For a standard A/B test with one control and one variant, double the number to get total required traffic. For an A/B/C test with two variants, triple it.

The practical implication is clear: if your page receives 5,000 visitors per week and you need 80,000 per variant, the test needs to run for 32 weeks -- over seven months. At that point, you have three choices: increase traffic to the page, increase MDE (accept that you can only detect larger effects), or test on a higher-traffic page instead.

Why is stopping a test early so dangerous?

Stopping a test before reaching the pre-calculated sample size inflates the false positive rate dramatically. A test peeked at daily with a 95% confidence threshold will produce a false positive approximately 25-30% of the time, not 5%.

This is the most common and most expensive mistake in A/B testing. The scenario is always the same: you launch a test on Monday. By Wednesday, the testing tool shows variant B with a 12% lift and 96% statistical significance. The green light is on. The temptation to call it a winner and move to the next test is overwhelming.

Do not do it. That 96% significance number is misleading because it does not account for the multiple times you have checked the result. Every time you peek at the data and decide whether to stop, you are effectively running an additional statistical test. This is called the peeking problem or the multiple comparison problem, and it inflates your actual false positive rate far beyond the stated 5%.

SNOCKS

IFwe allow the SNOCKS search bar redesign test to run for the full pre-calculated duration of two months rather than stopping at the first significance signal

THENthe final result will be materially different from the interim result observed at week two

BECAUSEearly significance signals are heavily influenced by day-of-week effects, traffic source fluctuations, and the multiple comparison problem inherent in repeated peeking -- only the full-duration result integrates over all these confounds

ResultThe search bar test showed fluctuating significance for the first six weeks, crossing the 95% threshold multiple times in both directions. The final result at two months was a clear, stable positive effect. Had we stopped at any of the interim peaks, we would have deployed a version whose 'win' was within the noise band.

The SNOCKS search bar test is instructive because of its duration: two full months. For a team eager to ship changes and move to the next test, two months feels like an eternity. But the traffic volume and baseline conversion rate demanded it. The test required that duration to separate signal from noise.

Counterintuitive Finding

Your A/B testing tool will show you real-time results. It will show you confidence intervals, significance levels, and projected winners updating every hour. This is helpful for monitoring test health (checking for errors, broken variants, or extreme negative effects). It is catastrophic for decision-making. The only result that matters is the one at the end of the pre-calculated runtime.

The peeking penalty in numbers

How peeking inflates false positive rates

Peeking frequency	Nominal alpha	Actual false positive rate
Once (at end only)	5%	5%
Daily for 2 weeks	5%	~15%
Daily for 4 weeks	5%	~25%
Hourly for 2 weeks	5%	~30%

If you must monitor results during the test (and operationally, you should), use sequential testing methods that adjust the significance threshold for peeking. Tools like AB Tasty and VWO offer always-valid p-values or Bayesian approaches that account for interim analysis. But even with these adjustments, the test still needs the full calculated sample size to reach reliable conclusions.

What happens when traffic is too low for meaningful A/B testing?

Low-traffic pages require larger effect sizes to test, longer test durations, or alternative methods entirely. For pages under 10,000 monthly visitors, focus on qualitative research and high-conviction UX changes rather than statistical testing.

Not every page has enough traffic for A/B testing. And pretending otherwise -- running tests on low-traffic pages and declaring winners based on insufficient data -- is worse than not testing at all, because it creates false confidence in changes that may be harmful.

The green vs. red discount label test illustrates this perfectly. We tested whether displaying discounts in green (associated with savings) versus red (associated with urgency) affected conversion. The measured difference was 0.18%. At the traffic levels available, this difference was nowhere near statistically significant. It could easily be noise.

IFwe change the discount label color from red to green on product pages

THENconversion rate will increase by at least 1%

BECAUSEgreen is associated with positive outcomes and savings in Western color psychology, while red triggers loss aversion that may create negative associations with the purchase decision

ResultMeasured difference was 0.18% -- far below the minimum detectable effect at the available traffic level. The result is statistically indistinguishable from zero. Neither color outperformed the other within the bounds of the test's statistical power.

The honest conclusion is not that color does not matter. It is that this test could not tell us whether color matters, because we did not have enough traffic to detect a small effect. That is a fundamentally different statement, and the distinction is critical for maintaining intellectual honesty in a testing program.

What to do when traffic is insufficient

Accept larger MDE. If you can only detect 20% relative lifts, test only changes that you expect to produce at least 20% improvement. Small refinements cannot be validated at low traffic volumes.
Test at a higher funnel stage. If the product page has 5,000 monthly visitors, the homepage or collection page may have 50,000. Test there instead, where you can reach significance.
Aggregate across pages. If you are testing a design pattern (not a specific piece of content), apply it across multiple product pages simultaneously to increase the effective sample size.
Use qualitative methods. Heatmaps, session recordings, user surveys, and expert reviews can identify high-conviction improvements that do not require statistical validation. Implement these as direct changes rather than A/B tests.
Extend test duration. A test that needs 40,000 visitors per variant on a page with 2,000 weekly visitors can run for 20 weeks. This is long, but if the alternative is no data, patience wins.

DRIP Insight

Intellectual honesty about what your data can and cannot tell you is the foundation of a credible testing program. Admitting that a test was inconclusive is always better than fabricating a winner from insufficient data.

How do you translate sample size into test duration for your brand?

Divide the required sample size per variant by your weekly traffic to the tested page. Ensure the test runs for at least two full business weeks to capture day-of-week variation, and never end a test mid-week.

Sample size is a number of visitors. Duration is the time it takes to accumulate that number. The conversion from one to the other depends entirely on your traffic volume, which means two brands testing the same hypothesis may need radically different test durations.

Test duration estimates by traffic volume (10% relative MDE, 2% baseline CR)

Weekly page traffic	Sample needed per variant	Est. duration
5,000	~78,000	~32 weeks
10,000	~78,000	~16 weeks
25,000	~78,000	~7 weeks
50,000	~78,000	~4 weeks
100,000	~78,000	~2 weeks

Two constraints apply beyond the raw math. First, every test must run for at least two full weeks (14 days) to capture the full range of day-of-week variation. E-commerce traffic patterns differ substantially between weekdays and weekends, and a test that runs Monday through Friday misses the weekend behavior entirely.

Second, always end tests at the end of a complete weekly cycle. If your test starts on a Wednesday, end it on a Tuesday (or let it run to the following Tuesday). This ensures both variants received equal exposure to every day of the week.

Practical guidance by brand size

High-traffic brands (500K+ monthly sessions): You can run 2-3 concurrent tests on different pages with 2-3 week durations. Prioritize by expected revenue impact.
Medium-traffic brands (100K-500K monthly sessions): Run one test at a time on your highest-traffic pages. Expect 3-6 week durations. Queue tests rather than running them in parallel on the same page.
Lower-traffic brands (50K-100K monthly sessions): Focus on high-impact pages only (homepage, top product pages). Expect 6-12 week durations. Supplement with qualitative research between tests.
Low-traffic brands (under 50K monthly sessions): A/B testing is possible but slow. Focus on large MDE tests (expecting 20%+ relative lift) or shift to qualitative optimization methods entirely.

Pro Tip

Build a test calendar before the quarter starts. Estimate each test's duration based on the target page's traffic and the expected MDE. This prevents the common mistake of queuing more tests than your traffic can support, which leads to the temptation of stopping tests early.

What tools and methods help ensure sample size discipline?

Use pre-test sample size calculators, lock in the test duration before launch, implement sequential testing methods if you must peek, and create organizational guardrails that prevent premature test termination.

Knowing the math is necessary but not sufficient. The real challenge is organizational: ensuring that the people with the authority to stop tests (product managers, marketing leads, executives) understand and respect the pre-calculated duration.

Before the test: lock the parameters

Calculate the required sample size using a standard calculator (Evan Miller's calculator, or the one built into your testing tool). Document the inputs: baseline CR, MDE, confidence level, and power.
Calculate the expected duration by dividing the required sample by weekly traffic. Add a buffer of 20% for traffic fluctuations.
Document the end date before the test launches. Write it down. Share it with stakeholders. Treat it as a contract.
Define the primary metric and success criteria before launch. 'We will declare a winner if the primary metric shows a statistically significant improvement at 95% confidence after the full duration.' No other conditions.

During the test: monitor health, not results

Check the test daily for technical issues: broken variants, tracking errors, extreme negative effects that suggest a bug. Do not check it for performance. If you must look at interim results, use them only to verify that the test is collecting data correctly, not to make stop/continue decisions.

After the test: analyze with discipline

Check statistical significance at the pre-defined confidence level.
Verify the result holds across segments (device type, traffic source, new vs. returning).
If the result is close to the significance threshold, extend the test rather than rounding up.
Document the result regardless of outcome. Losing tests and inconclusive tests are valuable data.
If the test was inconclusive, explicitly state why (insufficient traffic, effect smaller than MDE, external confound) rather than interpreting noise as signal.

DRIP Insight

The discipline of sample size is not a statistical nicety. It is the difference between a testing program that produces reliable knowledge and one that produces expensive noise. Every false positive you avoid is a bad change you never deployed and revenue you never lost.

A/B Testing Sample Size: How to Calculate It (And Why Most Get It Wrong)

Why does sample size matter in A/B testing?