Why does sample size matter in A/B testing?
Imagine flipping a coin ten times and getting seven heads. You might conclude the coin is biased. But it is not -- you just did not flip it enough times. With 10,000 flips, the ratio would converge to 50/50. A/B testing works on exactly the same principle. With small samples, random variation dominates. With large samples, the real effect emerges.
In e-commerce, the stakes of getting this wrong are measured in revenue. If you stop a test after three days because variant B shows a 15% lift, you might deploy a change that was riding a weekend traffic spike or a coincidental influx of high-intent users. Two weeks later, the 'winning' variant is underperforming the original -- but you have already rolled it out to 100% of traffic.
The cost of a false positive is not just the development time spent implementing the variant. It is the ongoing revenue impact of a change that degrades performance, compounded by the fact that you believe it is helping. You might not revisit that element for months, trusting a result that was never real.
How do you calculate the required sample size before a test?
The calculation is straightforward once you understand the three inputs. Each one represents a decision you make before the test begins -- and each one has practical implications for how long the test needs to run.
Input 1: Baseline conversion rate
This is the current conversion rate of the page or flow you are testing. Pull it from your analytics for the same page, same traffic source, and same time period that the test will run. A 2% baseline requires more samples than a 10% baseline to detect the same relative improvement, because there is more noise relative to the signal.
Input 2: Minimum detectable effect (MDE)
MDE is the smallest improvement you care about detecting. If your baseline is 2% and you set MDE at 10% relative (a lift from 2.0% to 2.2%), you need a larger sample than if you set MDE at 20% relative (a lift from 2.0% to 2.4%). Smaller effects require more data to distinguish from noise.
Input 3: Confidence level and statistical power
Confidence level (typically 95%) sets your Type I error threshold (alpha). In practice, with alpha = 0.05, you tolerate a 5% chance of flagging a difference when none exists under the null hypothesis. Statistical power (typically 80%) is the probability of detecting a real effect at your chosen MDE. Together, they define your false-positive and false-negative risk.
| Baseline CR | MDE: 5% relative | MDE: 10% relative | MDE: 20% relative |
|---|---|---|---|
| 1% | ~630,000 | ~160,000 | ~40,000 |
| 2% | ~310,000 | ~78,000 | ~20,000 |
| 3% | ~200,000 | ~51,000 | ~13,000 |
| 5% | ~120,000 | ~31,000 | ~8,000 |
| 10% | ~57,000 | ~15,000 | ~4,000 |
These numbers are per variant. For a standard A/B test with one control and one variant, double the number to get total required traffic. For an A/B/C test with two variants, triple it.
The practical implication is clear: if your page receives 5,000 visitors per week and you need 80,000 per variant, the test needs to run for 32 weeks -- over seven months. At that point, you have three choices: increase traffic to the page, increase MDE (accept that you can only detect larger effects), or test on a higher-traffic page instead.
Why is stopping a test early so dangerous?
This is the most common and most expensive mistake in A/B testing. The scenario is always the same: you launch a test on Monday. By Wednesday, the testing tool shows variant B with a 12% lift and 96% statistical significance. The green light is on. The temptation to call it a winner and move to the next test is overwhelming.
Do not do it. That 96% significance number is misleading because it does not account for the multiple times you have checked the result. Every time you peek at the data and decide whether to stop, you are effectively running an additional statistical test. This is called the peeking problem or the multiple comparison problem, and it inflates your actual false positive rate far beyond the stated 5%.
The SNOCKS search bar test is instructive because of its duration: two full months. For a team eager to ship changes and move to the next test, two months feels like an eternity. But the traffic volume and baseline conversion rate demanded it. The test required that duration to separate signal from noise.
The peeking penalty in numbers
| Peeking frequency | Nominal alpha | Actual false positive rate |
|---|---|---|
| Once (at end only) | 5% | 5% |
| Daily for 2 weeks | 5% | ~15% |
| Daily for 4 weeks | 5% | ~25% |
| Hourly for 2 weeks | 5% | ~30% |
If you must monitor results during the test (and operationally, you should), use sequential testing methods that adjust the significance threshold for peeking. Tools like AB Tasty and VWO offer always-valid p-values or Bayesian approaches that account for interim analysis. But even with these adjustments, the test still needs the full calculated sample size to reach reliable conclusions.
What happens when traffic is too low for meaningful A/B testing?
Not every page has enough traffic for A/B testing. And pretending otherwise -- running tests on low-traffic pages and declaring winners based on insufficient data -- is worse than not testing at all, because it creates false confidence in changes that may be harmful.
The green vs. red discount label test illustrates this perfectly. We tested whether displaying discounts in green (associated with savings) versus red (associated with urgency) affected conversion. The measured difference was 0.18%. At the traffic levels available, this difference was nowhere near statistically significant. It could easily be noise.
The honest conclusion is not that color does not matter. It is that this test could not tell us whether color matters, because we did not have enough traffic to detect a small effect. That is a fundamentally different statement, and the distinction is critical for maintaining intellectual honesty in a testing program.
What to do when traffic is insufficient
- Accept larger MDE. If you can only detect 20% relative lifts, test only changes that you expect to produce at least 20% improvement. Small refinements cannot be validated at low traffic volumes.
- Test at a higher funnel stage. If the product page has 5,000 monthly visitors, the homepage or collection page may have 50,000. Test there instead, where you can reach significance.
- Aggregate across pages. If you are testing a design pattern (not a specific piece of content), apply it across multiple product pages simultaneously to increase the effective sample size.
- Use qualitative methods. Heatmaps, session recordings, user surveys, and expert reviews can identify high-conviction improvements that do not require statistical validation. Implement these as direct changes rather than A/B tests.
- Extend test duration. A test that needs 40,000 visitors per variant on a page with 2,000 weekly visitors can run for 20 weeks. This is long, but if the alternative is no data, patience wins.
How do you translate sample size into test duration for your brand?
Sample size is a number of visitors. Duration is the time it takes to accumulate that number. The conversion from one to the other depends entirely on your traffic volume, which means two brands testing the same hypothesis may need radically different test durations.
| Weekly page traffic | Sample needed per variant | Est. duration |
|---|---|---|
| 5,000 | ~78,000 | ~32 weeks |
| 10,000 | ~78,000 | ~16 weeks |
| 25,000 | ~78,000 | ~7 weeks |
| 50,000 | ~78,000 | ~4 weeks |
| 100,000 | ~78,000 | ~2 weeks |
Two constraints apply beyond the raw math. First, every test must run for at least two full weeks (14 days) to capture the full range of day-of-week variation. E-commerce traffic patterns differ substantially between weekdays and weekends, and a test that runs Monday through Friday misses the weekend behavior entirely.
Second, always end tests at the end of a complete weekly cycle. If your test starts on a Wednesday, end it on a Tuesday (or let it run to the following Tuesday). This ensures both variants received equal exposure to every day of the week.
Practical guidance by brand size
- High-traffic brands (500K+ monthly sessions): You can run 2-3 concurrent tests on different pages with 2-3 week durations. Prioritize by expected revenue impact.
- Medium-traffic brands (100K-500K monthly sessions): Run one test at a time on your highest-traffic pages. Expect 3-6 week durations. Queue tests rather than running them in parallel on the same page.
- Lower-traffic brands (50K-100K monthly sessions): Focus on high-impact pages only (homepage, top product pages). Expect 6-12 week durations. Supplement with qualitative research between tests.
- Low-traffic brands (under 50K monthly sessions): A/B testing is possible but slow. Focus on large MDE tests (expecting 20%+ relative lift) or shift to qualitative optimization methods entirely.
What tools and methods help ensure sample size discipline?
Knowing the math is necessary but not sufficient. The real challenge is organizational: ensuring that the people with the authority to stop tests (product managers, marketing leads, executives) understand and respect the pre-calculated duration.
Before the test: lock the parameters
- Calculate the required sample size using a standard calculator (Evan Miller's calculator, or the one built into your testing tool). Document the inputs: baseline CR, MDE, confidence level, and power.
- Calculate the expected duration by dividing the required sample by weekly traffic. Add a buffer of 20% for traffic fluctuations.
- Document the end date before the test launches. Write it down. Share it with stakeholders. Treat it as a contract.
- Define the primary metric and success criteria before launch. 'We will declare a winner if the primary metric shows a statistically significant improvement at 95% confidence after the full duration.' No other conditions.
During the test: monitor health, not results
Check the test daily for technical issues: broken variants, tracking errors, extreme negative effects that suggest a bug. Do not check it for performance. If you must look at interim results, use them only to verify that the test is collecting data correctly, not to make stop/continue decisions.
After the test: analyze with discipline
- Check statistical significance at the pre-defined confidence level.
- Verify the result holds across segments (device type, traffic source, new vs. returning).
- If the result is close to the significance threshold, extend the test rather than rounding up.
- Document the result regardless of outcome. Losing tests and inconclusive tests are valuable data.
- If the test was inconclusive, explicitly state why (insufficient traffic, effect smaller than MDE, external confound) rather than interpreting noise as signal.
