Webinar

The Truth About Security In Open-Source CMS

Register now →

Effective A/B Testing (Part 4)

Peeking, Pushing, And When To Pull The Plug

Niels Christian Laursen
Written by Niels Christian Laursen

Knowing when to stop a test is just as important as knowing when to start one. We’ve made that mistake in two ways: stopping tests too early, and refusing to stop them at all. This post is about timing. How long should you run your test? When can you trust the result? And how do you stop yourself from acting too soon, or waiting too long?

If you’ve followed the first three parts of this series, you know how easy it is to waste time on tests that were never set up to deliver real insight. But even the best-designed test can fail to deliver value if you get the ending wrong.

The peeking problem

You launch a new test. A few hours later, curiosity wins, and you check the results. One variant is already outperforming the other with 95 percent confidence. Looks promising, right?

We used to think so, too. But early significance is rarely real significance.

The problem isn’t peeking. The problem is reacting to incomplete data.

Most A/B testing tools calculate confidence levels continuously. That means you’ll see high confidence at some point, even in a completely random test. If you stop based on that early signal, you increase the risk of making decisions based on noise.

💡 Acting on early significance can inflate your false positive rate by more than 25 percent. That’s one in four tests giving you a false win.

ℹ️ What to do instead: Before you launch your test, calculate the required sample size based on your expected impact. Stick to it. Do not evaluate results until you reach that number.

The “one more week” trap

After learning not to stop early, we made the opposite mistake. When a test didn’t reach significance, we just kept it running.

“If we give it one more week, maybe it will reach the threshold.”

We told ourselves it was responsible. In reality, it was another form of bias.

We were giving insignificant results more time, but acting quickly on anything that looked like a win. This double standard increased the chance of random effects turning into decisions.

If you only wait longer for the results you don’t like, you are letting your hopes decide what gets implemented.

💡 Significant and insignificant results should be treated the same. Set clear thresholds and follow them, no matter what the numbers say along the way.

ℹ️ Our fix: We now define both sample size and test duration up front. If a test ends without reaching significance, we log the result, review it, and decide what to do next. But we do not keep pushing.

What we do instead: guardrails, not guesswork

We stopped letting the data tempt us into bad habits. Instead, we built a process that removes the guesswork.

Here’s what that looks like now:

  1. Every test starts with a sample size calculation. We use our baseline conversion rate and expected minimum effect to estimate how many sessions we’ll need.

  2. We commit to a test duration before launching. This helps ensure we aren’t reacting emotionally to interim results.

  3. If we reach the threshold and the result is still insignificant, we stop. We don’t extend the test just to chase a number.

💡 It’s easier to trust your results when you’re not constantly moving the finish line.

ℹ️ Umbraco Engage includes built-in support for test guardrails. You can define sample sizes, set durations, and avoid early stopping errors with clear test rules and thresholds.

Summary: Letting your test run too short or too long will cost you

If your A/B test ends before it reaches enough users, the result is unreliable. If it runs long past the planned threshold, you risk overfitting to noise.

💡 Discipline beats intuition. A clear plan will outperform gut feeling, especially when your gut wants the test to win.

Need help planning smarter tests?

If you liked this series, you’ll love CRObot, our AI assistant built to help you test with more confidence and less guesswork.

It can help you:

  • Estimate sample sizes
  • Define meaningful thresholds
  • Avoid common timing mistakes
  • Prioritize test ideas based on impact and effort
  • Help draft up content variations

💡 We introduced CRObot in Part 2 as a way to take the math and planning off your plate. It’s available 24/7 and doesn’t get emotionally attached to Variant B.

ℹ️ PS: We’re also working on a full testing framework you can download, complete with templates and calculators.

Try CRObot now

From learning to doing: What’s next?

That’s it. Four posts, dozens of mistakes, and one big takeaway:

💡 A/B testing works best when you plan with purpose, test with discipline, and act on real insight.

Over the course of this series, we’ve covered:

If you've made it this far, you're not just testing for the sake of it - you’re optimizing with intent. That’s where the real gains happen.

Umbraco Engage: Built to help you test smarter

If you’re ready to put these learnings into action, Umbraco Engage is the perfect tool for the job.

With built-in guardrails for test duration and sample size, content variation tools, in-depth analytics, and full personalization support, Engage gives you everything you need for conversion optimization and to run faster, cleaner, and more impactful experiments.

👉 Take the Engage product tour
👉 Book a discovery call to learn how Umbraco can help you succeed

Happy testing!