Did you know that when you do an A/B test there is a point when it becomes “reliable enough”? And did you know that stopping the test before it gets to that point might give you misleading results?

The “reliability” of A/B test — or any scientific experiment — results is called statistical significance.

If you’re a mathematician, you might hate the way I am going to explain this, but if you’re not a mathematician it should help. :)

****

Think of it like this: if you had to predict the behaviour of 100 million people based on the first person that does the A/B test, would that be reliable?

Probably not. They could be anybody!

When the second person does it, the odds that those two people represent the majority are still low, but it’s a bit better than just one person.

After 10 million people have done the A/B test it is probably getting pretty reliable.

Lots of things can affect people the people you’re testing. Maybe they are having a bad day, maybe someone asks them a question while they are on your site, maybe they have used your app a hundred times before, maybe they aren’t in your target group (or maybe they are!).

Who knows?!

Statistical significance isn’t true or false. It’s probability. So the A/B test is 20% reliable, or 60%, or 99%, or anything in between.

Go for the big numbers (95+%). It takes longer. It requires more users. And it’s worth it.

Then you know (probably).

The analytics tool usually does all the math for you, so don’t panic… but it is good if you understand the idea. I don’t calculate my own statistical significance, and you probably won’t either, but it helps to know how to think about it.

As they say in the article, just because an A/B test result is reliable doesn’t mean it is important. That’s a whole other conversation.

But it can’t be important if it isn’t reliable.