Notes on A/B testing and its statistical implications
Websites should often change their designs to catch up and get up to date. A/B testing, or as some call it, split testing is a method that shows your users two (or more) variants of your website within a timeframe and the decision on what variation is most effective compared to others. It is undoubtedly a successful experiment mainly used to increase website capacity in web marketing, i.e., increasing the conversion rate of ecommerce sites.
Although all seems clear, the problem starts here: how will we decide what to change, and going forward, what is the most effective way this change should be implemented? You should lean back and think of your question and goals again. The typical questions are what holds the users back? Which areas are in dire need of optimization? Looking to other stories could surely help, but a change that worked at one point will most likely not work at another. Admittedly, following a trialanderror method can lead to a significant loss of time and money, and it could result in useless decisions in the end.
Work with metrics
We slowly move onto the phase of deciding the experiment. Let’s say that we want the users to click on a “click me” button. How can we improve the design to get more clicks? Let’s assume we will get more clicks when we change the position of the click me button from the end of the leftaligned text to the middle of the leftaligned text^{2}. When you decide to optimize the user experience in your website, you can analyze many metrics such as clickthrough rate, bounce rate, average order value, and so on. The statistical calculations, i.e., distributions and estimated variance, behind these metrics can be slightly or totally different from each other.
A indicates the "control" variant, and B indicates the "experiment" variant
Experimenting with such change is done by a common metric, the clickthrough rate (CTR), which is the unique visitors who clicked the click me button divided by the total unique visitors that landed on the pages. I use the term unique here because we want individual clicks and page views, not duplications.
We turn the rate into probability to measure the impact of that button. A clickthrough probability (CTP) is the CTR multiplied by 100. For instance, you have a CTP of 40% when you have a 40 unique clicks from 100 unique visitors landing your page. A simple formula for CTP is:
The definition of metrics can change the result of the entire experiment. How did we understand that the users clicked the button more? How do we count the ‘clicks’? a) One click in every ten minutes, or every refreshing page? b) One click by one user, or IP? How do we track the user? You have to decide on the unit of diversions carefully, as they can be, i.e., a unique login system or web browser cookies.
Define a research question
In A/B testing, you can change many things from your control variant A and present the changes in an experiment in variant B. For example, the size of the header, text color, the symbols on your buttons, the general layout, and so on. So, always making a test can give you satisfactory results no matter how excellent an idea or design you have. The reality can totally be different than your ideal “good.” Contrarily, you are taking the risk of users leaving your site because of the new changes that they might not enjoy.
Keep in mind that testing different variables at the same time, a.k.a. multivariate tests, can be complex and may lead to biased results if not set properly. A site should need a relatively high level of traffic in order to produce statistically significant results. Independent A/B tests with different variants of a single variable should be the way to go. As you cannot test everything at once, you must learn how to prioritize. Take the most important questions first and connect them. If you do not know where to start prioritizing, reconsider your goals. The most fundamental aspect of testing is to ask questions. A question you ask will yield many more questions, and an answer will do the same^{1}.
You can run many variants with A/B testing as its design is not limited to only one. There can be different experimental variants examining the alternative models such as A vs. B, A vs. C, and A vs. D. If you have a smallscale website, one design test at a time is the most preferable. Unless you have a site with lots of traffic, more than four experiments at a time might not be preferred due to lack of data. It is not advised to try multiple different designs in your experiment as it could lead to biased results.
Knowing your data is essential before starting the analysis. Be sure of what you track on your website and in what format it is stored in your database. After you are sure that you know your data, you should be thinking about your goals, covered by your measurement plan. A precedent goal can be based on website or blog objectives, or KPIs (key performance indicators) for business. Then you can come to the first step of the experiment, which is defining your key metric(s) depending on your goal and questions.
Identifying hypotheses
Using additional sources such as surveys, focus groups, interviews, user experience experiments, etc. is unquestionably invaluable to the hypothesis identifying processes according to your goals. Identifying a hypothesis can make your tests less vague; that is to say, it makes your test specific and focus on what you are actually determining, i.e., clears whether we determine the color of button, layout of button, or what to write on the button, instead of just saying that we are experimenting with the button.
According to our goal, the hypothesis becomes “Changing the position of ‘click me’ button from bottom to the middle will change the clickthrough probability of the button.” In mathematical notation, null is where there is no difference, and alternative has a difference . Broaden this; our null hypothesis is where the clickthrough probability of our control variant minus our experiment variant is equal to zero, . Then, the alternative hypothesis is, where experiment variant has a difference than control variant. In other words, alternative hypothesis has a difference; then null hypothesis shows no difference.
Distribution in the control and experiment variants
Test statistics
As we always work with probabilities, the important issue here is about how we can be statistically confident about the results. In the experimental design field, we have to produce a test statistic which will help us to define the chance that the observed difference occurred due to random variation.
Statistical significance is a term depicting how likely the observation has occurred due to random variation. The pvalue, known as calculated probability, is the probability which indicates that the null hypothesis is true. The most common confidence intervals are 90%, 95% and 99% (and calculated probabilities are .1, .05, and .01 respectively). For the 95% confidence interval, there is a 5% chance that the null hypothesis is true. In other words, if you are 95% confident with your result, you can reject the null. (Well, error is always considered in statistics.)
While doing this, it is best to check variability as it might affect many things. If your data lacks consistency (high variance), or does not have a big enough sample size, you will end up with a large confidence interval.
The test statistic is performed to characterize the differences between null and alternative hypotheses; in most cases, the ztest is commonly used in A/B testing, especially when the normal population is greater than 30 and a standard deviation is known. We are going to use the binomial distribution to assess our statistical significance based on our hypothesis. That brings us a binary situation like 0 (clicked) and 1 (nonclicked). Then, calculating the ztest can help to determine the statistical significance in population and mean (M), with population standard deviation (SD).
There is a need to use a twotailed test in considering the possibility of both directions. In the twotailed test, we look at both sides of the normally distributed curve where we can determine the critical region (or known as the decision rule). For 95% confidence interval, twotailed zscore value equals to 1.96. Other values are z = 2.576 for 99% level of confidence, and z = 1.645 for 90% level of confidence.
Then we can reject the null () with 95% confidence if the zscore is between 1.96 and 1.96 and come to a substantial conclusion that our result is statistically significant. Keep in mind that when the results are statistically significant, it does not mean that they are just significant. We support the alternative hypothesis because the chance is very small that the null hypothesis is not rejected. Nevertheless, it would be worth it to check practical significance for whether the differences in statistical tests are really significant.
Pooled proportion is the number of a sample divided by a total number of population. Besides that, if we repeated the experiment, how much probability would you expect? Check standard error for the binomial distribution. It is important that our interval constructed around the sample mean covers the value in population, while we are repeating the experiment.
We use a hat () because it is an estimated probability. And we work with the pooled results as we are comparing two samples, where:
n = population, x = sample.
When our metric returns as a probability, we use the ztest for two proportions. Recalling the ztest formula for two population proportions that is:
where the numerator is:
and denominator is the standard error that is the overall variation of the metric:
Sampling size, population and power
As you are already familiar, there are many important questions we have to take into consideration before we start the statistical implication of the experiment. How can we become confident that there is a difference between groups? What is the size of this change? How many subjects or page views do we need for our experiment? The first and second questions are covered above. For the third question, about sample size, we need to know about power calculation.
Power calculation allows us to find out the probability of the needed sample size under a given level of confidence, so we can use it to know what sample size we need to run our test. If the probability does not meet the requirements, we can change the experiment design. A larger population will give you a smaller standard error.
Significance level, shown as alpha (), refers to the probability of a Type I error. The significance level is also the probability of finding a true effect between the null and alternative hypotheses. In addition to this, you must not decide or change the hypothesis after your analysis leads to the increment of the probability of a Type I error () by falsely rejecting the null and again falsely accepting the alternative.
While Type II error, shown as (), is the probability of failing to reject the null hypothesis, Power, known as 1  , refers to the probability of correctly rejecting the null hypothesis. Effect size, which can be notated as , is the normalized mean difference between null and alternative hypotheses. The size of the effect depends on the statistical calculations of hypothesis testing. The smaller change you want to detect, the larger pageviews you need in the experiment.
Among these four notions of power calculation, a given three can help you find out the fourth. While the researcher can design significance level and sample size, power and effect size cannot be entirely controlled. In standard, nonstrict general A/B testing approaches, a 5% significance level and 80% power (1  ) are taken.
We can calculate pooled difference () to estimate the difference in proportions between control and experiment groups^{3}.
becomes,
The difference between two groups in the is 0.
or,
The estimated difference in the , should be greater than 1.96 (the zscore for 95% confidence interval) times the pooled standard error, or negative of it, in order for that we are able to reject the null.
If you are working in the R environment, you can use power.prop.test()
function to compute the power of the twosample test for proportions, or determine parameters to obtain a target power (similarly, you can use pwr.2p.test()
function from pwr package).
# Arguments:
# n: number of observations, common sample size in 'each variant'
# p1: probability in one variant
# p2: probability in other variant
# alternative: one or twosided test
# sig.level: significance level (Type I error probability)
# power: power of test (1 minus Type II error probability)
##
## Twosample comparison of proportions power calculation
##
## n = 2351.143
## p1 = 0.1
## p2 = 0.11
## sig.level = 0.05
## power = 0.2
## alternative = two.sided
##
## NOTE: n is number in *each* variant
Final thoughts
A/B testing is not quite different than the classic statistical hypothesis testing and experiment design which has been used in academia for many years. Understanding your data, metrics, and dimensions is very important before starting the experiment. For more intermediate analyses, you can use segmentation by grouping your users/customers from the common attributes.
If your A/B test failed, or did not reach a significant result, it may be because they are not ready to be tested, or your website has undesired content for your audience. Every experiment you do have will not return a positive outcome; of course, you may obtain negative results, which is the nature of experimentation. When you receive negative results, you should consider yourself “lucky” that it was found before you implemented the variation. Moreover, if you run many A/B tests without reaching the sample audience in short times, you are unlikely to produce meaningful results.
Some important things to consider:

Always run the experiment simultaneously. Showing different variants to users over different periods of time will simply reveal your results as biased.

You should wait some time before you start analyzing the results. Usually, a duration of the experiment is 2 weeks. (Anyhow, it depends on the website traffic and experiment design.)

Underlying variabilities such as user system, population, etc. can affect the sensitivity of your metrics. Not everybody in the world has the same Internet speed, and reaction time of devices are varied. An A/A test can be beneficial if you need to be sure everything else is equal. An A/A (or A vs. A) test checks that the two identical variants of your website function well and ensures there is no statistically significant differences. Although some big businesses use it to be sure about the change, you can ignore these variabilities as the difference might not be remarkable for small businesses.

Keep the log/records of the tests. It will give you power to learn lessons from your past so that you can see what worked, what did not work, and why or why not it worked.
What is “inferential” about inferential statistics is that we use a sample to conclude inferences about the whole population. We should never forget the notion of chance in the experiment. The design we did in the versions can be a pure change. As we make inferences from a sample, we cannot really know that any increment or decrement in the rates are bound to our change in design.
Practice makes perfect! You should keep testing no matter what you get from the results, simply because what works in one experiment will not work in another. You can learn more and more after every test, which will help you understand your results and optimize them more efficiently in the future. Keep testing and believing (or not believing) in your data.
Remarks & References
If you want to do your own experiment free and easy way, you can use GA. Choose your metrics regarding to your goals and then test the page layout such as headers, images, text, page layout etc. FYI, GA uses a different algorithm called multiarmed bandit for the experiments.
[1]: For sure, this is an epistemological cycle.
[2]: Tracking the clicks of your page, you need to do backend development in your webpage (by using, such as PHP or JS), or use a tag management system service like GTM. It’s free!
[3]: Also check Cohen’s d defining normalized mean difference formula as: . See here.
Buckey, C., Diane, T., Grimes, C., A/B Testing by Google. MOOC by Udacity. Retrieved on Oct 25, 2017 from, https://www.udacity.com/course/abtestingud257
Farmer, J. (2009, 19 January) An Introduction to A/B Testing. 20 bits. Retrieved from, http://20bits.com/article/anintroductiontoabtesting
Kabacoff, R. I. (2015). R in Action: Data analysis and graphics with R. New York: Manning.
Lehmann, E. L., & Romano, J. P. (2006). Testing statistical hypotheses (3rd edition). New York: Springer. ISBN 0387988645
Siroker, D., Koomen, P., & Harshman, C. (Eds.). (2013). A/B Testing: The Most Powerful Way to Turn Clicks into Customers. New Jersey: Wiley.