Online experiments (e.g., A/B tests) are fast becoming part of the standard toolkit of digital organisations in measuring the impact of one's work and guiding business decisions. Big tech companies report running thousands of experiments at any given time, and multiple companies are set up solely to help other organisations manage their experiments.
Many online experiments are straightforward -- we randomise website users into a control and treatment group and perform a two-sample t-test on the average response from the users. However, these procedures rely on statistical assumptions that can easily fall apart in real-life applications. For example, a vanilla t-test assumes i.i.d. samples, which does not hold when the user responses become correlated. Making inferences on the treatment effect with a two-sample test also assumes exchangeability in the potential responses of the two samples, which does not hold when the experiment involves targeting a different audience.
In this session, we spell out these assumptions with more rigour, illustrate using past experiments how they can be violated, and outline some approaches addressing the issue with the practical tradeoffs. We also discuss cases where it is infeasible to perform randomisation and set up a proper control, presenting common designs one employs to estimate the treatment effect. |