1A/B Testing: Designs and Analysis
2Three Essential Components of Statistics
(Data Science):
Data+Computer+Analytics
1 Introduction 3
1 Introduction
1.1 What is A/B testing?
A/B test is the shorthand for a simple controlled experiment. As the
name implies, two versions (A and B) of a single variable are
compared, which are identical except for one variation that might
affect a user’s behavior. A/B tests are widely considered the simplest
form of controlled experiment. However, by adding more variants to
the test, this becomes more complex.
A/B testing is the process of comparing two variations of a page
element, usually by testing users’ response to variant A vs variant B,
and concluding which of the two variants is more effective.
1 Introduction 4
A/B tests are useful for understanding user engagement and
satisfaction of online features, such as a new feature or product. Large
social media sites like LinkedIn, Facebook, and Instagram use A/B
testing to make user experiences more successful and as a way to
streamline their services.
1 Introduction 5
Today, A/B tests are being used to run more complex experiments,
such as network effects when users are offline, how online services
affect user actions, and how users influence one another. Many jobs
use the data from A/B tests. This includes, data engineers, marketers,
designers, software engineers, and entrepreneurs. Many positions rely
on the data from A/B tests, as they allow companies to understand
growth, increase revenue, and optimize customer satisfaction.
1 Introduction 6
Version A might be the currently used version (control), while version
B is modified in some respect (treatment). For instance, on an
e-commerce website the purchase funnel is typically a good candidate
for A/B testing, as even marginal decreases in drop-off rates can
represent a significant gain in sales. Significant improvements can
sometimes be seen through testing elements like copy text, layouts,
images and colors, but not always. In these tests, users only see one of
two versions, as the goal is to discover which of the two versions is
preferable.
1 Introduction 7
Controlled experiments have a long and fascinating history. They are
sometimes called A/B tests, A/B/C tests (multiple variants), field
experiments, randomized controlled experiments, split tests, bucket
tests, and flights.
1 Introduction 8
1.2 Online experiments
Example 1. Online A/B testing. (Kohavi and Thomke, 2017,
Harvard Business Review) Microsoft, Amazon, Facebook and Google
conduct more than 10,000 online controlled experiments annually, with
many tests engaging millions of users.
Amazon’s experiment.
Treatment A: Credit card offers on front page.
Treatment B: Credit card offers on the shopping cart page.
This (change from A to B) boosted profits by tens of millions of US
Dollars annually.
1 Introduction 9
1.2.1 A/B Testing in eCommerce Industry
Through A/B testing, online stores can increase the average order
value, optimize their checkout funnel, reduce cart abandonment rate,
and so on. You may try testing: the way shipping cost is displayed and
where, if, and how free shipping feature is highlighted, text and color
tweaks on the payment page or checkout page, the visibility of reviews
or ratings, etc.
1 Introduction 10
In the eCommerce industry, Amazon is at the forefront in conversion
optimization partly due to the scale they operate at and partly due to
their immense dedication to providing the best customer experience.
Amongst the many revolutionary practices they brought to the
eCommerce industry, the most prolific one has been their ‘1-Click
Ordering’. Introduced in the late 1990s after much testing and
analysis, 1-Click Ordering lets users make purchases without having to
use the shopping cart at all. Once users enter their default billing card
details and shipping address, all they need to do is click on the button
and wait for the ordered products to get delivered. Users don’t have to
enter their billing and shipping details again while placing any orders.
With the 1-Click Ordering, it became impossible for users to ignore the
ease of purchase and go to another store. This change had such a
huge business impact that Amazon got it patented (now expired) in
1999. In fact, in 2000, even Apple bought a license for the same to be
used in their online store.
1 Introduction 11
People working to optimize Amazon’s website do not have sudden
‘Eureka’ moments for every change they make. It is through
continuous and structured A/B testing that Amazon is able to deliver
the kind of user experience that it does. Every change on the website
is first tested on their audience and then deployed. If you were to
notice Amazon’s purchase funnel, you would realize that even though
the funnel more or less replicates other websites’ purchase funnels,
each an every element in it is fully optimized, and matches the
audience’s expectations.
1 Introduction 12
Every page, starting from the homepage to the payment page, only
contains the essential details and leads to the exact next step required
to push the users further into the conversion funnel. Additionally, using
extensive user insights and website data, each step is simplified to their
maximum possible potential to match their users’ expectations.
1 Introduction 13
Take their omnipresent shopping cart, for example. There is a small
cart icon at the top right of Amazon’s homepage that stays visible no
matter which page of the website you are on.
1 Introduction 14
The icon is not just a shortcut to the cart or reminder for added
products. In its current version, it offers 5 options:
(i) Continue shopping (if there are no products added to the cart)
(ii) Learn about today’s deals (if there are no products added to the
cart)
(iii) Wish List (if there are no products added to the cart)
(iv) empty cart
(v) Proceed to checkout (when there are products in the cart). Sign in
to turn on 1-Click Checkout (when there are products in the cart).
1 Introduction 15
With one click on the tiny icon offering so many options, the user’s
cognitive load is reduced, and they have a great user experience. As
can be seen in the above screenshot, the same cart page also suggests
similar products so that customers can navigate back into the website
and continue shopping. All this is achieved with one weapon: A/B
Testing.
1 Introduction 16
1.2.2 A/B Testing in Travel Industry
Increase the number of successful bookings on your website or mobile
app, your revenue from ancillary purchases, and much more through
A/B testing. You may try testing your home page search modals,
search results page, ancillary product presentation, your checkout
progress bar, and so on.
1 Introduction 17
In the travel industry, Booking.com easily surpasses all other
eCommerce businesses when it comes to using A/B testing for their
optimization needs. They test like it’s nobody’s business. From the
day of its inception, Booking.com has treated A/B testing as the
treadmill that introduces a flywheel effect for revenue. The scale at
which Booking.com A/B tests is unmatched, especially when it comes
to testing their copy. While you are reading this, there are nearly 1000
A/B tests running on Booking.com’s website.
1 Introduction 18
Even though Booking.com has been A/B testing for more than a
decade now, they still think there is more that they can do to improve
user experience. And this is what makes Booking.com the ace in the
game. Since the company started, Booking.com incorporated A/B
testing into its everyday work process. They have increased their
testing velocity to its current rate by eliminating HiPPOs and giving
priority to data before anything else. And to increase the testing
velocity, even more, all of Booking.com’s employees were allowed to
run tests on ideas they thought could help grow the business.
1 Introduction 19
This example will demonstrate the lengths to which Booking.com can
go to optimize their users’ interaction with the website. Booking.com
decided to broaden its reach in 2017 by offering rental properties for
vacations alongside hotels. This led to Booking.com partnering with
Outbrain, a native advertising platform, to help grow their global
property owner registration.
1 Introduction 20
Within the first few days of the launch, the team at Booking.com
realized that even though a lot of property owners completed the first
sign-up step, they got stuck in the next steps. At this time, pages built
for the paid search of their native campaigns were used for the sign-up
process.
1 Introduction 21
Both the teams decided to work together and created three versions of
landing page copy for Booking.com. Additional details like social
proof, awards, and recognitions, user rewards, etc. were added to the
variations.
1 Introduction 22
The test ran for two weeks and produced a 25% uplift in owner
registration. The test results also showed a significant decrease in the
cost of each registration.
1 Introduction 23
1.2.3 A/B Testing in B2B/SaaS Industry
Generate high-quality leads for your sales team, increase the number of
free trial requests, attract your target buyers, and perform other such
actions by testing and polishing important elements of your demand
generation engine. To get to these goals, marketing teams put up the
most relevant content on their website, send out ads to prospect
buyers, conduct webinars, put up special sales, and much more. But
all their effort would go to waste if the landing page which clients are
directed to is not fully optimized to give the best user experience. The
aim of SaaS (Software as a service) A/B testing is to provide the best
user experience and to improve conversions. You can try testing your
lead form components, free trial sign-up flow, homepage messaging,
CTA text, social proof on the home page, and so on.
1 Introduction 24
POSist, a leading SaaS-based restaurant management platform with
more than 5,000 customers at over 100 locations across six countries,
wanted to increase their demo requests. Their website homepage and
Contact Us page are the most important pages in their funnel. The
team at POSist wanted to reduce drop-off on these pages. To achieve
this, the team created two variations of the homepage as well as two
variations of the Contact Us page to be tested. Let’s take a look at the
changes made to the homepage. This is what the control looked like:
1 Introduction 25
The team at POSist hypothesized that adding more relevant and
conversion-focused content to the website will improve user
experience, as well as generate higher conversions. So they created
two variations to be tested against the control.
Control was first tested against Variation 1, and the winner was
Variation 1. To further improve the page, variation one was then
tested against variation two, and the winner was variation 2. The new
variation increased page visits by about 5%.
1 Introduction 26
1.3 Clinical trials
Example 2. HIV transmission. Connor et al. (1994, The New
England Journal of Medicine) report a clinical trial to evaluate the
drug AZT in reducing the risk of maternal-infant HIV transmission.
50-50 randomization scheme is used:
AZT Group—239 pregnant women (20 HIV positive infants).
placebo group—238 pregnant women (60 HIV positive
infants).
1 Introduction 27
Given the seriousness of the outcome of this study, it is reasonable to
argue that 50-50 allocation was unethical. As accruing information
favoring (albeit, not conclusively) the AZT treatment became
available, allocation probabilities should have been shifted from
50-50 allocation proportional to weight of evidence for
AZT. Designs which attempt to do this are called Response-Adaptive
designs (Response-Adaptive Randomization).
1 Introduction 28
If the treatment assignments had been done with the DBCD (Hu
and Zhang, 2004, Annals of Statistics) with urn target:
AZT Group— 360 patients
placebo group—117 patients
then, only 60 (instead of 80) infants would be HIV positive.
1 Introduction 29
Example 3: Remdesivir-COVID-19 trial (China). Remdesivir
in adults with severe COVID-19 trial (Wang et al. 2020) is a
randomized, double-blind, placebo-controlled, multicentre trial that
aimed to compare Remvesivir with placebo. There were 236 patients
in the trial. There are about 20 baseline covariates for each patient,
including 10 continuous variables (e.g. age and White blood cell
count) and 10 discrete variables (e.g. gender and Hypertension). The
stratified (according to the level of respiratory support) permuted
block (30 patients per block) randomization procedure were
implemented. At the end of this trial, some important imbalances
existed at enrollment between the groups, including more patients with
hypertension, diabetes, or coronary artery disease in the Remdesivir
group than the placebo group.
1 Introduction 30
Example 4: Moderna COVID-19 vaccine trial (2020). The
trial began on July 27, 2020, and enrolled 30,420 adult volunteers at
clinical research sites across the United States. Volunteers were
randomly assigned 1:1 to receive either two 100 microgram (mcg)
doses of the investigational vaccine or two shots of saline placebo 28
days apart. The average age of volunteers is 51 years. Approximately
47% are female, 25% are 65 years or older and 17% are under the age
of 65 with medical conditions placing them at higher risk for severe
COVID-19. Approximately 79% of participants are white, 10% are
Black or African American, 5% are Asian, 0.8% are American Indian or
Alaska Native, 0.2% are Native Hawaiian or Other Pacific Islander, 2%
are multiracial, and 21% (of any race) are Hispanic or Latino.
1 Introduction 31
From the start of the trial through Nov. 25, 2020, investigators
recorded 196 cases of symptomatic COVID-19 occurring among
participants at least 14 days after they received their second shot. One
hundred and eighty-five cases (30 of which were classified as severe
COVID-19) occurred in the placebo group and 11 cases (0 of which
were classified as severe COVID-19) occurred in the group receiving
mRNA-1273. The incidence of symptomatic COVID-19 was 94.1%
lower in those participants who received mRNA-1273 as compared to
those receiving placebo.
1 Introduction 32
Investigators observed 236 cases of symptomatic COVID-19 among
participants at least 14 days after they received their first shot, with
225 cases in the placebo group and 11 cases in the group receiving
mRNA-1273. The vaccine efficacy was 95.2% for this secondary
analysis.
Long-term Treatment Effects?
1 Introduction 33
1.4 Economics and Social Science
Political A/B testing
A/B tests are used for more than corporations, but are also driving
political campaigns. In 2007, Barack Obama’s presidential campaign
used A/B testing as a way to garner online attraction and understand
what voters wanted to see from the presidential candidate. For
example, Obama’s team tested four distinct buttons on their website
that led users to sign up for newsletters. Additionally, the team used
six different accompanying images to draw in users. Through A/B
testing, staffers were able to determine how to effectively draw in
voters and garner additional interest.
1 Introduction 34
Example 5. The Project GATE (Growing America Through
Entrepreneurship), sponsored by the U.S. Department of Labor, was
designed to evaluate the impact of offering tuition-free
entrepreneurship training services (GATE services) on helping clients
create, sustain or expand their own business.
(https://www.doleta.gov/reports/projectgate/)
The cornerstone is complete randomization. Members of the
treatment group were offered GATE services; members of the control
group were not.
n = 4, 198 participants
p = 105 covariates
1 Introduction 35
1.5 Biological, psychological, and agricultural
research
Controlled experiments were mainly developed in these areas in
1900-1950.
1 Introduction 36
Road Map of this course:
(i) The history of experiment design;
(ii) A/B testing in medical studies;
(iii) Online controlled experiments (A/B testing).
2 The history of experiment design 37
2 The history of experiment design
2.1 Experiment design before Fisher
Statistical experiments, following Charles S. Peirce Main article:
Frequentist statistics See also: Randomization A theory of statistical
inference was developed by Charles S. Peirce in ”Illustrations of the
Logic of Science” (1877–1878) and ”A Theory of Probable Inference”
(1883), two publications that emphasized the importance of
randomization-based inference in statistics.
2 The history of experiment design 38
Randomized experiments: Charles S. Peirce randomly assigned
volunteers to a blinded, repeated-measures design to evaluate their
ability to discriminate weights. Peirce’s experiment inspired other
researchers in psychology and education, which developed a research
tradition of randomized experiments in laboratories and specialized
textbooks in the 1800s.
2 The history of experiment design 39
Optimal designs for regression models:
Charles S. Peirce also contributed the first English-language
publication on an optimal design for regression models in 1876. A
pioneering optimal design for polynomial regression was suggested by
Gergonne in 1815. In 1918, Kirstine Smith published optimal designs
for polynomials of degree six (and less).
2 The history of experiment design 40
2.2 Fisher’s principles
A methodology for designing experiments was proposed by Ronald
Fisher, in his innovative books: The Arrangement of Field Experiments
(1926) and The Design of Experiments (1935). Much of his pioneering
work dealt with agricultural applications of statistical methods. As a
mundane example, he described how to test the lady tasting tea
hypothesis, that a certain lady could distinguish by flavour alone
whether the milk or the tea was first placed in the cup. These
methods have been broadly adapted in biological, psychological, and
agricultural research.
2 The history of experiment design 41
2.2.1 Comparison
In some fields of study it is not possible to have independent
measurements to a traceable metrology standard. Comparisons
between treatments are much more valuable and are usually preferable,
and often compared against a scientific control or traditional
treatment that acts as baseline.
2 The history of experiment design 42
2.2.2 Randomization
Random assignment is the process of assigning individuals at random
to groups or to different groups in an experiment, so that each
individual of the population has the same chance of becoming a
participant in the study. The random assignment of individuals to
groups (or conditions within a group) distinguishes a rigorous, ”true”
experiment from an observational study or ”quasi-experiment”. There
is an extensive body of mathematical theory that explores the
consequences of making the allocation of units to treatments by means
of some random mechanism (such as tables of random numbers, or the
use of randomization devices such as playing cards or dice). Assigning
units to treatments at random tends to mitigate confounding, which
makes effects due to factors other than the treatment to appear to
result from the treatment.
2 The history of experiment design 43
The risks associated with random allocation (such as having a serious
imbalance in a key characteristic between a treatment group and a
control group) are calculable and hence can be managed down to an
acceptable level by using enough experimental units. However, if the
population is divided into several subpopulations that somehow differ,
and the research requires each subpopulation to be equal in size,
stratified sampling can be used. In that way, the units in each
subpopulation are randomized, but not the whole sample. The results
of an experiment can be generalized reliably from the experimental
units to a larger statistical population of units only if the experimental
units are a random sample from the larger population; the probable
error of such an extrapolation depends on the sample size, among
other things.
2 The history of experiment design 44
2.2.3 Statistical replication
Measurements are usually subject to variation and measurement
uncertainty; thus they are repeated and full experiments are replicated
to help identify the sources of variation, to better estimate the true
effects of treatments, to further strengthen the experiment’s reliability
and validity, and to add to the existing knowledge of the topic.
2 The history of experiment design 45
However, certain conditions must be met before the replication of the
experiment is commenced: the original research question has been
published in a peer-reviewed journal or widely cited, the researcher is
independent of the original experiment, the researcher must first try to
replicate the original findings using the original data, and the write-up
should state that the study conducted is a replication study that tried
to follow the original study as strictly as possible.
2 The history of experiment design 46
2.2.4 Blocking
Blocking is the non-random arrangement of experimental units into
groups (blocks) consisting of units that are similar to one another.
Blocking reduces known but irrelevant sources of variation between
units and thus allows greater precision in the estimation of the source
of variation under study.
2 The history of experiment design 47
2.2.5 Orthogonality
Orthogonality concerns the forms of comparison (contrasts) that can
be legitimately and efficiently carried out. Contrasts can be
represented by vectors and sets of orthogonal contrasts are
uncorrelated and independently distributed if the data are normal.
Because of this independence, each orthogonal treatment provides
different information to the others. If there are T treatments and T–1
orthogonal contrasts, all the information that can be captured from
the experiment is obtainable from the set of contrasts.
2 The history of experiment design 48
Example 2.1. Measurement Error: We would like to measure
the weight of a subject A by using a scale. We know that there is a
error of scale. Suppose that the error follows a normal distribution
with mean 0 and variance σ2. Mathematically, we may write:
w1 = A+ e1,
where wA is the true weight, YA is the observed weight and e1 is the
measurement error.
2 The history of experiment design 49
Figure 1: A scale to measure subject A
2 The history of experiment design 50
Now we would like to measure the weights of two subjects A and B by
using the same scale twice. What should we do?
2 The history of experiment design 51
Method 1:
w1 = A+ e1 and w2 = B + e2.
2 The history of experiment design 52
Figure 2: Subject B
2 The history of experiment design 53
Method 2:
w3 = A+B + e3 and w4 = A?B + e4.
2 The history of experiment design 54
Figure 3: A + B
2 The history of experiment design 55
Figure 4: A - B
2 The history of experiment design 56
The measurement errors:
Method 1:
Subject A: e1 ~ N(0, σ2).
Subject B: e2 ~ N(0, σ2).
Method 2:
Subject A: (e3 + e4)/2 ~ N(0, σ2/2).
Subject B: (e3 ? e4)/2 ~ N(0, σ2/2).
2 The history of experiment design 57
Use of factorial experiments instead of the one-factor-at-a-time
method. These are efficient at evaluating the effects and possible
interactions of several factors (independent variables). Analysis of
experiment design is built on the foundation of the analysis of
variance, a collection of models that partition the observed variance
into components, according to what factors the experiment must
estimate or test.
2 The history of experiment design 58
2.2.6 Avoiding false positives
False positive conclusions, often resulting from the pressure to publish
or the author’s own confirmation bias, are an inherent hazard in many
fields. A good way to prevent biases potentially leading to false
positives in the data collection phase is to use a double-blind design.
When a double-blind design is used, participants are randomly assigned
to experimental groups but the researcher is unaware of what
participants belong to which group. Therefore, the researcher can not
affect the participants’ response to the intervention.
2 The history of experiment design 59
Experimental designs with undisclosed degrees of freedom are a
problem. This can lead to conscious or unconscious ”p-hacking”:
trying multiple things until you get the desired result. It typically
involves the manipulation – perhaps unconsciously – of the process of
statistical analysis and the degrees of freedom until they return a
figure below the p?.05 level of statistical significance.
2 The history of experiment design 60
So the design of the experiment should include a clear statement
proposing the analyses to be undertaken. P-hacking can be prevented
by preregistering researches, in which researchers have to send their
data analysis plan to the journal they wish to publish their paper in
before they even start their data collection, so no data manipulation is
possible.
2 The history of experiment design 61
Another way to prevent this is taking the double-blind design to the
data-analysis phase, where the data are sent to a data-analyst
unrelated to the research who scrambles up the data so there is no
way to know which participants belong to before they are potentially
taken away as outliers.
2 The history of experiment design 62
2.2.7 Causal attributions
In the pure experimental design, the independent (predictor) variable is
manipulated by the researcher – that is – every participant of the
research is chosen randomly from the population, and each participant
chosen is assigned randomly to conditions of the independent variable.
Only when this is done is it possible to certify with high probability
that the reason for the differences in the outcome variables are caused
by the different conditions. Therefore, researchers should choose the
experimental design over other design types whenever possible.
2 The history of experiment design 63
However, the nature of the independent variable does not always allow
for manipulation. In those cases, researchers must be aware of not
certifying about causal attribution when their design doesn’t allow for
it. For example, in observational designs, participants are not assigned
randomly to conditions, and so if there are differences found in
outcome variables between conditions, it is likely that there is
something other than the differences between the conditions that
causes the differences in outcomes, that is – a third variable. The same
goes for studies with correlational design. (Ade′r Mellenbergh, 2008).
2 The history of experiment design 64
2.2.8 Statistical control
It is best that a process be in reasonable statistical control prior to
conducting designed experiments. When this is not possible, proper
blocking, replication, and randomization allow for the careful conduct
of designed experiments. To control for nuisance variables, researchers
institute control checks as additional measures. Investigators should
ensure that uncontrolled influences (e.g., source credibility perception)
do not skew the findings of the study. A manipulation check is one
example of a control check. Manipulation checks allow investigators to
isolate the chief variables to strengthen support that these variables
are operating as planned.
2 The history of experiment design 65
One of the most important requirements of experimental research
designs is the necessity of eliminating the effects of spurious,
intervening, and antecedent variables. In the most basic model, cause
(X) leads to effect (Y). But there could be a third variable (Z) that
influences (Y), and X might not be the true cause at all. Z is said to
be a spurious variable and must be controlled for. The same is true for
intervening variables (a variable in between the supposed cause (X)
and the effect (Y)), and anteceding variables (a variable prior to the
supposed cause (X) that is the true cause). When a third variable is
involved and has not been controlled for, the relation is said to be a
zero order relationship. In most practical applications of experimental
research designs there are several causes (X1, X2, X3). In most
designs, only one of these causes is manipulated at a time.
2 The history of experiment design 66
2.3 Experimental designs after Fisher
Some efficient designs for estimating several main effects were found
independently and in near succession by Raj Chandra Bose and K.
Kishen in 1940 at the Indian Statistical Institute, but remained little
known until the Plackett–Burman designs were published in
Biometrika in 1946. About the same time, C. R. Rao introduced the
concepts of orthogonal arrays as experimental designs. This concept
played a central role in the development of Taguchi methods by
Genichi Taguchi, which took place during his visit to Indian Statistical
Institute in early 1950s. His methods were successfully applied and
adopted by Japanese and Indian industries and subsequently were also
embraced by US industry albeit with some reservations.
2 The history of experiment design 67
In 1950, Gertrude Mary Cox and William Gemmell Cochran published
the book Experimental Designs, which became the major reference
work on the design of experiments for statisticians for years afterwards.
Developments of the theory of linear models have encompassed and
surpassed the cases that concerned early writers. Today, the theory
rests on advanced topics in linear algebra, algebra and combinatorics.
2 The history of experiment design 68
As with other branches of statistics, experimental design is pursued
using both frequentist and Bayesian approaches: In evaluating
statistical procedures like experimental designs, frequentist statistics
studies the sampling distribution while Bayesian statistics updates a
probability distribution on the parameter space.
2 The history of experiment design 69
Some important contributors to the field of experimental designs are
C. S. Peirce, R. A. Fisher, F. Yates, R. C. Bose, A. C. Atkinson, R. A.
Bailey, D. R. Cox, G. E. P. Box, W. G. Cochran, W. T. Federer, V. V.
Fedorov, A. S. Hedayat, J. Kiefer, O. Kempthorne, J. A. Nelder,
Andrej Pa′zman, Friedrich Pukelsheim, D. Raghavarao, C. R. Rao,
Shrikhande S. S., J. N. Srivastava, William J. Studden, G. Taguchi
and H. P. Wynn.
2 The history of experiment design 70
The textbooks of D. Montgomery, R. Myers, and G. Box/W.
Hunter/J.S. Hunter have reached generations of students and
practitioners.
Some discussion of experimental design in the context of system
identification (model building for static or dynamic models) is given
in[35] and [36].
2 The history of experiment design 71
2.4 Sequences of experiments
The use of a sequence of experiments, where the design of each may
depend on the results of previous experiments, including the possible
decision to stop experimenting, is within the scope of sequential
analysis, a field that was pioneered by Abraham Wald in the context of
sequential tests of statistical hypotheses. Herman Chernoff wrote an
overview of optimal sequential designs, while adaptive designs have
been surveyed by S. Zacks. One specific type of sequential design is
the ”two-armed bandit”, generalized to the multi-armed bandit, on
which early work was done by Herbert Robbins in 1952.
2 The history of experiment design 72
2.5 Human participant constraints
Laws and ethical considerations preclude some carefully designed
experiments with human subjects. Legal constraints are dependent on
jurisdiction. Constraints may involve institutional review boards,
informed consent and confidentiality affecting both clinical (medical)
trials and behavioral and social science experiments.[37] In the field of
toxicology, for example, experimentation is performed on laboratory
animals with the goal of defining safe exposure limits for humans.
Balancing the constraints are views from the medical field.[39]
Regarding the randomization of patients, ”... if no one knows which
therapy is better, there is no ethical imperative to use one therapy or
another.” (p 380) Regarding experimental design, ”...it is clearly not
ethical to place subjects at risk to collect data in a poorly designed
study when this situation can be easily avoided...”. (p 393)
2 The history of experiment design 73
2.6 Some important issues to design experiments
Clear and complete documentation of the experimental methodology is
also important in order to support replication of results.
Discussion topics when setting up an experimental design An
experimental design or randomized clinical trial requires careful
consideration of several factors before actually doing the experiment.
An experimental design is the laying out of a detailed experimental
plan in advance of doing the experiment. Some of the following topics
have already been discussed in the principles of experimental design
section:
2 The history of experiment design 74
1) How many factors does the design have, and are the levels of these
factors fixed or random?
2) Are control conditions needed, and what should they be?
3) Manipulation checks; did the manipulation really work?
4) What are the background variables?
5) What is the sample size. How many units must be collected for the
experiment to be generalisable and have enough power?
6) What is the relevance of interactions between factors?
2 The history of experiment design 75
7) What is the influence of delayed effects of substantive factors on
outcomes?
8) How do response shifts affect self-report measures?
9) How feasible is repeated administration of the same measurement
instruments to the same units at different occasions, with a post-test
and follow-up tests?
10) What about using a proxy pretest?
11) Are there lurking variables?
2 The history of experiment design 76
12) Should the client/patient, researcher or even the analyst of the
data be blind to conditions?
13) What is the feasibility of subsequent application of different
conditions to the same units?
14) How many of each control and noise factors should be taken into
account?
15) How to deal with missinbg values?
16) What are the good matrices?
........
2 The history of experiment design 77
The independent variable of a study often has many levels or different
groups. In a true experiment, researchers can have an experimental
group, which is where their intervention testing the hypothesis is
implemented, and a control group, which has all the same element as
the experimental group, without the interventional element. Thus,
when everything else except for one intervention is held constant,
researchers can certify with some certainty that this one element is
what caused the observed change. In some instances, having a control
group is not ethical. This is sometimes solved using two different
experimental groups. In some cases, independent variables cannot be
manipulated, for example when testing the difference between two
groups who have a different disease, or testing the difference between
genders (obviously variables that would be hard or unethical to assign
participants to). In these cases, a quasi-experimental design may be
used.
3 A/B tests (Randomized Control Studies) in clinical trials 78
3 A/B tests (Randomized Control
Studies) in clinical trials
3 A/B tests (Randomized Control Studies) in clinical trials 79
3.1 Drug development
Drug development is a complex and lengthy process that take 7 to 15
years for a single drug at a cost that may reach hundreds of millions of
dollars. There are three main parts of the drug development process:
Discovery and decision;
Preclinical studies;
Clinical studies.
3 A/B tests (Randomized Control Studies) in clinical trials 80
Discovery and Decision
The process starts with the discovery of a new compound or of a new
potential application of an existing compound. Based on adequate
results, the decision whether to develop the drug is then made.
3 A/B tests (Randomized Control Studies) in clinical trials 81
Preclinical Studies
The initial toxicology of compound is studied in animals. Initial
formulation of the drug development and specific or comprehensive
pharmacological studies in animals are also performed at this stage. At
the end of preclinical study, the evidence of potential safety and
effectiveness of the drug is assessed by the company.
To proceed further, A US-based company needs to file a Notice of
Claimed Investigational New Drug Exemption (to allow the company
to conduct studies on human subjects).
3 A/B tests (Randomized Control Studies) in clinical trials 82
Clinical Studies There is sufficient evidence that the drug will be
benefit to human subjects. Testing the drug in human subjects is the
next step.
3 A/B tests (Randomized Control Studies) in clinical trials 83
Phase I clinical trial: To establish the initial safety information
about the effect of the drug on humans, such the range of acceptable
dosages and the pharmacokinetics of the drug. This studies are
normally conducted with healthy volunteers. The number of subjects
typically varies between 4 to 20 per study, with up to 100 subjects in
total used over the course of Phase I trials.
3 A/B tests (Randomized Control Studies) in clinical trials 84
Phase II clinical trial: This studies are conducted towards patients
who will potentially benefit from the new drug. Effective dose ranges
and initial effects of the drug on these patients are assessed. Up to
several hundred patients are usually selected in Phase II trials.
3 A/B tests (Randomized Control Studies) in clinical trials 85
Phase III clinical trial: Phase III studies provide assessment of
safety, efficacy, and optimum dosage. These studies are designed with
controls and treatment groups. Usually hundreds or even thousands
patients are involved in Phase II trials.
Based on successful results obtained from these studies, the company
can then submit a NDA (New Drug Application). The application
contains the results from all three stages (from discovery to Phase III)
and is reviewed by FDA.
The FDA review panel of the NDA consists of reviewers in the
following areas: medicine, pharmacology, biopharmaceutics, chemisty,
and statistics.
3 A/B tests (Randomized Control Studies) in clinical trials 86
Phase IV: Postmarket activities. Followup studies are conducted
to examine the longterm effects of the drug. The main propose of
these studies is to ensure that all claims made by the company about
the new drug can be substantiated by so called ”clinical evidence”. All
reported adverse effects must also be investigated by the company and
in some cases, the drug may need to be withdrawn from the market.
3 A/B tests (Randomized Control Studies) in clinical trials 87
Statistician’s Responsibilities:
Participate in the development plan for study a drug.
Study design and protocol development. Randomization schemes.
Data cleaning and database construction format.
Analysis plan and program development for analysis.
Report preparation. Produce tables and figures.
Integrate clinical study results, safety and efficacy reports.
Communication and NDA defense to FDA review panel.
Publication support and consulting with other company personnel.
3 A/B tests (Randomized Control Studies) in clinical trials 88
Example 3.1. HIV transmission. Connor et al. (1994, The New
England Journal of Medicine) report a clinical trial to evaluate the
drug AZT in reducing the risk of maternal-infant HIV transmission.
50-50 randomization scheme is used:
AZT Group (A)—239 pregnant women (20 HIV positive
infants).
placebo group (B)—238 pregnant women (60 HIV positive
infants).
3 A/B tests (Randomized Control Studies) in clinical trials 89
Given the seriousness of the outcome of this study, it is reasonable to
argue that 50-50 allocation was unethical. As accruing information
favoring (albeit, not conclusively) the AZT treatment became
available, allocation probabilities should have been shifted from
50-50 allocation proportional to weight of evidence for
AZT. Designs which attempt to do this are called Response-Adaptive
designs (Response-Adaptive Randomization).
3 A/B tests (Randomized Control Studies) in clinical trials 90
If the treatment assignments had been done with the DBCD (Hu
and Zhang, 2004, Annals of Statistics) with urn target:
AZT Group— 360 patients
placebo group—117 patients
then, only 60 (instead of 80) infants would be HIV positive.
3 A/B tests (Randomized Control Studies) in clinical trials 91
Allocation rule AZT Placebo Power HIV+
EA 239 238 0.9996 80
DBCD 360 117 0.989 60
Neyman 186 291 0.9998 89
FPower 416 61 0.90 50
3 A/B tests (Randomized Control Studies) in clinical trials 92
Example 2 (ECMO Trial). Extracorporeal membrane oxygenation
(ECMO) is an external system for oxygenating the blood based on
techniques used in cardiopulmonary bypass technology developed for
cariac surgery. In the literature, there are three well-document clinical
trials on evaluating the clinical effectiveness of ECMO:
(i) the Michigan ECMO study (Bartlett, et al. 1985);
(ii) the Boston ECMO study (Ware, 1989);
(iii) the UK Collaborative ECMO Trials Group, 1996).
3 A/B tests (Randomized Control Studies) in clinical trials 93
Example 2 (Continued): Michigan ECMO trial using
RPW rule:
The RPW rule was used in a clinical trial of extracorporeal membrane
oxygenation (ECMO; Bartlett, et al. 1985, Pediatrics).
Total 12 patients.
ECMO group– 11 patients, all survived.
Conventional therapy– 1 patient, died.
3 A/B tests (Randomized Control Studies) in clinical trials 94
3.2 Determining the Sample Size
In the planning stages of a randomized clinical trial, it is necessary to
determine the numbers of subjects (sample size) to be randomized.
For two treatments (A and B), say n = nA + nB . We assume here
that the allocation proportions are known in advance, that is,
nA/n = ρ and nB/n = 1? ρ are predetermined.
3 A/B tests (Randomized Control Studies) in clinical trials 95
Examples of calculations of SS.
3 A/B tests (Randomized Control Studies) in clinical trials 96
3.3 Mathematical Framework of Randomization
Procedures
Suppose we compare two treatments A and B. Let T1, ..., Tn be a
sequence of random treatment assignments.
Ti = 1 if the patient i is assigned to treatment A;
Ti = 0 if the patient i is assigned to treatment B.
NA(n) =
∑n
i=1 Ti = number of patients onA and
NB(n) = n?NA(n).
3 A/B tests (Randomized Control Studies) in clinical trials 97
X1, ...,Xn: response variables. Where Xi represents the sequence of
responses that would be observed if each treatment were assigned to
the i-th patient independently.
Z1, ...,Zn: covariates. Here Zi represents the covariates of i-th
patient.
3 A/B tests (Randomized Control Studies) in clinical trials 98
When the (i+ 1)th patient is ready to be randomized in a clinical
trial, following information is available:
patients assignments: T1, ..., Ti;
responses: X1, ...,Xi (assume immediately responses);
patients covariates: Z1, ...,Zi and Zi+1.
3 A/B tests (Randomized Control Studies) in clinical trials 99
Let Tn = σ{T1, ..., Tn} be the sigma-algebra generated by the first n
treatment assignments.
Let Xn = σ{X1, ...,Xn} be the sigma-algebra generated by the first
n responses.
Let Zn = σ{Z1, ...,Zn} be the sigma-algebra generated by the first n
covariate vectors. Let Fn = Tn ?Xn ?Zn+1.
3 A/B tests (Randomized Control Studies) in clinical trials 100
A randomization procedure is defined by
φn = E(Tn|Fn?1),
where φn+1 is Fn-measurable. We can describe φn as the conditional
probability of assigning treatments 1, ...,K to the n-th patient,
conditional on the previous n? 1 assignments, responses, and
covariate vectors, and the current patient’s covariate vector.
3 A/B tests (Randomized Control Studies) in clinical trials 101
We can describe five types of randomization procedures:
(i) complete randomization if
φn = E(Tn|Fn?1) = E(Tn);
Not use any information.
(ii) restricted randomization if
φn = E(Tn|Fn?1) = E(Tn|Tn?1);
Only use information of patients’ assignments.
(iii) response-adaptive randomization if
φn = E(Tn|Fn?1) = E(Tn|Tn?1,Xn?1);
Use information of patients’ assignments and responses.
3 A/B tests (Randomized Control Studies) in clinical trials 102
(iv) covariate-adaptive randomization if
φn = E(Tn|Fn?1) = E(Tn|Tn?1,Zn);
Use information of patients’ assignments and covariates.
(v) covariate-adjusted response-adaptive (CARA) randomization if
φn = E(Tn|Fn?1) = E(Tn|Tn?1,Xn?1,Zn).
use all available information.
3 A/B tests (Randomized Control Studies) in clinical trials 103
3.4 Complete randomization
The simplest form of a randomization procedure is complete
randomization.
E(Ti|T1, ..., Ti?1) = P (Ti = 1|T1, ..., Ti?1) = 1/2, i = 1, ..., n.
NA(n) has binomial(n, 1/2).
This procedure is rarely used in practice because of the nonnegligible
probability of treatment imbalances in moderate samples.
3 A/B tests (Randomized Control Studies) in clinical trials 104
3.5 Restricted randomization
Truncated binomial design: Complete randomization is used until n/2
have been assigned to A or B, then the reminder is filled with the
opposite treatment with probability 1. Here the procedure is given by
φi = 1/2, if max{NA(i? 1), NB(i? 1)} ≤ n/2,
= 0, if NA(i? 1) = n/2,
= 1, if NB(i? 1) = n/2.
3 A/B tests (Randomized Control Studies) in clinical trials 105
Blocked Procedures: Because we do not know n exactly in advance,
we typically require overrunning of the randomization sequence.
Forced balance designs are therefore typically used in blocks.
Permuted block design: Blocks of even size 2b are filled using
either a random allocation rule or a truncated binomial design.
The maximum imbalance is b and the only possibility of a terminal
imbalance occurs if the last block is unfilled. Every block has at
least one deterministic assignment.
Random block design: Blocks of size 2, 4, 6, ..., 2K are randomly
selected and equirobable.
3 A/B tests (Randomized Control Studies) in clinical trials 106
Efron’s biased coin design (BCD): (Efron, 1971). Let
Di = NA(i)?NB(i) be the imbalance between treatments A and B.
Define a constant pi ∈ (0.5, 1]. Then the procedure is given by
φi = 1/2, if Di?1 = 0,
= pi, if Di?1 < 0,
= 1? pi, if Di?1 > 0.
Efron suggested pi = 2/3 might be a reasonable value (without
justification).
3 A/B tests (Randomized Control Studies) in clinical trials 107
Many other designs have been proposed and studied in literature
(Smith’s design (1984), Wei’s design (1978), Big Stick design (Soares
and Wu, 1982), etc.)
When n = 50, V ar(Dn) = 49.92 (Complete randomization);
V ar(Dn) = 4.36 (Efron’s BCD with pi = 2/3). (Based on 100, 000
replications).
3 A/B tests (Randomized Control Studies) in clinical trials 108
3.6 Selection Bias
Selection Bias refers to biases that are introduced into an unmasked
study because an investigator maybe able to guess the treatment
assignment of future patients based on knowing the treatments
assigned to the past patients. Patients usually enter a trial sequentially
over time.
The great clinical trialist Chalmers (1990) was convinced that the
elimination of selection bias is the most essential requirement for a
good clinical trial.
3 A/B tests (Randomized Control Studies) in clinical trials 109
How to measure the Selection Bias?
3 A/B tests (Randomized Control Studies) in clinical trials 110
Blackwell and Hodge (1957), Berger, Ivanova and Knoll (2003) and
others had suggested the predictability of a randomization
sequence to measure the selection bias.
One measure of the predictability of a randomization
sequence is given by
Ppred =
∑n
i=1 |Eφi ? 0.5|
n
.
3 A/B tests (Randomized Control Studies) in clinical trials 111
Selection bias of different designs.
4 Response-adaptive randomization procedures 112
4 Response-adaptive randomization
procedures
.
4.1 Historical notes
Adaptive designs in the clinical trials context were first formulated as
solutions to optimal decision-making questions:
Which treatment is better?
What sample size should be used before determining a “better”
treatment to maximize the total number receiving the better
treatment?
How do we incorporate prior data or accruing data into these
decisions?
4 Response-adaptive randomization procedures 113
The preliminary ideas can be traced back to Thompson (1933,
Biometrika) and Robbins (1952, Bulletin of the American
Mathematical Society) and led to a flurry of work in the 1960s by
Anscombe (1963, JASA), Colton (1963, JASA), Zelen (1969, JASA)
and Cornfield, Halperin, and Greenhouse (1969, Annals of
Mathematical Statistics), among others.
4 Response-adaptive randomization procedures 114
4.2 Play-the-winner rule
Perhaps the simplest of these adaptive designs is the play-the-winner
rule originally explored by Robbins (1952, Bulletin of the American
Mathematical Society) and later by Zelen (1969, JASA).
4 Response-adaptive randomization procedures 115
Binary response: treatment A and B.
pA: P (success|A), qA = 1? pA;
pB : P (success|B), qB = 1? pB ;
NA(n): number of patients on A;
NB(n): number of patients on B, n = NA(n) +NB(n).
4 Response-adaptive randomization procedures 116
Play-the-winner rule:
a success on one treatment results in the next patient’s
assignment to the same treatment,
a failure on one treatment results in the next patient’s assignment
to the opposite treatment.
That is
φn = 1 if Tn?1 = 1 and Xn?1(A) = 1 or Tn?1 = 0 and
Xn?1(B) = 0.
φn = 0 if Tn?1 = 1 and Xn?1(A) = 0 or Tn?1 = 0 and
Xn?1(B) = 1.
4 Response-adaptive randomization procedures 117
The properties of play-the-winner rule?
What is the proportion of patients in treatment A:
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。