联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Java编程Java编程

日期:2022-10-09 08:27

1A/B Testing: Designs and Analysis


2Three Essential Components of Statistics

(Data Science):

Data+Computer+Analytics

1 Introduction 3

1 Introduction

1.1 What is A/B testing?

A/B test is the shorthand for a simple controlled experiment. As the

name implies, two versions (A and B) of a single variable are

compared, which are identical except for one variation that might

affect a user’s behavior. A/B tests are widely considered the simplest

form of controlled experiment. However, by adding more variants to

the test, this becomes more complex.

A/B testing is the process of comparing two variations of a page

element, usually by testing users’ response to variant A vs variant B,

and concluding which of the two variants is more effective.

1 Introduction 4

A/B tests are useful for understanding user engagement and

satisfaction of online features, such as a new feature or product. Large

social media sites like LinkedIn, Facebook, and Instagram use A/B

testing to make user experiences more successful and as a way to

streamline their services.

1 Introduction 5

Today, A/B tests are being used to run more complex experiments,

such as network effects when users are offline, how online services

affect user actions, and how users influence one another. Many jobs

use the data from A/B tests. This includes, data engineers, marketers,

designers, software engineers, and entrepreneurs. Many positions rely

on the data from A/B tests, as they allow companies to understand

growth, increase revenue, and optimize customer satisfaction.

1 Introduction 6

Version A might be the currently used version (control), while version

B is modified in some respect (treatment). For instance, on an

e-commerce website the purchase funnel is typically a good candidate

for A/B testing, as even marginal decreases in drop-off rates can

represent a significant gain in sales. Significant improvements can

sometimes be seen through testing elements like copy text, layouts,

images and colors, but not always. In these tests, users only see one of

two versions, as the goal is to discover which of the two versions is

preferable.

1 Introduction 7

Controlled experiments have a long and fascinating history. They are

sometimes called A/B tests, A/B/C tests (multiple variants), field

experiments, randomized controlled experiments, split tests, bucket

tests, and flights.

1 Introduction 8

1.2 Online experiments

Example 1. Online A/B testing. (Kohavi and Thomke, 2017,

Harvard Business Review) Microsoft, Amazon, Facebook and Google

conduct more than 10,000 online controlled experiments annually, with

many tests engaging millions of users.

Amazon’s experiment.

Treatment A: Credit card offers on front page.

Treatment B: Credit card offers on the shopping cart page.

This (change from A to B) boosted profits by tens of millions of US

Dollars annually.

1 Introduction 9

1.2.1 A/B Testing in eCommerce Industry

Through A/B testing, online stores can increase the average order

value, optimize their checkout funnel, reduce cart abandonment rate,

and so on. You may try testing: the way shipping cost is displayed and

where, if, and how free shipping feature is highlighted, text and color

tweaks on the payment page or checkout page, the visibility of reviews

or ratings, etc.

1 Introduction 10

In the eCommerce industry, Amazon is at the forefront in conversion

optimization partly due to the scale they operate at and partly due to

their immense dedication to providing the best customer experience.

Amongst the many revolutionary practices they brought to the

eCommerce industry, the most prolific one has been their ‘1-Click

Ordering’. Introduced in the late 1990s after much testing and

analysis, 1-Click Ordering lets users make purchases without having to

use the shopping cart at all. Once users enter their default billing card

details and shipping address, all they need to do is click on the button

and wait for the ordered products to get delivered. Users don’t have to

enter their billing and shipping details again while placing any orders.

With the 1-Click Ordering, it became impossible for users to ignore the

ease of purchase and go to another store. This change had such a

huge business impact that Amazon got it patented (now expired) in

1999. In fact, in 2000, even Apple bought a license for the same to be

used in their online store.

1 Introduction 11

People working to optimize Amazon’s website do not have sudden

‘Eureka’ moments for every change they make. It is through

continuous and structured A/B testing that Amazon is able to deliver

the kind of user experience that it does. Every change on the website

is first tested on their audience and then deployed. If you were to

notice Amazon’s purchase funnel, you would realize that even though

the funnel more or less replicates other websites’ purchase funnels,

each an every element in it is fully optimized, and matches the

audience’s expectations.

1 Introduction 12

Every page, starting from the homepage to the payment page, only

contains the essential details and leads to the exact next step required

to push the users further into the conversion funnel. Additionally, using

extensive user insights and website data, each step is simplified to their

maximum possible potential to match their users’ expectations.

1 Introduction 13

Take their omnipresent shopping cart, for example. There is a small

cart icon at the top right of Amazon’s homepage that stays visible no

matter which page of the website you are on.

1 Introduction 14

The icon is not just a shortcut to the cart or reminder for added

products. In its current version, it offers 5 options:

(i) Continue shopping (if there are no products added to the cart)

(ii) Learn about today’s deals (if there are no products added to the

cart)

(iii) Wish List (if there are no products added to the cart)

(iv) empty cart

(v) Proceed to checkout (when there are products in the cart). Sign in

to turn on 1-Click Checkout (when there are products in the cart).

1 Introduction 15

With one click on the tiny icon offering so many options, the user’s

cognitive load is reduced, and they have a great user experience. As

can be seen in the above screenshot, the same cart page also suggests

similar products so that customers can navigate back into the website

and continue shopping. All this is achieved with one weapon: A/B

Testing.

1 Introduction 16

1.2.2 A/B Testing in Travel Industry

Increase the number of successful bookings on your website or mobile

app, your revenue from ancillary purchases, and much more through

A/B testing. You may try testing your home page search modals,

search results page, ancillary product presentation, your checkout

progress bar, and so on.

1 Introduction 17

In the travel industry, Booking.com easily surpasses all other

eCommerce businesses when it comes to using A/B testing for their

optimization needs. They test like it’s nobody’s business. From the

day of its inception, Booking.com has treated A/B testing as the

treadmill that introduces a flywheel effect for revenue. The scale at

which Booking.com A/B tests is unmatched, especially when it comes

to testing their copy. While you are reading this, there are nearly 1000

A/B tests running on Booking.com’s website.

1 Introduction 18

Even though Booking.com has been A/B testing for more than a

decade now, they still think there is more that they can do to improve

user experience. And this is what makes Booking.com the ace in the

game. Since the company started, Booking.com incorporated A/B

testing into its everyday work process. They have increased their

testing velocity to its current rate by eliminating HiPPOs and giving

priority to data before anything else. And to increase the testing

velocity, even more, all of Booking.com’s employees were allowed to

run tests on ideas they thought could help grow the business.

1 Introduction 19

This example will demonstrate the lengths to which Booking.com can

go to optimize their users’ interaction with the website. Booking.com

decided to broaden its reach in 2017 by offering rental properties for

vacations alongside hotels. This led to Booking.com partnering with

Outbrain, a native advertising platform, to help grow their global

property owner registration.

1 Introduction 20

Within the first few days of the launch, the team at Booking.com

realized that even though a lot of property owners completed the first

sign-up step, they got stuck in the next steps. At this time, pages built

for the paid search of their native campaigns were used for the sign-up

process.

1 Introduction 21

Both the teams decided to work together and created three versions of

landing page copy for Booking.com. Additional details like social

proof, awards, and recognitions, user rewards, etc. were added to the

variations.

1 Introduction 22

The test ran for two weeks and produced a 25% uplift in owner

registration. The test results also showed a significant decrease in the

cost of each registration.

1 Introduction 23

1.2.3 A/B Testing in B2B/SaaS Industry

Generate high-quality leads for your sales team, increase the number of

free trial requests, attract your target buyers, and perform other such

actions by testing and polishing important elements of your demand

generation engine. To get to these goals, marketing teams put up the

most relevant content on their website, send out ads to prospect

buyers, conduct webinars, put up special sales, and much more. But

all their effort would go to waste if the landing page which clients are

directed to is not fully optimized to give the best user experience. The

aim of SaaS (Software as a service) A/B testing is to provide the best

user experience and to improve conversions. You can try testing your

lead form components, free trial sign-up flow, homepage messaging,

CTA text, social proof on the home page, and so on.

1 Introduction 24

POSist, a leading SaaS-based restaurant management platform with

more than 5,000 customers at over 100 locations across six countries,

wanted to increase their demo requests. Their website homepage and

Contact Us page are the most important pages in their funnel. The

team at POSist wanted to reduce drop-off on these pages. To achieve

this, the team created two variations of the homepage as well as two

variations of the Contact Us page to be tested. Let’s take a look at the

changes made to the homepage. This is what the control looked like:

1 Introduction 25

The team at POSist hypothesized that adding more relevant and

conversion-focused content to the website will improve user

experience, as well as generate higher conversions. So they created

two variations to be tested against the control.

Control was first tested against Variation 1, and the winner was

Variation 1. To further improve the page, variation one was then

tested against variation two, and the winner was variation 2. The new

variation increased page visits by about 5%.

1 Introduction 26

1.3 Clinical trials

Example 2. HIV transmission. Connor et al. (1994, The New

England Journal of Medicine) report a clinical trial to evaluate the

drug AZT in reducing the risk of maternal-infant HIV transmission.

50-50 randomization scheme is used:

AZT Group—239 pregnant women (20 HIV positive infants).

placebo group—238 pregnant women (60 HIV positive

infants).

1 Introduction 27

Given the seriousness of the outcome of this study, it is reasonable to

argue that 50-50 allocation was unethical. As accruing information

favoring (albeit, not conclusively) the AZT treatment became

available, allocation probabilities should have been shifted from

50-50 allocation proportional to weight of evidence for

AZT. Designs which attempt to do this are called Response-Adaptive

designs (Response-Adaptive Randomization).

1 Introduction 28

If the treatment assignments had been done with the DBCD (Hu

and Zhang, 2004, Annals of Statistics) with urn target:

AZT Group— 360 patients

placebo group—117 patients

then, only 60 (instead of 80) infants would be HIV positive.

1 Introduction 29

Example 3: Remdesivir-COVID-19 trial (China). Remdesivir

in adults with severe COVID-19 trial (Wang et al. 2020) is a

randomized, double-blind, placebo-controlled, multicentre trial that

aimed to compare Remvesivir with placebo. There were 236 patients

in the trial. There are about 20 baseline covariates for each patient,

including 10 continuous variables (e.g. age and White blood cell

count) and 10 discrete variables (e.g. gender and Hypertension). The

stratified (according to the level of respiratory support) permuted

block (30 patients per block) randomization procedure were

implemented. At the end of this trial, some important imbalances

existed at enrollment between the groups, including more patients with

hypertension, diabetes, or coronary artery disease in the Remdesivir

group than the placebo group.

1 Introduction 30

Example 4: Moderna COVID-19 vaccine trial (2020). The

trial began on July 27, 2020, and enrolled 30,420 adult volunteers at

clinical research sites across the United States. Volunteers were

randomly assigned 1:1 to receive either two 100 microgram (mcg)

doses of the investigational vaccine or two shots of saline placebo 28

days apart. The average age of volunteers is 51 years. Approximately

47% are female, 25% are 65 years or older and 17% are under the age

of 65 with medical conditions placing them at higher risk for severe

COVID-19. Approximately 79% of participants are white, 10% are

Black or African American, 5% are Asian, 0.8% are American Indian or

Alaska Native, 0.2% are Native Hawaiian or Other Pacific Islander, 2%

are multiracial, and 21% (of any race) are Hispanic or Latino.

1 Introduction 31

From the start of the trial through Nov. 25, 2020, investigators

recorded 196 cases of symptomatic COVID-19 occurring among

participants at least 14 days after they received their second shot. One

hundred and eighty-five cases (30 of which were classified as severe

COVID-19) occurred in the placebo group and 11 cases (0 of which

were classified as severe COVID-19) occurred in the group receiving

mRNA-1273. The incidence of symptomatic COVID-19 was 94.1%

lower in those participants who received mRNA-1273 as compared to

those receiving placebo.

1 Introduction 32

Investigators observed 236 cases of symptomatic COVID-19 among

participants at least 14 days after they received their first shot, with

225 cases in the placebo group and 11 cases in the group receiving

mRNA-1273. The vaccine efficacy was 95.2% for this secondary

analysis.

Long-term Treatment Effects?

1 Introduction 33

1.4 Economics and Social Science

Political A/B testing

A/B tests are used for more than corporations, but are also driving

political campaigns. In 2007, Barack Obama’s presidential campaign

used A/B testing as a way to garner online attraction and understand

what voters wanted to see from the presidential candidate. For

example, Obama’s team tested four distinct buttons on their website

that led users to sign up for newsletters. Additionally, the team used

six different accompanying images to draw in users. Through A/B

testing, staffers were able to determine how to effectively draw in

voters and garner additional interest.

1 Introduction 34

Example 5. The Project GATE (Growing America Through

Entrepreneurship), sponsored by the U.S. Department of Labor, was

designed to evaluate the impact of offering tuition-free

entrepreneurship training services (GATE services) on helping clients

create, sustain or expand their own business.

(https://www.doleta.gov/reports/projectgate/)

The cornerstone is complete randomization. Members of the

treatment group were offered GATE services; members of the control

group were not.

n = 4, 198 participants

p = 105 covariates

1 Introduction 35

1.5 Biological, psychological, and agricultural

research

Controlled experiments were mainly developed in these areas in

1900-1950.

1 Introduction 36

Road Map of this course:

(i) The history of experiment design;

(ii) A/B testing in medical studies;

(iii) Online controlled experiments (A/B testing).

2 The history of experiment design 37

2 The history of experiment design

2.1 Experiment design before Fisher

Statistical experiments, following Charles S. Peirce Main article:

Frequentist statistics See also: Randomization A theory of statistical

inference was developed by Charles S. Peirce in ”Illustrations of the

Logic of Science” (1877–1878) and ”A Theory of Probable Inference”

(1883), two publications that emphasized the importance of

randomization-based inference in statistics.

2 The history of experiment design 38

Randomized experiments: Charles S. Peirce randomly assigned

volunteers to a blinded, repeated-measures design to evaluate their

ability to discriminate weights. Peirce’s experiment inspired other

researchers in psychology and education, which developed a research

tradition of randomized experiments in laboratories and specialized

textbooks in the 1800s.

2 The history of experiment design 39

Optimal designs for regression models:

Charles S. Peirce also contributed the first English-language

publication on an optimal design for regression models in 1876. A

pioneering optimal design for polynomial regression was suggested by

Gergonne in 1815. In 1918, Kirstine Smith published optimal designs

for polynomials of degree six (and less).

2 The history of experiment design 40

2.2 Fisher’s principles

A methodology for designing experiments was proposed by Ronald

Fisher, in his innovative books: The Arrangement of Field Experiments

(1926) and The Design of Experiments (1935). Much of his pioneering

work dealt with agricultural applications of statistical methods. As a

mundane example, he described how to test the lady tasting tea

hypothesis, that a certain lady could distinguish by flavour alone

whether the milk or the tea was first placed in the cup. These

methods have been broadly adapted in biological, psychological, and

agricultural research.

2 The history of experiment design 41

2.2.1 Comparison

In some fields of study it is not possible to have independent

measurements to a traceable metrology standard. Comparisons

between treatments are much more valuable and are usually preferable,

and often compared against a scientific control or traditional

treatment that acts as baseline.

2 The history of experiment design 42

2.2.2 Randomization

Random assignment is the process of assigning individuals at random

to groups or to different groups in an experiment, so that each

individual of the population has the same chance of becoming a

participant in the study. The random assignment of individuals to

groups (or conditions within a group) distinguishes a rigorous, ”true”

experiment from an observational study or ”quasi-experiment”. There

is an extensive body of mathematical theory that explores the

consequences of making the allocation of units to treatments by means

of some random mechanism (such as tables of random numbers, or the

use of randomization devices such as playing cards or dice). Assigning

units to treatments at random tends to mitigate confounding, which

makes effects due to factors other than the treatment to appear to

result from the treatment.

2 The history of experiment design 43

The risks associated with random allocation (such as having a serious

imbalance in a key characteristic between a treatment group and a

control group) are calculable and hence can be managed down to an

acceptable level by using enough experimental units. However, if the

population is divided into several subpopulations that somehow differ,

and the research requires each subpopulation to be equal in size,

stratified sampling can be used. In that way, the units in each

subpopulation are randomized, but not the whole sample. The results

of an experiment can be generalized reliably from the experimental

units to a larger statistical population of units only if the experimental

units are a random sample from the larger population; the probable

error of such an extrapolation depends on the sample size, among

other things.

2 The history of experiment design 44

2.2.3 Statistical replication

Measurements are usually subject to variation and measurement

uncertainty; thus they are repeated and full experiments are replicated

to help identify the sources of variation, to better estimate the true

effects of treatments, to further strengthen the experiment’s reliability

and validity, and to add to the existing knowledge of the topic.

2 The history of experiment design 45

However, certain conditions must be met before the replication of the

experiment is commenced: the original research question has been

published in a peer-reviewed journal or widely cited, the researcher is

independent of the original experiment, the researcher must first try to

replicate the original findings using the original data, and the write-up

should state that the study conducted is a replication study that tried

to follow the original study as strictly as possible.

2 The history of experiment design 46

2.2.4 Blocking

Blocking is the non-random arrangement of experimental units into

groups (blocks) consisting of units that are similar to one another.

Blocking reduces known but irrelevant sources of variation between

units and thus allows greater precision in the estimation of the source

of variation under study.

2 The history of experiment design 47

2.2.5 Orthogonality

Orthogonality concerns the forms of comparison (contrasts) that can

be legitimately and efficiently carried out. Contrasts can be

represented by vectors and sets of orthogonal contrasts are

uncorrelated and independently distributed if the data are normal.

Because of this independence, each orthogonal treatment provides

different information to the others. If there are T treatments and T–1

orthogonal contrasts, all the information that can be captured from

the experiment is obtainable from the set of contrasts.

2 The history of experiment design 48

Example 2.1. Measurement Error: We would like to measure

the weight of a subject A by using a scale. We know that there is a

error of scale. Suppose that the error follows a normal distribution

with mean 0 and variance σ2. Mathematically, we may write:

w1 = A+ e1,

where wA is the true weight, YA is the observed weight and e1 is the

measurement error.

2 The history of experiment design 49

Figure 1: A scale to measure subject A

2 The history of experiment design 50

Now we would like to measure the weights of two subjects A and B by

using the same scale twice. What should we do?

2 The history of experiment design 51

Method 1:

w1 = A+ e1 and w2 = B + e2.

2 The history of experiment design 52

Figure 2: Subject B

2 The history of experiment design 53

Method 2:

w3 = A+B + e3 and w4 = A?B + e4.

2 The history of experiment design 54

Figure 3: A + B

2 The history of experiment design 55

Figure 4: A - B

2 The history of experiment design 56

The measurement errors:

Method 1:

Subject A: e1 ~ N(0, σ2).

Subject B: e2 ~ N(0, σ2).

Method 2:

Subject A: (e3 + e4)/2 ~ N(0, σ2/2).

Subject B: (e3 ? e4)/2 ~ N(0, σ2/2).

2 The history of experiment design 57

Use of factorial experiments instead of the one-factor-at-a-time

method. These are efficient at evaluating the effects and possible

interactions of several factors (independent variables). Analysis of

experiment design is built on the foundation of the analysis of

variance, a collection of models that partition the observed variance

into components, according to what factors the experiment must

estimate or test.

2 The history of experiment design 58

2.2.6 Avoiding false positives

False positive conclusions, often resulting from the pressure to publish

or the author’s own confirmation bias, are an inherent hazard in many

fields. A good way to prevent biases potentially leading to false

positives in the data collection phase is to use a double-blind design.

When a double-blind design is used, participants are randomly assigned

to experimental groups but the researcher is unaware of what

participants belong to which group. Therefore, the researcher can not

affect the participants’ response to the intervention.

2 The history of experiment design 59

Experimental designs with undisclosed degrees of freedom are a

problem. This can lead to conscious or unconscious ”p-hacking”:

trying multiple things until you get the desired result. It typically

involves the manipulation – perhaps unconsciously – of the process of

statistical analysis and the degrees of freedom until they return a

figure below the p?.05 level of statistical significance.

2 The history of experiment design 60

So the design of the experiment should include a clear statement

proposing the analyses to be undertaken. P-hacking can be prevented

by preregistering researches, in which researchers have to send their

data analysis plan to the journal they wish to publish their paper in

before they even start their data collection, so no data manipulation is

possible.

2 The history of experiment design 61

Another way to prevent this is taking the double-blind design to the

data-analysis phase, where the data are sent to a data-analyst

unrelated to the research who scrambles up the data so there is no

way to know which participants belong to before they are potentially

taken away as outliers.

2 The history of experiment design 62

2.2.7 Causal attributions

In the pure experimental design, the independent (predictor) variable is

manipulated by the researcher – that is – every participant of the

research is chosen randomly from the population, and each participant

chosen is assigned randomly to conditions of the independent variable.

Only when this is done is it possible to certify with high probability

that the reason for the differences in the outcome variables are caused

by the different conditions. Therefore, researchers should choose the

experimental design over other design types whenever possible.

2 The history of experiment design 63

However, the nature of the independent variable does not always allow

for manipulation. In those cases, researchers must be aware of not

certifying about causal attribution when their design doesn’t allow for

it. For example, in observational designs, participants are not assigned

randomly to conditions, and so if there are differences found in

outcome variables between conditions, it is likely that there is

something other than the differences between the conditions that

causes the differences in outcomes, that is – a third variable. The same

goes for studies with correlational design. (Ade′r Mellenbergh, 2008).

2 The history of experiment design 64

2.2.8 Statistical control

It is best that a process be in reasonable statistical control prior to

conducting designed experiments. When this is not possible, proper

blocking, replication, and randomization allow for the careful conduct

of designed experiments. To control for nuisance variables, researchers

institute control checks as additional measures. Investigators should

ensure that uncontrolled influences (e.g., source credibility perception)

do not skew the findings of the study. A manipulation check is one

example of a control check. Manipulation checks allow investigators to

isolate the chief variables to strengthen support that these variables

are operating as planned.

2 The history of experiment design 65

One of the most important requirements of experimental research

designs is the necessity of eliminating the effects of spurious,

intervening, and antecedent variables. In the most basic model, cause

(X) leads to effect (Y). But there could be a third variable (Z) that

influences (Y), and X might not be the true cause at all. Z is said to

be a spurious variable and must be controlled for. The same is true for

intervening variables (a variable in between the supposed cause (X)

and the effect (Y)), and anteceding variables (a variable prior to the

supposed cause (X) that is the true cause). When a third variable is

involved and has not been controlled for, the relation is said to be a

zero order relationship. In most practical applications of experimental

research designs there are several causes (X1, X2, X3). In most

designs, only one of these causes is manipulated at a time.

2 The history of experiment design 66

2.3 Experimental designs after Fisher

Some efficient designs for estimating several main effects were found

independently and in near succession by Raj Chandra Bose and K.

Kishen in 1940 at the Indian Statistical Institute, but remained little

known until the Plackett–Burman designs were published in

Biometrika in 1946. About the same time, C. R. Rao introduced the

concepts of orthogonal arrays as experimental designs. This concept

played a central role in the development of Taguchi methods by

Genichi Taguchi, which took place during his visit to Indian Statistical

Institute in early 1950s. His methods were successfully applied and

adopted by Japanese and Indian industries and subsequently were also

embraced by US industry albeit with some reservations.

2 The history of experiment design 67

In 1950, Gertrude Mary Cox and William Gemmell Cochran published

the book Experimental Designs, which became the major reference

work on the design of experiments for statisticians for years afterwards.

Developments of the theory of linear models have encompassed and

surpassed the cases that concerned early writers. Today, the theory

rests on advanced topics in linear algebra, algebra and combinatorics.

2 The history of experiment design 68

As with other branches of statistics, experimental design is pursued

using both frequentist and Bayesian approaches: In evaluating

statistical procedures like experimental designs, frequentist statistics

studies the sampling distribution while Bayesian statistics updates a

probability distribution on the parameter space.

2 The history of experiment design 69

Some important contributors to the field of experimental designs are

C. S. Peirce, R. A. Fisher, F. Yates, R. C. Bose, A. C. Atkinson, R. A.

Bailey, D. R. Cox, G. E. P. Box, W. G. Cochran, W. T. Federer, V. V.

Fedorov, A. S. Hedayat, J. Kiefer, O. Kempthorne, J. A. Nelder,

Andrej Pa′zman, Friedrich Pukelsheim, D. Raghavarao, C. R. Rao,

Shrikhande S. S., J. N. Srivastava, William J. Studden, G. Taguchi

and H. P. Wynn.

2 The history of experiment design 70

The textbooks of D. Montgomery, R. Myers, and G. Box/W.

Hunter/J.S. Hunter have reached generations of students and

practitioners.

Some discussion of experimental design in the context of system

identification (model building for static or dynamic models) is given

in[35] and [36].

2 The history of experiment design 71

2.4 Sequences of experiments

The use of a sequence of experiments, where the design of each may

depend on the results of previous experiments, including the possible

decision to stop experimenting, is within the scope of sequential

analysis, a field that was pioneered by Abraham Wald in the context of

sequential tests of statistical hypotheses. Herman Chernoff wrote an

overview of optimal sequential designs, while adaptive designs have

been surveyed by S. Zacks. One specific type of sequential design is

the ”two-armed bandit”, generalized to the multi-armed bandit, on

which early work was done by Herbert Robbins in 1952.

2 The history of experiment design 72

2.5 Human participant constraints

Laws and ethical considerations preclude some carefully designed

experiments with human subjects. Legal constraints are dependent on

jurisdiction. Constraints may involve institutional review boards,

informed consent and confidentiality affecting both clinical (medical)

trials and behavioral and social science experiments.[37] In the field of

toxicology, for example, experimentation is performed on laboratory

animals with the goal of defining safe exposure limits for humans.

Balancing the constraints are views from the medical field.[39]

Regarding the randomization of patients, ”... if no one knows which

therapy is better, there is no ethical imperative to use one therapy or

another.” (p 380) Regarding experimental design, ”...it is clearly not

ethical to place subjects at risk to collect data in a poorly designed

study when this situation can be easily avoided...”. (p 393)

2 The history of experiment design 73

2.6 Some important issues to design experiments

Clear and complete documentation of the experimental methodology is

also important in order to support replication of results.

Discussion topics when setting up an experimental design An

experimental design or randomized clinical trial requires careful

consideration of several factors before actually doing the experiment.

An experimental design is the laying out of a detailed experimental

plan in advance of doing the experiment. Some of the following topics

have already been discussed in the principles of experimental design

section:

2 The history of experiment design 74

1) How many factors does the design have, and are the levels of these

factors fixed or random?

2) Are control conditions needed, and what should they be?

3) Manipulation checks; did the manipulation really work?

4) What are the background variables?

5) What is the sample size. How many units must be collected for the

experiment to be generalisable and have enough power?

6) What is the relevance of interactions between factors?

2 The history of experiment design 75

7) What is the influence of delayed effects of substantive factors on

outcomes?

8) How do response shifts affect self-report measures?

9) How feasible is repeated administration of the same measurement

instruments to the same units at different occasions, with a post-test

and follow-up tests?

10) What about using a proxy pretest?

11) Are there lurking variables?

2 The history of experiment design 76

12) Should the client/patient, researcher or even the analyst of the

data be blind to conditions?

13) What is the feasibility of subsequent application of different

conditions to the same units?

14) How many of each control and noise factors should be taken into

account?

15) How to deal with missinbg values?

16) What are the good matrices?

........

2 The history of experiment design 77

The independent variable of a study often has many levels or different

groups. In a true experiment, researchers can have an experimental

group, which is where their intervention testing the hypothesis is

implemented, and a control group, which has all the same element as

the experimental group, without the interventional element. Thus,

when everything else except for one intervention is held constant,

researchers can certify with some certainty that this one element is

what caused the observed change. In some instances, having a control

group is not ethical. This is sometimes solved using two different

experimental groups. In some cases, independent variables cannot be

manipulated, for example when testing the difference between two

groups who have a different disease, or testing the difference between

genders (obviously variables that would be hard or unethical to assign

participants to). In these cases, a quasi-experimental design may be

used.

3 A/B tests (Randomized Control Studies) in clinical trials 78

3 A/B tests (Randomized Control

Studies) in clinical trials

3 A/B tests (Randomized Control Studies) in clinical trials 79

3.1 Drug development

Drug development is a complex and lengthy process that take 7 to 15

years for a single drug at a cost that may reach hundreds of millions of

dollars. There are three main parts of the drug development process:

Discovery and decision;

Preclinical studies;

Clinical studies.

3 A/B tests (Randomized Control Studies) in clinical trials 80

Discovery and Decision

The process starts with the discovery of a new compound or of a new

potential application of an existing compound. Based on adequate

results, the decision whether to develop the drug is then made.

3 A/B tests (Randomized Control Studies) in clinical trials 81

Preclinical Studies

The initial toxicology of compound is studied in animals. Initial

formulation of the drug development and specific or comprehensive

pharmacological studies in animals are also performed at this stage. At

the end of preclinical study, the evidence of potential safety and

effectiveness of the drug is assessed by the company.

To proceed further, A US-based company needs to file a Notice of

Claimed Investigational New Drug Exemption (to allow the company

to conduct studies on human subjects).

3 A/B tests (Randomized Control Studies) in clinical trials 82

Clinical Studies There is sufficient evidence that the drug will be

benefit to human subjects. Testing the drug in human subjects is the

next step.

3 A/B tests (Randomized Control Studies) in clinical trials 83

Phase I clinical trial: To establish the initial safety information

about the effect of the drug on humans, such the range of acceptable

dosages and the pharmacokinetics of the drug. This studies are

normally conducted with healthy volunteers. The number of subjects

typically varies between 4 to 20 per study, with up to 100 subjects in

total used over the course of Phase I trials.

3 A/B tests (Randomized Control Studies) in clinical trials 84

Phase II clinical trial: This studies are conducted towards patients

who will potentially benefit from the new drug. Effective dose ranges

and initial effects of the drug on these patients are assessed. Up to

several hundred patients are usually selected in Phase II trials.

3 A/B tests (Randomized Control Studies) in clinical trials 85

Phase III clinical trial: Phase III studies provide assessment of

safety, efficacy, and optimum dosage. These studies are designed with

controls and treatment groups. Usually hundreds or even thousands

patients are involved in Phase II trials.

Based on successful results obtained from these studies, the company

can then submit a NDA (New Drug Application). The application

contains the results from all three stages (from discovery to Phase III)

and is reviewed by FDA.

The FDA review panel of the NDA consists of reviewers in the

following areas: medicine, pharmacology, biopharmaceutics, chemisty,

and statistics.

3 A/B tests (Randomized Control Studies) in clinical trials 86

Phase IV: Postmarket activities. Followup studies are conducted

to examine the longterm effects of the drug. The main propose of

these studies is to ensure that all claims made by the company about

the new drug can be substantiated by so called ”clinical evidence”. All

reported adverse effects must also be investigated by the company and

in some cases, the drug may need to be withdrawn from the market.

3 A/B tests (Randomized Control Studies) in clinical trials 87

Statistician’s Responsibilities:

Participate in the development plan for study a drug.

Study design and protocol development. Randomization schemes.

Data cleaning and database construction format.

Analysis plan and program development for analysis.

Report preparation. Produce tables and figures.

Integrate clinical study results, safety and efficacy reports.

Communication and NDA defense to FDA review panel.

Publication support and consulting with other company personnel.

3 A/B tests (Randomized Control Studies) in clinical trials 88

Example 3.1. HIV transmission. Connor et al. (1994, The New

England Journal of Medicine) report a clinical trial to evaluate the

drug AZT in reducing the risk of maternal-infant HIV transmission.

50-50 randomization scheme is used:

AZT Group (A)—239 pregnant women (20 HIV positive

infants).

placebo group (B)—238 pregnant women (60 HIV positive

infants).

3 A/B tests (Randomized Control Studies) in clinical trials 89

Given the seriousness of the outcome of this study, it is reasonable to

argue that 50-50 allocation was unethical. As accruing information

favoring (albeit, not conclusively) the AZT treatment became

available, allocation probabilities should have been shifted from

50-50 allocation proportional to weight of evidence for

AZT. Designs which attempt to do this are called Response-Adaptive

designs (Response-Adaptive Randomization).

3 A/B tests (Randomized Control Studies) in clinical trials 90

If the treatment assignments had been done with the DBCD (Hu

and Zhang, 2004, Annals of Statistics) with urn target:

AZT Group— 360 patients

placebo group—117 patients

then, only 60 (instead of 80) infants would be HIV positive.

3 A/B tests (Randomized Control Studies) in clinical trials 91

Allocation rule AZT Placebo Power HIV+

EA 239 238 0.9996 80

DBCD 360 117 0.989 60

Neyman 186 291 0.9998 89

FPower 416 61 0.90 50

3 A/B tests (Randomized Control Studies) in clinical trials 92

Example 2 (ECMO Trial). Extracorporeal membrane oxygenation

(ECMO) is an external system for oxygenating the blood based on

techniques used in cardiopulmonary bypass technology developed for

cariac surgery. In the literature, there are three well-document clinical

trials on evaluating the clinical effectiveness of ECMO:

(i) the Michigan ECMO study (Bartlett, et al. 1985);

(ii) the Boston ECMO study (Ware, 1989);

(iii) the UK Collaborative ECMO Trials Group, 1996).

3 A/B tests (Randomized Control Studies) in clinical trials 93

Example 2 (Continued): Michigan ECMO trial using

RPW rule:

The RPW rule was used in a clinical trial of extracorporeal membrane

oxygenation (ECMO; Bartlett, et al. 1985, Pediatrics).

Total 12 patients.

ECMO group– 11 patients, all survived.

Conventional therapy– 1 patient, died.

3 A/B tests (Randomized Control Studies) in clinical trials 94

3.2 Determining the Sample Size

In the planning stages of a randomized clinical trial, it is necessary to

determine the numbers of subjects (sample size) to be randomized.

For two treatments (A and B), say n = nA + nB . We assume here

that the allocation proportions are known in advance, that is,

nA/n = ρ and nB/n = 1? ρ are predetermined.

3 A/B tests (Randomized Control Studies) in clinical trials 95

Examples of calculations of SS.

3 A/B tests (Randomized Control Studies) in clinical trials 96

3.3 Mathematical Framework of Randomization

Procedures

Suppose we compare two treatments A and B. Let T1, ..., Tn be a

sequence of random treatment assignments.

Ti = 1 if the patient i is assigned to treatment A;

Ti = 0 if the patient i is assigned to treatment B.

NA(n) =

∑n

i=1 Ti = number of patients onA and

NB(n) = n?NA(n).

3 A/B tests (Randomized Control Studies) in clinical trials 97

X1, ...,Xn: response variables. Where Xi represents the sequence of

responses that would be observed if each treatment were assigned to

the i-th patient independently.

Z1, ...,Zn: covariates. Here Zi represents the covariates of i-th

patient.

3 A/B tests (Randomized Control Studies) in clinical trials 98

When the (i+ 1)th patient is ready to be randomized in a clinical

trial, following information is available:

patients assignments: T1, ..., Ti;

responses: X1, ...,Xi (assume immediately responses);

patients covariates: Z1, ...,Zi and Zi+1.

3 A/B tests (Randomized Control Studies) in clinical trials 99

Let Tn = σ{T1, ..., Tn} be the sigma-algebra generated by the first n

treatment assignments.

Let Xn = σ{X1, ...,Xn} be the sigma-algebra generated by the first

n responses.

Let Zn = σ{Z1, ...,Zn} be the sigma-algebra generated by the first n

covariate vectors. Let Fn = Tn ?Xn ?Zn+1.

3 A/B tests (Randomized Control Studies) in clinical trials 100

A randomization procedure is defined by

φn = E(Tn|Fn?1),

where φn+1 is Fn-measurable. We can describe φn as the conditional

probability of assigning treatments 1, ...,K to the n-th patient,

conditional on the previous n? 1 assignments, responses, and

covariate vectors, and the current patient’s covariate vector.

3 A/B tests (Randomized Control Studies) in clinical trials 101

We can describe five types of randomization procedures:

(i) complete randomization if

φn = E(Tn|Fn?1) = E(Tn);

Not use any information.

(ii) restricted randomization if

φn = E(Tn|Fn?1) = E(Tn|Tn?1);

Only use information of patients’ assignments.

(iii) response-adaptive randomization if

φn = E(Tn|Fn?1) = E(Tn|Tn?1,Xn?1);

Use information of patients’ assignments and responses.

3 A/B tests (Randomized Control Studies) in clinical trials 102

(iv) covariate-adaptive randomization if

φn = E(Tn|Fn?1) = E(Tn|Tn?1,Zn);

Use information of patients’ assignments and covariates.

(v) covariate-adjusted response-adaptive (CARA) randomization if

φn = E(Tn|Fn?1) = E(Tn|Tn?1,Xn?1,Zn).

use all available information.

3 A/B tests (Randomized Control Studies) in clinical trials 103

3.4 Complete randomization

The simplest form of a randomization procedure is complete

randomization.

E(Ti|T1, ..., Ti?1) = P (Ti = 1|T1, ..., Ti?1) = 1/2, i = 1, ..., n.

NA(n) has binomial(n, 1/2).

This procedure is rarely used in practice because of the nonnegligible

probability of treatment imbalances in moderate samples.

3 A/B tests (Randomized Control Studies) in clinical trials 104

3.5 Restricted randomization

Truncated binomial design: Complete randomization is used until n/2

have been assigned to A or B, then the reminder is filled with the

opposite treatment with probability 1. Here the procedure is given by

φi = 1/2, if max{NA(i? 1), NB(i? 1)} ≤ n/2,

= 0, if NA(i? 1) = n/2,

= 1, if NB(i? 1) = n/2.

3 A/B tests (Randomized Control Studies) in clinical trials 105

Blocked Procedures: Because we do not know n exactly in advance,

we typically require overrunning of the randomization sequence.

Forced balance designs are therefore typically used in blocks.

Permuted block design: Blocks of even size 2b are filled using

either a random allocation rule or a truncated binomial design.

The maximum imbalance is b and the only possibility of a terminal

imbalance occurs if the last block is unfilled. Every block has at

least one deterministic assignment.

Random block design: Blocks of size 2, 4, 6, ..., 2K are randomly

selected and equirobable.

3 A/B tests (Randomized Control Studies) in clinical trials 106

Efron’s biased coin design (BCD): (Efron, 1971). Let

Di = NA(i)?NB(i) be the imbalance between treatments A and B.

Define a constant pi ∈ (0.5, 1]. Then the procedure is given by

φi = 1/2, if Di?1 = 0,

= pi, if Di?1 < 0,

= 1? pi, if Di?1 > 0.

Efron suggested pi = 2/3 might be a reasonable value (without

justification).

3 A/B tests (Randomized Control Studies) in clinical trials 107

Many other designs have been proposed and studied in literature

(Smith’s design (1984), Wei’s design (1978), Big Stick design (Soares

and Wu, 1982), etc.)

When n = 50, V ar(Dn) = 49.92 (Complete randomization);

V ar(Dn) = 4.36 (Efron’s BCD with pi = 2/3). (Based on 100, 000

replications).

3 A/B tests (Randomized Control Studies) in clinical trials 108

3.6 Selection Bias

Selection Bias refers to biases that are introduced into an unmasked

study because an investigator maybe able to guess the treatment

assignment of future patients based on knowing the treatments

assigned to the past patients. Patients usually enter a trial sequentially

over time.

The great clinical trialist Chalmers (1990) was convinced that the

elimination of selection bias is the most essential requirement for a

good clinical trial.

3 A/B tests (Randomized Control Studies) in clinical trials 109

How to measure the Selection Bias?

3 A/B tests (Randomized Control Studies) in clinical trials 110

Blackwell and Hodge (1957), Berger, Ivanova and Knoll (2003) and

others had suggested the predictability of a randomization

sequence to measure the selection bias.

One measure of the predictability of a randomization

sequence is given by

Ppred =

∑n

i=1 |Eφi ? 0.5|

n

.

3 A/B tests (Randomized Control Studies) in clinical trials 111

Selection bias of different designs.

4 Response-adaptive randomization procedures 112

4 Response-adaptive randomization

procedures

.

4.1 Historical notes

Adaptive designs in the clinical trials context were first formulated as

solutions to optimal decision-making questions:

Which treatment is better?

What sample size should be used before determining a “better”

treatment to maximize the total number receiving the better

treatment?

How do we incorporate prior data or accruing data into these

decisions?

4 Response-adaptive randomization procedures 113

The preliminary ideas can be traced back to Thompson (1933,

Biometrika) and Robbins (1952, Bulletin of the American

Mathematical Society) and led to a flurry of work in the 1960s by

Anscombe (1963, JASA), Colton (1963, JASA), Zelen (1969, JASA)

and Cornfield, Halperin, and Greenhouse (1969, Annals of

Mathematical Statistics), among others.

4 Response-adaptive randomization procedures 114

4.2 Play-the-winner rule

Perhaps the simplest of these adaptive designs is the play-the-winner

rule originally explored by Robbins (1952, Bulletin of the American

Mathematical Society) and later by Zelen (1969, JASA).

4 Response-adaptive randomization procedures 115

Binary response: treatment A and B.

pA: P (success|A), qA = 1? pA;

pB : P (success|B), qB = 1? pB ;

NA(n): number of patients on A;

NB(n): number of patients on B, n = NA(n) +NB(n).

4 Response-adaptive randomization procedures 116

Play-the-winner rule:

a success on one treatment results in the next patient’s

assignment to the same treatment,

a failure on one treatment results in the next patient’s assignment

to the opposite treatment.

That is

φn = 1 if Tn?1 = 1 and Xn?1(A) = 1 or Tn?1 = 0 and

Xn?1(B) = 0.

φn = 0 if Tn?1 = 1 and Xn?1(A) = 0 or Tn?1 = 0 and

Xn?1(B) = 1.

4 Response-adaptive randomization procedures 117

The properties of play-the-winner rule?

What is the proportion of patients in treatment A:


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp