 #### СЊЯЕЗНЪН

• QQЃК99515681
• гЪЯфЃК99515681@qq.com
• ЙЄзїЪБМфЃК8:00-21:00
• ЮЂаХЃКcodinghelp #### ФњЕБЧАЮЛжУЃКЪзвГ >> Algorithm ЫуЗЈзївЕAlgorithm ЫуЗЈзївЕ

###### ШеЦкЃК2022-06-10 10:46

A Power Calculation

Suppose, for illustation, that we are interested in testing the hypothesis

H0: 1 = ?2 vs. HA: 1 6= 2

Suppose, also for illustration, that the test statistic associated with this test has the form

It will be useful to define the notion of a rejection region R: all values of the observed test statistic

t that would lead to the rejection of H0:

R = {t | H0 is rejected}

ЈC If t 2 R, we reject H0

ЈC If t 2 Rc, we do not reject H0

Defining Type I and Type II error rates in terms of a rejection region is also useful:

= Pr(Type I Error) = Pr(Reject H0 | H0 is true)

= Pr(T 2 R | H0 is true)

= Pr(Type II Error) = Pr(Do Not Reject H0 | H0 is false)

= Pr(T 2 Rc | H0 is false)

2

3

Permutation and Randomization Tests

All of the previous tests have made some kind of distributional assumption for the response measure-

ments

It would be preferable to have a test that does not rely on any assumptions

This is precisely the purpose of permutation and randomization tests.

ЈC These tests are nonparametric and rely on resampling.

ЈC The motivation is that if H0 : ?1 = ?2 is true, any random rearrangement of the data is equally

likely to have been observed.

ЈC With n1 and n2 units in each condition, there are?

arrangements of the n1 + n2 observations into two groups of size n1 and n2 respectively

4

A true permutation test considers all possible rearrangements of the original data

ЈC The test statistic t is calculated on the original data and on every one of its rearrangements

ЈC This collection of test statistic values generate the empirical null distribution

A randomization test is carried out similarly, except that we do not consider all possible rearrange-

ments

ЈC We just consider a large number N of them

Randomization Test Algorithm

1. Collect response observations in each condition.

2. Calculate the test statistic t on the original data.

5

3. Pool all of the observations together and randomly sample (without replacement) n1 observations which

will be assigned to ЁАCondition 1ЁБ and the remaining n2 observations are assigned to ЁАCondition 2ЁБ.

Repeat this N times.

4. Calculate the test statistic t?k on each of the ЁАshu?edЁБ datasets, k = 1, 2, . . . , N .

5. Compare t to {t?1, t?2, . . . , t?N}, the empirical null distribution and calculate the p-value:

p-value =

# of t?ЁЏs that are at least as extreme as t

N

Example: Pokemon Go

Suppose that Niantic is experimenting with two di?erent promotions within PokeЁфmon Go:

ЈC Condition 1: Give users nothing

ЈC Condition 2: Give users 200 free PokeЁфcoins

ЈC Condition 3: Give users a 50% discount on Shop purchases

In a small pilot experiment n1 = n2 = n3 = 100 users are randomized to each condition

For each user, the amount of real money (in USD) they spend in the 30 days following the experiment

is recorded

The data summaries are:

ЈC y1 = \$10.74, Q1(0.5) = \$9

ЈC y2 = \$9.53, Q2(0.5) = \$8

ЈC y3 = \$13.41, Q3(0.5) = \$10

6

3 Experiments with More than Two Conditions

3.1 Anatomy of an A/B/m Test

We now consider the design and analysis of an experiment consisting of more than two experimental

conditions ЈC or what many data scientists broadly refer to as ЁАA/B/m TestingЁБ.

ЈC Canonical A/B/m test:

Figure 1: Button-Colour Experiment

Other, more tangible, examples:

ЈC Netflix

ЈC Etsy

Typically the goal of such an experiment is to decide which condition is optimal with respect to some

metric of interest. This could be a

ЈC mean

ЈC proportion

ЈC variance

ЈC quantile

ЈC technically any statistic that can be calculated from sample data

From a design standpoint, such an experiment is very similar to a two-condition experiment

1. Choose a metric of interest ? which addresses the question you are trying to answer

2. Determine the response variable y that must be measured on each unit in order to estimate b?

3. Choose the design factor x and the m levels you will experiment with.

4. Choose n1, n2, . . . , nm and assign units to conditions at random

5. Collect the data and estimate the metric of interest in each condition:

b1, b2, . . . , bm

7

Determining which condition is optimal typically involves a series of pairwise comparisons

But it is useful to begin such an investigation with a gatekeeper test which serves to determine whether

there is any di?erence between the m experimental conditions. Formally, such a question is phrased

as the following statistical hypothesis.

H0: 1 = 2 = ЁЄ ЁЄ ЁЄ = m versus HA: j 6= k for some j 6= k (1)

3.2 Comparing Multiple Means with an F -test

We assume that our response variable follows a normal distribution and we assume that the mean of

the distribution depends on the condition in which the measurements were taken, and that the variance

is the same across all conditions.

The ЁАgatekeeperЁБ test for means is tested using an F -test

In particular, we use the F -test for overall significance in an appropriately defined linear regression

model :

ЈC The appropriately defined linear regression model in this situation is one in which the response

variable depends on m 1 indicator variables:

xij =

(

1 if unit i is in condition j

0 otherwise

for j = 1, 2, . . . ,m 1.

ЈC For a particular unit i, we adopt the model

Yi = 0 + 1xi1 + 2xi2 + ЁЄ ЁЄ ЁЄ+ m1xi,m1 + "i

8

ЈC In this model the ЁЏs are unknown parameters and may be interpreted in the context of the

following expectations:

E[Yi|xi1 = xi2 = ЁЄ ЁЄ ЁЄ = xi,m1 = 0] = 0

E[Yi|xij = 1] = 0 + j

ЈC Based on these assumptions, H0 in (1) is true if and only if 1 = 2 = ЁЄ ЁЄ ЁЄ = m1 = 0. Thus

testing (1) is equivalent to testing

H0: 1 = 2 = ЁЄ ЁЄ ЁЄ = m1 = 0 vs. HA: j 6= 0 for some j

ЈC This hypothesis corresponds, as noted, to the F -test for overall significance in the model.

In regression parlance, the test statistic is defined to be the ratio of the regression mean squares (MSR)

to the mean squared error (MSE) in a standard regression-based analysis of variance (ANOVA):

t =

MSR

MSE

In our setting we can more intuitively think of the test statistic as comparing the response variability

between conditions to the response variability within conditions:

9

The null distribution for this test is F(m1,Nm)

The p-value for this test is calculated by

p-value = P (T t)

where T F(m1,Nm)

Example: Candy Crush Boosters

ЈC Candy Crush is experimenting with three di?erent versions of in-game ЁАboostersЁБ: the lollipop

hammer, the jelly fish, and the color bomb.

Figure 2: Candy Crush Experiment

ЈC Users are randomized to one of these three conditions (n1 = 121, n2 = 135, n3 = 117) and they

of these di?erent boosters on the length of time a user plays the game.

ЈC Let ІЬj represent the average length of game play (in minutes) associated with booster condition

j = 1, 2, 3. While interest lies in finding the condition associated with the longest average length

of game play, here we first rule out the possibility that booster type does not influence the length

of game play (i.e., ІЬ1 = ІЬ2 = ІЬ3).

ЈC In order to do this we fit the linear regression model

Y = 0 + 1x1 + 2x2 + "

where x1 and x2 are indicator variables indicating whether a particular value of the response was

observed in the jelly fish or color bomb conditions, respectively. The lollipop hammer is therefore

the reference condition.

10

Optional Exercises:

Calculations: 2, 7

Proofs: 1, 5, 6, 9, 10, 14, 17, 18

R Analysis: 2, 5, 6, 8, 13(g), 17 (not g,h), 22(h), 23(a-f)

#### ЯрЙиЮФеТ

ЁОЩЯвЛЦЊЁПЃКЕНЭЗСЫ
ЁОЯТвЛЦЊЁПЃКУЛгаСЫ