hw06-Copy1
November 16, 2018
1 Homework 6: Probability and Hypothesis Testing
1.1 Due Sunday November 18th, 11:59pm
Directly sharing answers is not okay, but discussing problems with the course staff or with other
students is encouraged.
You should start early so that you have time to get help if you’re stuck.
In [ ]: #: Don't change this cell; just run it.
import numpy as np
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from client.api.notebook import Notebook
ok = Notebook('hw06.ok')
_ = ok.auth(inline=True)
Important: The ok tests don’t usually tell you that your answer is correct. More often, they
help catch careless mistakes. It’s up to you to ensure that your answer is correct. If you’re not
sure, ask someone (not for the answer, but for some guidance about your approach).
Once you’re finished, you must do two things:
1.1.1 a. Turn into OK
Select "Save and Checkpoint" in the File menu and then execute the submit cell below. The result
will contain a link that you can use to check that your assignment has been submitted successfully.
If you submit more than once before the deadline, we will only grade your final submission.
In [ ]: #: turn in your notebook
_ = ok.submit()
1.1.2 b. Turn PDF into Gradescope
Select File > Download As > PDF via LaTeX in the File menu. Turn in this PDF file into the
respective assignement at https://gradescope.com/. If you submit more than once before the
deadline, we will only grade your final submission
1
1.2 1. Numbers in a Slot Machine
You are in front of a slot machine with three slots. Each slot in the slot machine has 10 possible
outcomes: the numbers from 0-9. When you press the "Spin" button on the slot machine, each of
the three slots spins independently and stops at a number. Assume that the slot machine always
picks a number randomly.
Question 1. Suppose you win the jackpot if you are lucky enough to encounter the following
sequence of spins, in order:
Spin 1: You see a 777 in the slot machine.
Spin 2: You see a 999 in the slot machine.
What is the probability that you win the jackpot if you press the "Spin" button twice? Assign
your answer to jackpot_chance.
In [ ]: jackpot_chance =
jackpot_chance
In [ ]: #: grade 1.1
_ = ok.grade('q1_1')
Question 2. What is the probability that you see a number greater than 700 when you press
"Spin" once? Assign your answer to greater_than_700.
In [ ]: greater_than_700 = ...
greater_than_700
In [ ]: #: grade 1.2
_ = ok.grade('q1_2')
Question 3. Write a function called simulate_one_spin. It should take no arguments, and it
should return a random number that is equally-likely to come up in the slot-machine. Note that
since it is a number, the leading zeros are ignored. For example, if the slot number spits out 009,
then the corresponding return value of your function should be 9.
In [ ]: # Place your answer here. It may contain several lines of code.
In [ ]: #: grade 1.3
_ = ok.grade('q1_3')
Question 4. Call the function simulate_one_spin 100,000 times. What proportion of times
does the slot machine output 777? Assign your answer to proportion_777. Your solution may
take more than one line.
In [ ]: proportion_777 = ...
proportion_777
In [ ]: #: grade 1.4
_ = ok.grade('q1_4')
2
Question 5. Compute the probability that at least one of the slots in the slot machine (out of the
three) gives out a 7. You can write it as an expression which can be evaluated by Python. Assign
your answer to at_least_one_7.
In [ ]: at_least_one_7 = ...
at_least_one_7
In [ ]: #: grade 1.5
_ = ok.grade('q1_5')
1.3 2. Apples and Oranges
Suppose you are given a huge farm that yields apples and oranges.
In [ ]: #: Don't change this cell, just run it
apples = ['Apple' for _ in range(400)]
oranges = ['Orange' for _ in range(600)]
farm_table = Table().with_column(
'Fruit Type', apples + oranges
)
farm_table
Question 1. Because you like apples more, you’re interested in the proportion of apples
in the farm. Calculate the true proportion of apples in the farm. Store it in the variable
apples_true_prop.
In [ ]: apples_true_prop =
apples_true_prop
In [ ]: #: grade 2.1
_ = ok.grade('q2_1')
Question 2. Which of the following would create a representative sample of fruits and why?
Explain your answer.
1. farm_table.take(np.arange(200))
2. farm_table.sample(200)
Option 2 would create a representative sample of fruits becuause .sample would choose 200
fruits at random so each fruit has an equal chance of being selected; whereas np.arange does not
do it by random.
Question 3. Let’s say we have a fruit basket that can contain at most 200 fruits. We pick 200
fruits (without replacement) from the farm and place it in our fruit basket using the sampling you
chose in question 3 above. Write a function called pick_200_fruits that simulates this. Specifi-
cally, the function should take no arguments and should return an array of 200 fruits selected as
per your choice in question 3.
In [ ]: # Place your answer here. It may contain several lines of code.
3
In [ ]: #: grade 2.3
_ = ok.grade('q2_3')
Question 4. As we mentioned, we’re interested in knowing the true proportion of apples in the
farm. But we can pick only 200 fruits at a time in our fruit basket. Hence, we simulate this experiment
in 500 trials. For each trial, we decide to calculate the proportion of apples in our basket. Simulate
the experiment and store the array of proportions in the variable apples_empirical_props.
In [ ]: # Place your answer here. It may contain several lines of code.
In [ ]: #: grade 2.4
_ = ok.grade('q2_4')
Question 5. Now, compute the average of apples_empirical_props. You claim that this
average is a good estimate of the proportion of apples in the farm. Store your proportion in
apples_claim_prop.
In [ ]: apples_claim_prop = ...
apples_claim_prop
In [ ]: #: grade 2.5
_ = ok.grade('q2_5')
Question 6. How far away is your claim from the true proportion of apples. Compute the absolute
difference between the two and store it in the variable error. Remember that you calculated
the true proportion of apples in Question 2
In [ ]: error = ...
error
In [ ]: #: grade 2.6
_ = ok.grade('q2_6')
1.4 3. Broken Phones
A phone manufacturing company claims that it produces phones that are 99% non-faulty. In other
words, only 1% of the phones that they manufacture have some fault in them. They open a retail
shop in the friendly neighbourhood of La Jolla. Because the phones are cheap and nice, 100 UCSD
students have bought phones at this shop. However, it is soon discovered that four of the students
had faulty phones. You’re angry and argue that the company’s claim is wrong. But the company
is adament that they are right. You decide to investigate.
Question 1. Assign null_probabilities to a two-item array such that the first element contains
the chance of a phone being non-faulty and the second element contains the chance that the
phone is faulty under the null hypothesis.
In [ ]: null_probabilities = ...
null_probabilities
In [ ]: #: grade 3.1
_ = ok.grade('q3_1')
4
Question 2. Using the function you wrote above, simulate the buying of 100 phones
5,000 times, using the proportions that you assigned to null_probabilities. Create an array
simulations with the number of faulty phones in each simulation.
Note that the number of faulty phones in a simulation of sample size x is the proportion of
faulty phones in the simulation multiplied by x.
In [ ]: # Place your answer here. It may contain several lines of code.
In [ ]: #: Consider the resulting histogram of the fault_statistics array
Table().with_column("Faulty Statistic", simulations).hist(bins=np.arange(8))
In [ ]: #: grade 3.2
_ = ok.grade('q3_2')
Question 3. Using the results of your simulation, calculate an estimate of the p-value, i.e.,
the probability of observing four or more faulty phones under the null hypothesis. Assign your
answer to p_value_3_3
In [ ]: p_value = ...
p_value
In [ ]: #: grade 3.3
_ = ok.grade('q3_3')
Question 4. Given the results of your above experiment, do you reject the null hypothesis?
Explain why.
Write your answer here.
1.5 4. Bias towards customers
The insurance company LivLife10 classifies its customers into 3 categories - low-income, midincome
and high-income. The company claims that it treats all of its customers equally and makes
no compromises on the quality of the products that it provides. You know that the company
has 10,000 customers, 40% of which are low-income customers, 30% mid-income and 30% highincome
customers. However, over the past year, 60% of the complaints that the company has
received are from low-income customers, 30% from mid-income customers and 10% from highincome
customers.
In [ ]: #: Don't change the below three lines
type_of_customers = ["low-income", "mid-income", "high-income"]
proportion_of_customers = np.array([0.4, 0.3, 0.3])
proportion_of_complaints = np.array([0.6, 0.3, 0.1])
insurance_customers = Table().with_columns(
"Type of Customers", type_of_customers,
"Proportion of Customers", proportion_of_customers,
"Proportion of Complaints", proportion_of_complaints)
insurance_customers
5
You have a suspicion that the insurance company is biased towards its high-income customers.
That is, the insurance company is providing a better product to the high-income customers than
to others. A better product is one that generates lesser complaints. You decide to test your idea.
Your null hypothesis is:
Null hypothesis: The complaints are drawn from the population according to the proportion
of customers which are low-, mid-, and high-income.
Question 1. What is the expected proportion of complaints that should be heard
from the high-income customers under the null hypothesis? Assign your answer to
complaints_proportion_null.
In [ ]: complaints_proportion_null = ...
complaints_proportion_null
In [ ]: #: grade 4.1
_ = ok.grade('q4_1')
Question 2. You wish to check the bias in the insurance company towards different categories
of customers. However, there are three categories of customers: high-, mid-, and low-income.
Which among the following do you think is not a reasonable choice of test statistic for your
hypothesis. You may include more than one answer. Append all your choices in a list called
unreasonable_test_statistics. For example, if you think statistics 1, 2, and 3 are unreasonable,
you should have unreasonable_test_statistics = [1,2,3]
1. Average of the absolute difference between proportion of customers and proportion of corresponding
complaints
2. Sum of the absolute difference between proportion of customers and proportion of corresponding
complaints
3. The total number of complaints that the company has received in the past year
4. The total variation distance between the probability distribution of customers and the distribution
of complaints
5. The absolute difference between the sum of proportion of customers and the sum of proportion
of corresponding complaints
6. Average of the sum of the proportion of customers and the proportion of corresponding
complaints
In [ ]: unreasonable_test_statistics = ...
unreasonable_test_statistics
In [ ]: #: grade 4.2
_ = ok.grade('q4_2')
Question 3. Say you went ahead with the total variation distance as your test statistic
Write a function called total_variation_distance that takes in two probability distributions
as arrays and calculates the total variation distance between them.
In [ ]: # Place your answer here. It may contain several lines of code.
In [ ]: #: Use the below code to test your function
total_variation_distance(np.array([1,0,0]), np.array([0,0,1])) # Output should be 1.0
6
In [ ]: #: grade 4.3
_ = ok.grade('q4_3')
Question 4. Write a simulation which computes the TVD statistic 5000 times on data generated
under the null hypothesis. Save the simulated statistics in an array called empirical_tvds.
Hint: Use sample_proportions.
In [ ]: # Place your answer here. It may contain several lines of code.
In [ ]: #: grade 4.4
_ = ok.grade('q4_4')
Question 5. Calculate the total variation distance in the actual scenario, that is, the observed
scenario. Save the result in observed_tvd.
In [ ]: observed_tvd = ...
observed_tvd
Let us plot a histogram of empirical_tvds and compare that to our observed_tvd
In [ ]: #: Visualize
Table().with_column("Empirical TVDs", empirical_tvds).hist()
plt.scatter(observed_tvd, 0, color='red', s=30)
In [ ]: #: grade 4.5
_ = ok.grade('q4_5')
Question 6. Recall that the null hypothesis was that the complaints are drawn from the population
according to the proportion of customers which where low-, mid-, and high-income. Looking
at the histogram above, do you think it is likely that the null hypothesis is true? Write your answer
in the variable insurance_claim_true. The value of the boolean variable should be True if you
agree that the null hypothesis is true, and False otherwise.
In [ ]: insurance_claim_true = ...
insurance_claim_true
In [ ]: #: grade 4.6
_ = ok.grade('q4_6')
Question 7. Does rejecting the null hypothesis in this case prove (or otherwise highly suggest)
that the company is biased in its treatment of customers? Why or why not?
Write your answer here.
1.6 5. Loaded Die
... And we are back to rolling dice! A loaded die is one that is unfair, i.e., does not have equal
probability for each of the outcomes 1–6 (inclusive).
Question 1. Your friend Aby has a model that says that the die is loaded in a way such that
the probability of "1" coming up is 0.5 and all the other values have the same probabilities.
Write down Aby’s model’s distribution as an array. It should contain 6 elements, each describing
the probability of seeing the corresponding face of the die, and it should sum to 1.
7
In [ ]: aby_hypothesis_model_distribution = ...
aby_hypothesis_model_distribution
In [ ]: #: grade 5.1
_ = ok.grade('q5_1')
Question 2. Say we want to test Aby’s model. In particular, we wish to test if the probability
of "1" coming up is 0.5. We roll the die 10 times and we got "1" a whopping 8 times. We claim that
Aby’s model is wrong. In order to substantiate our claim, we run a simulation of the die-roll.
Write a simulation and run it 5000 times, maintaining an array differences which keeps track
of the absolute difference between number of ’1’s that were seen and the expected number (5) in
each simulation.
In [ ]: # Place your answer here. It may contain several lines of code.
In [ ]: #: Visualize with a histogram
Table().with_column("Difference", differences).hist(bins=np.arange(8))
In [ ]: #: grade 5.2
_ = ok.grade('q5_2')
Question 3. Recall that we saw the die come up "1" eight times. Set the variable
null_hypothesis_boolean below to True if you think Aby’s model is plausible or False if it
should be rejected.
In [ ]: null_hypothesis_boolean = ...
null_hypothesis_boolean
In [ ]: #: grade 5.3
_ = ok.grade('q5_3')
Question 4. Now, we check the p-value of our claim. That is, compute the proportion of
times in our simulation that we saw a difference of 3 or more between the number of ’1’s and the
expected number of ’1’s. Assign your result to p_value_5_4
In [ ]: p_value_5_4 = ...
p_value_5_4
In [ ]: #: grade 5.4
_ = ok.grade('q5_4')
To submit:
1. Select Run All from the Cell menu to ensure that you have executed all cells, including the
test cells.
2. Read through the notebook to make sure everything is fine.
3. Submit using the cell below.
4. Save PDF and submit to gradescope
In [ ]: #: For your convenience, you can run this cell to run all the tests at once!
import os
_ = [ok.grade(q[:-3]) for q in os.listdir('tests') if q.startswith('q')]
8
1.7 Before submitting, select "Kernel" -> "Restart & Run All" from the menu!
Then make sure that all of your cells ran without error.
In [ ]: #: submit your notebook
_ = ok.submit()
1.8 Don’t forget to submit to both OK and Gradescope!
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。