FIT5197 2019 S1 Assignment 1 (25 marks)
18 March 2019
Contents
1 Details 2
2 Probabilities in Cards (2 marks) 3
2.1 A special flush (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 No repeats (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 PDF and Expectations (3 Marks) 3
3.1 Plot (1/2 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Mean (1/2 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3 Variance (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.4 Skewness (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Distributions (2 marks) 4
4.1 Model (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.2 Checking (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
5 Entropy (3 Marks) 4
5.1 Conditional probabilities (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.2 Entropies (1 marks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.3 Coding (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6 Maximum likelihood estimation of parameters (3 marks) 5
6.1 Model (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6.2 Maximum likelihood fitting (2 marks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
7 Central Limit Theorem (7 marks) 6
7.1 Sampling distribution (2 marks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
7.2 Simulation (2 marks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
7.3 Plotting normality (3 marks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Submission due date: by 11:59pm on Friday 12 April 2019 (end of Week 6)
1
1 Details
Marks
This assignment contains 6 questions. There are 25 marks in total for the assignment and it counts for 25%
of your overall grade for the unit. Also, 3 of the 25 marks are awarded for code quality and 2 of the marks
awarded for presentation of results, for instance use of plots. That leaves 20 marks for individual answers.
You must show all working, so we can see how you obtain your answer. Marks are assigned for working as
well as the correct answer.
Your solutions
Please put your name or student number on the first page of your solutions. Do not copy questions in your
solutions. Only include question numbers. If you use any external resources for developing your results,
please remember to provide the link to the source.
If an extension has been given then submission after the due date is allowed with no penalty being
incurred. If no extension has been given then assignments submitted after the due date, there will be penalised
5% per day up to a maximum of 10 days late.
Submitting your assignment on Moodle
Please submit your assignment through Moodle via upload a Word or PDF document as well as R markdown
you used to generate results.
If you choose to use R markdown file to directly knit Word/PDF document, you would need to type in
Latex equations for Question 1,2 and 5. Find more information about using latex in R markdown files
here. You may also find the R markdown cheatsheet useful.
You can also work with Word and R markdown separately. In this case you would need to type your
answers in Word and also copy R code (using the format: Courier New), results and figures to the
Word document.
We will mark your submission mostly using your Word/PDF document. However, you need to make sure
your R markdown file is executable in case we need to check your code.
Code quality marks
Your R code will be reviewed for conciseness, efficiency, explainability and quality. Inline documentation, for
instance, should demarcate key sections and explain more obtuse operations, but shouldn’t be over verbose.
Out of the 25 marks, 3 will be awarded for code quality.
Presentation marks
Your presentation of results using R will be reviewed. How well do you use plots or other means of ordering
and conveying results. Out of the 25 marks, 2 will be awarded for presentation using R.
2
2 Probabilities in Cards (2 marks)
Have a regular deck of cards with no jokers (13 cards per suit, 4 suits) giving 52 cards. Suppose we draw a
5 card hand, so 5 cards without replacement. For each answer write out the full calculation in R to show
working.
Note there are 52!
47! different 5 card hands if ordering of the draw is considered, and each is equally likely. If
ordering of the draw is ignored, there are
different 5 card hands.
2.1 A special flush (1 mark)
What is the probability of getting a royal flush but where the cards ordered by rank have alternate color?
That is, order the cards as 10,J,Q,K,A and then check to see they have alternate colour. Note in a proper
royal flush, it is all the one suit, but we have changed that to alternate colour. So, for example “red 10,
black J, red Q, black K, red A” is OK but “red J, black 10, red Q, black K, red A” is not OK because once
reordered in rank the alternating colour no longer holds. Note the order in which they are drawn from the
pack is not considered.
HINT: This event is defined ignoring the order of the draw, so count out the number of such hands (ignoring
the order of the draw), and divide by
2.2 No repeats (1 mark)
What is the probability that in the sequence of cards, as they are drawn, no rank occurs twice in a row?
So ignoring the suit, the following are allowed: A, 10, 4, J, 10 or A, 10, A, 4, A, but the following are not
allowed: A, A, 10, 4, A (A repeated in positions 1 and 2), A, 4, 10, 10, J (10 repeated in positions 3 and 4).
HINT: This event is defined using the order of the draw, so count out the number of such hands, and divide
by 52!/47!.
3 PDF and Expectations (3 Marks)
Let X have the PDF given by a function with a different negative and positive part.
f(x) = 12
You can use Wolfram Alpha to do the definite integrals, for instance
https://www.wolframalpha.com/input/?i=integral+(1-x)%5E3+from+0+to+1
3.1 Plot (1/2 mark)
Draw the plot in R.
3
3.2 Mean (1/2 mark)
Find E(X). Why is it not zero?
3.3 Variance (1 mark)
Find variance, V ar(X).
3.4 Skewness (1 mark)
Find skewness, using the formula in the lecture notes. Interpret the value.
4 Distributions (2 marks)
One study has evaluated a number of leukaemia records in a rural area. The population of the area was
35,000. In a year there were 16 leukaemia cases identified, of which 4 where not local residents but tourists
or new immigrants (of which there are not many). In a general population, the annual rate of leukaemia is
typically about one in 10,000.
4.1 Model (1 mark)
Describe the model you recommend to use for the counts, and estimate the parameters using suitable point
estimates.
4.2 Checking (1 mark)
Also, consider the hypothesis, “the annual rate of leukaemia in the area is 1/10,000?” Assume this is the rate
for the residents only. Plot the distribution over counts under this hypothesis. Where does your data lie, and
do you think it is consistent with the hypothesis?
5 Entropy (3 Marks)
In this question, we will use a modified version of the Titanic dataset from the Kaggle competition, Titanic:
Machine Learning from Disaster? The dataset includes information about passenger characteristics as well as
whether they survived from the disaster.
Import the Titanic data using the following R code:
df <- read.csv("Titanic.csv",header=TRUE, sep=",")
Now Survived is Boolean so convert to a truth value with:
df[['Survived']] <- df[['Survived']]==1
4
5.1 Conditional probabilities (1 mark)
Compute tables for the frequency estimates of P(Survived), P(Survived|P class = val) and
P(Survived|Gender = val), for different vals. Do the computation in R. But its OK to present
the final table as a separate Word table (since it might be hard to layout in R). What does this tell you
about survival?
5.2 Entropies (1 marks)
Calculate the entropy (log2()) of Survived, H(Survived) and the conditional entropy of Survived given
P class, H(Survived|P class), and of Survived given Gender, H(Survived|Gender). Do not use an entropy
function but write the code yourself. Use R functions table() and prop.table() to gather stats and form
probabilities from the data frame. What do these three entropies tell you about Survived?
5.3 Coding (1 mark)
Consider the joint space (Survived, P class) which has six outcomes, (T rue, 1), (T rue, 2), (T rue, 3), (F alse, 1),
(F alse, 2), (F alse, 3). Develop an efficient binary prefix code to transmit these outcomes. Would it be
adequate to just provide the codelengths, or is a code needed too? Justify your answer.
6 Maximum likelihood estimation of parameters (3 marks)
One of the central problems of sensory neuroscience is to separate the recordings of background physiological
processes that are irrelevant (noise), from neural responses that are of experimental interest (signal). This
is by no means an easy task, as the signals that neurons produce when they fire are extremely weak and
more random. It is therefore of particular interest to examine the randomness of neuro signals as this allows
researchers to study the brain at a cellular level.
Let’s assume that we have conducted one experiment and recorded the spike signals from one particular
neuron for a duration of 15 seconds. After some data processing, we can obtain spike signals with data given
by a time in seconds and a spike size, similar to the following data and in Figure 1.
n <- 30
times <- c(0.8670763, 1.2550631, 1.3463051, 2.6999393, 3.5238785, 4.8215638, 4.8502006,
5.2372364, 5.3201143, 6.2835730, 7.6961491, 8.0164785, 8.6279902, 9.1390150,
9.5136710, 9.9207854, 9.9795974, 10.0242579, 10.1622076, 10.5968354,
11.6766725, 12.3441424, 12.7731282, 12.8911034, 13.0458095, 13.4280567,
14.2443711, 14.4219672, 14.7461019, 14.7726211)
spikes <- c(0.220136914, 1.252061356, 0.943525370, 0.907732787, 1.157388806, 0.342485956,
0.291760012, 0.556866189, 0.738992636, 0.690779640, 0.425849738, 0.876344116,
1.248761245, 0.697514552, 0.174445203, 1.376500202, 0.731507303, 0.483036515,
0.650835440, 1.106788259, 0.587840538, 0.978983532, 1.179754064, 0.941462421,
0.749840071, 0.005994156, 0.664525928, 0.816033621, 0.483828371, 0.524253461)
6.1 Model (1 mark)
Let us assume that the rate of signals remains constant over time, and the size of each signal is independent
of time too. If the rate of the signals remains constant over time, which distribution would most suit to
model the probability distribution for the number of spike signals over 15 seconds? Why? Briefly answer this
question in a sentence or 2. Also, while we don’t know enough to suggest a distribution for spike sizes, but
what properties should it have?
5
Figure 1: Spike data.
6.2 Maximum likelihood fitting (2 marks)
Using the model above, what is the log-likelihood function for number of spike signals for the period of
experiment time, and what is the maximum likelihood estimate for its parameters?
You’re told that a candidate distribution for spike sizes is the Weibull with shape given by 0.7 and unknown
scale, between 0.5 and 2. This is supported in R using the [dpqr]weibull() functions. One can do maximum
likelihood fitting using the Weibull density on the unknown parameter. Use the optimize() function for that,
so something like
‘optimize(fn, c(minvalue,maxvalue), maximum = TRUE, tol = .Machine$double.eps?0.25)’
7 Central Limit Theorem (7 marks)
Assume that we draw random integers from a Poisson distribution with rate one of λ1 = 1, λ2 = 5, or λ3 = 20.
7.1 Sampling distribution (2 marks)
According to Central Limit Theorem what is the limiting distribution for the sample mean, for the three
rates λ1, λ2, λ3, when we have sample size of 10, 100, 1000 and 10000? Give the theory then compute the
parameter values in R.
Bonus question for HD students giving bonus 1 mark (added to final mark for Assignment 1 only if the final
mark is 24 or less): what is the limiting distribution for the sample variance? This is not really a solvable
problem, so approximate it for just λ3 = 20.
6
7.2 Simulation (2 marks)
Experimentally justify the result in the CLT that says the sample mean has a mean given by the population
mean and a variance given by the population variance divided by sample size. See the CLT Theorem in
Lecture 4. Use simulation given sample a size of 10, 100 and 1000. For each given sample size use 50000
simulations to compute samples and their means. From these means compute the mean and variance of the
sample means, and discuss how results reflect the CLT. Plot the results (3 sample sizes and 3 rates with
mean and SD) to demonstrate any effects you want to discuss.
7.3 Plotting normality (3 marks)
When rate λ1 = 1 and λ2 = 5 and sample size is 10 or 100, obtain the z scores of the sampling means (from
50000 simulations). Plot their distributions in a histogram with the theoretical Gaussian curve overlaid.
Note for sample size 100, the plots overlay very nicely. But what happens with sample size 10? Explain the
differences between the four plots.
For each simulation: the z score of the mean can be calculated as:
where Xˉ is the mean of the sample, μ is the population mean and σ is the population SD.
7
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。