联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2021-02-08 11:10

DS 5230 Unsupervised Machine Learning and

Data Mining – Spring 2021 – Homework 1

Submission Instructions

• It is recommended that you complete this exercises in Python 3 and submit

your solutions as a Jupyter notebook.

• You may use any other language, as long as you include a README with

simple, clear instructions on how to run (and if necessary compile) your code.

• Please upload all files (code, README, written answers, etc.) to GradeScope

in a single zip file.

Exercise 1: Understanding Apriori and FP growth

1. Consider a dataset for frequent set mining as in the following table where we

have 6 binary features and each row represents a transaction.

TID Items

a. Illustrate the first three passes of the Apriori algorithm (set sizes 1, 2 and

3) for support threshold of 3 transactions. For each stage, list the candidate

sets Ck and the frequent sets Lk. What are the maximal frequent sets

discovered in the first 3 levels?

b. Pick one of the maximal sets that has more than 1 item, and check if any

of its subsets are association rules with support (i.e. frequency) at least

0.3 and confidence at least 0.6. Please explain your answer and show your

work.

2. Given the following transaction database, let the min support = 2, answer the

following questions.

a. Construct FP-tree from the transaction datasbase and draw it here.

b. Draw d’s conditional FP-tree, and find frequent patterns (i.e.itemsets)

based on d’s conditional FP-Tree.

Exercise 2: Probability Basics

1. Let X and Y be two independent random variables with densities p(x) and p(y),

respectively. Show the following two properties:

Ep(x,y)

[X + aY ] = Ep(x)

[X] + aEp(y)

[Y ] (1)

Varp(x,y)

[X + aY ] = Varp(x)

[X] + a

2Varp(y)

[Y ] (2)

for any scalar constant a ∈ R. Hint: use the definition of expectation and

variance,

Ep(x)

[X] = Z

x

p(x)xdx (3)

varp(x)

[X] = Ep(x)

[X

2

] − E

2

p(x)

[X] (4)

2. Let X be a random variable with Beta distribution,

p(x; α, β) = x

α−1

(1 − x)

β−1

B(α, β)

(5)

where B(α, β) is beta function. Prove that

E[X] = α

α + β

(6)

var[X] = αβ

(α + β)

2

(α + β + 1) (7)

2

3. Suppose we observe N i.i.d data points D = {x1, x2, ..., xN }, where each

xn ∈ {1, 2, ..., K} is a random variable with categorical (discrete) distribution

parameterized by θ = (θ1, θ2, ..., θK), i.e.,

xn ∼ Cat(θ1, θ2, ..., θK), n = 1, 2, ..., N (8)

In detail, this distribution means that for a specific n, the random variable xn

follows P(xn = k) = θk, k = 1, 2, ..., K.

Equivalently, we can also write the density function of a categorical distribution

as

p(xn) = Y

K

k=1

θ

I[xn=k]

k

(9)

where I[xn = k] is called identity function, and defined as

I[xn = k] = 

1, if xn = k

0, otherwise (10)

a. Now we want to prove that the joint distribution of multiple i.i.d categorical

variables is a multinomial distribution. Show that the density function of

D = {x1, x2, ..., xN } is

p(D|θ) = Y

K

k=1

θ

Nk

k

(11)

where Nk =

PN

n=1 I[xn = k] is the number of random variables belonging

to category k. In other word, D = {x1, x2, ..., xN } follows a multinomial

distribution.

b. We often call p(D|θ) likelihood function, since it indicates the possibility

we observe this dataset given the model parameters θ. By Bayes rule, we

can rewrite the posterior as

p(θ|D) = p(D|θ)p(θ)

p(D)

(12)

where p(θ) is piror distribution which indicates our preknowledge about

the model parameters. And p(D) is the distribution of the observations

(data), which is constant w.r.t. posterior. Thus we can write

p(θ|D) ∝ p(D|θ)p(θ) (13)

If we assume the Dirichlet prior on θ, i.e.,

p(θ; α1, α2, ..., αK) = Dir(θ; α1, α2, ..., αK) = 1

where B(α) is Beta function and α = (α1, α2, ..., αK).

3

Now try to derive the joint distribution p(D, θ) and ignore the constant

term w.r.t. α. Show that the posterior is actually also Dirichlet and

parameterized as follows:

p(θ|D) = Dir(θ; α1 + N1, α2 + N2, ..., αK + NK) (15)

[In fact, this nice property is called conjugacy in machine learning. A general

statement is : If the prior distribution is conjuagate to the likelihood, then

the posterior will be the same distribution as the prior distribution. Search

conjugate prior and exponential family for more detail if you are interested.]

Before you work on implementation, you need to install Jupyter and PySpark

by reading Instructions on PySpark Installation.pdf

Exercise 3: Exploratory Analysis and Data Visualization

In this exercise, we will be looking at a public citation dataset from Aminer (https:

//aminer.org/), a free online service used to index and search academic social

networks. You will perform some exploratory analysis and data visualization for this

dataset. The dataset is up to the year 2012 and can be downloaded in the attached

file called q3 dataset.txt. We show an example item format in README.txt.

The ArnetMiner public citation dataset is a real world dataset containing lots of

noise. For example, you may see a venue name like “The Truth About Managing

People...And Nothing But the Truth”. However, you are not expected to do data

cleaning in this phase.

1. Count the number of distinct authors, publication venues (conferences and

journals), and publications in the dataset.

a. List each of the counts.

b. Are these numbers likely to be accurate? As an example look up all the

publications venue names associated with the conference “Principles and

Practice of Knowledge Discovery in Databases”1

.

c. In addition to the problem in 1.b, what other problems arises when you

try to determine the number of distinct authors in a dataset?

2. We will now look at the publications associated with each author and venue.

a. For each author, construct the list of publications. Plot a histogram of the

number of publications per author (use a logarithmic scale on the y axis).

1https://en.wikipedia.org/wiki/ECML_PKDD

4

b. Calculate the mean and standard deviation of the number of publications

per author. Also calculate the Q1 (1st quartile2

), Q2 (2nd quartile, or

median) and Q3 (3rd quartile) values. Compare the median to the mean

and explain the difference between the two values based on the standard

deviation and the 1st and 3rd quartiles.

c. Now construct a list of publications for each venue. Plot a histogram of

the number of publications per venue. Also calculate the mean, standard

deviation, median, Q1 and Q3 values. What is the venue with the largest

number of publications in the dataset?

3. Now construct the list of references (that is, the cited publications) for each

publication. Then in turn use this set to calculate the number of citations for

each publication (that is, the number of times a publication is cited).

a. Plot a histogram of the number of references and citations per publication.

What is the publication with the largest number of references? What is

the publication with the largest number of citations?

b. Calculate the so called impact factor for each venue. To do so, calculate

the total number of citations for the publications in the venue, and then

divide this number by the number of publications for the venue. Plot a

histogram of the results.

c. What is the venue with the highest apparent impact factor? Do you believe

this number?

d. Now repeat the calculation from part b, but restrict the calculation to

venues with at least 10 publications. How does your histogram change?

List the citation counts for all publications from the venue with the highest

impact factor. How does the impact factor (mean number of citations)

compare to the median number of citations?

e. Finally, construct a list of publications for each publication year. Use

this list to plot the average number of references and average number of

citations per publication as a function of time. Explain the differences you

see in the trends.

Exercise 4: Market Basket Analysis of Academic Communities

In this problem, you will try to apply frequent pattern mining techniques to the real

world bibliographic dataset from Aminer (https://aminer.org/). One thing worth

noting is that you are required consider the whole dataset, instead of running with

part of the dataset. You may use any Apriori or FP-growth implementation that is

made available in existing libraries. We recommend that you use the implementation

in Spark (http://spark.apache.org/).

2https://en.wikipedia.org/wiki/Quartile

5

1. The dataset included with this problem is q4 dataset.txt. Parse this data, and

comment on how it differs from the previous file (q3 dataset.txt), in terms of

number of publications, authors, venues, references, and years of publication.

2. Coauthor discovery: Please use FP-Growth to analyze coauthor relationships,

treating each paper as a basket of authors.

a. What happens when you successively decrease the support threshold using

the values {1e-4, 1e-5, 0.5e-5, 1e-6}?

b. Keep threshold = 0.5e-5 and report the top 5 co-authors for these researchers:

Rakesh Agrawal, Jiawei Han, Zoubin Ghahramani and Christos Faloutsos

according to frequency.

3. Academic community discovery: In computer science communities tend to

organize around conferences. Here are 5 key conferences for areas of data science

• Machine learning: NIPS/NeurIPS (Neural Information Processing Systems)3

• Data mining: KDD (Conference on Knowledge Discovery and Data Mining)

• Databases: VLDB (Very Large Data Bases)

• Computer networks: INFOCOM (International Conference on Computer

Communications)

• Natural language processing: ACL (Association for Computational Linguistics)

a. We will now use FP growth to analyze academic communities. To do so,

represent each author as a basket in which the items are the venues in which

the author has at least one publication. What happens as you decrease the

support threshold using values {1e-3, 5e-4, 1e-4}?

b. Keep the threshold=5e-4 and report results. For each of those 5 key

conferences, please report the top 3 venues that authors also publish in.

3NIPS is today abbreviated as NeurIPS, but this dataset only contains references to NIPS.

6


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp