联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2020-05-28 10:09

INFS7450 Social Media Analytics

Project 2 – Link Prediction

Semester 1, 2020

Goal: The purpose of this project is to help students gain practical experiences and understand

the concepts of a popular data mining task - link prediction in social networks.

Dataset: In this project, you will be working with a co-author network. The dataset contains

the following files:

*training.txt: This file contains training data set for you to develop your prediction

methods. Each line of this file is a link in the network during the training time period.

*val_positive.txt and val_negative.txt: This is the validation set. This file contains

validation links for you to tune and validate your developed methods.

*test.txt: This the test set which contains the unlabeled edges to be ranked.

*example.txt: This is an example result file. You must follow the format of this file to

submit your results.

The dataset is available from UQ blackboard. See /Assessment/INFS7450 Project Two.

Task Description:

The provided co-author network has 5,242 nodes and 11,696 edges. The edges of the whole

provided co-author network are then split into three parts, which are E_train (11,496 edges),

E_validation (including two parts: 100 positive edges in val_positive.txt which were randomly

removed from the complete dataset and 10,000 negative edges val_negative.txt which were

built at random and not overlapped with E_train and 100 positive edges in E_validation), and E_test

(100 positive edges and 10,000 negative edges which were constructed in the same way but

not overlapped with E_validation and are unlabeled). Based on the given training and validation

sets of the co-author network, you are required to write a program to rank the unlabeled edges

in the test set. For each pair of nodes in the test set, your program should compute a proximity

score. Rank the 10,100 pairs of nodes according to your computed proximity score in

descending order and output the Top-100 edges (or pairs of nodes) with the highest proximity

score.

Input: The provided network datasets.

Output: The predicted Top-100 edges.

Programming Languages:

Python and NetworkX are recommended. However, you have your own choices of

preferred programming languages including, but not limited to, Python, MATLAB,

Java, C, C++, etc.

Deliverables:

Your submission must include the following:

1. A source code file.

2. A report (.pdf). See the given appendix for an example template.

3. A text file of the predicted Top-100 node pairs (edges). The format of this file must

follow the format of the provided example file - example.txt.

4. Name all the submitted files after your student ID. For example, 41234567.zip for

the source code, 41234567.txt for your submitted results and 41234567.pdf for

your report.

5. Submit one archive file with your student number as the file name (e.g.

41234567.zip). Make sure that all the files mentioned above are in the archive file.

Marking Criteria (Total marks: 15):

• 15 marks = 4 marks (code) + 7 marks (results) + 4 marks (report)

• Your results should be reproducible and your codes should be readable. If your

codes cannot be executed or generate the results as reported, the corresponding

marks for the code and results will be deducted.

• Results Mark = Accuracy * 7

where “Accuracy = The number of correctly predicted edges/100”. Accuracy is

calculated based on your submitted results compared against the ground truth.

Please note, the ground truth will not be released until due. You should evaluate

your code by using the validation data set.


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp