联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Database作业Database作业

日期:2021-05-25 11:19

Cardiff School of Computer Science and Informatics

Module Code: CMT655

Module Title: Manipulating and Exploiting Data

Assessment Title: Course Portfolio

Assessment Number: 1

Date Set: Friday 26th March 2021

Submission Date and Time: May 31st 2021 9:30

Return Date: June 30th 2021

This assignment is worth 100% of the total marks available for this module. If coursework

is submitted late (and where there are no extenuating circumstances):

1 If the assessment is submitted no later than 24 hours after the deadline, the mark

for the assessment will be capped at the minimum pass mark;

2 If the assessment is submitted more than 24 hours after the deadline, a mark of 0

will be given for the assessment.

Your submission must include the official Coursework Submission Cover sheet, which can

be found here:

https://docs.cs.cf.ac.uk/downloads/coursework/Coversheet.pdf

Submission Instructions

Your coursework should be submitted via Learning Central by the above deadline. It

consists of a portfolio divided in three assessments.

Assessment (1) consists of a set of exercises. The final deliverable consists of two

jupyter notebooks.

Assessment (2) is a machine-learning powered web service which is able to

train, evaluate and run predictions on unseen data, as well as storing model

configuration and results in a database. The deliverable is a zip file with the

application source code, a README.txt file and (optionally) a requirements.txt

file which lists dependencies and versions the app would require to run.

Assessment (3) is a reflective report (up to 2,000 words) describing solutions,

design choices and a reflection on the main challenges and ethical considerations

addressed during the development of solutions for assessments (1) and (2).

You have to upload the following files:

Any deviation from the submission instructions above (including the number and types of

files submitted) may result in a mark of zero for the assessment or question part.

Assignment

In this portfolio, students demonstrate their familiarity with the topics covered in the module

via three separate assessments.

Deliverable

Assessment 1

The deliverable for Assessment 1 consists of 2 jupyter notebook (.ipynb) files. They will be

submitted with all the output cells executed in a fresh run (i.e., Kernel -> Restart and run

all). 20 marks.

Assessment 2

The deliverable for Assessment 2 will be a zip file containing the webapp code, a

README.txt and an optional requirements.txt file, which will list the dependencies the app

requires. 25 marks.

Assessment 3

The deliverable for Assessment 3 will be a PDF file based on the .docx template provided

for this assessment in the starter package, available at Learning Central. 55 marks.

Description Type Name

Cover sheet  Compulsory One PDF (.pdf) file [student number].pdf

Assessment

1

Compulsory One Jupyter notebook (.ipynb) assessment1_db_creation_[stude

nt number].ipynb

Assessment

1

Compulsory One Jupyter notebook (.ipynb) assessment1_queries_[student

number].ipynb

Assessment

2

Compulsory One zip file (.zip) assessment2_webapp_[student

number].zip

Assessment

3

Compulsory One PDF (.pdf) file assessment3_report_[student

number].pdf

Assessment 1

In Assessment 1, students solve two main types of challenges. These challenges are: (1)

data modeling and (2) database querying.

DATA MODELING AND QUERYING (20 Marks)

assessment1_db_creation_[student number].ipynb

assessment1_queries_[student number].ipynb

1. Data modeling (8 marks)

You are given an initial .csv dataset from Reddit (data_portfolio_21.csv,

available in the starter package in Learning Central). This data dump contains posts

extracted from Covid-related subreddits, as well as random subreddits. Your first

task is to process this dump and design, create and implement a relational

(MySQL) database, which you will then populate with all the posts and related data.

This dataset has information about three entities: posts, users and subreddits.

The column names are self-explanatory: columns starting with the prefix user_

describe users, those starting with the prefix subr_ describe subreddits, the column

subreddit is the subreddit name, and the rest of the columns are post attributes

(author, post date, post title and text, number of comments, score, favorited by,

etc.).

What to implement: Start from the notebook assessment1_db_creation_[student

number].ipynb, replacing [student number] with your student number. Implement

the following (not necessarily in this order):

-- Python logic for reading in the data. [2 marks]

-- SQL code for creating tables. [3 marks]

-- SQL code for populating tables. [3 marks]

Use comments or markdown along the way to explain how you dealt with

issues such as missing data, non-standard data types, multivalued columns, etc.

You are not required to explain the database design (normalization, integrity,

constraints, etc) process in this notebook, as there is a dedicated part of the report

in Assessment 3 for this. However, you can include pointers to design choices for

facilitating the understanding of your implementation.

All your code should be self-contained in Python code, and therefore you will have

to rely on a MySQL library for executing SQL statements and queries. Please use

pymysql, the one we have used in class.

You should submit your notebook with all the cells executed, from start to

finish, in a fresh run (i.e., first cell number should be [1], second [2], etc.). You can

achieve this by selecting Kernel -> Restart and run all. At the end of the run, your

notebook should have populated a database in the university server which you will

have created exclusively for this coursework.

2. Querying (12 marks)

You are given a set of questions in natural language, for which you must implement

queries to find the answer. While the queries will be answered in the provided

jupyter notebook, they will have to be written in SQL, i.e., you cannot use Python to

solve them.

What to implement: Start from the notebook assessment1_queries_[student

number].ipynb, replacing [student number] with your student number. All the logic

should be contained inside the provided (empty) functions. Then, a call to each

function should show the output of these queries. You are also required to submit

your notebook after a fresh run (Kernel -> Restart and run all).

The questions are:

1 - Users with highest scores over time [0.5 marks]

● Implement a query that returns the users with the highest aggregate scores (over all their

posts) for the whole dataset. You should restrict your results to only those whose aggregated

score is above 10,000 points, in descending order. Your query should return two columns:

username and aggr_scores.

2 - Favorite subreddits with numbers but not 19 [0.5 marks]

● Implement a query that returns the set of subreddit names who have been favorited at least

once and that contain any number in their name, but you should exclude those with the digit

'19', as we want to filter out COVID-19 subreddit names. Your query should only return one

column: subreddit.

3 - Most active users who add subreddits to their favorites. [0.5 marks]

● Implement a query that returns the top 20 users in terms of the number of subreddits they have

favorited. Since several users have favorited the same number of subreddits, you need to

order your results, first, by number of favourites per user, and secondly, alphabetically by user

name. The alphabetical order should be, first any number, then A-Z (irrespective of case). Your

query should return two columns: username and numb_favs.

4 - Awarded posts [0.5 marks]

● Implement a query that returns the number of posts who have received at least one award.

Your query should return only one value.

5 - Find Covid subreddits in name and description. [1 mark]

● Implement a query that retrieves the name and description of all subreddits where the name

starts with covid or corona and the description contains covid anywhere. The returned table

should have two columns: name and description.

6 - Find users in haystack [1 mark]

● Implement a query that retrieves only the names of those users who have at least 3 posts with

the same score as their number of comments, and their username contains the string meme

anywhere. Your returned table should contain only one column: username.

7 - Subreddits with the highest average upvote ratio [1 mark]

● Implement a query that shows the 10 top subreddits in terms of the average upvote ratio of the

users that posted in them. Your query should return two columns: subr_name and

avg_upv_ratio.

8 - What are the chances [1 mark]

● Implement a query that finds those posts whose length (in number of characters) is exactly the

same as the length of the description of the subreddit in which they were posted on. You

should retrieve the following columns: subreddit_name, posting_user, user_registered_at,

post_full_text, post_description and dif (which should show the difference in characters

between the subreddit description and the post.

9 - Most active December 2020 days. [1 mark]

● Write a query that retrieves only a ranked list of the most prolific days in December 2020,

prolific measured in number of posts per day. Your query should return those days in a

single-column table (column name post_day) in the format YYYY-MM-DD.

10 - Top 'covid'-mentioning users. [1 mark]

● Retrieve the top 5 users in terms of how often they have mentioned the term 'covid' in their

posts. Your query should return two columns: username and total_count. You will consider an

occurrence of the word 'covid' only when it appears before and after a whitespace (i.e.,

<space>covid<space>) and irrespective of case (both <space>Covid<space> and

<space>covid<space> would be valid hits).

11 - Top 10 users whose posts reached the most users, but only in their favorite subreddits. [2

marks]

● Write a query to retrieve a list of 10 users sorted in descending order by the number of users

their posts reached, considering only the subset of users belonging to their favourite

subreddits. Your query must return only one column: username.

12 - Users with high score for their posts. [2 marks]

● Retrieve the number of users with an average score for their posts which is higher than the

average score for the posts in our dataset. Your query should return only one result, under the

column result.

Assessment 2

In Assessment 2, you implement a Flask application which manages a machine-learning

based text classifier and speaks to both MongDB and MySQL databases.

WEBAPP (25 Marks)

assessment2_webapp_[student number].zip

In this assessment the goal is to set up a web service based on Flask which will sit on top

of the database you built in Assessment 1, and will have a machine learning component.

Specifically, the app will have several functionalities for training, evaluating and deploying

a covid-or-not classifier, which will take as input a message posted in social media (e.g.,

Reddit), and predicts whether it is about Covid-19 or not.

What to implement: You will pull your data from the MySQL database that you

implemented in Assessment 1. Then, your task is to develop a web application based on

Flask which will have the following functionalities:

a) Run a classification experiment and store results, models and configuration in a

MongoDB database;

b) Retrieve results for the experiments done so far, ranked based on a criterion of

your choice; and

c) Perform inference, i.e., given a piece of text provided by the user, predict whether

it is about Covid-19 or not.

You are provided with an ‘empty’ skeleton which contains starter HTML and Python code.

Your task is to implement the backend logic following the detailed instructions below. The

provided skeleton has an index.html page which has the layout shown in Figure 1:

Figure 1: Landing page of the ML-powered Flask application.

Then, your task will be to implement the backend logic that is triggered when each of the

three buttons shown in Figure 1 are clicked on. The logic corresponding to each of these

buttons is explained in detail below.

1. Run a classification experiment [15 marks]

In this exercise, you have to implement the following workflow:

a) Reset, create and verify two VIEWS, which you will call training_data and

test_data. These VIEWS should contain non-overlapping posts which you will use

to train and evaluate your classifier. The logic will be implemented in the following

(empty) functions, located in the main.py script, provided in the starter package:

- reset_views() - Drop (if they exist) and create the views. [1 mark]

- create_training_view() - Create an SQL VIEW for training data. This

view will have two columns, the text and the label. It is up to you to decide the size

of the dataset and the proportion of Covid vs. non-Covid posts. And also which part

of the post you take (title, body or both). You will justify these choices in

Assessment 3. We will make the strong assumption that any post submitted to a

Covid-related subreddit will be about Covid-19, and that it will be non-related to

Covid if it is submitted somewhere else. [3 marks]

- create_test_view() - Create view for test data. This view will have two

columns, the text and the label. The same principles as in the previous case apply.

[3 marks]

- check_views() - Retrieve all records in training_data and test_data, and

print to the console their size. This is a small sanity check. [1 mark]

b) Retrieve data from the views you created in step 1, train and evaluate a

classifier, and return the results as a JSON object that is rendered in the browser.

Implement this functionality in the experiment() method, again in the main.py

script. It is up to you to decide on the classifier and its configuration. There is an

opportunity to reflect on these choices in Assessment 3. [5 marks]

c) Take the model binaries, model configuration and the classification results you

obtained in step (b), as well as the time in which the experiment was performed,

and store this information in a dedicated collection. This exercise is open, i.e., there

is no suggested format on how to store this data, what information to store for your

models or the evaluation metrics you use. There is an opportunity to reflect on

these choices in Assessment 3. [2 marks]

2. Retrieve information on the experiments conducted so far (5 marks)

a) In this exercise, you query the collection you implemented in step 1c, and show

the top 3 experiments based on a certain criterion (best scoring according to

metric X, the most recent experiments, the fastest experiments in training time,

etc.). Your results will be returned as JSON objects and rendered in the browser. [5

marks]

3. Implement a ‘covid-or-not’ predictor (5 marks)

a) In this exercise, you implement a functionality for predicting ‘on the fly’ whether a

piece of text is Covid-19-related or not. To this end, you will use the top-ranked

model according to the ranking you implemented in step 2a. This model will then be

applied to the input text and the results will be rendered in the browser as a JSON

object with the format:

{

"input_text": some_input_text,

"prediction": the_prediction_of_your_classifier

}. [5 marks]

Assessment 3

Report (55 Marks)

assessment3_report_[student number].pdf

In Assessment 3, you write a technical report on Assessments 1 and 2, and discuss

ethical, legal and social implications of the development of this Covid-19 application in the

context of the UK Data Ethics Framework. You are strongly encouraged to follow a

scholarly approach, e.g., with peer-reviewed references as support, and with additional

empirical evidence (your own tests) for justifying your decisions (e.g., performance or

physical storage for DBMS, training time or accuracy for the ML webapp solution).

Maximum 2,000 words (not counting references, index and table and figure captions).

This report should cover the following aspects, discussing challenges and problems

encountered and the solutions implemented to overcome them. The mark will be divided

between 3 expected sections:

o [3a] Database Creation (DB choice, design, etc.), i.e., the research and

findings stemming from the development of Assessment 1. Specifically, you

should discuss any business rules that can be inferred from the dataset

(reverse-engineering), normalization (identifying partial and transitive

dependencies, if any, unnormalized relations, etc.), data integrity and

constraints, bad data, etc. Moreover, the expectation is that any design decision

(or lack thereof) will be empirically (e.g., with performance tests) and/or

theoretically (pointing to peer-reviewed publications) supported. [20 Marks]

o [3b] ML Application, explaining the implementation of the training and test

VIEWS; the ML algorithm chosen (based on main features, hyperparameters

used in the application, training speed as opposed to other alternatives, etc);

evaluation metrics; the overall logic followed by the app for storing and retrieving

experimental results; and finally any further details that may be relevant for the

‘covid-or-not’ inference functionality. You should also discuss the rationale

behind the MongoDB interaction with pointers both to the database and the code

that interacts with it. [20 Marks]

o [3c] Ethics and Bias in Data-driven Solutions in the specific context of this

dataset and the broader area of application of this project (automatic

categorization of social media content to enable easier screening of public

opinion). You should map your discussion to one of the five actions outlined

in the UK’s Data Ethics Framework. You should prioritize the action that, in your

opinion, is the weakest. Then, justify your choice by critically analyzing the three

key principles outlined in the Framework, namely transparency, accountability

and fairness. Finally, you should propose one solution that explicitly addresses

one point related to one of these three principles, reflecting on how your solution

would improve the data cycle in this particular use case. [15 Marks]

Learning Outcomes Assessed

This coursework covers the 7 LOs listed in the module description.

Criteria for assessment

Credit will be awarded against the following criteria.

Assessment 1 (20 Marks) Criteria for assessment are detailed below:

Data Modeling (8 marks)

Criteria Fail (0-50%) Pass

(50-59%)

Merit

(60-69%)

Distinction

(70-100%)

Functionality

and quality

Faulty

application,

with unclear

pipeline. Parts

of the original

dataset are

missing (either

whole entities

or their

attributes).

Data types

wrongly

defined in the

DDL

statements. No

implementation

of PK/FK

constraints.

Notebook

poorly

structured.

Inefficient table

population

code.

Full dataset

loaded into a

MySQL

database,

some missing

values are

handled and

there is an

attempt to deal

with

multi-valued

columns.

Some data

types in DDL

statements are

appropriate.

Minimal PK/FK

constraints

implemented.

Notebook is

overall

self-explanator

y with some

parts difficult to

understand.

Somewhat

efficient table

population

code.

Pass and all

data types and

multivalued

columns

correctly

handled,

column data

types are

appropriate to

their range and

domain. Some

PK/FK

constraints

implemented.

Efficient and

well-document

code.

Notebook well

structured.

Merit and

excellent

management

of missing

data, type

recasting and

multivalued

columns. Fully

maintainable

DB, with DDL

and DML

statements

embedded in

high quality

SQL code that

evidences

research

beyond the

concepts seen

in class (e.g.,

Transaction

management

or Triggers).

Excellent

notebook

presentation,

with elegant

mix of Python

and SQL

Querying (12 marks)

Assessment 2 (25 Marks) Criteria for assessment are detailed below.

comments and

Markdown.

Criteria Fail (0-50%) Pass

(50-59%)

Merit

(60-69%)

Distinction

(70-100%)

Correctness Query not

implemented

or evidencing

limited

progress

towards a

solution.

Query

implemented

with wrong

output, but

evidencing

substantial

progress

towards a

solution.

Query

implemented

with correct

output, but with

minor mistakes

such as wrong

column

arrangement.

Query

implemented

and correct,

returning the

exact value or

table as

requested in

the question.

Criteria Fail (0-50%) Pass

(50-59%)

Merit

(60-69%)

Distinction

(70-100%)

Functionality

and quality

Faulty

application with

most or all

functionality

not

implemented.

Functional

application with

some of the

requested

functionalities

implemented.

Code for

defining

training and

test VIEWs is

functional, but

with some

errors. Browser

rendering of

results and

Mongo

insertion is

minimally

implemented,

but with some

errors. Covid

detector works

with some

Pass and all

requested

functionalities

implemented,

some of them

with high

quality.

Training and

test VIEWS are

correctly

implemented,

and

Python+Mongo

handling of

models, results

and model

parameters is

very good.

Covid

detection

works and is

well justified,

with almost all

Merit and all

functionalities

showing

excellent

quality, with

evidence of

maintainability

and

customization

beyond what

we have seen

in class (e.g.,

user can select

how much

training/test

data is used in

each

experiment).

Excellent

handling of

Python and

Mongo

interaction,

Assessment 3 (55 Marks) Criteria for assessment are detailed below, with specific

benchmarks for each section.

Database creation (20 marks)

inputs, and

criterion for

model

selection is

unclear.

inputs. with elegant

implementation

. Covid

detection

works

flawlessly with

any input.

Criteria Fail (0-50%) Pass

(50-59%)

Merit

(60-69%)

Distinction

(70-100%)

Correctness

and reflection

Missing or

limited

discussion of

the different

modules of the

pipeline. No or

minimal

discussion on

reverse-engine

ering,

normalization,

data integrity,

and

missing/multiav

lued data

handling.

Some

discussion

involving the

modules of the

pipeline.

Limited

discussion on

reverse-engine

ering,

normalization

is mentioned

but not

developed in

detail, data

integrity

constraints are

mentioned but

miss a clear

mapping with

the

implementation

, handling of

bad data is

mentioned but

without proper

reflection on

the decisions

made.

Pass and all

significant

discussion of

all modules in

the pipeline.

Business rules

fully

developed,

most

normalization

steps well

justified. Data

integrity

constraints well

mapped to the

implementation

and with a

reflective

component on

how they make

the db better

(preventing

anomalies,

ensuring

consistency,

etc.). Handling

of bada data is

mentioned and

reflected upon,

discussing

different

Merit and

excellent

discussion of

all modules.

Business rules

fully developed

and all

normalization

steps show

excellent

reflection. Data

integrity

constraints and

modular/maint

ainable SQL

code is

discussed.

Excellent and

scholarly

reference-supp

orted reflection

on data

integrity and

bad data

management.

ML Application (20 marks)

Ethics and Bias in Data-driven Solutions (15 marks)

approaches.

Criteria Fail (0-50%) Pass

(50-59%)

Merit

(60-69%)

Distinction

(70-100%)

Correctness

and reflection

Missing or

limited

discussion of

the different

modules of the

pipeline. No

justification for

the algorithm

chosen,

evaluation

metrics,

vectorization

procedure and

interaction with

browser and

MongoDB

database. No

discussion on

the

‘covid-or-not’

functionality.

Some

discussion

involving the

modules of the

pipeline. Some

analysis of the

algorithm

chosen, with

some

experiments

reported on

held-out data

referencing the

training/test

VIEWS. Some

reflection on

the

Python-Mongo

DB-browser

interaction.

‘Covid-or-not’

functionality

briefly

described.

Pass and

extensive

analysis of the

ML model

chosen, with

meaningful

comparison

with other

models,

vectorization

methods,

training/test

data splits

(with reference

to the VIEWS)

and with

ablation tests.

Good reflection

on the

Python-Mongo

DB-browser

interaction.

Significant

reflection on

the

‘Covid-or-not’

functionality,

showing

evidence of

previous

attempts.

Merit and

excellent

analysis of the

ML

component.

Extensive

analyses

across models,

vectorizers and

training/test

data splits

which

reference the

application’s

implementation

. Excellent

reflection on

the

Python-Mongo

DB-browser

interaction, and

extensive

reflection on

the

‘Covid-or-not’

functionality,

showing

evidence of

previous

attempts under

different

experimental

setups.

Criteria Fail (0-50%) Pass

(50-59%)

Merit

(60-69%)

Distinction

(70-100%)

Correctness

and reflection

Fails to

address the

actions and

One action

addressed

without proper

One action

addressed and

critically

One action

addressed and

critically

The grade range is divided in:

o Distinction (70-100%)

o Merit (60-69%)

o Pass (50-59%)

o Fail (0-50)

Feedback and suggestion for future learning

Feedback on your coursework will address the above criteria. Feedback and marks will be

returned between June 30rd and July 2

th via Learning Central. There will be opportunity for

individual feedback during an agreed time.

Feedback for this assignment will be useful for subsequent skills development, such as

database design, SQL and NoSQL, data analysis, machine learning and client-motivated

deliverables involving predictive analysis.

principles in

the

UK Data Ethics

Framework. No

or trivial

solution

provided.

justification.

Some

principles

addressed.

Trivial solution

provided that

does not

clearly map

with a specific

action or

principle.

justified. All

principles

addressed.

Some

supporting

peerreviewed

references.

Justified

solution based

on

limited

literature

and/or

empirical

results.

justified and

all principles

addressed, in

both cases

with

extensive use

of supporting

peer-reviewed

references.

Solution is

justified

extensively

with

references and

empirical

results.


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp