联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2019-10-23 11:17

CITS1401 Computational Thinking with Python

Project 2 Semester 2 2019

Page 1 of 6

Project 2: Using Stylometry to Verify Authorship

Submission deadlines:

Stage 1: 5:00pm, Friday 18 October 2019 for the pseudocode

Stage 2: 5:00pm, Friday 25 October 2019 for the code.

Value: 20% of CITS1401.

To be done individually.

You should construct a Python 3 program containing your solution to the following

problem and submit your program electronically on LMS. No other method of

submission is allowed.

You are expected to have read and understood the University's guidelines on

academic conduct. In accordance with this policy, you may discuss with other

students the general principles required to understand this project, but the work

you submit must be the result of your own effort. Plagiarism detection, and other

systems for detecting potential malpractice, will therefore be used. Besides, if

what you submit is not your own work then you will have learnt little and will

therefore, likely, fail the final exam.

You must submit your project before the submission deadline listed above.

Following UWA policy, a late penalty of 10% will be deducted for each day (or part

day), after the deadline, that the assignment is submitted. However, in order to

facilitate marking of the assignments in a timely manner, no submissions will be

allowed after 7 days following the deadline.

Overview

UWA, like every university around the country (probably around the galaxy) is

very worried about ghost-written submissions for assignments. This is also known

as contract cheating. Whatever you call it, ghost-writing is about getting someone

else to do your work, but submitting it as if it was only your work. In this case we

are concerned with essays. The incidence is believed to be low, but it's clearly not

a good thing.

Coming from a different angle, debates have raged at various times about whether

different authors' works were actually by those authors. For example, were all the

works attributed to William Shakespeare actually by him? One approach to

examining both of these issues is to use stylometry. That is, rather than looking

directly at the content of texts, as one does when looking for suspected plagiarism,

stylometic looks for stylistic similarities. In other words, similarities in the ways a

particular author uses language, rather than similarities in the actual words on the

page, on the assumption that an author will use a similar style for similar sorts of

content, fiction, non-fiction, etc.

CITS1401 Computational Thinking with Python

Project 2 Semester 2 2019

Page 2 of 6

What you will do for this Project is write a program that reads in two text files

containing the works to be analysed and builds a profile for each. The two profiles

are compared and returned besides a score which reflects the distance between

the two works in terms of their style; low scores, down to 0, imply that the same

author is likely responsible for both works, while large scores imply different

authors.

Specification: What your program will need to do

Input:

Your program must define the function main with the following signature:

def main(textfile1, textfile2, feature)

The first and second arguments are the names of the text files with a work to be

analysed. The third argument is the type of feature that will be used to compare

the document profiles. The allowed feature names are: "punctuation", "unigrams",

"conjunctions" and "composite".

Output:

The function is required to return the following outputs in the order provided

below:

• the score from a pairwise comparison rounded to four decimal places,

• the dictionary containing the profile of first file (textfile1), and

• the dictionary containing the profile of second file (textfile2)

A more detailed specification

• For the purposes of this project, a sentence is a sequence of words followed by

either a full-stop, question mark or exclamation mark, which in turn must be

followed either by a quotation mark (so the sentence is the end of a quote or

spoken utterance), or white space (space, tab or new-line character). Thus:

This is some text. This is yet more text

contains one sentence followed by the start of another sentence.

• You are required to create the profile of input files using dictionaries. The profile

for each document will contain the number of occurrences of certain words

(case insensitive) and pieces of punctuation.

• The counted words or punctuations are dependent on the input feature which

can be: "punctuation", "unigrams", "conjunctions" and "composite".

• For conjunctions: your program is required to count the number of

occurrences of the following words:

"also", "although", "and", "as", "because", "before", "but", "for", "if", "nor", "of",

"or", "since", "that", "though", "until", "when", "whenever", "whereas",

"which", "while", "yet"

CITS1401 Computational Thinking with Python

Project 2 Semester 2 2019

Page 3 of 6

• For unigrams: your program is required to count the number of occurrences

of each word in the files. Consider the following three lines of text contained in

a file:

This is a Document.

This is only a document

A test should not cause problem

The word count will be: "a":3, "document":2, "this":2, "is":2, "only":1,

"should":1, "not""1, "cause":1, "problem":1

• For punctuation: your program should count certain pieces of punctuation:

comma and semicolon. In addition, your program should also count singlequote

and hyphen, but only under certain circumstances. Specifically, your

program should count single-quote marks, but only when they appear as

apostrophes surrounded by letters, i.e. indicating a contraction such as

"shouldn't" or "won't". (Apostrophe is being included as an indication of more

informal writing, perhaps direct speech.). Your program should count dash

(minus) signs, but only when they are surrounded by letters, indicating a

compound-word, such as "compound-word". Any other punctuation or letters,

e.g '.' when not at the end of a sentence, should be regarded as white space,

so serve to end words. For these purposes, strings of digits are also words as

they convey information. Therefore, in the unlikely event that a floating point

number, such as 3.142, appears, that is regarded as two words.

Note: Some of the texts we will use include double hyphen, i.e. "--". This is to

be regarded as a space character.

• For composite: your program should contain number of occurrences of

punctuations (as explained above) and conjunctions. In addition, your program

should also add to the profile two further parameters relating to the text: the

average number of words per sentence and the average number of sentences

per paragraph, where a paragraph is any number of sentences followed by a

blank line or by the end of the text.

• Each of the words and punctuation symbols should be placed, together with

their respective counts, in a dictionary, which is called a profile.

• The first output by the main function is the distance between the corresponding

profiles which should be computed using the standard distance formula:

• The second and third outputs returned by the main function are

the profiles corresponding to the first and second text files respectively. The

returned profiles as dictionaries in which each word is the key and value is the

number of occurrences of the key, such as {“also”:10, ”got”: 6} where

“also” and “got” are the keys and have occurred 10 and 6 times respectively.

CITS1401 Computational Thinking with Python

Project 2 Semester 2 2019

Page 4 of 6

Example:

Download the project2data.zip file from the folder of Project 2 on LMS. An example

interaction is provided as a sampleanswers.txt which you can find in

sampleresult.txt. The results are based on three files: sample1.txt and

sample2.txt, both excerpts taken from "Life on the Mississippi", by Mark Twain.

Some Text Files to Examine

Some text files are also included in the zip file for you to try out. All of the texts,

apart from "Kangaroo", were obtained from Project Gutenberg

(www.gutenberg.org). All the files have a long text at the end which contains

Project Gutenberg license and terms of use. I have removed the Gutenberg terms

and license in the files rather than left them in the texts because that may affect

the profiles.

Author Title Fiction/Non-fiction

Henry Lawson Children of the Bush Fiction

D. H. Lawrence Fantasia of the Unconscious Non Fiction

Mark Twain Life on the Mississippi Non Fiction

D. H. Lawrence Sea and Sardinia Non Fiction

D. H. Lawrence Kangaroo Fiction

Mark Twain Adventures of Hucklebery Finn Fiction

Andrew Barton

'Banjo' Paterson

Three Elephant Power Fiction

A small note of warning. If you decide to download your own texts from Project

Gutenberg, please be aware that many of the texts include spurious Unicode

characters. Unfortunately, the file input-output functions we use in CITS1401 (and

I use on a daily basis) only work with the standard ASCII character set, so will

cause an exception if Unicode characters are in the text. While Python is well able

to deal with Unicode, special input-output functions are needed, which are beyond

the scope of this unit. What I have done is use the Unix command: cat –vet

filename to make the Unicode characters visible in the ASCII character set, and

then use a text editor to remove them. (Tedious.)

Important:

You will have noticed that you have not been asked to write specific functions.

That has been left to you. However, as in Project 1, it is important that your

program defines the top-level function main() as described above. main()

should then call the other functions. (Of course, these may call further functions.)

CITS1401 Computational Thinking with Python

Project 2 Semester 2 2019

Page 5 of 6

The reason this is important is that when I test your program, my testing program

will call your main() function. So, if you fail to define main(), or define it with a

different signature, my program will not be able to test your program.

Things to avoid:

There are a few things for your program to avoid.

• You are not allowed to import any Python module except math or os. While

use of other modules are perfectly sensible thing to do (and the way I often

may do it), it takes away much of the point of different aspects of the project,

which is about getting practice creating code to accurately extract the parts of

strings that that you need, and use of basic Python structures, in this case

dictionaries.

• Please do not assume that the input file names will end in .txt. File name

suffixes such as .csv and .txt are not mandatory in systems other than Microsoft

Windows.

• Please make sure your program does NOT call the input() or print()

functions. That will cause your program to hang, waiting for input that my

automated testing system will not provide. In fact, what will happen is that the

marking program detects the call(s), and will not test your code at all.

Submission:

Stage 1:

Submit a single PDF file containing your approach and/or pseudocode for the

solution of the problem as per guidelines discussed in Lecture L2 Software

Development Process and Project 1 Stage 1 submission. You need to discuss the

document with lab demonstrator before submission. It is mandatory to submit this

file before 5:00pm 18 October 2019 on LMS to avoid 10% deduction in Project

2 grading. This will be a formative feedback of your problem solving skills

developed in the course. In case you do not submit the file, 10% of the total marks

of the project will be deducted from your obtained grade of the Stage 2

submission.

Stage 2:

Submit a single Python (.py) file containing all of your functions via LMS before

5:00pm 25 October 2019 on LMS

You need to contact unit coordinator if you have special considerations or making

late submission.

Marking Rubric:

Your program will be marked out of 30 (later scaled to be out of 20% of the final

mark).

22 out of 30 marks will be awarded based on how well your program completes a

number of tests, reflecting normal use of the program, and also how the program

handles various error states, such as the input file not being present. Other than

CITS1401 Computational Thinking with Python

Project 2 Semester 2 2019

Page 6 of 6

things that you were asked to assume, you need to think creatively about the

inputs your program may face.

8 out of 30 marks will be style (4/8) “the code is clear to read” and efficiency (4/8)

“your program is well constructed and runs efficiently”. For style, think about use

of comments, sensible variable names, your name at the top of the program, etc.

(Please look at your lecture notes, where this is discussed.)

Style Rubric:

0 Gibberish, impossible to understand

1-2 Style is really poor

3 Style is good or very good, with small lapses

4 Excellent style, really easy to read and follow

Your program will be traversing text files of various sizes (possibly including large

corpora) so try to minimise the number of times your program looks at the same

data items. You may wish to use dictionaries (or sets, if you are prepared to read

the documentation), rather than lists.

Efficiency Rubric:

0 Code too incomplete to judge efficiency, or wrong problem tackled

1 Very poor efficiency, additional loops, inappropriate use of readline()

2 Acceptable efficiency, one or more lapses

3 Good efficiency, small lapses

4 Excellent efficiency, should have no problem on large files

Automated testing is being used so that all submitted programs are being tested

the same way. Sometimes it happens that there is one mistake in the program

that means that no tests are passed. If the marker is able to spot the cause and

fix it readily, then they are allowed to do that and your - now fixed - program will

score whatever it scores from the tests, minus 2 marks, because other students

will not have had the benefit of marker intervention. Still, that's way better than

getting zero. On the other hand, if the bug is too hard to fix, the marker needs to

move on to other submissions.


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp