联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Java编程Java编程

日期:2021-10-13 10:04

Lab 1 - Regular Expressions

 1/9 Lab 1 - Regular Expressions

Due Monday by 11:59pm Points 100 Submitting on paper

Introduction

In this programming lab you will practice imperative programming in the language

Raku (https://raku.org/resources/) . You are provided an incomplete program and are

tasked to complete the regular expression to perform a number of text processing tasks. Each task

requires a precise regular expression to match the line, word, character counts expected by the

tests.

Language: Raku

For this lab, you will use Raku. Download Raku from here: https://rakudo.org/

(https://rakudo.org/downloads) .

The Comma IDE (https://commaide.com/) is built for Raku and has syntax highlighting:

https://commaide.com/download (https://commaide.com/download) .

Download the Lab 1 Starter Pack: 381_lab1_starter_pack.zip

(https://oregonstate.instructure.com/courses/1862887/files/89343738/download?download_frd=1)

Dataset

This lab will make use of a subset of a dataset containing a million song titles. This dataset is used

in various machine learning experiments and is provided by the Laboratory for the Recognition

and Organization of Speech and Audio at Columbia University

(http://labrosa.ee.columbia.edu/) .

The original file is available here

(http://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/unique_tracks.txt) . This

file contains one millions lines and weighs in at 82 MB. You should probably avoid viewing it in a

web browser.

Because this dataset is so large, we provide a smaller random sample of 100k tracks as the file

tracks.txt which is availablein the starter pack

(https://oregonstate.instructure.com/courses/1862887/files/89343738/download?download_frd=1) . We

will work with this smaller sample of data in this lab.

Getting Started

Unpack the starter pack

(https://oregonstate.instructure.com/courses/1862887/files/89343738/download?download_frd=1) ?

 - Regular Expressions

 2/9

archive available on Canvas. First make a copy of the lab template (e.g., cp lab1_template.raku

lab1.raku ). You will modify lab1.raku for this lab.

Open the file in your favorite text editor. At the top, add your name and email as instructed by the

comments. Set the variable $name to your name.

Begin by running the program.

$ raku lab1.raku

The template provides you with a working user menu, although not all features are enabled until

you write them. It takes commands followed by an argument (optional). The command list is listed

below. Try it out and type commands, one per line. Afterwards, type the end-of-input character

CTRL+D on Linux and MacOS or CTRL+Z on Windows.

load tracks.txt

print 1000

^D

This will read in the file tracks.txt and print out the first 1000 lines to standard out. You will want to

kill the program or you will be waiting a long time. You can use CTRL+C to cancel execution of the

program.

Next, run the program again and count the number of lines in the input file.

load tracks.txt

count tracks

^D

You can verify that there are indeed one hundred thousand song titles.

Self-Check

Run the following command to count tracks, words, and characters in the dataset.

debug off

load tracks.txt

count tracks

count words

count characters

^D

This results in the following counts.

100000

460449

8594540

The above is the same as test t01 .

Lab Tasks

?

 - Regular Expressions

 3/9

As you can see, these songs titles are incredibly messy. We need to clean them up. Regular

expressions are the perfect tool for that task. And Raku (Perl) is the perfect language for regular

expression wizardry. Task 1: Extra song title

Each line contains a track id, song id, artist name, and the song title, such as:

TRWRJSX12903CD8446<SEP>SOBSFBU12AB018DBE1<SEP>Frank Sinatra<SEP>Everything Happens To Me

You are only concerned with the last field, the song title. As your first task, you will write a regular

expression that extracts the song title and discards all other information.

In the code, find the function extract_title . In the if statement, edit the regular expression to

capture only the song title. The print command is useful to print out the tracks as you develop

your regular expressions. The print command takes an optional integer argument which restricts

printing to only the first n tracks. For example, print 10 , prints only the first ten tracks.

On completion of Task 1, you can check yourself with test t02 . Task 2: Eliminate superfluous text

The song title, however, is quite noisy, often containing additional information beyond the song title.

Consider this example.

Strangers In The Night (Remastered Album Version) [The Frank Sinatra Collection]

You need to preform some pre-processing in order to clean up the song titles. You will write a

series of regular expressions that match a block of text and replace it with nothing. Use a

substitution regex that matches one string and replaces it with another.

Begin by writing a regular expression that matches a left parenthesis and any text that follows it.

You need not match the right parenthesis explicitly. Replace the parenthesis and all text that follow

it with nothing.

In the above Sinatra example, the modified title becomes:

Strangers In The Night

Repeat this for patterns beginning with the left bracket, the left curly brace, and all the other

characters listed below.

( [ { \ / _ - : " ` + = feat.

Most of the these are single symbols. The last is the abbreviation feat. which is short for

featuring. Make sure to remove the literal period which follows the t . For example, Sunbeam feat.

Vishal Vaid becomes Sunbeam . ?

 - Regular Expressions

 4/9

Also take note that the above list shows the backtick (on the tilde key above tab) and not the

apostrophe (located left of the enter key). This is a very important distinction. We do not want to

omit the apostrophe as it allows contractions.

Many of these characters have special meanings in Raku. Make sure you properly escape symbols

when necessary with a \ . Alternately, you can put characters to escape in quotations, such as

"#" . Failing to escape characters properly will be the most common mistake made in this lab. This

resource (http://www.dispersiondesign.com/articles/perl/perl_escape_characters) lists the escape

characters in Perl, which are the same in Raku.

In most cases, these symbols indicate additional information that need not concern us for this

exercise. The above steps will very occasionally corrupt a valid song title that actually contains, for

example, parentheses in the song title. Do not worry about these infrequent cases and uniformly

carry out the procedure listed above. These steps will catch and fix the vast majority of irregularities

in the song titles.

In the code, find the incomplete function comments . Add a series of regex substitutions to remove

the superfluous information. You can now use the command filter_comments .

On completion of Task 2, you can check yourself with test t03 . Task 3: Eliminate punctuation

Next, find and delete the following typical punctuation marks.

? ? ! ? . ; : & $ * @ % # |

Unlike before, delete only the symbol itself and leave all of the text the follows. Be sure to do a

'global' match in order to replace all instances of the punctuation mark. Be careful to match the

period itself as the symbol . has a special meaning in regular expressions. This is true for many of

the symbols above. Again, refer to the list of escape characters

(http://www.dispersiondesign.com/articles/perl/perl_escape_characters) . For the symbols ? and ? ,

use the Unicode representations, which are \x[00BF] \x[00A1] respectively. Note the colon is

intentionally repeated from Task 2.

In the code, find the incomplete function punctuation . Add a series of regex substitutions to remove

the punctuation. You can now use the command filter punctuation filter punctuation.

On completion of Task 3, you can check yourself with test t04 . Task 4: Trim whitespace

All our replacements have left some blank titles or mangled titles. First, identify any titles that begin

or end with an apostrophe. Replace those apostrophes with nothing. Be careful not to disturb the

apostrophes in the interior of the sentence. Next, trim leading and trailing whitespace from each

title. Be careful not to replace all whitespace as we need whitespace between word boundaries.

 - Regular Expressions

 5/9 To match the tests and simplify debugging, do not combine the above two operations into one, but

rather (1) handle apostrophes first and then (2) handle whitespace as a second line of code.

In the code, find the incomplete function clean and complete the instructions to achieve the above

goal.

Task 5: Filter out non-English characters We want to ignore all song titles that contain a non-English character (e.g., á, ì, ?, ?, etc.). Although

this may eliminate some titles in languages like Spanish or Danish, we need to drop out the

Unicode garbage characters found in many of the titles.

If the titles contain only valid English alphanumeric characters ('a' to 'z' and 0 to 9) or the

apostrophe or a space, we keep it. Otherwise, we discard the song title entirely. Use the Raku

command next; to skip to the next iteration of the loop, which bypasses pushing it to the array to

return. Make sure to accept both uppercase and lowercase letters.

In the code, continuing editing the function clean to achieve the above goal.

Task 6: Skip blank titles

After all the replacements, some titles are left with nothing. Continue to edit the function clean to

throw out any empty titles. Use a regular expression to check if a title is empty or contains only

whitespace. If empty, do not retain it in the array you return. Again use the Raku command next;

to skip this title. Likewise, if the title contains only an apostrophe, discard it.

To match the tests, do not combine the above two operations into one, but (1) handle whitespace

first and then (2) handle apostrophes as a second line of code.

In the code, continuing editing the function clean to achieve the above goal.

Task 7: Set to lowercase

Convert all words in the sentence to lowercase. Raku has a simple function to do this for you. Go

back and edit the clean function one more time to convert all titles to lower case.

On completion of Tasks 4-7, you can check yourself with test t06 . Task 8: Filtering out common words

In the domain of Natural Language Processing (NLP), stop words

(https://en.wikipedia.org/wiki/Stop_word) are common words that are often filtered out in

preprocessing. Go edit the function stopwords to filter out the following common stop words from

the song title.

a an and by for from in of on or out the to with

?

 - Regular Expressions

 6/9

Be careful to only replace entire words, not just any occurrence of these letters. Otherwise you will

turn words like oustanding to stg .

The regex control character <|w> will be quite useful here, which defines a word boundary

(https://docs.raku.org/language/regexes#Word_boundary) . Additionally, remove a single whitespace

following the second word boundary. This prevents your titles being littered with double spaces

throughout because we also remove the single space that follows the word. This has the

unintended consequence that it will retain the stop word if it's the final word of the title (because

there is no space to match after), but for simplicity, let's not worry about that.

We may not want to always filter these words, but we want to have the option. You can now use the

command filter stopwords . You can toggle this mode using commands stopwords on and stopwords

off .

On completion of Task 8, you can check yourself with test t07 . Postlude: Putting it all together You can now use the command preprocess which runs all the individual filter tasks.

On completion of Tasks 1-8, you can check yourself with test t08 (stopwords off) and test

t09 (stopwords on).

Menu System

Menu system for Labs 1 and 2

command argument description

build build the bigram model

debug

on

off

turn on debug mode

turn off debug mode

count

tracks

words

characters

counts tracks in @tracks

counts words in @tracks

counts characters in @tracks

filter

title

comments

punctuation

unicode

extract title from original input

removes extra phrases

removes punctuation

removes non-Unicode and

whitespace

load FILE loads the given FILE

 - Regular Expressions

 7/9

length INT update $SEQUENCE LENGTH=INT

mcw WORD

print most common word to follow

WORD

name

print sequence based on your

name

preprocess run all filter tasks and build

print

INT

prints all tracks to file tracks.out

prints only INT tracks to stdout

random

print sequence based on random

word

stopwords

on

off

filter stopwords on

filter stopwords off

sequence WORD find sequence to follow WORD

Resources

This lab requires an independent study of the Raku language. You are encouraged to use any web

tutorials and resources to learn Raku.

Regular Expression Tester (https://regex101.com/)

Raku Guide (https://raku.guide/)

Raku Resources (https://raku.org/resources/)

Raku Tutorial (https://www.i-programmer.info/news/222-perl/13267-the-raku-beginner- tutorial.html)

Raku Documentation (https://docs.raku.org/language.html)

Allow yourself plenty of time, and use patience, perseverance, and the Internet to debug your code.

Testing

In the directory tests you will find a number of pairs of files representing the given input and

expected output of the test.

raku lab1.raku < ./tests/t01.in

The above line will execute lab1.raku piping in the commands given in file t01.in . You can

compare your result with the expected output given in file t01.out .

Submission

 - Regular Expressions

 8/9

Lab 1 Rubric

Each student will complete and submit this assignment individually. Do not consult with others.

However, you are encouraged to use the Internet to learn Raku.

You will submit only your Raku code file.

lab1.raku

Do not submit lab1_template.raku , tracks.txt or 381_lab1_starter_pack.zip .

Grading criteria

This project is worth 100 points. Include your name in your file. Format your code neatly using

expected conventions for indentations. Use descriptive variable names. Comment your program

heavily. Thoughtful and unique comments help protect you against accusations of violations of

academic integrity. Your program will be evaluated against a series of automatic tests using Gradescope. The grading

tests will be the same as the provided tests but will examine a different subset of the one million

tracks. Your grade is determined by the number of public and hidden tests your program passes.

?

 - Regular Expressions

 9/9 Total Points: 100

Criteria Ratings Pts

Test 1 Credit for submission 16 pts Passed 12 pts Partial (2) 8 pts Partial (1) 4 pts Failed Test Test 2 Task 1 - Extract title

12 pts Passed 8 pts Partial (2) 4 pts Partial (1) 0 pts Failed Test Test 3 Task 2 - Filter Comments 12 pts Passed 8 pts Partial (2) 4 pts Partial (1) 0 pts Failed Test Test 4 Task 3 - Filter Punctuation 12 pts Passed 8 pts Partial (2) 4 pts Partial (1) 0 pts Failed Test Test 6 Tasks 4-7 - Clean whitespace, apostrophes, and Unicode 12 pts Passed 8 pts Partial (2) 4 pts Partial (1) 0 pts Failed Test Test 7 Tasks 8 - Remove stopwords 12 pts Passed 8 pts Partial (2) 4 pts Partial (1) 0 pts Failed Test Test 8 Full check, without stopwords 12 pts Passed 8 pts Partial (2) 4 pts Partial (1) 0 pts Failed Test Test 9 Full check, with stopwords 12 pts Passed 8 pts Partial (2) 4 pts Partial (1) 0 pts Failed Test ?


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp