Lab 1 - Regular Expressions
1/9 Lab 1 - Regular Expressions
Due Monday by 11:59pm Points 100 Submitting on paper
Introduction
In this programming lab you will practice imperative programming in the language
Raku (https://raku.org/resources/) . You are provided an incomplete program and are
tasked to complete the regular expression to perform a number of text processing tasks. Each task
requires a precise regular expression to match the line, word, character counts expected by the
tests.
Language: Raku
For this lab, you will use Raku. Download Raku from here: https://rakudo.org/
(https://rakudo.org/downloads) .
The Comma IDE (https://commaide.com/) is built for Raku and has syntax highlighting:
https://commaide.com/download (https://commaide.com/download) .
Download the Lab 1 Starter Pack: 381_lab1_starter_pack.zip
(https://oregonstate.instructure.com/courses/1862887/files/89343738/download?download_frd=1)
Dataset
This lab will make use of a subset of a dataset containing a million song titles. This dataset is used
in various machine learning experiments and is provided by the Laboratory for the Recognition
and Organization of Speech and Audio at Columbia University
(http://labrosa.ee.columbia.edu/) .
The original file is available here
(http://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/unique_tracks.txt) . This
file contains one millions lines and weighs in at 82 MB. You should probably avoid viewing it in a
web browser.
Because this dataset is so large, we provide a smaller random sample of 100k tracks as the file
tracks.txt which is availablein the starter pack
(https://oregonstate.instructure.com/courses/1862887/files/89343738/download?download_frd=1) . We
will work with this smaller sample of data in this lab.
Getting Started
Unpack the starter pack
(https://oregonstate.instructure.com/courses/1862887/files/89343738/download?download_frd=1) ?
- Regular Expressions
2/9
archive available on Canvas. First make a copy of the lab template (e.g., cp lab1_template.raku
lab1.raku ). You will modify lab1.raku for this lab.
Open the file in your favorite text editor. At the top, add your name and email as instructed by the
comments. Set the variable $name to your name.
Begin by running the program.
$ raku lab1.raku
The template provides you with a working user menu, although not all features are enabled until
you write them. It takes commands followed by an argument (optional). The command list is listed
below. Try it out and type commands, one per line. Afterwards, type the end-of-input character
CTRL+D on Linux and MacOS or CTRL+Z on Windows.
load tracks.txt
print 1000
^D
This will read in the file tracks.txt and print out the first 1000 lines to standard out. You will want to
kill the program or you will be waiting a long time. You can use CTRL+C to cancel execution of the
program.
Next, run the program again and count the number of lines in the input file.
load tracks.txt
count tracks
^D
You can verify that there are indeed one hundred thousand song titles.
Self-Check
Run the following command to count tracks, words, and characters in the dataset.
debug off
load tracks.txt
count tracks
count words
count characters
^D
This results in the following counts.
100000
460449
8594540
The above is the same as test t01 .
Lab Tasks
?
- Regular Expressions
3/9
As you can see, these songs titles are incredibly messy. We need to clean them up. Regular
expressions are the perfect tool for that task. And Raku (Perl) is the perfect language for regular
expression wizardry. Task 1: Extra song title
Each line contains a track id, song id, artist name, and the song title, such as:
TRWRJSX12903CD8446<SEP>SOBSFBU12AB018DBE1<SEP>Frank Sinatra<SEP>Everything Happens To Me
You are only concerned with the last field, the song title. As your first task, you will write a regular
expression that extracts the song title and discards all other information.
In the code, find the function extract_title . In the if statement, edit the regular expression to
capture only the song title. The print command is useful to print out the tracks as you develop
your regular expressions. The print command takes an optional integer argument which restricts
printing to only the first n tracks. For example, print 10 , prints only the first ten tracks.
On completion of Task 1, you can check yourself with test t02 . Task 2: Eliminate superfluous text
The song title, however, is quite noisy, often containing additional information beyond the song title.
Consider this example.
Strangers In The Night (Remastered Album Version) [The Frank Sinatra Collection]
You need to preform some pre-processing in order to clean up the song titles. You will write a
series of regular expressions that match a block of text and replace it with nothing. Use a
substitution regex that matches one string and replaces it with another.
Begin by writing a regular expression that matches a left parenthesis and any text that follows it.
You need not match the right parenthesis explicitly. Replace the parenthesis and all text that follow
it with nothing.
In the above Sinatra example, the modified title becomes:
Strangers In The Night
Repeat this for patterns beginning with the left bracket, the left curly brace, and all the other
characters listed below.
( [ { \ / _ - : " ` + = feat.
Most of the these are single symbols. The last is the abbreviation feat. which is short for
featuring. Make sure to remove the literal period which follows the t . For example, Sunbeam feat.
Vishal Vaid becomes Sunbeam . ?
- Regular Expressions
4/9
Also take note that the above list shows the backtick (on the tilde key above tab) and not the
apostrophe (located left of the enter key). This is a very important distinction. We do not want to
omit the apostrophe as it allows contractions.
Many of these characters have special meanings in Raku. Make sure you properly escape symbols
when necessary with a \ . Alternately, you can put characters to escape in quotations, such as
"#" . Failing to escape characters properly will be the most common mistake made in this lab. This
resource (http://www.dispersiondesign.com/articles/perl/perl_escape_characters) lists the escape
characters in Perl, which are the same in Raku.
In most cases, these symbols indicate additional information that need not concern us for this
exercise. The above steps will very occasionally corrupt a valid song title that actually contains, for
example, parentheses in the song title. Do not worry about these infrequent cases and uniformly
carry out the procedure listed above. These steps will catch and fix the vast majority of irregularities
in the song titles.
In the code, find the incomplete function comments . Add a series of regex substitutions to remove
the superfluous information. You can now use the command filter_comments .
On completion of Task 2, you can check yourself with test t03 . Task 3: Eliminate punctuation
Next, find and delete the following typical punctuation marks.
? ? ! ? . ; : & $ * @ % # |
Unlike before, delete only the symbol itself and leave all of the text the follows. Be sure to do a
'global' match in order to replace all instances of the punctuation mark. Be careful to match the
period itself as the symbol . has a special meaning in regular expressions. This is true for many of
the symbols above. Again, refer to the list of escape characters
(http://www.dispersiondesign.com/articles/perl/perl_escape_characters) . For the symbols ? and ? ,
use the Unicode representations, which are \x[00BF] \x[00A1] respectively. Note the colon is
intentionally repeated from Task 2.
In the code, find the incomplete function punctuation . Add a series of regex substitutions to remove
the punctuation. You can now use the command filter punctuation filter punctuation.
On completion of Task 3, you can check yourself with test t04 . Task 4: Trim whitespace
All our replacements have left some blank titles or mangled titles. First, identify any titles that begin
or end with an apostrophe. Replace those apostrophes with nothing. Be careful not to disturb the
apostrophes in the interior of the sentence. Next, trim leading and trailing whitespace from each
title. Be careful not to replace all whitespace as we need whitespace between word boundaries.
- Regular Expressions
5/9 To match the tests and simplify debugging, do not combine the above two operations into one, but
rather (1) handle apostrophes first and then (2) handle whitespace as a second line of code.
In the code, find the incomplete function clean and complete the instructions to achieve the above
goal.
Task 5: Filter out non-English characters We want to ignore all song titles that contain a non-English character (e.g., á, ì, ?, ?, etc.). Although
this may eliminate some titles in languages like Spanish or Danish, we need to drop out the
Unicode garbage characters found in many of the titles.
If the titles contain only valid English alphanumeric characters ('a' to 'z' and 0 to 9) or the
apostrophe or a space, we keep it. Otherwise, we discard the song title entirely. Use the Raku
command next; to skip to the next iteration of the loop, which bypasses pushing it to the array to
return. Make sure to accept both uppercase and lowercase letters.
In the code, continuing editing the function clean to achieve the above goal.
Task 6: Skip blank titles
After all the replacements, some titles are left with nothing. Continue to edit the function clean to
throw out any empty titles. Use a regular expression to check if a title is empty or contains only
whitespace. If empty, do not retain it in the array you return. Again use the Raku command next;
to skip this title. Likewise, if the title contains only an apostrophe, discard it.
To match the tests, do not combine the above two operations into one, but (1) handle whitespace
first and then (2) handle apostrophes as a second line of code.
In the code, continuing editing the function clean to achieve the above goal.
Task 7: Set to lowercase
Convert all words in the sentence to lowercase. Raku has a simple function to do this for you. Go
back and edit the clean function one more time to convert all titles to lower case.
On completion of Tasks 4-7, you can check yourself with test t06 . Task 8: Filtering out common words
In the domain of Natural Language Processing (NLP), stop words
(https://en.wikipedia.org/wiki/Stop_word) are common words that are often filtered out in
preprocessing. Go edit the function stopwords to filter out the following common stop words from
the song title.
a an and by for from in of on or out the to with
?
- Regular Expressions
6/9
Be careful to only replace entire words, not just any occurrence of these letters. Otherwise you will
turn words like oustanding to stg .
The regex control character <|w> will be quite useful here, which defines a word boundary
(https://docs.raku.org/language/regexes#Word_boundary) . Additionally, remove a single whitespace
following the second word boundary. This prevents your titles being littered with double spaces
throughout because we also remove the single space that follows the word. This has the
unintended consequence that it will retain the stop word if it's the final word of the title (because
there is no space to match after), but for simplicity, let's not worry about that.
We may not want to always filter these words, but we want to have the option. You can now use the
command filter stopwords . You can toggle this mode using commands stopwords on and stopwords
off .
On completion of Task 8, you can check yourself with test t07 . Postlude: Putting it all together You can now use the command preprocess which runs all the individual filter tasks.
On completion of Tasks 1-8, you can check yourself with test t08 (stopwords off) and test
t09 (stopwords on).
Menu System
Menu system for Labs 1 and 2
command argument description
build build the bigram model
debug
on
off
turn on debug mode
turn off debug mode
count
tracks
words
characters
counts tracks in @tracks
counts words in @tracks
counts characters in @tracks
filter
title
comments
punctuation
unicode
extract title from original input
removes extra phrases
removes punctuation
removes non-Unicode and
whitespace
load FILE loads the given FILE
- Regular Expressions
7/9
length INT update $SEQUENCE LENGTH=INT
mcw WORD
print most common word to follow
WORD
name
print sequence based on your
name
preprocess run all filter tasks and build
INT
prints all tracks to file tracks.out
prints only INT tracks to stdout
random
print sequence based on random
word
stopwords
on
off
filter stopwords on
filter stopwords off
sequence WORD find sequence to follow WORD
Resources
This lab requires an independent study of the Raku language. You are encouraged to use any web
tutorials and resources to learn Raku.
Regular Expression Tester (https://regex101.com/)
Raku Guide (https://raku.guide/)
Raku Resources (https://raku.org/resources/)
Raku Tutorial (https://www.i-programmer.info/news/222-perl/13267-the-raku-beginner- tutorial.html)
Raku Documentation (https://docs.raku.org/language.html)
Allow yourself plenty of time, and use patience, perseverance, and the Internet to debug your code.
Testing
In the directory tests you will find a number of pairs of files representing the given input and
expected output of the test.
raku lab1.raku < ./tests/t01.in
The above line will execute lab1.raku piping in the commands given in file t01.in . You can
compare your result with the expected output given in file t01.out .
Submission
- Regular Expressions
8/9
Lab 1 Rubric
Each student will complete and submit this assignment individually. Do not consult with others.
However, you are encouraged to use the Internet to learn Raku.
You will submit only your Raku code file.
lab1.raku
Do not submit lab1_template.raku , tracks.txt or 381_lab1_starter_pack.zip .
Grading criteria
This project is worth 100 points. Include your name in your file. Format your code neatly using
expected conventions for indentations. Use descriptive variable names. Comment your program
heavily. Thoughtful and unique comments help protect you against accusations of violations of
academic integrity. Your program will be evaluated against a series of automatic tests using Gradescope. The grading
tests will be the same as the provided tests but will examine a different subset of the one million
tracks. Your grade is determined by the number of public and hidden tests your program passes.
?
- Regular Expressions
9/9 Total Points: 100
Criteria Ratings Pts
Test 1 Credit for submission 16 pts Passed 12 pts Partial (2) 8 pts Partial (1) 4 pts Failed Test Test 2 Task 1 - Extract title
12 pts Passed 8 pts Partial (2) 4 pts Partial (1) 0 pts Failed Test Test 3 Task 2 - Filter Comments 12 pts Passed 8 pts Partial (2) 4 pts Partial (1) 0 pts Failed Test Test 4 Task 3 - Filter Punctuation 12 pts Passed 8 pts Partial (2) 4 pts Partial (1) 0 pts Failed Test Test 6 Tasks 4-7 - Clean whitespace, apostrophes, and Unicode 12 pts Passed 8 pts Partial (2) 4 pts Partial (1) 0 pts Failed Test Test 7 Tasks 8 - Remove stopwords 12 pts Passed 8 pts Partial (2) 4 pts Partial (1) 0 pts Failed Test Test 8 Full check, without stopwords 12 pts Passed 8 pts Partial (2) 4 pts Partial (1) 0 pts Failed Test Test 9 Full check, with stopwords 12 pts Passed 8 pts Partial (2) 4 pts Partial (1) 0 pts Failed Test ?
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。