CS 335 2018 Page 1 of 5 A#1R v1
Word Frequency Table
Finding the most common words in a text file is a famous “programming pearl”,
originally posed by Jon Bentley (1984). Several interesting solutions have been
proposed by Knuth (an exquisite model of “literate programming”, 1986),
McIlroy (an engineering example of combining a timeless set of “UNIX tools”,
1986), Hanson (an alternate efficient solution, 1987).
This assignment requires a simple solution, comparable in style and performance
to McIlroy’s pipelined solution (though not as performant as the much more
complex solutions of Knuth and Hanson). For this, we will use a simple functional
programming style based on .NET LINQ framework.
Problem statement: Given a text file and an integer k, you are to print the
words (and their frequencies of occurrence) whose frequencies of occurrence are
the k largest (upper limit) in order of decreasing frequency; breaking any ties in
increasing words’ alphabetical order. Words are consecutive sequences of ASCII
alphabetical letters (A-Za-z), where letter case is ignored, and all other
characters are ignored. A short minimalistic example will clarify these specs.
Sample text file, called test1.txt:
五 πέντε пять
dd00 DD?!
ccc.
AA aa, Aa.
CCC bb aA?
!
Sample result, for k=3: The output must be a text file with two columns. Extra
white spaces are permitted but discarded. Please use this format, because your
output will be assessed by a script, using whitespace tolerant textual
comparisons.
4 AA
2 CCC
2 DD
CS 335 2018 Page 2 of 5 A#1R v1
To understand this, let us consider the following intermediate steps.
Word list extracted from this text file:
dd DD ccc AA aa Aa CCC bb aA
Canonical word list – upper case versions to ignore the case:
DD DD CCC AA AA AA CCC BB AA
An unsorted frequency table:
4 AA
1 BB
2 CCC
2 DD
A sorted frequency table: 1st by decreasing frequencies, 2nd by increasing
alphabetical word order:
4 AA
2 CCC
2 DD
1 BB
Taking the first k=3 rows only, we obtain the above shown result.
Programs: (1) a C# solution (required); (2) a Node JS solution (required); and
(3) an F# solution (optionally, bonus). Essentially, each program must be totally
contained in one single file and use only the standard libraries extant in the labs.
The C# solution must use the .NET LINQ framework and System.IO.
The Node JS solution must only use linq-es2015 (external) and fs (internal).
The F# solution must only use the F# core modules and System.IO.
These programs must be named after your upi, e.g. jbon007.cs, jbon007.js,
jbon007.fs. The dropbox will reject programs that do not follow this convention.
Compilation: Assuming the C# and F# compilers are on the current PATH, the
required C# and the optional F# programs must be compilable in the labs with
these simple commands (no external libraries are permitted):
csc jbon007.cs
fsc jbon007.fs
The required Node JS program will not be compiled, but executed directly by
Node.
CS 335 2018 Page 3 of 5 A#1R v1
Execution – C# and F#
After the above compilation, the C# and F# programs will have a name like
jbon007.exe. This program will be given (1) a mandatory parameter, the text
file name (path); and (2) an optional parameter, the integer k, with the default
value k=3. If errors occur, an error line must appear. Sample runs and outputs:
>jbon007.exe test1.txt 2 all arguments given
4 AA
2 CCC
>jbon007.exe test1.txt 10 all arguments given
4 AA
2 CCC
2 DD
1 BB
>jbon007.exe test1.txt default value k=3
4 AA
2 CCC
2 DD
>jbon007.exe test1.txt x incorrect number k
*** Error: ...
>jbon007.exe testx.txt assuming that testx.txt does not exist
*** Error: ...
>jbon007.exe missing file name
*** Error: ...
In the error texts, in place of ellipses (...), display either system created
messages, or your custom messages (presumably more adequate). However, we
do not prescribe these texts. The important part is that the programs must catch
and display something meaningful for all possible errors.
CS 335 2018 Page 4 of 5 A#1R v1
Execution – Node JS
The JS program will be executed under Node, so the command lines will be
longer and errors may look differently - but still all errors must be caught.
>node jbon007.js test1.txt 2
4 AA
2 CCC
>node jbon007.js test1.txt 10
4 AA
2 CCC
2 DD
1 BB
>node jbon007.js test1.txt
4 AA
2 CCC
2 DD
>node jbon007.js test1.txt x
*** Error: ...
>node jbon007.js testx.txt
*** Error: ...
>node jbon007.js
*** Error: ...
CS 335 2018 Page 5 of 5 A#1R v1
Functional style: Besides the results’ accuracy, your submission will also be
assessed on its high-level functional style. For this assignment, we will be
specifically looking at a consistent formatting, a reasonable size, and at the
number of low-level explicit loops used, i.e. for, foreach, while statements.
Generally, a functional style naturally results in crisper and shorter programs
(without artificially squeezing many expressions on one single line). The model
solutions have about 50-60 lines, including a few blank lines inserted for
readability, and no explicit loops.
As we will see in the lectures, functional programming offers high-level functions
that encapsulate and hide most required loops.
Submission
Please submit to the Assignment Dropbox (https://adb.auckland.ac.nz/) the
following items:
(1) Your C# source file, upi.cs - required
(2) Your Node JS source file, upi.js - required
(3) Your F# source file, upi.fs – allowed (optional bonus)
References (just in case you are interested to read more about this problem
and its very efficient solutions)
Bentley, J., Knuth, D., McIlroy, D.: Programming pearls: A literate program.
Commun. ACM 29(6), 471-483 (Jun 1986),
http://doi.acm.org/10.1145/5948.315654
Appendix
A sample bash script solution, based on McIlroy’s pipeline of UNIX tools. Again,
no explicit loops. Just in case, as you do not have to study this if are not familiar
with UNIX.
cat test1.txt |
tr -cs A-Za-z '\n' |
awk 'NF' |
tr a-z A-Z |
gsort |
uniq -c |
gsort -k1rn -k2 |
head -3
#sed 3q
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。