联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> C/C++编程C/C++编程

日期:2024-04-02 09:31

CSC 330 Programming Languages

Project

Note 1 This project is to be done individually

Note 2 Working with other people is prohibited.

Note 2 Sharing queries or files with other people is prohibited.

A note on Academic Integrity and Plagiarism

Please review the following documents:

• Standards for Professional Behaviour, Faculty of Engineering:

https://www.uvic.ca/engineering/assets/docs/professional-behaviour.pdf

• Policies Academic Integrity, UVic:

https://www.uvic.ca/students/academics/academic-integrity/

• Uvic’s Calendar section on Plagirism:

https://www.uvic.ca/calendar/undergrad/index.php#/policy/Sk_0xsM_V

Note specifically:

Plagiarism

Single or multiple instances of inadequate attribution of sources should result in a failing grade

for the work. A largely or fully plagiarized piece of work should result in a grade of F for the

course.

Submissions will be screened for plagiarism at the end of the term.

You are responsible for your own submission, but you could also be responsible if somebody plagiarizes

your submission.

1 Objectives

After completing this project, you will have experience:

• Programming Functionally

• Basket analysis in big datasets using the A-priori algorithm

2 The A-priori algorithm

The A-priori algorithm is used to identify commonly found groups of items in basket analysis. For example,

if you buy milk, what other item are you likely to buy?

The input file to this algorithm will be a CSV file. Each CSV file is a set of transactions, one per line. A

transaction is a set of items sold together (each item is separated by a delimiter character). A transaction is

guaranteed to have no duplicate items.

1

Our goal is to identify pairs of items that are sold together above certain minimum threshold. A minimum

support threshold is the minimum proportion of transactions an item should appear in order to be be part of

the output (we will call these popular items).

The A-priori algorithm is relatively simple. Assume that i is the number of different items in the dataset,

and n is the number of popular items. Using dictionaries as the main data structure, the memory requirements

for this algorithm are O(max(n

2

, i)) irrespectively of the size of the input (which can be much larger than

the available memory). To accomplish this, the algorithm does two passes on the input data.

• First pass. In the first pass of all transactions, a dictionary of popular items is computed. This dictionary will have as its key the item name, and as its value its frequency (number of transactions where

the item appears). After all transactions are read, compute the minimum support (an integer computed

by multiplying the minimum support threshold by the number of transactions, truncating this number,

i.e. floor). Return a dictionary that contains only the popular items. The result of the first pass is a

dictionary that contains only popular items.

• Second pass. Read the transactions again. This time keep a counter of popular pairs of items. The key

will be the pair of items (item1, item2). The frequency is symmetric, freq(item1, item2) = freq(item2,

item1). For this reason the dictionary should keep only one copy per pair (e.g. order them such

that given item1, item2, the key to the dictionary is always (min(item1,item2), max(item1,item2)).

Return a dictionary that contains only the popular pairs of items. The result of the second pass is a

dictionary that contains only popular pairs of items.

• Report the results. Using the two dictionaries, print a table with the results. See below for details.

Note that some CSV files use ’,’ and some ’;’ as a separator.

3 Your task, should you choose to accept it

Implement 3 functions that implement the 3 steps of the A-priori algorithm

3.1 Preliminaries: how to run your program

Your program is run with the following arguments. See the Makefile for the specifics of how to compile and

run the program. You will find in the Makefile 4 test cases.

<filename> <delimiter> <MinimumSuportThreshold> <linesToPrint>

• filename. A CSV delimited file. One record per line. You can assume there are no duplicates in a

record.

• delimiter. A one character string that indicates the separator between items in a record. Use quotes

around it to avoid the shell to interpret the character.

• MinimumSuportThreshold. The minimum support threshold that an pair should have to be part of the

output (between 0 and 1)

• linesToPrint. Print at most this number of items. If the result has more frequent items than this

number, print only this number of pairs.

2

For example, the command:

./apriori online.csv ’,’ .01 10

uses the file online.csv with delimiter , (comma) and minimum support threshold of 0.01 and requests to print at most the first 10 items.

3.2 Implementation details

Download from Gitlab the tar file. It contains several files. You will only modify and submit apriori.ml.

Note that this file defines a module that must match the signature provided.

Your job is to implement 3 functions:

3.3 First Pass

let do_first_pass ((threshold: float),

(lines: in_channel),

(delim: char)): first_pass_result =

This function takes three parameters:

1. threshold: The minimum threshold (between 0 and 1),

2. lines: a input stream from where to read the lines,

3. delim: the delimiter of the items in each line (a char).

Read the stream with the function input line. Note that this function generates an exception (that

you must handle) when the end of file is reached.

do first pass should return a record with 3 values (see apriori.ml for the definition of the record):

1. the number of transactions in the stream,

2. the minimum support (i.e. the minimum number of times a frequent item should appears), and

3. the dictionary of the frequent items with their corresponding frequency.

To compute the minimum support truncate the floating point number to an integer. The dictionary should

contain only items with at least this support.

3.3.1 Second Pass

let do_second_pass((support: int),

(popItems: items_dict),

(lines: in_channel),

(delim:char)): pairs_dict =

This function takes 4 parameters:

3

1. support: the minimum support (an integer),

2. popItems: the dictionary of frequent items,

3. lines: the input stream with the transactions,

4. delim: the delimiter (a char)

It should return a dictionary that contains the popular pairs equal or above the minimum support threshold and their corresponding frequency.

3.4 print table

let print_table((nTransactions: int),

(popItems: items_dict),

(popPairs: pairs_dict),

(toPrint:int)):int =

This function prints a table with the popular pairs, and some extra information. It takes 4 parameters:

1. nTransactions is the number of records in the file (result of first pass).

2. popItems is the popular items (result of the first pass),

3. freqPairs is the list of popular pairs (result of the second pass).

4. toPrint: print at most a given number of pairs.

The columns of these report are:

• Support of the pair: the proportion of transactions that have this pair.

• The item name.

• The frequency of this item. Number of transactions where this item appears.

• The support of this item. The proportion of transactions that include this item.

• The other item in the pair.

• The frequency of this pair. Number of transactions where this pair appears.

• Confidence: the support of the pair divided by the support of other item. This number represents the

proportion of transactions of the second item where the first item appears. When it one it means that

every time the second item appears, the first also appears.

• Lift. The support of the pair divided by the product of the support of both items.

Running with parameters: Filename [./data/online.csv] Separator [,] Minimum relative support threshold

[0.01] and Print at most 10 tuples.

4

SuppPair-Item - Freq- Support-With - FreqPair- Conf. - Lift

0.010887 regency tea plate green 386 0.014902 regency tea plate pink 282 0.898 60.265

0.010501 regency tea plate roses 457 0.017643 regency tea plate pink 272 0.866 49.097

0.012431 regency tea plate roses 457 0.017643 regency tea plate green 322 0.834 47.281

0.010887 wooden star christmas scandinavian 515 0.019883 wooden tree christmas scandinavian 282 0.832 41.838

0.013512 set/6 red spotty paper plates 527 0.020346 set/6 red spotty paper cups 350 0.818 40.193

0.024863 green regency teacup and saucer 1057 0.040808 pink regency teacup and saucer 644 0.804 19.702

0.010154 poppy’s playhouse kitchen 440 0.016987 poppy’s playhouse livingroom 263 0.797 46.916

0.010115 poppy’s playhouse bedroom 426 0.016447 poppy’s playhouse livingroom 262 0.794 48.274

0.013512 small dolly mix design orange bowl 528 0.020385 small marshmallows pink bowl 350 0.781 38.326

0.012354 bathroom metal sign 709 0.027372 toilet metal sign 320 0.780 28.514

Number items printed: 10

Number transactions: 25902

Number popular items: 592

Number popular pairs: 746

The results should be ordered according to (think order by in SQL):

1. confidence descending,

2. lift descending,

3. item frequency descending,

4. item name, ascending,

5. item name it appears with, ascending.

Use these bindings to format the output (they are defined in apriori.ml):

let lineTitleFormatString = format_of_string

" SuppPair-%s - Freq- Support-%s - FreqPair- Conf. - Lift\n"

let lineDataFormatString = format_of_string

"%9.6f %s %6d %8.6f %s %6d %9.3f %9.3f\n"

These work in the same way than printf in C. Use Note that the columns for both items (%s) do not

have a width specification. This is because their width will be dynamically computed. The width of each of

the columns is the maximum width of any of the corresponding items to be printed (but cannot be less than

10 characters).

3.5 Tests

The files you downloaded contain several tests. The directory expected contains the expected output. The

directory data contains four datasets. See the Makefile to understand how to run the program and generate

each output. Use diff to compare your output to the expected one (see Makefile).

Your program’s output should be identical to the expected output for the same command line and input

dataset.

3.6 Dictionaries

The dictionaries used in this assignment are non-mutable. As with our assignments, when we insert or delete

from a dictionary, a new dictionary is created. Unfortunately it means that to update a value, you have to

remove it and insert it again. The dictionaries are defined in the file dicts.ml.

To create a dictionary of type ItemMap use Dicts.ItemMap.empty.

To find a value in a dictionary use Dicts.ItemMap.add key value dictionary.

To remove a value use Dicts.ItemMap.remove key dictionary. Both functions are curried.

5

3.7 Grading

The grading is divided into 3 parts:

• First pass: 30%

• Second pass: 30%

• Report: 40%

Your functions will be tested one at a time, using the signature provided.

Your program will be tested in Linux.csc.uvic.ca. Make sure it compiles there. If your program does not

compile, it will receive a zero.

If your program generates any warnings, it will receive a zero.

3.8 Hints

1. The algorithm is relatively straightforward, most of the code you will write is to parse the input and

to format the output.

2. Do not read the stream more than once per pass, because you function might run out of memory.

This means you need to compute its length as you are processing it.

3. Be careful with the output. Your program’s output should match byte-by-byte the expected output.

4. For anything not precisely specified here, see expected output and check/ask in brightspace. There

will a forum to discuss the assignment.

3.9 Other information and restrictions

Violating any of the following restrictions will result in a grade of zero:

1. No mutation.

2. The input stream should not be processed more than once in each function.

3. Your functions should read only one line at a time. You cannot read the entire input stream to

memory.

You can share test cases with others but you are forbidden from sharing source code.

4 What to submit

Submit, via Brigthspace:

1. the file apriori.ml

6


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp