联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Java编程Java编程

日期:2023-03-16 09:04

University of Edinburgh School of Informatics


INFR11199 - Advanced Database Systems (Spring 2023)


Coursework Assignment


Due: Thursday, 19 March 2023 at 4:00pm


IMPORTANT:


Plagiarism: Every student has to work individually on this project assignment.


All of the code for this project must be your own. You may not copy source code from other students or other sources that you find on the web. You may not share your code with other students. You may not host your code on a public code repository.


Each submission will be checked using plagiarism detection software.


Plagiarism will be reported to School and College Academic Misconduct Officers. See the University’s page on Academic Misconduct for additional information.


Start early and proceed in steps. Read the assignment description carefully before you start programming.


The assignment is out of 100 points and counts for 40% of your final mark.


INFR11199 Coursework Assignment Page 2 of 17


1 Goals and Important Points


In this assignment, you will implement the minimization procedure for conjunctive queries and a lightweight database engine for evaluating queries called Minibase.


The goals of this assignment are threefold:


1. to understand the minimization procedure for conjunctive queries,


2. to teach you how to translate conjunctive queries into relational algebra query plans,


3. to learn the iterator model for relational operator evaluation and implement the most common operators (e.g., selection, projection, join, aggregation).


The assignment consists of three tasks:


Task 1 (30%): Implementation of the minimization procedure for conjunctive queries. Task 2 (40%): Implementation of the iterator model and common RA operators. Task 3 (20%): Optimisation of constructed query plans.


The remaining 10% of your mark is for code style and comments; more details later on. Task 3 requires elements of independent thinking and creativity, which are necessary for distinction marks according to the Common Marking Scheme.


You should consider the set semantics for each task, meaning that every relation consists of distinct tuples and the input and output of each operator contain no duplicates.


You will start from a skeleton project consisting of two main classes:  CQMinimizer, which should contain your code for Task 1


Minibase, which should contain your code for Tasks 2 & 3.


Both classes define the expected command line interface. You are free to modify these


classes but must preserve their command line interface as it is essential for marking.


The skeleton project also comes with a parser for our query language, so you do not have to write your own (unless you want to). The main classes show how to parse query strings into Java objects.


The rest of this document describes in detail how to complete the assignment. Take time to read it carefully. Note that some topics will be covered later in the course. Start working as early as possible. This is not a project that should be left to the last minute.


Coursework Assignment – continues. . .


INFR11199 Coursework Assignment Page 3 of 17


2 Setting Up Local Development Environment


You are free to use any text editor or IDE to complete the project. We will use Maven to compile your project. We recommend setting up a local development environment by installing Java 8 or later and using an IDE such as IntelliJ or Eclipse. To import the project into IntelliJ or Eclipse, make sure that you import as a Maven project (select the pom.xml file when importing).


3 Task 1: CQ Minimization


In Task 1, you will implement the minimization algorithm for conjunctive queries dis- cussed in class. You will build a program that reads a query from the specified input file, minimizes the query, and writes a minimized query in the specified output file.


3.1 Query Language


Task 1 considers the language of conjunctive queries as presented in class, with queries expressed using the rule-based form:


Q(v) :- R1(t1),...,Rm(tm)


 head


body


where R1,...Rm are relation names, v is a tuple of variables, t1,...,tm are tuples of terms, and each variable from v appears in the body. A term can be a positive integer constant (e.g., 42), a string constant enclosed in single quotation marks (e.g., ’ADBS’), or a variable. A valid variable name consists of lowercase letters, e.g., a,x,abc are valid variable names; note that this is different from the notation used in class where a,b,c denote constants. A valid relation name consists of uppercase letters, and the relation name in the head does not appear in the body. In our query language, ’ADBS’ is a string constant while ADBS (without quotation marks) is a valid relation name.


Examples of valid conjunctive queries are:


Q(x) :- R(x, ’This is a test string’) Q(x, z) :- R(4, z), S(x, y), T (12, ’x’)


Q() :- R(y, z), R(z, x)


DBSTUDENTNAME (name) :- STUDENT (id, name), ENROLLED (id, ’ADBS’)


Coursework Assignment – continues. . .


INFR11199 Coursework Assignment Page 4 of 17 Examples of invalid conjunctive queries in our language are:


Q(y) :- R(x, ’test’) Q() :-


Q(x, 3) :- R(4, z), S(x, y) Q(y) :- R(y), S()


– y does not appear in body – empty body – constant in head – empty relational atom in body


You can assume that only valid, correctly-typed conjunctive query will be provided as input. For instance, Q(x) :- R(x, 4), R(y, ’4’) is wrongly-typed as the second attribute of R does not have a unique type. Similarly, Q(x) :- R(x, y), R(x, y, z) is also wrongly-typed as the arity of R is not unique. Such queries will never appear as input.


For simplicity, you can assume that every relational atom appearing in the body or the head has no repeated variables. For instance, the atom R(x, x) is invalid according to this restriction, while R(x, xx) and R(3, 3) are valid; the query Q(x, x) :- R(x) is also invalid. But the same variable can appear in different relational atoms in the body.


You do not need to perform any type or naming checks on the input query.


3.2 Query Parser


The provided code contains a parser for our query language. The parser was gen- erated using ANTLR (https://www.antlr.org/) based on the grammar defined in parser/Minibase.g4. Have a look at the grammar to understand the structure of the language constructs. The base directory contains Java classes matching these constructs. You are free to modify and extend these classes.


The grammar also allows comparison atoms such as x > 3 and x != y to appear in the body and SUM aggregation to appear in the head. We will consider comparison atoms and aggregation in Tasks 2 & 3. For Task 1, we will assume that the body consists of relational atoms only and that the head consists of (non-repeated) variables only.


The parser/generated directory contains the parser classes generated by ANTLR. Do not modify these generated classes directly. The class from parser/QueryParser.java uses the visitor pattern1 to construct a Query object from ANTLR objects.


To get you started, we provide a simple method in CQMinimizer.java that uses the parser to read a query from a file, print it out, and access fields of the Query object.


1 https://en.wikipedia.org/wiki/Visitor_pattern


 Coursework Assignment – continues. . .


INFR11199 Coursework Assignment Page 5 of 17


3.3 Input and Output


We provide a few sample conjunctive queries. The data/minimization directory con- tains input and expected output as subdirectories. The input directory contains files with example conjunctive queries. There is one query per file. The expected output directory contains the expected output files for the given sample queries. For exam- ple, the query in expected output/query1.txt is the minimized version of the query in input/query1.txt.


CQMinimizer.java provides the command-line interface that expects two arguments: the path to an input file and the path to an output file. You are free to modify the provided code and include additional classes and packages, but you must preserve the existing command-line interface. You can run the CQMinimizer class from your IDE and provide the required arguments.


3.4 Compile and Run


We will compile your code from the command line as follows:


     mvn clean compile assembly:single


This command will produce target/minibase-1.0.0-jar-with-dependencies.jar. We will run CQMinimizer as follows:


$ java -cp target/minibase-1.0.0-jar-with-dependencies.jar \ ed.inf.adbs.minibase.CQMinimizer


     Usage: CQMinimizer input_file output_file


CQMinimizer requires passing two mandatory arguments.


$ java -cp target/minibase-1.0.0-jar-with-dependencies.jar \ ed.inf.adbs.minibase.CQMinimizer \


          data/minimization/input/query1.txt \


          data/minimization/output/query1.txt


       Entire query: Q() :- R(x, ’z’), S(4, z, w)


       Head: Q()


       Body: [R(x, ’z’), S(4, z, w)]


The provided sample code parses the query from the input file into a Java object and prints different parts of the query. Your code should print the minimized query into the specified output file. You can assume that the output directory already exists.


After running your code, we will check whether the query in the output file is equiva- lent to ours up to variable renaming; e.g., Q(x) :- R(x) would be equivalent to Q(y) :- R(y). We will test your code on a set of conjunctive queries, including the sample queries.


      Coursework Assignment – continues. . .


INFR11199 Coursework Assignment Page 6 of 17


4 Task 2: Query Evaluation


In Task 2, you will implement a simple interpreter for conjunctive queries. You will build a program that takes in a database (a set of files with data) and an input file containing one query. The program will process and evaluate the query on the database and write the query result in the specified output file. The evaluation should be based on the set semantics, meaning no tuple duplicates should appear in the input and the output.


4.1 Query Language


For Tasks 2 & 3, we extend the query language from Task 1 to allow comparison atoms in the body of a query. A comparison atom has the form:


term1 op term2


where term 1 and term 2 are constants or variables, and op ∈ {=, !=, <, <=, >, >=}. For instance, x = y, x < 4, ’ADBS’ >= z, 3 > 12 are valid comparison atoms. String comparison should be performed lexicographically.


Comparison atoms in the body of a query represent a conjunction of predicates that filter the query result, similarly to a conjunction of predicates in the WHERE clause in SQL.


For instance, the query


Q(name) :- STUDENT(id,name,age), age = 23, name != ’Jones’


returns the names of all students who are 23 years old and whose name is not ’Jones’. The equivalent SQL query is (remember that this assignment assumes the set semantics):


SELECT DISTINCT name FROM STUDENT WHERE age = 23 AND name != ’Jones’


You can assume that both terms in a comparison atom are always of the same type. Furthermore, every variable appearing in a comparison atom must also appear in at least one relational atom. No need to perform such checks.


We also extend the query language with the SUM aggregate; more details are in Section 4.5.5.


We make a few simplifying assumptions as below. When we say a query is valid, we mean it is a permitted input to your interpreter which you should be able to handle. When we talk about a base relation, we mean a real relation that exists in the database.


You may assume all valid queries only refer to relations that exist in the database.


You may assume there will be at least one relational atom in the body of the query.


Two relational atoms with the same name may appear in the body of the query.


The head atom contains at least one term. For instance, Q() :- R(x,y) is a valid Boolean conjunctive query but will not be considered in Tasks 2 & 3.


Coursework Assignment – continues. . .


INFR11199 Coursework Assignment Page 7 of 17


4.2 Data and Output Formats


We have provided you some sample data and some sample queries. Take a look at the data/evaluation directory. It has db, input, and expected output as subdirectories.


The input directory has files with example queries. There is one query per file.


The db directory contains schema.txt specifying the schema for your database as well as the files subdirectory, where the data itself is stored. The names schema.txt and files are hard-coded and must exist in a valid database directory. Your program should support an arbitrary schema defined in schema.txt, not just the schema of the sample data.


The schema.txt file contains one line per relation (table) in the database. Every line contains several strings separated by spaces. The first string on each line is the relation name and all the remaining ones are attribute (column) types, in the order in which they appear in the relation. The possible names for attribute types are int and string.


The data subdirectory contains one file per database relation, and the name of the file is the same as the name of the database relation with the added .csv extension. Every file contains zero or more tuples without duplicates; a tuple is a line in the file with field (attribute) values separated by commas. The type of each value is defined in the schema file. You do not have to handle null values, but you do need to handle empty relations.


The expected output directory contains the expected output files for the queries we provided. For example, query1.csv contains the expected output for the query in query1.txt. The format for the output is the same as the format for the data.


4.3 Compile and Run


We will compile and run your code from the command line as in Task 1:


$ mvn clean compile assembly:single


$ java -cp target/minibase-1.0.0-jar-with-dependencies.jar \


         ed.inf.adbs.minibase.Minibase \


          data/evaluation/db \


          data/evaluation/input/query1.txt \


          data/evaluation/output/query1.csv


       Entire query: Q(x, y, z) :- R(x, y, z)


       Head: Q(x, y, z)


       Body: [R(x, y, z)]


  Coursework Assignment – continues. . .


INFR11199 Coursework Assignment Page 8 of 17


The Minibase class requires passing three mandatory arguments: the path to a database directory, the path to an input file, and the path to an output file. Your code should handle these arguments appropriately (i.e., do not hardcode any paths).


We will test your code using our own test queries and databases with potentially different schemas, including different relation names. The database directory will have the same structure as described above, with files in the files directory named according to the database schema with .csv as the file extension. You may assume that prior to execution, a given output file does not exist but the output directory does exist.


After we run your code, we will compare your output files with ours. The order of tuples in the output file does not matter for this assignment. As you can imagine, it is very important for you to respect the expected input and output format.


Note: We will test your code on a DICE machine with Ubuntu Linux. Remember that Linux/MacOS environments use ’/’ as path separator. The database directory will be provided with no final ’/’ symbol, as above. If you use Windows, make sure that when you form file paths, you use File.separator instead of ’\’ as path separator.


4.4 Operators and the Iterator Model


A key abstraction in this project will be the iterator model for relational operators. You will implement several operators:


selection, projection, (tuple nested loop) join, and group-by aggregation.


a scan operator which is the leaf operator for any query plan. This is really a physical operator rather than something you would add to the relational algebra, but for now we will put it in the same category as the above.


The standard way to implement all relational operators is to use an iterator API. You should create an abstract class Operator, and all your operators will extend that. Certain operators may have one or two child operators. A scan operator has no children, a join has two children, and the remaining operators have one child. Your end goal is to build a query plan that is a tree of operators.


Every operator must implement the methods getNextTuple() and reset() (put these in your abstract Operator class). The idea is that once you create a relational operator, you can call getNextTuple() repeatedly to get the next tuple of the operator’s output. This is sometimes called “pulling tuples” from the operator. If the operator still has some available output, it will return the next tuple, otherwise it should return null.


The reset() method tells the operator to reset its state and start returning its output again from the beginning; that is, after calling reset() on an operator, a subsequent call to getNextTuple() will return the first tuple in that operator’s output, even though the


 Coursework Assignment – continues. . .


INFR11199 Coursework Assignment Page 9 of 17


tuple may have been returned before. This functionality is useful if you need to process an operator’s output multiple times, e.g., for scanning the inner relation multiple times during a join.


For each of the above operators, you will implement both getNextTuple() and reset(). Remember that if your operator has a child operator, the getNextTuple() of your operator can - and probably will - call getNextTuple() on the child operator and do something useful with the output it receives from the child.


A big advantage of the iterator model, and one of the reasons it is popular, is that it supports pipelined evaluation of multi-operator plans, i.e., evaluation without material- ising (writing to disk) intermediate results.


The bulk of this task involves implementing each of the above operators, as well as writing code to translate a given query (i.e., a line of text) to a query plan (i.e., a suitable tree of operators). Once you have the query plan, you can actually compute the answer to the query by repeatedly calling getNextTuple() on the root operator and putting the tuples somewhere as they come out.


We suggest you add a dump() method to your abstract Operator class. This method repeatedly calls getNextTuple() until the next tuple is null (no more output) and writes each tuple to a suitable PrintStream. That way you can dump() the results of any operator – including the root of your query plan – to your favourite PrintStream, whether it leads to a file or whether it is System.out (for testing).


4.5 Implementation Instructions


We recommend that you implement and test one feature at a time. Our instructions below are given in suggested implementation order.


We also recommend (but do not require) you set up a test infrastructure early on. You should do two kinds of testing – unit tests for individual components and end-to-end tests where you run your interpreter on queries and look at the output files produced to see if they match a set of expected output files. As you add more features, rerun all your tests to check that you didn’t introduce bugs that affect earlier functionality.


After you implement and test each feature, make a copy of your code and save it so if you mess up later you still have a version that works (and that you can submit for partial credit if all else fails!).


4.5.1 Implement Scan


Your first goal is to support queries that are full relation scans, e.g., Q(x, y, z) :- R(x, y, z). To achieve this, you will need to implement your first operator – the scan operator.


Coursework Assignment – continues. . .

INFR11199 Coursework Assignment Page 10 of 17


Implement a ScanOperator that extends your Operator abstract class. Every in- stance of ScanOperator knows which base relation it is scanning. Upon initialisation, it opens a file scan on the appropriate data file; when getNextTuple() is called, it reads the next line from the file and returns the next tuple. You probably want to have a Tuple class to handle the tuples as objects.


The ScanOperator needs to know where to find the data file for its relation. It is recommended to handle this by implementing a database catalog in a separate class. The catalog can keep track of information such as where a file for a given relation is located, what the schema of different relations is, and so on. Because the catalog is a global entity that various components of your system may want to access, you should consider using the singleton pattern for the catalog; if unfamiliar with the singleton pattern, see many other online references.


Once you have written ScanOperator, test it thoroughly to be sure getNextTuple() and reset() both work as expected. Then, hook up your ScanOperator to your inter- preter. Assuming that all your queries are of the form Q(x, y, . . .) :- RELATION (x, y, . . .), write code that grabs RELATION from the body and constructs a ScanOperator for it.


In summary the top-level structure of your code at this point should be:


parse the query from the input file


construct a ScanOperator for the relation in the body of the query


call dump() on your ScanOperator to send the results somewhere helpful, like a file or your console.


4.5.2 Implement Selection


The next order of business is single-relation selection. That is, you are aiming to support queries like Q(x, y, z) :- R(x, y, z), y > 3.


This means you need to implement a second Operator, which is a SelectOperator. Your query plan will now have two operators – the SelectOperator as the root and the ScanOperator as its child. During evaluation, the SelectOperator’s getNextTuple() method will grab the next tuple from its child (i.e., from the scan), check if that tuple passes the selection condition, and if so output it. If the tuple does not pass the selection condition, the selection operator will continue pulling tuples from the scan until either it finds one that passes or it receives null (i.e., the scan runs out of output).


The tricky part will be implementing the logic to check if a tuple passes the selection condition. The selection condition is a list (conjunction) of ComparisonAtoms from the body of your query. The SelectOperator needs to know that selection condition.


Coursework Assignment – continues. . .


INFR11199 Coursework Assignment Page 11 of 17


You will need to write a class to test whether a selection condition holds on a given tuple. For example, if you have a relational atom R(x,y,z), you may encounter a tuple (1, 9, ’adbs’) and a selection condition consisting of two atoms, x < y and z! = ’adbs’, and you need to determine if the condition is true or false on this tuple.


Your code needs some way to resolve variable references; i.e., if our input tuple is (1, 9, ’adbs’) coming from the atom R(x, y, z), your code needs a way to determine that x is 1, y is 9, etc. So, it also needs to take in some schema information and allow mapping from variable references like x to their value.


Once you have written your class for evaluating selection conditions, unit-test it thor- oughly. Start with simple expressions that have no variable references, like the conjunc- tion of 1 < 2 and 3 = 17. Then test it with variable references until you are 100% sure it works. Once your expression evaluation logic is solid, you can plug it into the getNextTuple() method of your SelectOperator.


Queries with constants in relational atoms are also valid, e.g., Q(x, y) :- R(x, y, 4) is valid and equivalent to Q(x, y) :- R(x, y, z), z = 4. Support the latter form with explicit comparisons first and then do the former form, which also requires projection.


4.5.3 Implement Projection


Your next task is to implement projection, i.e., you will be able to handle queries of the form Q(x) :- R(x, y, z), y > 10. You get the projection variables from the query head. For this task, you may assume that every term in the head is a variable and that constants cannot appear in the head.


For implementing projection, you need a third Operator that is a ProjectOperator. Recall that projection must return only distinct tuples. When getNextTuple() is called, it grabs the next tuple from its child, extracts only the desired values into a new tuple, and returns that tuple if it has not been reported before; otherwise, the operator fetches the next tuple from its child. Note that the child could be either a SelectOperator or a ScanOperator, depending on whether your query has a selection condition.


To ensure that only distinct tuples are reported, the operator can maintain an internal buffer containing already reported (distinct) tuples. Buffering all these tuples in memory may not work for very large outputs; for this project assignment, this is fine.


Note that the variable order in the head atom does not have to match the attribute order in the relation. The queries Q(x, y) :- R(x, y) and Q(y, x) :- R(x, y) are both valid and produce different output results.


By this point you should have code that takes in a query and produces a query plan containing:


an optional projection operator, having as a child Coursework Assignment – continues. . .


INFR11199 Coursework Assignment Page 12 of 17 ? an optional selection operator, having as a child


a non-optional scan operator.


Thus, your query plan could have one, two or three operators. Make sure you are supporting all possibilities; try queries with/without a projection/selection. If the query does not project away any variables and does not reorder variables in the head, then do not create a projection operator, and if the query has no selection predicate, do not create a selection operator.


You are now producing relatively complex query plans; however, things are about to get much more exciting and messy as we add joins. This is a good time to pull out the logic for constructing the query plan into its own class, if you have not done so already. Thus, you should have a top-level interpreter class that reads the statement from the query file. You should also have a second class that knows how to construct a query plan for a Query object, and returns the query plan back to the interpreter so the interpreter can dump() the results of the query plan somewhere.


4.5.4 Implement Join


Next up, the star of the show: joins.


You need a JoinOperator that has both a left and right child Operator. It also has a selection predicate that captures the join condition. This selection predicate could be a single comparison such as x = y, a conjunction of comparisons, or it could be null if the join is a cross product.


Implement the simple (tuple) nested loop join algorithm: the join should scan the left (outer) child once, and for each tuple in the outer child, it should scan the inner child completely (finally a use for the reset() method!). Once the operator has obtained a tuple from the outer and a tuple from the inner, it glues them together. If there is a non-null join condition, the tuple is only returned if it matches the join condition (so you will be reusing your class for evaluating a selection condition from Section 4.5.2). If the join is a cross product, all pairs of tuples are returned.


Once you have implemented and unit-tested your JoinOperator, you need to figure out how to translate a query to a plan that includes joins.


For this project, we require that you construct a left-deep join tree that follows the order in the body of the query. That is, a query whose body includes relational atoms R(x, y), S(y, z), T (x, y) produces a plan with the structure shown below:


 Coursework Assignment – continues. . .


INFR11199 Coursework Assignment Page 13 of 17


The tricky part will be processing the body of the query to extract join conditions. The body may contain both selections on a single relation as well as join conditions linking multiple relations. For example,


Q(x) :- R(x,y),S(z,u),x = y,z = 1,y > u


contains a selection condition on R, a selection condition on S, and a join condition on both R and S together. Obviously, it is most efficient to evaluate the selections as early as possible and to evaluate y > u during computation of the join, rather than computing a cross product and having a selection later.


While the focus of this task is not optimisation, you do not want to compute cross products unless you have to as this is grossly inefficient. Therefore, we require that you have some strategy for extracting join conditions from the body part and evaluating them as part of the join. You do not need to be very clever about this, but you may not simply compute the cross products (unless of course the query actually asks for a cross product). You must explain your strategy in comments in your code and in the README that you submit with your code.


You should also be aware of implicit equi-join conditions arising when the same vari- able appears in more relational atoms, e.g., Q(x) :- R(x, y), S(y, z) encodes the equality condition between the second variable of R and the first variable of S. An equivalent query would be Q(x) :- R(x, y), S(yy, z), y = yy.


A suggested way to do this is as follows. Identify all selection conditions (conjuncts) in the body, including both implicit and explicit conditions. For every condition, determine which relations are referenced and add the condition to an appropriate conjunction of conditions. If there are k relations being joined, there could be 2k ? 1 running conjunc- tions: k ? 1 join conjunctions and k selection conjunctions on the individual relations. Once all the identified conditions are processed, the 2k ? 1 conjunctions of conditions can be integrated into the appropriate selection and join operators in the query plan. Of course some of these conjunctions may turn out to be empty, depending on the query.


For example, if we have


Q(b) :- R(1,b),S(b,c),T(d,c,e),d < 5,e != 0


the above approach would give the following query plan (in the figure below, R. i refers to the i-th attribute of R):


Coursework Assignment – continues. . .

You don’t have to follow exactly this strategy for Task 2.


4.5.5 Implement Group-By Aggregation


The final operator to implement is group-by aggregation with the SUM aggregate func- tion. The SUM function is a term that may appear at most once in the head of a query, after all the head variables. The function can take as argument one term (positive inte- ger constant or variable) or a product of terms and return one aggregate value per group computed from the result of the body of a given query. We next explain the syntax and semantics of aggregate queries by means of examples.


The aggregate query


Q( SUM(x) ) :- R(x, y)


returns one integer value representing the sum of x-values in R. An equivalent SQL query would be SELECT SUM(R.x) FROM R. The output file for this query should contain one integer number. Similarly, Q( SUM(1) ) :- R(x, y) computes the number of tuples in R.


The group-by aggregate query


Q(x, SUM(z ? z)) :- R(x, y), S(y, z)


returns distinct x-values, each paired with the sum of squared z-values computed from the body for the given x-value. An equivalent SQL query would be SELECT R.x, SUM(S.z * S.z) FROM R JOIN S ON R.y = S.y GROUP BY R.x. There could be multiple (distinct) group-by variables in the head but at most one SUM aggregate function or none at all. Group-by variables can appear in the SUM function, e.g., the query Q(x, SUM(x)) is valid. You can assume that all variables in the SUM function will be of integer type.


You will implement one group-by aggregation operator, SumOperator, which reads all of the output from its child, extracts relevant values into tuples, organizes tuples into groups, and for each group computes an aggregate value. The SUM operator needs to see all of its input before producing any output (i.e., it is a blocking operator).


Coursework Assignment – continues. . .


INFR11199 Coursework Assignment Page 15 of 17


5 Task 3: Query Optimisation


The final task is to enrich Minibase with query optimisation. The goal of query opti- misation is to transform query plans such that their operators process as few tuples as possible, thus reducing processing time and avoiding large intermediate results; yet such optimised query plans must still produce correct query results. To achieve this goal, you must keep the same join order but you are allowed to transform query plans, for example, swap operators or introduce new instances of the non-join operators discussed above. You should not implement any new relational algebra operator.


Properly optimised query plans should avoid processing large intermediate results whenever possible, thus allowing such plans to be executed under restricted memory budgets and time limits. We will evaluate the effectiveness of your optimisation rules by running a set of queries on large databases. Failing to properly optimise input queries will most likely lead to wrong results, out-of-memory exceptions, or timeouts.


Task 3 is not about improving the performance of individual relational algebra oper- ators but about coming up with optimisation rules for the class of group-by aggregate queries described in Section 4. Some optimisation rules will be covered in lectures but some will require a bit of thinking on your side.


6 Grading


We strongly suggest that you follow the architecture we described for Task 2. However, we will not penalize you for making different architectural decisions, with a few exceptions:


You must have relational Operators that implement getNextTuple() and reset() methods as outlined above. This is the standard relational algebra evaluation model and you need to learn it. Do not hard-wire combinations of operators, e.g., projec- tion should not assume selection as its child.


You must construct a tree of Operators and then evaluate it by repeatedly calling getNextTuple() on the root operator.


As explained in Section 4.5.4, you must build a left-deep join tree that follows the ordering of the relations in the body. Also, you must have a strategy to identify join conditions and evaluate them as part of the join rather than doing a selection after a cross product.


Disregarding any of the above three requirements will result in severe point deductions. Next we give the grading breakdown.


Coursework Assignment – continues. . .


INFR11199 Coursework Assignment Page 16 of 17


6.1 Code Style and Comments (10 points)


Follow standard guidelines for writing clean and understandable code; e.g., use standard naming conventions for classes and methods, break up large monolithic blocks of code into smaller logical pieces, avoid code duplication if possible – the “rule of three” says if your code is copied more than twice, then it probably needs to be abstracted out.


You must provide comments for every method you implement. At minimum, the com- ment must include one sentence about the purpose of the method, and @params/@return annotations for every argument/return value respectively. In addition, every class must have a comment describing the class and the logic of any algorithm used in the class.


If you follow the above rules and write reasonably clean code that follows our overall architecture, you are likely to get the full 10 points for code style.


6.2 Test Queries (90 points)


We will run your code with our own queries and database. We will award you 2 points for every query that returns the correct output. The queries we provide with the assignment count for 24 out of the 90 points. You can expect that we will add additional relations to the database; of course the schema of these relations will be mentioned in schema.txt and the data files will be found in the files directory. We will test with basic queries as well as with complex queries that include any/all of the features you are to implement.


If you cannot implement one or more parts, that is fine, although obviously you won’t get full points. In that case, you must clearly tell us in the README what you have not been able to implement.


7 Submission Instructions


Double-check that your code compiles and runs as described in this document. If your code fails to compile or crashes during execution, we will invest a reasonable effort to fix the problem; this can cause (severe) point deduction.


You must keep CQMinimizer and Minibase as the top-level main classes of your code. Submit just one project directory for all three tasks.


Create a README file in the root of the project directory containing the following:


For Task 2, an explanation of your logic for extracting join conditions from the body of a query. If this logic is fully explained in comments in your code, your README does not need to repeat that; however, it must mention exactly where in the code/comments the description is, so the grader can find it easily.


Coursework Assignment – continues. . .


INFR11199 Coursework Assignment Page 17 of 17


For Task 3, an explanation of your optimisation rules, reasons why they are correct, and how they can reduce the size of intermediate results during query evaluation.


Any other information you want the grader to know, such as known bugs.


Create a .zip archive containing a README file and your entire project directory so that we can compile and run your code on the command line. Do not include any .class files nor large .csv files. Disable all debugging statements before submitting your code. Failure to do so may result in point deduction.


Upload the zip archive to Learn: Assessment (the left panel) → Assignment Submis- sion → Coursework: Minibase.


Make sure you start early! Good luck!


End of Coursework Assignment


相关文章

版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp