CSC 173, Fall 2002

Assignment 1: Relational Database

DUE DATE:

Monday September 23. 11:59 pm

NO EXTENSIONS.

Project Overview

You are to design and build a simple database system in Java. Your system must support databases with a variety of schemes. We will make several simplifying assumptions to keep the project managable:

All data will fit in main memory.
The columns of a relation will be numbered (and thus ordered), but not named. To make life simple in Java column numbers will start at zero.
All data can be stored as character strings.

Relation abstract data type

You should begin by creating a Java class (ADT) to represent a relation. Your relation class should have two constructors. The first will create an empty relation with a given number of columns. The second will take a second argument that specifies a file name from which initial data for the relation should be read. You may assume that tuples in the file are separated by newline characters and that fields (attributes) within a tuple are separated by tabs. You should deal gracefully with invalid file contents (e.g. a line with too few or too many attributes), printing an appropriate error message and producing a well-defined result.

Internally, you may represent a relation as an array of Java strings.

In addition to the constructors, your ADT should provide the following methods:

print: print the relation as lines of tab-separated fields on the screen.
insert: insert a given tuple into the relation.
delete: remove from the relation all tuples matching a given predicate.
lookup: (a.k.a. select) return a new relation containing all tuples from the original relation that match a given predicate.
project: return a new relation that is the projection of the given relation onto a given list of fields. If the original relation has six fields, for example, and [1, 4, 2] is the list onto which to project, tuple (a, b, c, d, e, f) in the original relation would become tuple (b, e, c) in the new.
join: return the join of the original relation and a relation passed as a argument, equating the last field of the original relation with the first field of the argument, in such a way that the fields of the original relation precede those of the argument in the result. Note that this very simplistic notion of join (without attribute names) will generally need to be used in conjunction with projection operations that reorder the fields of the arguments and result.
union: return the union of the original relation and a relation passed as an argument.
intersect: return the intersection of the original relation and a relation passed as an argument.

Several of these methods will be easier to implement if your relation class supports the standard Enumeration or Iterator interface.

For the purposes of delete and lookup methods, a predicate is a triple consisting of

attribute identifier (field number)
comparison test (<=, <, >=, >, ==, !=) — to be performed lexicographically on strings
attribute identifier or string constant

Note that lexicographic comparisons of the strings representing integers will do the "right" thing numerically if (and only if) you ensure that all tuples use the same number of digits to represent a given attribute. For example, "00123" < "01200", but "123" > "1200".

Query language

To allow a user to manipulate the database, you will need a command language and a way to refer to relations. You should maintain, internally, a mapping from single-character relation names (lower case letters will suffice) to the relation (if any) named by that character. This mapping can be a simple array. Your system should then read a sequence of commands from standard input, one command per line. Valid commands are as follows:

c: create a new relation
```
  	c a 4 filename
      
```
This command creates a new relation a with 4 fields, and reads the data for the relation from the (optional) file specified.
a: add a tuple to a relation
```
          a b "foo" "3" "hi, mom"
      
```
This command adds a tuple into the existing relation b. The number of additional arguments after the b must equal the number of attributes in relation b.
o: output a relation
```
  	o a
      
```
This command will print relation a created above.
p: project a relation on a list of fields
```
  	p a 2 3 4 b
      
```
This command will project relation a onto 2 fields (fields numbers 3 and 4 in relation a) producing relation b.
j: join two relations producing a third
```
  	j a b c
      
```
This command joins relations a and b on the final field of a and the initial field of b, producing relation c. If a has n fields and b has m, c will have n+m-1.
s: select a subset of the tuples in a relation, producing a new relation
```
  	s a 1 == 2 b
      
```
This command selects those tuples from relation a where field 1 == field 2, and produces a new relation b containing those tuples.
```
  	s a 1 != "foo" b
      
```
This command selects those tuples from relation a where field 1 != "foo", and produces a new relation b containing those tuples.
The operators allowed in the select command can be any of the comparison operators supported by your implementation of predicates.
d: delete tuples that match a predicate from a relation
```
  	d a 1 == 3
      
```
This command deletes those tuples in relation a where field 1 == field 3.
As with the select command, the second operand of the predicate may be a string, and all of the comparison operations must be supported.
u: create a new relation containing the union of the tuples of two existing relations.
```
          u a b c
      
```
This command creates relation c, giving it all tuples found in a or b. Relations a and b must have the same number of fields.
i: create a new relation containing the intersection of the tuples of two existing relations.
```
          i a b c
      
```
This command creates relation c, giving it all tuples found in both a and b. Relations a and b must have the same number of fields.
x: delete an entire relation.
```
 x a 
```
This command deletes relation a, making it unavailable for future operations. You should be certain that the Java garbage collector will remove the data (in other words, remove all references to this Relation).

Your database need not necessarily be super efficient (but see the extra credit suggestions below). You may, if you wish, store the tuples of a relation in a list, and peruse the entire list when necessary to implement relational operations.

In the interests of compact maintainable code, you are encouraged to make maximum use of the Java standard libraries, including the collections framework.

You should of course detect and handle invalid commands. You should print a helpful message, ignore the command, and continue.

If you are not familiar with the Java StringTokenizer class, take a look. It might be useful for parsing your command line or other strings.

Tools

You are encouraged to work on the CSUG systems (especially exploiting remote access) to become familiar with the environment.

That said, you may work on any platform you like, but (1) your final code must compile and run correctly using the javac compiler and java virtual machine on the csug Linux machines, and (2) you must hand in your work using the appropriate turnin script on those machines (watch the newsgroup for details).

javac is Sun's version 1.4.0 compiler for Linux. The jikes compiler from IBM is also available for the adventurous, but again, your code must compile and run with javac. (Note to the adventurous: to use jikes you must set your JIKESPATH environment variable to ".:/usr/staff/lib/java/latest/jre/lib/rt.jar".) Extensive documentation for javac is available on-line in HTML. On the csug machines, the root file is /usr/staff/lib/java/latest/docs/index.html.

What/how to turn in

The TAs will shortly be creating test data that you can use to exercise your code. Watch the newsgroup.

They will also be posting instructions on

how to document the behavior of your code on this test data
what sorts of additional test data you need to create yourself
reread the instructions for grading and what to include when you submit your project, specifically the README.
how to find and run the turnin script

Watch the newsgroup for details.

Extra Credit Suggestions

For extra credit, you might consider the following options:

Allow fields (attributes) to have different types, e.g. to support true numeric data.
Extend your query language to allow relation names of arbitrary length.
Support set difference operations.
Allow fields (attributes) to have character string names.
Extend the join operation to permit joins on specified pairs of fields.
Allow the user to request that indices be built on particular fields or sets of fields, and make use of these indices when implementing relational operations.
Allow complex queries consisting of combinations of simple queries, specified all together (hard).
Perform query optimization (hard).

If you do any extra credit, remember to document it. If you do not, you will likely not get credit for the work. You should also provide test data, and a description of how to use the test data to show that your extensions do what you claim.