Lecture notes for CS 173, Tues. Sept. 12, 2002
----------------
Projection: chooses a subset of the columns in a relation, and discards
the rest.

    R2 = project (R1, D1, D2, ..., Dn)

That is, from the tuples in R1 we create a new relation R2 containing
attributes from the domains D1, D2, ..., Dn only.

project (Cycle Servers, Name, Status) =

    Name    Status
    ==============
    cycle1    up 
    heart     up
    brain     up
    aorta     down
    ganga     down
    tapti     up

project (select (Users, Console = heart), Name, Status) =

    Name              Status
    =========================
    X. Gu             Grad      
    G. Altekar        UG        
    T. Pawlicki       Faculty   


----------------

Join: combines attributes of two relations into one.

    R3 = join (R1, D1, R2, D2)

Given a domain from each relation, join considers all possible pairs of
tuples from the two relations, and if their values for the chosen domains
are equal, it adds a tuple to the result containing all the attributes of
both tuples (discarding the duplicate domain D2).

Natural join: If the two relations being joined have exactly one attribute
(domain) name in common, then we assume that the single attribute in common
is the one being compared to see if a new tuple will be inserted in the
result.

Suppose we augment the domain names in our lab database so that columns
that really indicate different things have different names.
The Workstations scheme might be

    WorkstationName  Room  Mem  OS  WorkstationModel  Speed

The File Servers scheme might be

    FileServerName  Status  FreeDisk  OS

A natural join on the Workstations and File Servers relations would then
produce a relation of all workstation/file server attribute pairs that are
running the same OS:

    Work-                                         File-
    station-                                      Server-          Free
    Name      Room  Mem    OS      Model  Speed   Name     Status  Disk
    ===================================================================
    crabgrass  727  128MB  Solaris   U10  333MHz steward   up    8.5GB
    crabgrass  727  128MB  Solaris   U10  333MHz slat      up    452MB
    crabgrass  727  128MB  Solaris   U10  333MHz butler    down 12.6GB
    fusion     715  176MB  MacOS      G3  500MHz nuclear   up    5.8GB 
    chemical   735   96MB  MacOS      G3  300MHz nuclear   up    5.8GB

Note that entries for saturn do not appear, because there is no Win2000
file server in the database.  Note also that we have introduced 
redundant information.  Joins generally produce wider relations than
those chosen (presumably with care) by the database designer.  That's
ok, however, because we'll usually be performing joins only as an
intermediate step in answering a query; there will almost always be a
project that tosses columns we don't need.

For example: find all workstations for which there exists a file server,
running the same OS, that is currently up and has more than 5GB of disk
space free; return (Workstation, FileServer, Room) triples:

   R1 = join as above
   R2 = select (R1, Status = up and FreeDisk > 5GB)

    Work-                                         File-
    station-                                      Server-          Free
    Name      Room  Mem    OS      Model  Speed   Name     Status  Disk
    ===================================================================
    crabgrass  727  128MB  Solaris   U10  333MHz steward   up    8.5GB
    fusion     715  176MB  MacOS      G3  500MHz nuclear   up    5.8GB 
    chemical   735   96MB  MacOS      G3  300MHz nuclear   up    5.8GB


   R3 = project (R2, WorkstationName, FileServerName, Room)

    WorkstationName  FileServerName  Room
    =====================================
    crabgrass        steward         727
    fusion           nuclear         715
    chemical         nuclear         735

Note that this is not the only way to get the result (and in particular
it's not the best way!); more on this later.

-----------------------

Implementing Set Operations (largely review of 172 material)

Assume our relations are implemented as lists.
To implement R1 U R2 (while eliminating duplicates) we can

   * sort R1 in O(N lg N)
   * sort R2 in O(M lg M)
   * merge R1 and R2 in O(N+M)

This takes care of the data.  What about the indices?  Hash tables using
the same hash function can be merged in linear time.  Trees can be
merged in N lg N time.  So O(N lg N + M lg M) definitely covers the
whole thing.  If our indices are hash tables, we can actually do the
whole thing in O(N + M):

   * copy R1 to destination in O(N)
   * for each tuple in R2 (which is O(M))
        add to destination iff not already there

Intersection and set difference have corresponding implementations.

----------------

Implementing Projection

To implement projection we must

   * process every tuple in the relation
   * remove any duplicates that result

To avoid duplicates we can

   * sort the result and remove consecutive tuples that are equal
        - requires time O(N log N) where N is the size of the original
          relation

   * implement the result as a set
        - set insertion guarantees no duplicates
        - by using a hash table, insertion is O(1), so projection is O(N)

        Note that we have to have a key for the hash table.  If nothing
        else, the entire tuple can always be used as a key.

----------------

Implementing Selection

In the absence of an index we

   * apply the predicate to every tuple in the relation
   * insert matches in the resulting relation (duplicates can't occur)
   * take O(N) time

Given an index, and an (in)equality-based predicate that uses the index
key, we

   * Lookup tuples using the key
   * evaluate only those tuples with the predicate
   * take O(K) time, where K tuples match the key

    Note that hash tables won't support < or > predicates, and no index
    helps if it doesn't cover the fields used in the predicates.

Why not just build an index any time we have an operation that might
benefit from it?  (After all, asymptotically we can hide the cost of
index creation in any one such operation.)  Because then we have to
maintain the index in every future insert and delete.  Only worth
it if we need the index for a reasonable fraction of all operations.

----------------

Implementing Join with Nested Loops

A nested loop join on relations R1 (with N attributes) and R2 (with M
attributes), considers all |R1| x |R2| pairs of tuples.

    R3 := join (R1, Ai, R2, Bj):

    R3 := <empty>
    for each tuple t in R1 do
      for each tuple s in R2 do
        if t.Ai = s.Bj then
          insert (R3, t.A1, t.A2, ..., t.AN,
            s.B1, ..., s.B(j-1), s.B(j+1), ..., s.BM)

This implementation takes time O(|R1|*|R2|).

----------------

Index Join

An index join exploits the existence of an index for one of the domains used
in the join to find matching tuples more quickly.

    R3 := join (R1, Ai, R2, Bj):

    for each tuple t in R1 do
       for each tuple s in R2 at index(t.Ai) do
          insert (R3, t.A1, t.A2, ..., t.AN,
            s.B1, ..., s.B(j-1), s.B(j+1), ..., s.BM)

We could choose to use an index for R2, and reverse the order of the loops.
Presumably we'd choose to do the outer loop on the smaller relation, and
make use of the index on the larger relation.

As usual, if we want any secondary indices for the joined relation we
should update them along the way.

----------------

Sort Join

If we don't have an index for a domain in the join, we can still improve on
the nested-loop join using sort join.

    R3 := join (R1, Ai, R2, Bj):

   * Sort the tuples of each relation on the join attribute.

   * Peruse the two lists.  If the initial tuples of one relation have a
     join attribute less than that of the initial tuples of the other
     relation, toss them.  Else they're equal: pair all (consecutive)
     tuples from the two relations with the same value in the join
     attribute.

    NB: the book suggests merging the tuples of the relations into a
    single list and sorting that.  This works, and may be faster, but
    only by a small constant factor, and it's harder to understand.

----------------

Comparison of Join Implementations

Assumptions

   * Join R1 and R2 on some common attribute to produce R3
   * R1 has i tuples, R2 has j tuples
   * |R3| = m, 0 <= m <= i * j
   * Every implementation takes at least time O(m)

Comparison

   * Nested-loop join takes time O(i * j)
   * Index join (using R2 index) takes time O(i+m)
        - lookup is O(1) for each tuple in R1
        - at most O(m) tuples match
   * Sort join takes time O(i log i + j log j + m)
        - O(i log i + j log j) to sort the relations
        - O(i + j + m) = O(m) to produce the output (0 <= m <= i*j)
     [NB: Time is O((i+j) log (i+j) + m) if you do it the way the
          book suggests.]