Lecture notes for CS 173, Tues. Sept. 10, 2002

Trivia:
  Project 1 is out.  Due Sept 23
  HW1 will be handed out today.  Quiz next Monday.
  Lab sessions this week.  W 5-6, Fr 2-3.
  Some NFS problems, use cycle1, cycle2, cycle3 to compile/run programs
  If you are a freshman or new student and did not receive mail from
    me, see me after class
  
==============

Operations
  insert, delete, lookup
  
----------------

Implementation Tradeoffs

Insert requires that we not insert a tuple that is already present,
therefore it requires an efficient test of membership.

Delete and Lookup require only those tuples that match a predicate.
This is fast if the specified (non-wildcard) fields constitute a key for
which we have an index.  It's still reasonably fast if some subset of
the specified fields constitute a key.

    unsorted linked list
        insert (without lookup) O(N)        to avoid duplicates
        lookup, delete O(N)

    balanced search tree
        insert O(log N)
        lookup, delete O(k log N), when k tuples have a matching key,
            or O(log N + k) if the tree is threaded

    characteristic vector (limited domains)
        insert O(1), assuming characteristic field is a key or tuples
            are randomly distributed
        lookup, delete O(k), when k tuples have a matching key

    hash table
        insert O(1), assuming a good hash function and well
            distributed keys
        lookup, delete O(k), when k tuples have a matching key

Note that absent an index for at least some of the fields specified in a
query, lookup (and thus delete) will require that we search the entire
relation: O(N) time.


  ------------------------------------------------------------------------

Hash Table Example for Workstations relation

    Name       Room    Mem    OS      Model   Speed
    ================================================
    crabgrass   727    128MB  Solaris   U10   333MHz
    fusion      715    176MB  MacOS      G3   500MHz
    chemical    735     96MB  MacOS      G3   300MHz
    saturn      727     64MB  Win2000    P2   265MHz
    manila      711      1GB  RedHat   D340   686MHz


Assume a simple (too simple!) hash function

   * h(machine name) = first letter of name

The resulting table:

    a  --> NIL
    b  --> NIL
    c  --> [crabgrass, 727, 128MB, Solaris, U10, 333MHz]
           --> [chemical, 735, 96MB, MacOS, G3, 300MHz] --> NIL
    d  --> NIL
    e  --> NIL
    f  --> [fusion, 715, 176MB, MacOS, G3, 500MHz] --> NIL
    :
    m  --> [manila, 711, 1GB RedHat D340 686] --> NIL
    :
    s  --> [saturn, 727, 64MB, Win2000, P2, 265MHz] --> NIL
    :
    z  --> NIL

   * To insert a tuple, make sure it is not already on the list in the
     selected bucket, then add it to the list.

   * To delete a tuple, examine each entry on the list for the selected
     bucket, evaluate the predicate for each entry, and delete those that
     match.

   * To lookup tuples, examine each entry on the list for the selected
     bucket, evaluate the predicate for each entry, and return as a result
     those tuples that match.

Given this structure, lookups in which only machine name is specified
will take O(1) time.  If the relation were indexed by room, lookups in
which room and OS are specified would take O(number of tuples in the
room).  In either case, lookups in which only OS is specified would
take O(N).

----------------

Secondary Index Structures

Remember that we can create as many indices as we like.  They work
especially well when indexing by a key, but are possible even for
non-keys.

If a primary index determines the layout of the data in memory,
secondary indices contain pointers into the primary structure, rather
than the data itself.

----------------

Operations that Span Relations

All of the operations we've considered so far (insert, delete, lookup)
operate on a single relation.

Operations may also span relations in order to query relationships among
tuples in different relations.

   * lookup users whose server is down
   * lookup machines in a room with a printer

----------------

Multi-Relation Operations: Example

Find the name of all users who have been idle less than 1 hour, and whose
server is down.

for all tuples t in User relation do
   for all tuples s in Server relation do
      if (t.Idle < 1:00) and (s.Name = t.Server) then
         if s.Status = down then
            print t.Name

The running time of the algorithm is O(|Users| * |Servers|).

In general, any query that requires that we look at every item in one
relation for every item in another relation is inefficient.

To improve the algorithm, we can

   * reorder the steps taken
   * use an index

----------------

Reordering Operations: Example

We can improve the previous implementation by reordering operations so that
we select items from one relation before iterating over the other relation.

Find the name of all users who have been idle less than 1 hour, and whose
server is down.

for all tuples t in User relation do
   if (t.Idle < 1:00) then
      for all tuples s in Server relation do
         if (s.Name = t.Server) then
            if (s.Status = down) then
               print t.Name

Assume that there are k users idle less than 1 hour.

The running time of the algorithm is O(|Users| + k|Servers|).

This implementation is considerably more efficient than the previous one in
cases where k << |Users|.

----------------

Reordering Again: Example

Suppose we reorder the two loops, and select from the Server relation before
iterating over the User relation.

Find the name of all users who have been idle less than 1 hour, and whose
server is down.

for all tuples s in Server relation do
   if (s.Status = down) then
      for all tuples t in User relation do
         if (t.Idle < 1:00) then
             if (s.Name = t.Server) then
                print t.Name

Assume that j servers are down.

The running time of the algorithm is O(|Servers| + j|Users|).

Which is better?

   * Are there more servers or users?
   * Are there more idle users or down servers?

----------------

Indexes for Complex Queries

Rather than search among all tuples in a relation we can use an index to
quickly find the tuples that match a given predicate.

    To implement a query about idle time and server status, we would
    have to implement a secondary index for each of those domains.  Such
    queries are not likely to be common, and such an index may not be
    worthwhile.

    An index based on the room number may make sense for both the
    Workstation and Printer relations.

    An index based on User Name (the character string) may be valuable
    for the User relation.

In general, we would like to answer queries in time linear in the size
of the answer.  We can do this only if we have a (good) hash table for
every attribute in the query, and all comparisons in the query are based
on equality or inequality, rather than < or >, which hash tables don't
support.  (< and > queries can benefit from tree-structured indices, but
these are O(log N) instead of O(1) for == and != queries.)

Note the downside of extra indices:
    - space
    - time to update on insert and delete operations

------------------------------------------------------------------------

Database Design

A database designer must define
   * the database scheme
   * the keys for each relation
   * primary index structures for each relation
   * (optional) secondary index structures for each relation

These decisions depend on
   * the information to be supported
   * the relationships between pieces of information
   * the expected size of the relations
   * the types of queries expected

Two important principles guide scheme design

  1. Two pieces of information that are independent should reside in two
     different relations.

  2. Two pieces of information that are related should reside in the same
     relation.

    Violations of principle 1 result in a database full of redundant
    information.

       * Course prerequisite information is independent of the time and
         day the course is offered.

       * A server's status is independent of the users it serves

       * Company information is independent of the mutual fund that owns
         shares in the company.

    Violations of principle 2 result in a loss of information in the
    database.

       * A Course-Day-Hour relation cannot be split into Course-Day and
         Course-Hour relations without losing track of the meeting time
         of a course on a particular day.

Key (index) selection is also important, as it dictates what will be
fast.

   * Unnecessary indices slow down the database, because they have to
     be updated on every insert and delete.

   * Remember that you can't identify keys simply by looking at the
     values of a relation; only the database designer knows for sure

   * The best choice of indices depends on knowledge of the data in the
     relations, and the expected pattern of queries

   * If the common queries don't use a key (eg. User Name in the Users
     relation), you can still use a hash table based on this domain, but
     the distribution of entries in hash buckets may be less even than
     if you used a key (such as user login).

------------------------------------------------------------------------

Relational Algebra

An algebra is a formal structure consisting of sets and operations on those
sets.

Relational algebra is a formal system for manipulating relations.

   * Operands of this algebra are relations.

   * Operations of this algebra include the usual set operations (since
     relations are sets of tuples) -- union, intersection, and
     difference -- and special operations defined for relations:

        - selection
        - projection
        - join


  ------------------------------------------------------------------------

Selection: chooses tuples from a relation whose attributes meet the
selection criteria, which is normally expressed as a predicate.

    R2 = select (R1, P)

That is, from R1 we create a new relation R2 containing those tuples from R1
that satisfy (make true) the predicate P.  Clearly R2 is a subset of R1.

A predicate is a boolean expression whose operators are the logical
connectives (and, or, not) and arithmetic comparisons (LT, LE, GT, GE, EQ,
NE), and whose operands are either domain names or domain constants.

select (Workstations, Room = 727) =

    Name       Room    Mem    OS      Model   Speed
    ================================================
    crabgrass   727    128MB  Solaris   U10   333MHz
    saturn      727     64MB  Win2000    P2   265MHz

select (Users, Status=Staff and Idle < :10) =

    Id         Name              Status   File Server  Console     Idle
    ===================================================================
    kparkins   K. Parkins        Staff     butler      saturn         0