Lecture notes for CS 173, Thurs. Sept. 5, 2002

Trivia:
  Ted is looking for TAs for 108 and 171.  Contact pawlicki@cs
  
  Lab sessions next week
    Monday 2-3  Josh
    Tuesday 11-12 Gautam
    Wednesday 5-6 Xipeng
    Friday 2-3 Gautam 

  The Hylan lab is closed tomorrow for two hours in the morning 
    (carpet installation).

  You still have remote access!

  Get your account from Marty soon!  
     Marty's office CSB 735
     Warning: Marty is not on campus tomorrow!
  
  First assignment will be out by tomorrow.  
    It will be due Sept 23, 11:59pm
  
==============

Review from CSC 172

Data structures v. Abstract Data Types
    the former are ways of organizing information in a computer program
    the latter present a "black box" interface to information and the
        operations on it

    ADT examples:
        stack, queue, list, set, function, mapping, tree (embodying
        information about parents and children), graph
    DS examples:
        array, singly-linked list, doubly-linked list, hash table,
        AVL tree, 2-3 tree, splay tree, skip list, binary tree embedded
        in array, ...

One of the key understandings of modern CS is that it is very important
to maintain a clear distinction between ADTs and DSes.  That's what lets
us build reusable code (e.g. the Java and C++ standard libraries).

You presumably spent a lot of time in 172 on sets and their relatives.

    set: collection of items, no two the same
    multiset: collection of items, possibly with repeats
    function: mapping from domain elements to (unique) range elements
    binary relation: collection of (unique) (domain, range) pairs, not
        necessarily one-one

    Possible implementations: arrays, linked lists, balanced trees, 
        and hash tables.

relation: generalization of a binary relation
	  collection of unique n-tuples
	  useful in databases, since mide 1970s

------------------------------------------------------------------------

Example: A Simple Database

Consider a database that might be maintained by the Computer Science
Department of the people and machines in the laboratories. The database
contains information about the workstations, printers, servers, and
registered users. Here are tables that describe the various parts of the
database (info adapted from the CS research network):

Workstations

    Name       Room    Mem    OS      Model   Speed
    ================================================
    crabgrass   727    128MB  Solaris   U10   333MHz
    fusion      715    176MB  MacOS      G3   500MHz
    chemical    735     96MB  MacOS      G3   300MHz
    saturn      727     64MB  Win2000    P2   265MHz
    manila      711      1GB  RedHat   D340   686MHz

Printers

    Name    Room    Status    Model
    ===============================
    chaucer  737      up      LJ4
    shake    602      up      LJ4
    keats    706      up      LJ2
    poe      707     down     LJ2
    picasso  737      up      P560

Cycle Servers

    Name    Status    Free Mem   #Users    OS
    =============================================
    cycle1    up        880MB        6    RedHat
    heart     up         95MB       16    Solaris
    brain     up        1.2GB        2    Solaris
    aorta     down      3.2GB        0    Solaris
    ganga     down      2.0GB        0    True64
    tapti     up        850MB        2    True64

File Servers

    Name    Status   Free Disk   OS
    ==================================
    nuclear   up        5.8GB    MacOS
    steward   up        8.5GB    Solaris
    slate     up        452MB    Solaris
    butler    down     12.6GB    Solaris

Users

    Id         Name              Status   File Server  Console     Idle
    ===================================================================
    scott      M. Scott          Faculty   nuclear     fusion      1:13
    murphy     A. Murphy         Faculty   steward     manila         0
    xgu        X. Gu             Grad      steward     heart   12:06:14
    galtekar   G. Altekar        UG        butler      heart        :07
    pawlicki   T. Pawlicki       Faculty   butler      heart          0
    marty      M. Guenther       Staff     nuclear     chemical     :15
    kparkins   K. Parkins        Staff     butler      saturn         0


----------------

Notes

  information stored in tables
  *tuple*      : row of a table
  *attributes* : entries in a tuple (a field)
  *attribute*  : label of a column
  *relation*   : collection of tuples (a table)
	         same attributes, number and type
		 order of rows doesn't matter
		 order of columns doesn't matter
		 why called a relation?  
		   individual data field not interesting
		   relationship among fields is interesting
		   e.g. marty has been idle 15 minutes.
		 can be huge.  Our example is small!!!
		   imagine all workstations connected to Internet

  *key* : column (or set of columns) that uniquely identifies the tuple
	  e.g., user's login, printer name
	  worst case, all columns form key

  multiple tables (*Database*)
    can have similar information in more than one table to correlate
      e.g., murphy's console runs RedHat
    matching columns may not have the same name!!!
      e.g. file server name in "Users" (File Server) and 
           "File Servers" (Name) 
    columns with same attribute label need not have the same information
      e.g., Users:Status and FileServers:Status
    Forming separate tables
      try to avoid redundancy: storing same relationship in more than
          one place
        can lead to inconsistency
      avoid losing information: relationship between marty and idle

      
----------------

Basic operations:

   * insert
   * delete
   * lookup (simple query)

Lookup operations can
  - use a single entry: how long has murphy been idle?
  - use a whole table: how many servers are up?
  - cross tables:

    all users who have been idle less than an hour, 
      and whose file server is currently down.  

    Possibly implement as:

       * Extract from the server table a list of servers that are down.

       * Extract from user table a list of users idle less than an hour.

       * For each such user, extract the server field from the entry,
         and see if that server name appears on the list of servers that
         are currently down.

       * If so, output the name of the user.

    There are other ways to get the same information: we could, for
    example, iterate over each server that is down, looking for users of
    that server.

----------------

How can Relations and Relational Algebra Help?

  - Conceptual framework for organizing information -- helps you decide
    what information to keep together, based on expected relationships
    and queries.

  - Abstract language for expressing queries and other operations on
    databases against which implementations can be compared (e.g. for
    completeness).

  - Algebra for expression simplification, to optimize queries for
    performance.

----------------

Getting formal

Remember our example ADTs:
    function: mapping from domain elements to (unique) range elements
    binary relation: collection of (unique) (domain, range) pairs, not
        necessarily one-one

Relations are the N-ary extension of binary relations.

Informally, we can think of a relation as a table, where the columns
(attributes) correspond to the domains of the relation, and the rows
correspond to the tuples.

Formally, a relation on the set of domains D1, D2, ... Dn is a set of
n-tuples, each of which is an element of the Cartesian product D1 x D2 x D3
x ... x Dn.

   * Since a relation is a set, order of tuples (rows) doesn't matter

   * Since a relation is NOT a multiset, two rows cannot be the same.

   * Since each of the domains (attributes) has a distinct name, 
     column order doesn't matter

The *scheme* of a relation is the list of domain/attribute names D1...Dn.

   Workstations: Name-Room-Mem-OS-Model-Speed
   Printers: Name-Room-Status-Model
   Cycle Servers: Name-Status-Free Mem-#Users-OS
   File Servers: Name-Status-Free Disk-OS
   Users: Id-Name-Status-File Server-Console-Idle

*database* :  a collection of relations. 
	      *scheme* for a database is the set of schemes for the
                 relations in the database.

We require domain names within a relation to be unique, but do not
require this of all domain names in the database as a whole.

----------------

Keys

As noted informally above, a key K for relation R on domains D1...Dn is
a subset of the domains D1...Dn such that

  1. the value of K uniquely identifies each tuple in R

  2. no proper subset of K exhibits property 1

Since every tuple in R must be unique, it follows that the set
{D1,D2,...,Dn} obeys property 1. To find a key then, we need only look for
subsets that obey property 2.

IMPORTANT NOTE: we cannot in general tell whether an attribute or set of
attributes is a key simply by inspecting the current contents of the
relation, because some future inserts might introduce tuples that have
the same "key" but different other attributes.  "Keyness" depends on the
semantics (meaning) of the relation as determined by the human user;
it's an assertion about *all* data that might *ever* be put in the
relation.  (Consider the printers relation above, with and without
picasso.  Is "room number" a key?)

In our simple database

    The various Name domains can serve as keys for the Workstations,
    Printers, Cycle Servers, and File Servers relations.

    Id can serve as a key for the Users relation.

    Name does NOT work as a key for the Users relation: two users could
    have the same name.

If we assume that two people with the same name are never assigned to the
same file server

    {Name, File Server} is a key for the Users relation.
    (That may not be a safe assumption!)

If we assume no room has more than one printer of the same model, then

    {Room, Model} is a key for the Printer relation
    (Again, this may not be a safe assumption!)

------------------------------------------------------------------------

Implementing Relations

Tuples are basically records, one field per domain.
As an ADT, they're objects with put_field and get_field methods.
The natural data structure uses contiguous allocation, like a C struct.

A relation is a set of tuples.  It can be realized as
    an array
    a linked list
    a tree
    a hash table

How to sort (or hash) the relation?
  can use a *primary key*
  don't sort
  choose a *primary index*
    need to manage duplicates

  primary key determines how data is organized, so we want to be
    careful about the choice for efficiency reasons 
      (think about huge databases, possibly stored on tape)

What if we have two equally used queries?  
  How do we choose the primary key?  
  another option: use a *secondary index* : another efficient structure
    that contains pointers/references to the data, not the data itself.
  secondary key(s) makes inserts/deletes expensive

  
Ultimately the decision on how best to implement a relation depends on
    the size of the relation
    the keys
    the sorts of lookup operations (queries) that must be supported
    the expected frequency of each sort of query

  ------------------------------------------------------------------------

More on operations

   * insert(t,R) - Insert tuple t in the relation R (if not already
     present).

   * delete(P,R) - Delete every tuple from relation R that satisfies the
     predicate P.

   * lookup(P,R) - Return a relation consisting of the tuples in R that
     satisfy the predicate P.

  Options for predicates:
	  equality :  [room == 727] 
	  comparisons :  arithmetic [idle > 60]) 
     expressed as
          logical statements 
	    ["status == faculty && file_server == butler"] 
	  patterns with "wildcard" (don't care) indicators 
	     ["(*,*,faculty,butler,*,*)"].  
	     queries are restricted to conjunctions (ANDs) of equality tests.