Lecture notes for CS 173, Tues. Sept. 17, 2002

===============================

Expressing Queries in Relational Algebra

  relational algebra: unambiguous notation (or formalism) for queries
  queries:  expressions in relational algebra.

  use rel. alg. to simplify queries, easier to compute
    - early selection reduces the number of tuples
    - early projecion reduces the number of domains    


Notation:
    Infix operators are easier to manipulate than prefix (function-call)
    style operators.  Traditional to use lc sigma for selection, lc pi for
    projection, and bowtie for join.  Since I'm typing in ascii, I'll use
        + for union
        * for intersection
        - for difference
        $P for selection on predicate P
        #A,B,C for projection onto attributes A, B, and C
        >< for join, with stacked equate for non-natural join

    [Note: please DO NOT feed dollar signs and pound signs back at me on an
    exam.  The real symbols are lower-case sigma and pi!]
    
    remind yourself the the algebraic laws for set
    union/intersection/difference (e.g., associativitiy, commutativity, etc)
    See section 7.3 of the text

  ------------------------------------------------------------------------

Algebraic Laws for Join
  (merging multiple relations)

Commutativity
    yes, if order of columns does not matter
    NOTE: join in assignment is more simple!

    R1 >< R2  ==  R2 >< R1
      A=B            B=A

Associativity


    Not always associative
      ({AB} >< {BC}) >< {AD}             {AB} >< ({BC} >< {AD})
            B        A                        B        A

         {ABC}   >< {AD}                 {AB} ><  {???}

    ---------

    (R1 >< R2) >< R3  ==  R1 >< (R2 >< R3)
       A=B    C=D           A=B    C=D

       associative if B and C are distinct attributes of R2
         in above example, A (the second join att.) is not an element of R2

       must be careful with naming
         if keep attr. name of left (e.g., A), then second expr. may be undefined

----------------

Algebraic Laws for Selection

Commutativity

    $P1 ($P2 (R))  ==  $P2 ($P1 (R))

Selection pushing
    want to have early selection
    push selection below the join, fewer rows

    $P (R >< S)  ==  $P (R) >< S

        iff all attributes mentioned in P are members of R.
    Similarly

    $P (R >< S)  ==  R >< $P (S)

        iff all attributes mentioned in P are members of S.
    And

    $P (R >< S)  ==  $P (R) >< $P (S)

        iff P mentions only the attribute on which we are
        performing the (natural) join.
    Similarly,

        $P (R + S)  ==  $P (R) + $P (S)
        $P (R * S)  ==  $P (R) * $P (S)
        $P (R - S)  ==  $P (R) - $P (S)

Selection splitting, where P == A and B

    $P (R)  ==  $A ($B (R))  ==  $B ($A (R))

    the predicate must contain AND
    Can split into different arguments of an underlying join
      If A mentions only attr of R and B mentions only attr of S

        $P (R >< S)  ==  $A (R) >< $B (S)

----------------

Example from the book:

Consider the following 4 relation database

   * CSG: Course-StudentID-Grade
   * SNAP: StudentID-Name-Address-Phone
   * CDH: Course-Day-Hour
   * CR: Course-Room

Implement the query "Where is Chris at Noon on Monday?"

Let P be (Name="Chris" and Day="Monday" and Hour="Noon")

Brute force: join all data into one BIG relation
             select tuples that meet query
             project to get the answer

    #Room ($P (((CSG >< SNAP) >< CDH) >< CR))                   [fig 8.18]
  
  1:  Note: selection using only Name/Day/Hour
            right-most relation does not contain any of these
      Opt:  push selection inside outermost join         

    #Room (($P ((CSG >< SNAP) >< CDH)) >< CR)                   [fig 8.19a]

       cannot push selection further
  2:  Note: Predicate uses AND
            some attr in left/right joins
      Opt:  Spilt predicate into Name and Day/Hour
            P1=Name="Chris"  P2=Day="Monday" AND Hour="Noon"

    #Room (($P1 ($P2 ((CSG >< SNAP) >< CDH))) >< CR)            [fig 8.19b]

  3:  Note: P2 involves only attr. from CDH
      Opt: push selection of P2 inside the join

    #Room (($P1 ((CSG >< SNAP) >< $P2 (CDH))) >< CR)

  4:  Note: P1 involves no attr. from CDH
      Opt: Push selection of P1 inside join

    #Room (($P1 (CSG >< SNAP) >< $P2 (CDH)) >< CR)              [fig 8.19c]

  5. Note: P1 involves only attr of SNAP
     Opt: Push selection of P1 inside join

    #Room (((CSG >< $P1 (SNAP)) >< $P2 (CDH)) >< CR)            [fig 8.18d]

----------------

Algebraic Laws for Projection

Idempotence

    #SR (R) = R, where SR is the entire scheme of R

Projection pushing

    Always push down for UNION

        UNION:
        #A (R + S)  ==  #A (R) + #A (S)

    Sometimes push into Intersection and Difference
      Problem: projection removes columns, so some tuples that used to be
                different look the same after the projection
      Solution: need to keep the attributes used in the join
                AND, get rid of them in the final step
      
      Ex: given #W (R >< S)
                      A=B

        Let WR be the subset of W found in R, plus A.
        Let WS be the subset of W found in S, plus B.  Then

        #W (R >< S)  ==  #W (#WR (R) >< #WS (S))

        + gets rid of unnecessary columns early
        - adds many operations (two additional projections!)
        ? when to do projection pushing?
           - Do not do it if you eliminate an index which may be useful
           - Do it if you don't need the column later

----------------

Continuing the example from the book:

Originally we had

    #Room ($P (((CSG >< SNAP) >< CDH) >< CR))                   [fig 8.18]

After selection pushing, we had

    #Room (((CSG >< $P1 (SNAP)) >< $P2 (CDH)) >< CR)            [fig 8.19d]

    Prob: still carries many extra attributes 

 1.  Note: Need course and room from left and right of outermost join
     Opt:  right:  CR is course/room: push not necessary
           left: room does not appear, push only with course

    #Room (#Course ((CSG >< $P1 (SNAP)) >< $P2 (CDH)) >< CR)    [fig 8.20a]

 2. Opt: push below next join.
         No "higher level" clean-up join necessary:
             course is only attribute, and it is pushed in both directions

    #Room ((#Course (CSG >< $P1 (SNAP))
                >< #Course ($P2 (CDH))) >< CR)                  [fig 8.20b]

  3. Opt:  Push on left, course and student id
          do not need course on the right, not an attr

    #Room ((#Course (#Course,StudentID (CSG) >< #StudentID ($P1 (SNAP)))
                >< #Course ($P2 (CDH))) >< CR)
                                                                [fig 8.20c]

  4. Note: projection on CDH on Course and StudID requieres looking at whole
             relation 
           Immediately after this project, do a join on StudentID
             StudentID is probably an index, so we don't need to look at 
                all the tuples!!!
     Opt:  Do not get rid of the index, so do not do the project on left

    #Room ((#Course (CSG >< #StudentID ($P1 (SNAP)))
                >< #Course ($P2 (CDH))) >< CR)                  [fig 8.20d]


The book goes through both the selection and the projection
"optimizations" with tree-based diagrams.  There are pretty helpful.
Take a look at figures 8.19 and 8.20, pp. 445 and 447.