Lecture notes for CS 173, Tues. Sept. 12, 2002 ---------------- Projection: chooses a subset of the columns in a relation, and discards the rest. R2 = project (R1, D1, D2, ..., Dn) That is, from the tuples in R1 we create a new relation R2 containing attributes from the domains D1, D2, ..., Dn only. project (Cycle Servers, Name, Status) = Name Status ============== cycle1 up heart up brain up aorta down ganga down tapti up project (select (Users, Console = heart), Name, Status) = Name Status ========================= X. Gu Grad G. Altekar UG T. Pawlicki Faculty ---------------- Join: combines attributes of two relations into one. R3 = join (R1, D1, R2, D2) Given a domain from each relation, join considers all possible pairs of tuples from the two relations, and if their values for the chosen domains are equal, it adds a tuple to the result containing all the attributes of both tuples (discarding the duplicate domain D2). Natural join: If the two relations being joined have exactly one attribute (domain) name in common, then we assume that the single attribute in common is the one being compared to see if a new tuple will be inserted in the result. Suppose we augment the domain names in our lab database so that columns that really indicate different things have different names. The Workstations scheme might be WorkstationName Room Mem OS WorkstationModel Speed The File Servers scheme might be FileServerName Status FreeDisk OS A natural join on the Workstations and File Servers relations would then produce a relation of all workstation/file server attribute pairs that are running the same OS: Work- File- station- Server- Free Name Room Mem OS Model Speed Name Status Disk =================================================================== crabgrass 727 128MB Solaris U10 333MHz steward up 8.5GB crabgrass 727 128MB Solaris U10 333MHz slat up 452MB crabgrass 727 128MB Solaris U10 333MHz butler down 12.6GB fusion 715 176MB MacOS G3 500MHz nuclear up 5.8GB chemical 735 96MB MacOS G3 300MHz nuclear up 5.8GB Note that entries for saturn do not appear, because there is no Win2000 file server in the database. Note also that we have introduced redundant information. Joins generally produce wider relations than those chosen (presumably with care) by the database designer. That's ok, however, because we'll usually be performing joins only as an intermediate step in answering a query; there will almost always be a project that tosses columns we don't need. For example: find all workstations for which there exists a file server, running the same OS, that is currently up and has more than 5GB of disk space free; return (Workstation, FileServer, Room) triples: R1 = join as above R2 = select (R1, Status = up and FreeDisk > 5GB) Work- File- station- Server- Free Name Room Mem OS Model Speed Name Status Disk =================================================================== crabgrass 727 128MB Solaris U10 333MHz steward up 8.5GB fusion 715 176MB MacOS G3 500MHz nuclear up 5.8GB chemical 735 96MB MacOS G3 300MHz nuclear up 5.8GB R3 = project (R2, WorkstationName, FileServerName, Room) WorkstationName FileServerName Room ===================================== crabgrass steward 727 fusion nuclear 715 chemical nuclear 735 Note that this is not the only way to get the result (and in particular it's not the best way!); more on this later. ----------------------- Implementing Set Operations (largely review of 172 material) Assume our relations are implemented as lists. To implement R1 U R2 (while eliminating duplicates) we can * sort R1 in O(N lg N) * sort R2 in O(M lg M) * merge R1 and R2 in O(N+M) This takes care of the data. What about the indices? Hash tables using the same hash function can be merged in linear time. Trees can be merged in N lg N time. So O(N lg N + M lg M) definitely covers the whole thing. If our indices are hash tables, we can actually do the whole thing in O(N + M): * copy R1 to destination in O(N) * for each tuple in R2 (which is O(M)) add to destination iff not already there Intersection and set difference have corresponding implementations. ---------------- Implementing Projection To implement projection we must * process every tuple in the relation * remove any duplicates that result To avoid duplicates we can * sort the result and remove consecutive tuples that are equal - requires time O(N log N) where N is the size of the original relation * implement the result as a set - set insertion guarantees no duplicates - by using a hash table, insertion is O(1), so projection is O(N) Note that we have to have a key for the hash table. If nothing else, the entire tuple can always be used as a key. ---------------- Implementing Selection In the absence of an index we * apply the predicate to every tuple in the relation * insert matches in the resulting relation (duplicates can't occur) * take O(N) time Given an index, and an (in)equality-based predicate that uses the index key, we * Lookup tuples using the key * evaluate only those tuples with the predicate * take O(K) time, where K tuples match the key Note that hash tables won't support < or > predicates, and no index helps if it doesn't cover the fields used in the predicates. Why not just build an index any time we have an operation that might benefit from it? (After all, asymptotically we can hide the cost of index creation in any one such operation.) Because then we have to maintain the index in every future insert and delete. Only worth it if we need the index for a reasonable fraction of all operations. ---------------- Implementing Join with Nested Loops A nested loop join on relations R1 (with N attributes) and R2 (with M attributes), considers all |R1| x |R2| pairs of tuples. R3 := join (R1, Ai, R2, Bj): R3 := for each tuple t in R1 do for each tuple s in R2 do if t.Ai = s.Bj then insert (R3, t.A1, t.A2, ..., t.AN, s.B1, ..., s.B(j-1), s.B(j+1), ..., s.BM) This implementation takes time O(|R1|*|R2|). ---------------- Index Join An index join exploits the existence of an index for one of the domains used in the join to find matching tuples more quickly. R3 := join (R1, Ai, R2, Bj): for each tuple t in R1 do for each tuple s in R2 at index(t.Ai) do insert (R3, t.A1, t.A2, ..., t.AN, s.B1, ..., s.B(j-1), s.B(j+1), ..., s.BM) We could choose to use an index for R2, and reverse the order of the loops. Presumably we'd choose to do the outer loop on the smaller relation, and make use of the index on the larger relation. As usual, if we want any secondary indices for the joined relation we should update them along the way. ---------------- Sort Join If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join. R3 := join (R1, Ai, R2, Bj): * Sort the tuples of each relation on the join attribute. * Peruse the two lists. If the initial tuples of one relation have a join attribute less than that of the initial tuples of the other relation, toss them. Else they're equal: pair all (consecutive) tuples from the two relations with the same value in the join attribute. NB: the book suggests merging the tuples of the relations into a single list and sorting that. This works, and may be faster, but only by a small constant factor, and it's harder to understand. ---------------- Comparison of Join Implementations Assumptions * Join R1 and R2 on some common attribute to produce R3 * R1 has i tuples, R2 has j tuples * |R3| = m, 0 <= m <= i * j * Every implementation takes at least time O(m) Comparison * Nested-loop join takes time O(i * j) * Index join (using R2 index) takes time O(i+m) - lookup is O(1) for each tuple in R1 - at most O(m) tuples match * Sort join takes time O(i log i + j log j + m) - O(i log i + j log j) to sort the relations - O(i + j + m) = O(m) to produce the output (0 <= m <= i*j) [NB: Time is O((i+j) log (i+j) + m) if you do it the way the book suggests.]