Lecture notes for CS 173, Tues. Sept. 10, 2002 Trivia: Project 1 is out. Due Sept 23 HW1 will be handed out today. Quiz next Monday. Lab sessions this week. W 5-6, Fr 2-3. Some NFS problems, use cycle1, cycle2, cycle3 to compile/run programs If you are a freshman or new student and did not receive mail from me, see me after class ============== Operations insert, delete, lookup ---------------- Implementation Tradeoffs Insert requires that we not insert a tuple that is already present, therefore it requires an efficient test of membership. Delete and Lookup require only those tuples that match a predicate. This is fast if the specified (non-wildcard) fields constitute a key for which we have an index. It's still reasonably fast if some subset of the specified fields constitute a key. unsorted linked list insert (without lookup) O(N) to avoid duplicates lookup, delete O(N) balanced search tree insert O(log N) lookup, delete O(k log N), when k tuples have a matching key, or O(log N + k) if the tree is threaded characteristic vector (limited domains) insert O(1), assuming characteristic field is a key or tuples are randomly distributed lookup, delete O(k), when k tuples have a matching key hash table insert O(1), assuming a good hash function and well distributed keys lookup, delete O(k), when k tuples have a matching key Note that absent an index for at least some of the fields specified in a query, lookup (and thus delete) will require that we search the entire relation: O(N) time. ------------------------------------------------------------------------ Hash Table Example for Workstations relation Name Room Mem OS Model Speed ================================================ crabgrass 727 128MB Solaris U10 333MHz fusion 715 176MB MacOS G3 500MHz chemical 735 96MB MacOS G3 300MHz saturn 727 64MB Win2000 P2 265MHz manila 711 1GB RedHat D340 686MHz Assume a simple (too simple!) hash function * h(machine name) = first letter of name The resulting table: a --> NIL b --> NIL c --> [crabgrass, 727, 128MB, Solaris, U10, 333MHz] --> [chemical, 735, 96MB, MacOS, G3, 300MHz] --> NIL d --> NIL e --> NIL f --> [fusion, 715, 176MB, MacOS, G3, 500MHz] --> NIL : m --> [manila, 711, 1GB RedHat D340 686] --> NIL : s --> [saturn, 727, 64MB, Win2000, P2, 265MHz] --> NIL : z --> NIL * To insert a tuple, make sure it is not already on the list in the selected bucket, then add it to the list. * To delete a tuple, examine each entry on the list for the selected bucket, evaluate the predicate for each entry, and delete those that match. * To lookup tuples, examine each entry on the list for the selected bucket, evaluate the predicate for each entry, and return as a result those tuples that match. Given this structure, lookups in which only machine name is specified will take O(1) time. If the relation were indexed by room, lookups in which room and OS are specified would take O(number of tuples in the room). In either case, lookups in which only OS is specified would take O(N). ---------------- Secondary Index Structures Remember that we can create as many indices as we like. They work especially well when indexing by a key, but are possible even for non-keys. If a primary index determines the layout of the data in memory, secondary indices contain pointers into the primary structure, rather than the data itself. ---------------- Operations that Span Relations All of the operations we've considered so far (insert, delete, lookup) operate on a single relation. Operations may also span relations in order to query relationships among tuples in different relations. * lookup users whose server is down * lookup machines in a room with a printer ---------------- Multi-Relation Operations: Example Find the name of all users who have been idle less than 1 hour, and whose server is down. for all tuples t in User relation do for all tuples s in Server relation do if (t.Idle < 1:00) and (s.Name = t.Server) then if s.Status = down then print t.Name The running time of the algorithm is O(|Users| * |Servers|). In general, any query that requires that we look at every item in one relation for every item in another relation is inefficient. To improve the algorithm, we can * reorder the steps taken * use an index ---------------- Reordering Operations: Example We can improve the previous implementation by reordering operations so that we select items from one relation before iterating over the other relation. Find the name of all users who have been idle less than 1 hour, and whose server is down. for all tuples t in User relation do if (t.Idle < 1:00) then for all tuples s in Server relation do if (s.Name = t.Server) then if (s.Status = down) then print t.Name Assume that there are k users idle less than 1 hour. The running time of the algorithm is O(|Users| + k|Servers|). This implementation is considerably more efficient than the previous one in cases where k << |Users|. ---------------- Reordering Again: Example Suppose we reorder the two loops, and select from the Server relation before iterating over the User relation. Find the name of all users who have been idle less than 1 hour, and whose server is down. for all tuples s in Server relation do if (s.Status = down) then for all tuples t in User relation do if (t.Idle < 1:00) then if (s.Name = t.Server) then print t.Name Assume that j servers are down. The running time of the algorithm is O(|Servers| + j|Users|). Which is better? * Are there more servers or users? * Are there more idle users or down servers? ---------------- Indexes for Complex Queries Rather than search among all tuples in a relation we can use an index to quickly find the tuples that match a given predicate. To implement a query about idle time and server status, we would have to implement a secondary index for each of those domains. Such queries are not likely to be common, and such an index may not be worthwhile. An index based on the room number may make sense for both the Workstation and Printer relations. An index based on User Name (the character string) may be valuable for the User relation. In general, we would like to answer queries in time linear in the size of the answer. We can do this only if we have a (good) hash table for every attribute in the query, and all comparisons in the query are based on equality or inequality, rather than < or >, which hash tables don't support. (< and > queries can benefit from tree-structured indices, but these are O(log N) instead of O(1) for == and != queries.) Note the downside of extra indices: - space - time to update on insert and delete operations ------------------------------------------------------------------------ Database Design A database designer must define * the database scheme * the keys for each relation * primary index structures for each relation * (optional) secondary index structures for each relation These decisions depend on * the information to be supported * the relationships between pieces of information * the expected size of the relations * the types of queries expected Two important principles guide scheme design 1. Two pieces of information that are independent should reside in two different relations. 2. Two pieces of information that are related should reside in the same relation. Violations of principle 1 result in a database full of redundant information. * Course prerequisite information is independent of the time and day the course is offered. * A server's status is independent of the users it serves * Company information is independent of the mutual fund that owns shares in the company. Violations of principle 2 result in a loss of information in the database. * A Course-Day-Hour relation cannot be split into Course-Day and Course-Hour relations without losing track of the meeting time of a course on a particular day. Key (index) selection is also important, as it dictates what will be fast. * Unnecessary indices slow down the database, because they have to be updated on every insert and delete. * Remember that you can't identify keys simply by looking at the values of a relation; only the database designer knows for sure * The best choice of indices depends on knowledge of the data in the relations, and the expected pattern of queries * If the common queries don't use a key (eg. User Name in the Users relation), you can still use a hash table based on this domain, but the distribution of entries in hash buckets may be less even than if you used a key (such as user login). ------------------------------------------------------------------------ Relational Algebra An algebra is a formal structure consisting of sets and operations on those sets. Relational algebra is a formal system for manipulating relations. * Operands of this algebra are relations. * Operations of this algebra include the usual set operations (since relations are sets of tuples) -- union, intersection, and difference -- and special operations defined for relations: - selection - projection - join ------------------------------------------------------------------------ Selection: chooses tuples from a relation whose attributes meet the selection criteria, which is normally expressed as a predicate. R2 = select (R1, P) That is, from R1 we create a new relation R2 containing those tuples from R1 that satisfy (make true) the predicate P. Clearly R2 is a subset of R1. A predicate is a boolean expression whose operators are the logical connectives (and, or, not) and arithmetic comparisons (LT, LE, GT, GE, EQ, NE), and whose operands are either domain names or domain constants. select (Workstations, Room = 727) = Name Room Mem OS Model Speed ================================================ crabgrass 727 128MB Solaris U10 333MHz saturn 727 64MB Win2000 P2 265MHz select (Users, Status=Staff and Idle < :10) = Id Name Status File Server Console Idle =================================================================== kparkins K. Parkins Staff butler saturn 0