Lecture notes for CS 173, Thurs. Sept. 5, 2002 Trivia: Ted is looking for TAs for 108 and 171. Contact pawlicki@cs Lab sessions next week Monday 2-3 Josh Tuesday 11-12 Gautam Wednesday 5-6 Xipeng Friday 2-3 Gautam The Hylan lab is closed tomorrow for two hours in the morning (carpet installation). You still have remote access! Get your account from Marty soon! Marty's office CSB 735 Warning: Marty is not on campus tomorrow! First assignment will be out by tomorrow. It will be due Sept 23, 11:59pm ============== Review from CSC 172 Data structures v. Abstract Data Types the former are ways of organizing information in a computer program the latter present a "black box" interface to information and the operations on it ADT examples: stack, queue, list, set, function, mapping, tree (embodying information about parents and children), graph DS examples: array, singly-linked list, doubly-linked list, hash table, AVL tree, 2-3 tree, splay tree, skip list, binary tree embedded in array, ... One of the key understandings of modern CS is that it is very important to maintain a clear distinction between ADTs and DSes. That's what lets us build reusable code (e.g. the Java and C++ standard libraries). You presumably spent a lot of time in 172 on sets and their relatives. set: collection of items, no two the same multiset: collection of items, possibly with repeats function: mapping from domain elements to (unique) range elements binary relation: collection of (unique) (domain, range) pairs, not necessarily one-one Possible implementations: arrays, linked lists, balanced trees, and hash tables. relation: generalization of a binary relation collection of unique n-tuples useful in databases, since mide 1970s ------------------------------------------------------------------------ Example: A Simple Database Consider a database that might be maintained by the Computer Science Department of the people and machines in the laboratories. The database contains information about the workstations, printers, servers, and registered users. Here are tables that describe the various parts of the database (info adapted from the CS research network): Workstations Name Room Mem OS Model Speed ================================================ crabgrass 727 128MB Solaris U10 333MHz fusion 715 176MB MacOS G3 500MHz chemical 735 96MB MacOS G3 300MHz saturn 727 64MB Win2000 P2 265MHz manila 711 1GB RedHat D340 686MHz Printers Name Room Status Model =============================== chaucer 737 up LJ4 shake 602 up LJ4 keats 706 up LJ2 poe 707 down LJ2 picasso 737 up P560 Cycle Servers Name Status Free Mem #Users OS ============================================= cycle1 up 880MB 6 RedHat heart up 95MB 16 Solaris brain up 1.2GB 2 Solaris aorta down 3.2GB 0 Solaris ganga down 2.0GB 0 True64 tapti up 850MB 2 True64 File Servers Name Status Free Disk OS ================================== nuclear up 5.8GB MacOS steward up 8.5GB Solaris slate up 452MB Solaris butler down 12.6GB Solaris Users Id Name Status File Server Console Idle =================================================================== scott M. Scott Faculty nuclear fusion 1:13 murphy A. Murphy Faculty steward manila 0 xgu X. Gu Grad steward heart 12:06:14 galtekar G. Altekar UG butler heart :07 pawlicki T. Pawlicki Faculty butler heart 0 marty M. Guenther Staff nuclear chemical :15 kparkins K. Parkins Staff butler saturn 0 ---------------- Notes information stored in tables *tuple* : row of a table *attributes* : entries in a tuple (a field) *attribute* : label of a column *relation* : collection of tuples (a table) same attributes, number and type order of rows doesn't matter order of columns doesn't matter why called a relation? individual data field not interesting relationship among fields is interesting e.g. marty has been idle 15 minutes. can be huge. Our example is small!!! imagine all workstations connected to Internet *key* : column (or set of columns) that uniquely identifies the tuple e.g., user's login, printer name worst case, all columns form key multiple tables (*Database*) can have similar information in more than one table to correlate e.g., murphy's console runs RedHat matching columns may not have the same name!!! e.g. file server name in "Users" (File Server) and "File Servers" (Name) columns with same attribute label need not have the same information e.g., Users:Status and FileServers:Status Forming separate tables try to avoid redundancy: storing same relationship in more than one place can lead to inconsistency avoid losing information: relationship between marty and idle ---------------- Basic operations: * insert * delete * lookup (simple query) Lookup operations can - use a single entry: how long has murphy been idle? - use a whole table: how many servers are up? - cross tables: all users who have been idle less than an hour, and whose file server is currently down. Possibly implement as: * Extract from the server table a list of servers that are down. * Extract from user table a list of users idle less than an hour. * For each such user, extract the server field from the entry, and see if that server name appears on the list of servers that are currently down. * If so, output the name of the user. There are other ways to get the same information: we could, for example, iterate over each server that is down, looking for users of that server. ---------------- How can Relations and Relational Algebra Help? - Conceptual framework for organizing information -- helps you decide what information to keep together, based on expected relationships and queries. - Abstract language for expressing queries and other operations on databases against which implementations can be compared (e.g. for completeness). - Algebra for expression simplification, to optimize queries for performance. ---------------- Getting formal Remember our example ADTs: function: mapping from domain elements to (unique) range elements binary relation: collection of (unique) (domain, range) pairs, not necessarily one-one Relations are the N-ary extension of binary relations. Informally, we can think of a relation as a table, where the columns (attributes) correspond to the domains of the relation, and the rows correspond to the tuples. Formally, a relation on the set of domains D1, D2, ... Dn is a set of n-tuples, each of which is an element of the Cartesian product D1 x D2 x D3 x ... x Dn. * Since a relation is a set, order of tuples (rows) doesn't matter * Since a relation is NOT a multiset, two rows cannot be the same. * Since each of the domains (attributes) has a distinct name, column order doesn't matter The *scheme* of a relation is the list of domain/attribute names D1...Dn. Workstations: Name-Room-Mem-OS-Model-Speed Printers: Name-Room-Status-Model Cycle Servers: Name-Status-Free Mem-#Users-OS File Servers: Name-Status-Free Disk-OS Users: Id-Name-Status-File Server-Console-Idle *database* : a collection of relations. *scheme* for a database is the set of schemes for the relations in the database. We require domain names within a relation to be unique, but do not require this of all domain names in the database as a whole. ---------------- Keys As noted informally above, a key K for relation R on domains D1...Dn is a subset of the domains D1...Dn such that 1. the value of K uniquely identifies each tuple in R 2. no proper subset of K exhibits property 1 Since every tuple in R must be unique, it follows that the set {D1,D2,...,Dn} obeys property 1. To find a key then, we need only look for subsets that obey property 2. IMPORTANT NOTE: we cannot in general tell whether an attribute or set of attributes is a key simply by inspecting the current contents of the relation, because some future inserts might introduce tuples that have the same "key" but different other attributes. "Keyness" depends on the semantics (meaning) of the relation as determined by the human user; it's an assertion about *all* data that might *ever* be put in the relation. (Consider the printers relation above, with and without picasso. Is "room number" a key?) In our simple database The various Name domains can serve as keys for the Workstations, Printers, Cycle Servers, and File Servers relations. Id can serve as a key for the Users relation. Name does NOT work as a key for the Users relation: two users could have the same name. If we assume that two people with the same name are never assigned to the same file server {Name, File Server} is a key for the Users relation. (That may not be a safe assumption!) If we assume no room has more than one printer of the same model, then {Room, Model} is a key for the Printer relation (Again, this may not be a safe assumption!) ------------------------------------------------------------------------ Implementing Relations Tuples are basically records, one field per domain. As an ADT, they're objects with put_field and get_field methods. The natural data structure uses contiguous allocation, like a C struct. A relation is a set of tuples. It can be realized as an array a linked list a tree a hash table How to sort (or hash) the relation? can use a *primary key* don't sort choose a *primary index* need to manage duplicates primary key determines how data is organized, so we want to be careful about the choice for efficiency reasons (think about huge databases, possibly stored on tape) What if we have two equally used queries? How do we choose the primary key? another option: use a *secondary index* : another efficient structure that contains pointers/references to the data, not the data itself. secondary key(s) makes inserts/deletes expensive Ultimately the decision on how best to implement a relation depends on the size of the relation the keys the sorts of lookup operations (queries) that must be supported the expected frequency of each sort of query ------------------------------------------------------------------------ More on operations * insert(t,R) - Insert tuple t in the relation R (if not already present). * delete(P,R) - Delete every tuple from relation R that satisfies the predicate P. * lookup(P,R) - Return a relation consisting of the tuples in R that satisfy the predicate P. Options for predicates: equality : [room == 727] comparisons : arithmetic [idle > 60]) expressed as logical statements ["status == faculty && file_server == butler"] patterns with "wildcard" (don't care) indicators ["(*,*,faculty,butler,*,*)"]. queries are restricted to conjunctions (ANDs) of equality tests.