CSE 5331/4331 Projects
1 and 2 Fall 2009
In these projects, you will implement a program that does a simple query optimization for SELECT-PROJECT-JOIN queries. The actual optimizer will be built in project 2. In project 1, you will first implement a simple catalog to keep the information needed for the optimization. You will also implement the formulas for estimating the cost of each algorithm for a particular operation. The catalog can have three basic tables (plus additional ones as needed), and can be stored in the ORACLE DBMS (or any other system, such as ACCESS or MySQL; or it can be implemented as internal data structures (tables) within your optimization program). The three basic catalog tables needed are REL_INFO, ATTR_INFO, and INDEX_INFO:
1. REL_INFO will have one record (tuple) for each database relation (table), giving the TableName, RecordSize, NoOfRecords, BlockingFactor, NoOfFileBlocks, and any other information relevant to the whole relation and needed for optimization.
2. ATTR_INFO will have one record (tuple) for each database attribute, giving the AttributeName, the TableName (to which the attribute belongs), AttributeSize, KeyNo (this attribute will have a 0 (zero) value if the attribute is not part of any key; a 1 (one) value if the attribute is part of the primary key; and a 2 (or higher) value if the attribute is part of a unique (secondary) key). (This allows a table to have any number of keys, numbered 1, 2, 3, ¡¦; it allows an attribute to be member of at most one key, though, for simplicity). Selectivity (assume uniform distribution of values), and any other information relevant to an individual attribute, should also be included.
3. INDEX_INFO will have one record (tuple) for each database index, giving the IndexName, TableName, AttributeName (on which the index is constructed; assume for simplicity that all indexes are constructed on single attributes), NoOfIndexLevels, and any other information relevant to an index and needed for your optimization.
Your should do the following steps for this Project 1 assignment:
1. Design the catalog and create the catalog tables, then load the information on the COMPANY database schema into the catalog (see Figure 8.1). Assume all relations have fixed length record sizes with the following records sizes, number of records, and indexes (based on Figure 8.1(a), and assuming INT requires 4 bytes of storage. We have also added a 1-byte deletion marker to each record when calculating the record size):
¡¤ EMPLOYEE (105 bytes, 45000 records, file is organized as a (static) hash file using SSN as hash key, secondary index on DNO with 2 levels plus extra level of indirection, secondary index on SALARY with 3 levels plus extra level of indirection, secondary index on LNAME with 4 levels plus extra level of indirection). Assume the following domain (no of distinct values) sizes: FNAME 6000, MINIT 26, LNAME 9000, SSN 45000, BDATE 10000, ADDRESS 30000, SEX 2, SALARY 3000, SUPERSSN 4500, DNO 900
¡¤ DEPARTMENT (39 bytes, 900 records, file is organized as an unordered file, unique secondary index on DNUMBER with 2 levels, unique secondary index on DNAME with 2 levels, unique secondary index on MGR_SSN with 2 levels). Assume the following domain (no of distinct values) sizes: DNAME 900, DNUMBER 900, MGR_SSN 900, MGR_START_DATE 800
¡¤ DEPT_LOCATIONS (20 bytes, 1800 records, file is clustered (ordered) on DNUMBER, clustering index on DNUMBER with 2 levels, secondary index on DLOCATION with 2 levels plus extra level of indirection). Assume the following domain (no of distinct values) sizes: DNUMBER 900, DLOCATION 200
¡¤ PROJECT (39 bytes, 4500 records, file is organized as an unordered file, unique secondary index on PNUMBER with 3 levels, unique secondary index on PNAME with 3 levels, secondary index on PLOCATION with 2 levels plus extra level of indirection). Assume the following domain (no of distinct values) sizes: PNAME 4500, PNUMBER 4500, PLOCATION 200, DNUM 900
¡¤ WORKS_ON (17 bytes, 90000 records, file is clustered (ordered) on ESSN, secondary index on PNO with 3 levels plus extra level of indirection, clustering index on ESSN with 4 levels). Assume the following domain (no of distinct values) sizes: ESSN 45000, PNO 4500, HOURS 300
¡¤ DEPENDENT (44 bytes, 90000 records, file is clustered (ordered) on ESSN, clustering index on ESSN with 4 levels). Assume the following domain (no of distinct values) sizes: ESSN 30000, DEPENDENT_NAME 7000, SEX 2, BDATE 26250, RELATIONSHIP 10
2. Use the attribute sizes as in Figure 8.1(a). Assume INT is 4 bytes and DATE is 10 bytes.
3. For selectivities of attributes, assume uniform distribution where appropriate.
4. For join selectivities, calculate the join selectivity for each pair of attributes on which a foreign key to primary key referential integrity constraint is specified in Figure 8.1. This can be stored in a separate catalog table JOININFO(FKTABLE, FK ATTR, PKTABLE, PKATTR, JS).
5. Assume that the disk block size is B= 4096 bytes. Calculate the number of file blocks b, blocking factor Bfr, and other relevant information for each file. Assume unspanned (fixed-length) records.
6. Design and write program functions in JDBC or any other programming language that retrieve from the catalog appropriate data that will be needed by the optimizer. The following is a partial list of some of these functions. You should implement the seven functions listed below as part of Project 1. (Additional functions can be added for Project 2 as needed.):
¡¤ TableExists(TableName): Returns true if a table exists with name TableName.
¡¤ GetTableInfo(TableName): Returns all attribute information on a specific table.
¡¤ GetTableAttributes(TableName): Returns a list of attribute names for a table.
¡¤ HasAttribute(TableName, AttributeName): Returns true if AttributeName is an attribute of TableName.
¡¤ HasIndex(TableName, AttributeName): Returns the IndexName if TableName has an index on the attribute AttrbuteName; otherwise, returns ¡°NoIndex¡±.
¡¤ GetIndexInfo(IndexName): Returns all attribute information on a specific index.
¡¤ GetAttributeInfo(TableName, AttributeName): Returns all attribute information on a specific attribute.
In Project 2, you will do the following:
1. Write your optimization algorithm for SELECT-PROJECT-JOIN queries. The program should first generate an initial query tree similar to the one shown in Figure 15.5(a), then transform it step by step to an optimized query tree using a combination of heuristic rules and cost estimation.
1. For each select or join operation in the tree, your optimizer should compare cost estimates and choose the algorithm with lowest cost (e.g. index search (using which index), linear search, single-loop join (with which file used for looping and which access path to find the matching record), etc.). Assume the number of main memory buffers available is 30 when this is part of the cost estimation formulas.
2. Run your program on the following queries and turn in the results. For each query, you need to show the initial query tree and the intermediate and final query trees generated by your optimization algorithm. You can use one of the JAVA (or other programming language) tree packages, so that it is easy to print your trees using the print functions of the package or you can write your own tree-printing functions (it is easiest to use the indented style of tree display). The following queries should be optimized (additional queries may be used during grading):
a. SELECT E.LNAME, E.FNAME FROM EMPLOYEE E WHERE E.DNO=5 AND E.SALARY<30000 ;
b. SELECT E.LNAME, E.FNAME, D.DNAME FROM EMPLOYEE E, DEPARTMENT D WHERE E.DNO=D.DNUMBER ;
c. SELECT E.LNAME, E.FNAME, D.DNAME FROM EMPLOYEE E, DEPARTMENT D WHERE E.SSN=D.MGRSSN ;
d. SELECT E.LNAME, E.FNAME, D.DNAME FROM EMPLOYEE E, DEPARTMENT D WHERE E.DNO=D.DNUMBER AND E.SALARY>100000 ;
e. SELECT E.LNAME, E.FNAME, P.PNAME, W.HOURS FROM EMPLOYEE E, PROJECT P, WORKS_ON W WHERE E.SSN=W.ESSN AND W.PNO=P.PNUMBER ;
f. SELECT E.LNAME, E.FNAME, P.PNAME, W.HOURS FROM EMPLOYEE E, PROJECT P, WORKS_ON W WHERE E.SSN=W.ESSN AND W.PNO=P.PNUMBER AND E.DNO=5;
g. SELECT E.LNAME, E.FNAME, P.PNAME, W.HOURS FROM EMPLOYEE E, PROJECT P, WORKS_ON W WHERE E.SSN=W.ESSN AND W.PNO=P.PNUMBER AND E.DNO=5 AND P.PLOCATION=¡±HoustonDowntown¡±;
h. SELECT E.LNAME, E.FNAME, P.PNAME, W.HOURS, D.DNAME FROM EMPLOYEE E, PROJECT P, WORKS_ON W, DEPARTMENT D WHERE E.SSN=W.ESSN AND W.PNO=P.PNUMBER AND P.DNUM=D.DNUMBER;
i.
SELECT E.LNAME, E.FNAME, P.PNAME, W.HOURS,
D.DNAME FROM EMPLOYEE E, PROJECT P, WORKS_ON W, DEPARTMENT D WHERE E.SSN=W.ESSN
AND W.PNO=P.PNUMBER AND P.DNUM=D.DNUMBER AND E.DNO=5 AND P.PLOCATION=¡±HoustonDowntown¡±;
Due Dates: For Project 1, Wednesday, October 21, 2009. For Project 2, the due date is Wednesday November 18, 2009. Instructions on what to turn in and how to turn it in will be posted on the course Web site http://crystal.uta.edu/~elmasri/db2/. A project demo will be required for project 2.
Note: This project can be done individually or in groups 2 students per group. All students in a group will receive the same grade. The same group members must do both projects 1 and 2.
¡©Important Note: You do not need to write a
scanner/parser. Design some convenient format and interface for inputting the
queries to your optimizing program and convert the given queries manually to
your format.