Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Mineral 2K - Utility and Queries - Final Project #4 | EECS 484, Study Guides, Projects, Research of Database Management Systems (DBMS)

Material Type: Project; Professor: Lefevre; Class: Database Mgt Syst; Subject: Electrical Engineering And Computer Science; University: University of Michigan - Ann Arbor; Term: Winter 2009;

Typology: Study Guides, Projects, Research

Pre 2010

Uploaded on 09/02/2009

koofers-user-ijl
koofers-user-ijl 🇺🇸

10 documents

1 / 12

Toggle sidebar

Related documents


Partial preview of the text

Download Mineral 2K - Utility and Queries - Final Project #4 | EECS 484 and more Study Guides, Projects, Research Database Management Systems (DBMS) in PDF only on Docsity! EECS 484: Winter 09 Minirel2K: Utilities and Queries Page 1 EECS 484, Winter 2009 Project 4: Minirel 2K: Utility and Queries Due: April 15th at 10:30 AM Introduction For the final part of the project, you will implement various database utility and query operators in the Minirel2K DBMS. The Minirel2K database system consists of three programs which are described below: 1. minirel <dbname> [SQL-file] This is the main program in the Minirel2K system, and is used to access and query the Minirel2K database dbname. When this program is invoked with a dbname, the very first thing it does is to change its working UNIX directory (using chdir) to dbname. Each Minirel2K database corresponds to a UNIX directory and each relation within that database corresponds to a file within the directory. The minirel program starts off by creating a buffer manager, opening the relation and attribute catalogs and then calling the function parse(). With the call to parse(), an infinite loop will start prompting the user for SQL commands. The parser type-checks the queries and makes appropriate calls to the functions provided by the lower layers of the system to execute the command. In this assignment, you need to implement some of the methods for query evaluation and for inserting tuples into relations. If the optional SQL-file is specified, then the program starts reading SQL statements from that file. 2. dbcreate <dbname> This program creates the database dbname. You should extend the skeleton code provided in file dbcreate.cpp to insert tuples in the catalog relations that describe the relation and attribute catalogs themselves. Note these two relations are created by creating an instance of the RelCatalog and AttrCatalog classes. More details about this will be given in the section on Catalogs. 3. dbdestroy <dbname> This program deletes the database dbname. SQL Statements The SQL parser that we provide understands only a small subset of the standard SQL language. The following SQL statements are supported: 1. CREATE TABLE TBLNAME (ATTRLIST…) The SQL “create table command”. However, the parser does not support any key or table constraints. The data types that are supported are: INTEGER, DOUBLE and CHAR. For example the following statement creates a table with three attributes: CREATE TABLE DA (IKEY INTEGER, FILLER CHAR(80), DKEY DOUBLE) 2. CREATE INDEX RELNAME (ATTRNAME) As you may recall SQL has no standard command for creating an index. This statement creates a hash index in Minirel2K. The code for hash index is provided by us (see files index.h and index.cpp). The interface is similar to the interface defined for the B+-tree class. 3. DROP TABLE RELNAME Deletes the table RelName. 4. DROP INDEX RELNAME(ATTRNAME) EECS 484: Winter 09 Minirel2K: Utilities and Queries Page 2 Deletes the index on the AttrName attribute of the relation RelName. 5. INSERT INTO RELNAME(ATTRNAMELIST) VALUES (VALUELIST) The standard SQL “insert into” command. However, the only modification is that the AttrNameList is mandatory (not optional as in SQL). For example the following statement inserts a tuple into the DA table defined above: INSERT INTO DA (IKEY, FILLER, DKEY) VALUES (11, ‘SING A SONG’’, 111.0); When the parser detects an insert query, it calls the “Updates::Insert” method (see file query.h and insert.cpp). You need to implement the code to evaluate the insert operation. Literal values for doubles in all Minirel2K SQL must be specified in the form “Num.Num”. Note that in the insert into statement above the value for dkey is specified as 111.0 and not 111. 6. SELECT PROJLIST FROM RELLIST WHERE PREDICATE: The SQL command for querying table. Only a limited form of the general SQL statement is supported. First, the RelList can have at most two relations. Second, the Predicate can only be a single predicate, and not a compound predicate with multiple predicates connected by AND or ORs. Third, all attributes must be referred to using the form “RelName.AttrName”, and not just “AttrName”. Finally, all literal values for doubles must be in the form “Num.Num”. Put another way, the following two types of queries are supported: • SELECT PROJLIST FROM RELNAME WHERE RELNAME.ATTR OP LITERAL A simple select query on a single table. When the parser detects a select query, it calls the “Operators::Select” method (see file query.h and select.cpp). You need to implement the code to evaluate the select operation. • SELECT PROJLIST FROM RELNAME1, RELNAME2 WHERE RELNAME1.ATTR1 OP RELNAME2.ATTR2 A simple join query on two tables. When the parser detects this join query, it calls the “Operators::Join” method (see file query.h and join.cpp). You need to implement the code to evaluate the join operation. 7. QUIT: The quit command exits the Minirel2K system. If the parser detects an error in the SQL statement, it prints out an error message and exits the system. Before exiting the system, it cleans up the system calling destructors on the buffer manager and catalog objects. However the database may get corrupted if the system crashed at arbitrary points (which may happen when you are testing your code). Since we don’t have a recovery manager in Minirel2K, it implies you may have to rebuild the database if it is corrupted. To destroy a database run “dbdestroy” on the database. Then reload the database (using your SQL commands). Catalogs One of the neat things about a relational database is that in addition to the data, even the metadata that describe what relations exist in the database, the types of their attributes and so on are also stored in relations, called catalogs. Minirel2K has two heapfiles called relcat and attrcat that are used to store the catalog information. The relcat relation contains one tuple for every relation in the database (including itself). The attrcat relation contains one tuple for every attribute of every relation (including the catalog relations) and this tuple contains information about the attribute. Both attrcat and relcat are created by the dbcreate utility and together they contain the schema of the database. relcat and attrcat are instances of the EECS 484: Winter 09 Minirel2K: Utilities and Queries Page 5 The file sql/catalog.sql contains a few SQL statements for querying the catalog relations. These statements will only return valid results after you have implemented the bootstrapping process. When you query the catalog relations, you will see additional temporary relations that the SQL parser creates when executing queries. Typically when a query is presented to the parser, it creates a temporary relation for the result of the query. Since the output of a SQL query is a relation, the parser creates a table for the result. The result relation is deleted after the tuples in the relation are printed in response to the users query. Temporary relation names start with the prefix “Tmp_M2K”. Since SQL is case-insensitive, the parser converts all names to lowercase before calling the catalog functions or the query/utility functions. This implies that when you create records during the bootstrapping process, all the strings in those records must be in lowercase. Otherwise when a query on the catalog relations is presented to the parser, the parser will look for catalog information using relation and attribute names that are in the lowercase, which will not match the names in the record that you insert into the catalog heapfiles. Architecture of Minirel2K Query Processing Path Component Interaction Parser (libsql.a ) Query Optimizer (select.C , join.C ) C a ta lo g s (l ib c a t. a , c a ta lo g .h , d a ta ty p e s .h ) Query Operators & Utilities (query.h , utility.h , insert.C , select.C , scanselect.C , indexselect.C , join.C , snl.C , inl.C , smj.C , sort.h , sort.C print.C , libmisc.a ) SQL liblsm.a Storage Manager Page Manager (page.h ) File Manager (heapfile.h , heapfile.C ) Hash Index Manager (index.h , index.C ) Buffer Manager (buf.h , bufMap.h ) DB/IO Layer (db.h ) Figure 1: Architecture of Minirel2K Figure 1 shows the architecture of the Minirel2K system. An incoming query is parsed by the SQL parser. After checking for syntactical errors, the parser typechecks the queries by consulting the system catalogs and checking if the relations and attributes in the query actually exist in the database system. If the SQL statement is a utility, such as create table/index or an insert statement, the parser calls the appropriate function in the “Query Operators and Utilities” layer. For this assignment, you only need to implement one utility, namely the insert utility. The remaining utilities are provided by us. EECS 484: Winter 09 Minirel2K: Utilities and Queries Page 6 For query statements, the parser determines if the query is a select query or a join query. To execute a select query, the parser calls the Operator::Select function defined in select.cpp, and for join queries it calles the Operator::Join function defined in join.cpp You need to implement both the select and the join operators. Insert Operator The signature of the insert operator is defined in the file query.h and the code for this operator must be added by your group in the file insert.cpp. Here is the definition of this function: const Status Updates::Insert(const string & relation, const int attrCnt, const attrInfo attrList[]) This function must insert a tuple with the given attribute values (in attrList) into the specified relation. The type attrInfo is defined in the file catalog.h. The value of the attribute is in attrList[I].attrValue, and the name of the attribute is in attrList[I].attrName. For the SQL INTEGER (DOUBLE) data type, attrValue is a pointer to an integer (double). Similarly for an attribute of type STRING (SQL CHAR type), attrValue points to a character string. The attrList array may not list the attributes in the same order as that present in the relation, so you may have to rearrange the attribute values before inserting a tuple into the relation. If no value is specified for an attribute in attrList, you should reject the insertion. (In a real database system, missing attribute values are implemented using NULLs.) In addition to inserting the tuple, this operator must also update all the hash indices on the relation. Use the Index::insertEntry to insert an entry into the index. Look up the system catalogs to find out information about the relation and attributes. (See Section “Getting Started” for more details.) Select Operators const Status Operators:: Select(const string & result, const int projCnt, const attrInfo projNames[], const attrInfo* attr, const Operator op, const void *attrValue) This function must implement the select operator, selecting all tuples that match the predicate. The predicate is specified via the variables attr, op and attrValue. You must implement two access methods for the select operator: one using a HeapFileScan and the other using an IndexScan. These access methods must be implemented as separate function called ScanSelect and IndexSelect. Use the files seanselect.cpp and indexselect.cpp to implement these algorithms. The Operators::Select function must implement a very simple rule for “optimizing” select queries. It must check if an index exists on the attribute in the predicate, and if the predicate is an equality predicate. If both these conditions are met, it calls the IndexSelect function, else it calls ScanSelect. This criteria works since hash indices are generally very efficient for evaluating equality predicates. Note if we had a B+-tree index instead of a hash index, we would need detailed statistics from the catalogs before choosing the select algorithm. attr can be NULL which means that the selection is unconditional. The result of the selection is stored in the result relation called result, which will be created by the parser function which calls Select. The names of the attributes of this relation are derived from the corresponding attributes in projNames, and have as suffix the attribute number. You can assume that there are no indices defined on the result relation. Projection, defined by projCnt and projNames, should be done on the fly when each result tuple is being written out. Don’t worry about eliminating duplicates during the projection. EECS 484: Winter 09 Minirel2K: Utilities and Queries Page 7 Finally, the search value, attrValue, is a pointer to a value that has the same type as attr. For SQL data types of INTEGER, DOUBLE, CHAR, the attrValue points to the C++ types int&, double&, and char* respectively. Join Operator const Status Operators:: Join(const string & result, const int projCnt, const attrInfo projNames[], const attrInfo *attr1, const Operator op, const attrInfo *attr2) This function must implement the join operator that joins two relations based on the predicate specified using the variables “attr1, op and attr2”. You must implement three join algorithms: simple nested-loops (not the page oriented or block nested-loops), indexed-nested loops and sort-merge join. These three algorithms must be implemented as separate function called SNL, INL, and SMJ respectively, in the files snl.cpp, inl.cpp and smj.cpp respectively. Operators::Join chooses amongst these alternatives based on the join predicate and the index availability on the join attributes. The order of preference for the algorithms is first INL, then SMJ and finally SNL. Collectively these algorithms allow evaluating both equi-joins and non-equi-joins. Non-equi-join must be processed using SNL. If it is an equi-join and an index exists on either attr1 or attr2, you should use the INL join algorithm. If indices exist on both you can arbitrarily choose which index to use. Finally, if it is an equi-join and no indices exist on either of the join attributes, you should use the SMJ algorithm. For implementing sort-merge join, you can use the sorting code provided in files sort.h and sort.cpp . You need to implement a simplified version of the general sort-merge join algorithm. Start by calling the BufMgr::numUnpinnedPages to determine the number of pages that are unpinned in the buffer pool. Set the number of pages that you will use for the sort at 80% of this number (as with most resources performance starts going down because of thrashing if the resource is overcommitted). Now you will have a number, k, that tells you how many pages in the buffer pool you can use for the sort. Consult the system catalogs and determine the size of the tuple (in number of bytes). Since Minirel pages are 1KB (defined by PAGESIZE), you can calculate how many tuples, n, would be contained in k pages. Then start scanning the first relation using a HeapFileScan, and read upto n tuples. For each tuple, copy out the sort key value and the rid of the tuple into an array that you allocate in memory (outside the buffer pool). Sort this array and then write the tuples out to a new temporary heapfile in the sorted order. This temporary file has the first sorted run for the first relation. Then, go on to produce the second sorted run for the first relation. Store each run in a separate heapfile. At the end of this phase, you will have a number of sorted runs, each with approximately n tuples. Repeat the same process for the second relation and produce sorted runs for that relation too. Finally merge the sorted-runs by starting upto k HeapFileScans on the sorted runs. Keep merging until the sort-merge algorithm is completed. For this assignment, you are allowed to simplify the implementation of the sort-merge join in the following way. In the sorted runs that are written out to the disk, write the entire tuple and not just the projected attributes. You can also implement a simpler version of sort-merge join in which you sort both relations completely on the join attribute, and then perform the join (though if you think carefully, it is not too hard to implement the merge step that does both the sorting and the join). To handle duplicates you will need to move the scan backwards! The HeapFileScan that is provide to you has two new functions setMarker() which sets a marker at the current tuple being scanned. Call the setMarker function only after you have called getRecord or scanNext to get a record. A marker will be set on the record that was fetched by the last call to getRecord/scanNext. To go back to scanning from the mark point, use the method gotoMarker(RID& rid, Record & rec), which moves the scan back to the last mark point, and retrieves the record at the marker. After calling gotoMarker, you can call scanNext to keep scanning forward from the marker point. EECS 484: Winter 09 Minirel2K: Utilities and Queries Page 10 Getting Started Start by copying the files, as suggested in “Your Assignment”. You should be able to compile the code using the given Makefile on a CAEN linux machine. % make clean % make At this point, you should see the executables “dbcreate”, “dbdestroy”, and “minirel” created. Let’s try using them. % dbcreate testdb This should create a sample, empty database called “testdb”. % dbdestroy testdb This should delete the database. Let’s create it again and try using it. % dbcreate testdb Check the contents of the directory “testdb”, where the FILE, HEAPFILE, etc., for the testdb will reside. You will notice that there are only two files there: “attrcat” and “relcat”. This is the initial catalog data for the empty database. Let’s now create some user tables by executing some SQL commands. % minirel testdb sql/insert.sql We created an empty database “testdb”, and are now executing the SQL commands in sql/insert.sql. The first two commands in sql/insert.sql are CREATE TABLE soaps(soapid integer, name char(32), network char(4), rating double); CREATE TABLE stars(starid integer, real_name char(20), plays char(12), soapid integer) After the execution of the above, you should see two additional files in the testdb directory. % ls testdb attrcat relcat soaps stars We just created two relations: soaps and stars. The CREATE TABLE command is already implemented for you. The catalog data for the two tables is also added to attrcat and relcat files. The sql/insert.sql file also contains INSERT commands. Those commands are read by the parser (already implemented). The parser invokes the insert method in insert.cpp. At this point, the method does nothing. EECS 484: Winter 09 Minirel2K: Utilities and Queries Page 11 A good starting point is to implement the insert method first. That will orient you to using most of the other classes effectively, such as Catalog, Heapfile, and Index. Let’s follow what should happen for the first INSERT command in sql/insert.sql: INSERT INTO stars(starid, real_name, plays, soapid) VALUES (100, 'Posey, Parker', 'Tess', 6); To insert a record into “stars”, the insert method needs to:  Insert the record into the Heapfile corresponding to “stars”.  Insert the record ID into each index that exists on ‘stars”. But to insert the record, you first need to map the data from the arguments to insert into a Record object. The Record object is defined in page.h as: struct Record { void* data; int length; }; The data that is passed in is in the “attrList” array, which is an array of attrInfo in catalog.h: typedef struct { char relName[MAXNAME]; // relation name char attrName[MAXNAME]; // attribute name int attrType; // INTEGER, FLOAT, or STRING int attrLen; // not used for INSERT. Usually -1. void *attrValue; // ptr to binary value (used // by the parser for insert into // statements) // [In some versions of the SQL // parser, attrValue is also used // to hold the default value // specified during the create table // command] // } attrInfo; For the above INSERT command, the 4 elements of the array will be something like the following: [<”stars”, “starid”, INTEGER, -1, pointer to 4 bytes containing 100>, <”stars”, “real_name” STRING, -1, pointer to “Posey, Parker”> <”stars”, “plays”, STRING, -1, pointer to “Tess”> <stars”, “soapid”, INTEGER, -1, pointer to 4 bytes containing 6>] The attrCnt argument to insert.cpp will be 4. Note that strings are null-terminated C-strings. We have to pack the above attribute values into a record for the relation. The CREATE command for stars specified the order of fields, types, etc. So, we have to look up the System Catalog to determine the EECS 484: Winter 09 Minirel2K: Utilities and Queries Page 12 attributes of “stars”, their offset in the record, and their data type. Note that it is not necessary that the attrList array contains the attributes in the correct order. attrCat is a global variable that points to the attribute table for the Catalog. You can use attrCat->getRelInfo(const string &rName, int &attrCnt, RelDesc& relattrs) to retrieve information about the attributes of the “stars” relation. This will tell you the offset, data type, and the size of each attribute for packing into the record. In this case, we should expect to find that “starid” has offset 0 and length 4. “real_name” has offset 4 and length 20, “plays” has offset 24 and length 12, and “soapid” has offset 36 and length “4”. The total length of the record becomes 40. (If the record is packed in a different way by the CREATE implementation, you may see different values than these.) At this point, you have all the information to allocate the memory for a record, fill it with data, and then insert it in the heapfile. To allocate the memory for the record, you need to know the record size. Let’s say the size of the record is determined to be 40. We need to create a Record object and then allocate 40 bytes for the data field of the record. Then, the data field needs to be filled in with the data from the attrList. You will need to use “memcpy” to do that. You will need to rely on the data from the RelDesc object returned by the getInfo call to determine the offsets for each attribute and the # of bytes to copy. Once the record is created, you can call insertRecord on the heapfile for the relation. First the constructor for the “stars” heapfile (which creates a heapfile if it does not exist, else opens it) needs to be called. Then, the record you created needs to be inserted. That will return a Record ID. This record ID needs to be inserted in each index for the relation. To determine if there is one or more index for the relation, you need to look at the catalogs. Then, use the interfaces defined in index.h. Make sure you free up memory when you are done with various objects. Either keep the objects, such as the index object, cached or free them up when you are done with them. Try getting the insert() method working. Once you are able to insert records in your tables, you are on your way to going to the next step of implementing the SELECT operators. Implementing the JOIN will take substantial time as you also have to implement external sorting. Make sure to start early, and set intermediate goals to avoid falling behind.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved