Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Understanding Files and Indexes in Database Systems, Schemes and Mind Maps of English

The concept of records identifiers (RID), file of records abstraction, and the role of indexes in optimizing retrieval operations. It covers primary and secondary indexes, clustered and unclustered indexes, and their impact on search and retrieval operations. The document also discusses hash indexes and B+-tree indexes, their differences, and their use cases.

Typology: Schemes and Mind Maps

2021/2022

Uploaded on 09/27/2022

dyanabel
dyanabel 🇺🇸

4.7

(22)

53 documents

1 / 11

Toggle sidebar

Related documents


Partial preview of the text

Download Understanding Files and Indexes in Database Systems and more Schemes and Mind Maps English in PDF only on Docsity! 8 OVERVIEW OF STORAGE AND INDEXING Exercise 8.1 Answer the following questions about data on external storage in a DBMS: 1. Why does a DBMS store data on external storage? 2. Why are I/O costs important in a DBMS? 3. What is a record id? Given a record’s id, how many I/Os are needed to fetch it into main memory? 4. What is the role of the buffer manager in a DBMS? What is the role of the disk space manager? How do these layers interact with the file and access methods layer? Answer 8.1 The answer to each question is given below. 1. A DBMS stores data on external storage because the quantity of data is vast, and must persist across program executions. 2. I/O costs are of primary important to a DBMS because these costs typically dominate the time it takes to run most database operations. Optimizing the amount of I/O’s for an operation can result in a substantial increase in speed in the time it takes to run that operation. 3. A record id, or rid for short, is a unique identifier for a particular record in a set of records. An rid has the property that we can identify the disk address of the page containing the record by using the rid. The number of I/O’s required to read a record, given a rid, is therefore 1 I/O. 4. In a DBMS, the buffer manager reads data from persistent storage into memory as well as writes data from memory into persistent storage. The disk space manager manages the available physical storage space of data for the DBMS. When the file 102 Overview of Storage and Indexing 103 and access methods layer needs to process a page, it asks the buffer manager to fetch the page and put it into memory if it is not all ready in memory. When the files and access methods layer needs additional space to hold new records in a file, it asks the disk space manager to allocate an additional disk page. Exercise 8.2 Answer the following questions about files and indexes: 1. What operations are supported by the file of records abstraction? 2. What is an index on a file of records? What is a search key for an index? Why do we need indexes? 3. What alternatives are available for the data entries in an index? 4. What is the difference between a primary index and a secondary index? What is a duplicate data entry in an index? Can a primary index contain duplicates? 5. What is the difference between a clustered index and an unclustered index? If an index contains data records as ‘data entries,’ can it be unclustered? 6. How many clustered indexes can you create on a file? Would you always create at least one clustered index for a file? 7. Consider Alternatives (1), (2) and (3) for ‘data entries’ in an index, as discussed in Section 8.2. Are all of them suitable for secondary indexes? Explain. Answer 8.2 The answer to each question is given below. 1. The file of records abstraction supports file creation and deletion, record creation and deletion, and scans of of individual records in a file one at a time. 2. An index is a data structure that organizes data records on disk to optimize certain kinds of retrieval operations. A search key for an index is the fields stored in the index that we can search on to efficiently retrieve all records satisfy the search conditions. Without indexes, every search would to a DBMS would require a scan of all records and be extremely costly. 3. The three main alternatives for what to store as a data entry in an index are as follows: (a) A data entry k* is an actual data record (with search key value k). (b) A data entry is a 〈 k, rid 〉 pair, where rid is the record id of a data record with search key value k. (c) A data entry is a 〈 k, rid-list 〉 pair, where rid-list is a list of record ids of data records with search key value k. 106 Chapter 8 5. 〈 11, (1,1) 〉, 〈 19, (2,1) 〉. The order of entries is significant since the order of the entries is the same as the order of data record. 6. 〈 11, (1,1) 〉, 〈 19, (2,1), (2,2) 〉. The order of entries is significant since the order of the entries is the same as the order of data record. 7. Contradiction. Cannot build unclustered index using Alternative (1) since method is inherently clustered. 8. 〈 1.8, (1,1) 〉, 〈 2.0, (1,2) 〉, 〈 3.2, (2,1) 〉, 〈 3.4, (1,3) 〉, 〈 3.8, (2,2) 〉. The order of entries is not significant. 9. 〈 1.8, (1,1) 〉, 〈 2.0, (1,2) 〉, 〈 3.2, (2,1) 〉, 〈 3.4, (1,3) 〉, 〈 3.8, (2,2) 〉. The order of entries is not significant. 10. Alternative (1) cannot be used to build a clustered index on gpa because the records in the file are not sorted in order of gpa. Only if the entries in (1,3) and (2,1) were switched would this possible, but then the data would no longer be sorted on age as previously defined. 11. Alternative (2) cannot be used to build a clustered index on gpa because the records in the file are not sorted in order of gpa. Only if the entries in (1,3) and (2,1) were switched would this possible, but then the data would no longer be sorted on age as previously defined. 12. Alternative (3) cannot be used to build a clustered index on gpa because the records in the file are not sorted in order of gpa. Only if the entries in (1,3) and (2,1) were switched would this possible, but then the data would no longer be sorted on age previously defined. Exercise 8.5 Explain the difference between Hash indexes and B+-tree indexes. In particular, discuss how equality and range searches work, using an example. Answer 8.5 A Hash index is constructed by using a hashing function that quickly maps an search key value to a specific location in an array-like list of elements called buckets. The buckets are often constructed such that there are more bucket locations than there are possible search key values, and the hashing function is chosen so that it is not often that two search key values hash to the same bucket. A B+-tree index is constructed by sorting the data on the search key and maintaining a hierarchical search data structure that directs searches to the correct page of data entries. Insertions and deletions in a hash based index are relatively simple. If two search values hash to the same bucket, called a collision, a linked list is formed connecting multiple records in a single bucket. In the case that too many of these collisions occur, the number of buckets is increased. Alternatively, maintaining a B+-tree’s hierarchical search data structure is considered more costly since it must be updated whenever there Overview of Storage and Indexing 107 are insertions and deletions in the data set. In general, most insertions and deletions will not modify the data structure severely, but every once in awhile large portions of the tree may need to be rewritten when they become over-filled or under-filled with data entries. Hash indexes are especially good at equality searches because they allow a record look up very quickly with an average cost of 1.2 I/Os. B+-tree indexes, on the other hand, have a cost of 3-4 I/Os per individual record lookup. Assume we have the employee relation with primary key eid and 10,000 records total. Looking up all the records individually would cost 12,000 I/Os for Hash indexes, but 30,000-40,000 I/Os for B+- tree indexes. For range queries, hash indexes perform terribly since they could conceivably read as many pages as there are records since the data is not sorted in any clear grouping or set. On the other hand, B+-tree indexes have a cost of 3-4 I/Os plus the number of qualifying pages or tuples, for clustered or unclustered B+-trees respectively. Assume we have the employees example again with 10,000 records and 10 records per page. Also assume that there is an index on sal and query of age ¿ 20,000, such that there are 5,000 qualifying tuples. The hash index could cost as much as 100,000 I/Os since every page could be read for every record. It is not clear with a hash index how we even go about searching for every possible number greater than 20,000 since decimals could be used. An unclustered B+-tree index would have a cost of 5,004 I/Os, while a clustered B+-tree index would have a cost of 504 I/Os. It helps to have the index clustered whenever possible. Exercise 8.6 Fill in the I/O costs in Figure 8.2. File Scan Equality Range Insert Delete T ype Search Search Heap file Sorted file Clustered file Unclustered tree index Unclustered hash index Figure 8.2 I/O Cost Comparison Answer 8.6 The answer to the question is given in Figure 8.3. We use B to denote the number of data pages total, R to denote the number of records per page, and D to denote the average time to read or write a page. 108 Chapter 8 File Scan Equality Range Insert Delete T ype Search Search Heap file BD 0.5BD BD 2D Search+ D Sorted file BD Dlog2B Dlog2B+# Search+ Search+ matching pages BD BD Clustered file 1.5BD DlogF 1.5B DlogF B+# Search+ Search+ matching pages D D Unclustered tree index BD(R+ D(1+ D(logF 0.15B+# D(3+ Search+ 0.15) logF 0.15B) matching records) logF 0.15B) 2D Unclustered hash index BD(R+ 2D BD 4D Search+ 0.125) 2D Figure 8.3 I/O Cost Comparison Exercise 8.7 If you were about to create an index on a relation, what considerations would guide your choice? Discuss: 1. The choice of primary index. 2. Clustered versus unclustered indexes. 3. Hash versus tree indexes. 4. The use of a sorted file rather than a tree-based index. 5. Choice of search key for the index. What is a composite search key, and what considerations are made in choosing composite search keys? What are index-only plans, and what is the influence of potential index-only evaluation plans on the choice of search key for an index? Answer 8.7 The answer to each question is given below. 1. The choice of the primary key is made based on the semantics of the data. If we need to retrieve records based on the value of the primary key, as is likely, we should build an index using this as the search key. If we need to retrieve records based on the values of fields that do not constitute the primary key, we build (by definition) a secondary index using (the combination of) these fields as the search key. 2. A clustered index offers much better range query performance, but essentially the same equality search performance (modulo duplicates) as an unclustered index. Overview of Storage and Indexing 111 Emp(eid: integer, sal: integer, age: real, did: integer) There is a clustered index on eid and an unclustered index on age. 1. How would you use the indexes to enforce the constraint that eid is a key? 2. Give an example of an update that is definitely speeded up because of the available indexes. (English description is sufficient.) 3. Give an example of an update that is definitely slowed down because of the indexes. (English description is sufficient.) 4. Can you give an example of an update that is neither speeded up nor slowed down by the indexes? Answer 8.10 The answer to each question is given below. 1. To enforce the constraint that eid is a key, all we need to do is make the clustered index on eid unique and dense. That is, there is at least one data entry for each eid value that appears in an Emp record (because the index is dense). Further, there should be exactly one data entry for each such eid value (because the index is unique), and this can be enforced on inserts and updates. 2. If we want to change the salaries of employees whose eid’s are in a particular range, it would be sped up by the index on eid. Since we could access the records that we want much quicker and we wouldn’t have to change any of the indexes. 3. If we were to add 1 to the ages of all employees then we would be slowed down, since we would have to update the index on age. 4. If we were to change the sal of those employees with a particular did then no advantage would result from the given indexes. Exercise 8.11 Consider the following relations: Emp(eid: integer, ename: varchar, sal: integer, age: integer, did: integer) Dept(did: integer, budget: integer, floor: integer, mgr eid: integer) Salaries range from $10,000 to $100,000, ages vary from 20 to 80, each department has about five employees on average, there are 10 floors, and budgets vary from $10,000 to $1 million. You can assume uniform distributions of values. For each of the following queries, which of the listed index choices would you choose to speed up the query? If your database system does not consider index-only plans (i.e., data records are always retrieved even if enough information is available in the index entry), how would your answer change? Explain briefly. 112 Chapter 8 1. Query: Print ename, age, and sal for all employees. (a) Clustered hash index on 〈ename, age, sal〉 fields of Emp. (b) Unclustered hash index on 〈ename, age, sal〉 fields of Emp. (c) Clustered B+ tree index on 〈ename, age, sal〉 fields of Emp. (d) Unclustered hash index on 〈eid, did〉 fields of Emp. (e) No index. 2. Query: Find the dids of departments that are on the 10th floor and have a budget of less than $15,000. (a) Clustered hash index on the floor field of Dept. (b) Unclustered hash index on the floor field of Dept. (c) Clustered B+ tree index on 〈floor, budget〉 fields of Dept. (d) Clustered B+ tree index on the budget field of Dept. (e) No index. Answer 8.11 The answer to each question is given below. 1. We should create an unclustered hash index on 〈ename, age, sal〉 fields of Emp (b) since then we could do an index only scan. If our system does not include index only plans then we shouldn’t create an index for this query (e). Since this query requires us to access all the Emp records, an index won’t help us any, and so should we access the records using a filescan. 2. We should create a clustered dense B+ tree index (c) on 〈floor, budget〉 fields of Dept, since the records would be ordered on these fields then. So when executing this query, the first record with floor = 10 must be retrieved, and then the other records with floor = 10 can be read in order of budget. Note that this plan, which is the best for this query, is not an index-only plan (must look up dids).
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved