Download Database Query Processing: Lecture 17 - Query Processing Techniques - Prof. Mihai Pop and more Study notes Principles of Database Management in PDF only on Docsity! CMSC 424 – Database design Lecture 17 Query processing Mihai Pop Admin • Homework 3 is on the website • Project part 1 due • Midterm answers/results > 90 – A (6) 75-90 – B (9) 50-75 – C (6) < 50 – D (1) • 10 improved, 10 got worse, 2 about the same Multiple Index Selection GOAL: apply the most restrictive one and combine multiple of them to reduce the intermediate results AS EARLY AS POSSIBLE • conjunctive selection using one index A: select using A and then apply the remaining of the predicates on the retrieved tuple values • conjunctive selection using a composite key index (R.A,R.B)- then create a composite key or range from the query values and search directly (range search on the first attribute only) • conjunctive selection using two indexes A and B: search each separately and intersect the tuple identifiers (TIDs) • disjunctive selection using two indexes A and B: search each separately and take the union of the TIDs Join Methods: Nested Loop • tuple-oriented: • block-oriented: • reverse inner loop similar to above but for even outer blocks we scan the inner relation in reverse inner blocks outer blocks outer blocks inner blocks for each tuple t(r) in r do begin for each tuple t(s) in s do begin join(t(r),t(s) and append the result to the output end end for each block b(r) in r do begin for each block b(s) in s do begin join(b(r),b(s) and append the result to the output end end Cost of Block-Oriented Nested Loop • cost depends on the number of buffers and the buffer replacement strategy – fasten 1 block from the outer relation, M for the inner and LRU cost: b(r) + b(r)*b(s) assuming that b(s) > M – fasten M blocks from the outer relation, and 1 for the inner 1: read M from the outer cost: M blocks 2: for each block of s join 1 X M blocks cost: b(s) -”- 3: repeat with the next M blocks of r until all done repeated b(r)/M times cost = [ (M + b(s)]* b(r)/M = b(r)+[b(r)*b(s)]/M • which relation should be the outer? Buffer size M+1 Hash-Join Algorithm Details 1. Partition the relation s using hashing function h. When partitioning a relation, one block of memory is reserved as the output buffer for each partition. 2. Partition r similarly. 3. For each i: (a)Load si into memory and build an in-memory hash index on it using the join attribute. This hash index uses a different hash function than the earlier one h. (b)Read the tuples in ri from the disk one by one. For each tuple tr locate each matching tuple ts in si using the in- memory hash index. Output the concatenation of their attributes. The hash-join of r and s is computed as follows. Relation s is called the build input and r is called the probe input. Example of Cost of Hash-Join • M= 20 blocks • bdepositor= 100 • bcustomer = 400. • depositor is the build input. Partition it into 5 partitions, each of size 20 blocks. This partitioning can be done in one pass. • partition customer into 5 partitions,each of size 80. This is also done in one pass. • Do the partition joins- for each j put 20 blocks of partition depositor(j) in memory, built the hash index, and do the probes with the 80 blocks of customer(j) • Therefore total cost, ignoring cost of writing partially filled blocks: – 3(100 + 400) = 1500 I/Os customer depositor Hash-Join algorithm (Cont.) • The value n and the hash function h is chosen such that each si should fit in memory. – Typically n is chosen as bs/M * f where f is a “fudge factor”, typically around 1.2 – The probe relation partitions ri need not fit in memory • Recursive partitioning required if number of partitions n is greater than number of pages M of memory. – instead of partitioning n ways, use M – 1 partitions for s – Further partition the M – 1 partitions using a different hash function – Use same partitioning method on r – Rarely required: e.g., recursive partitioning not needed for relations of 1GB or less with memory size of 2MB, with block size of 4KB. Other Operations • Outer Joins – Left outer join easy – Right/Full outer join (may need some bookkeeping) • Duplicate elimination – Hard – Sort at the end and eliminate – Hash output and eliminate • Aggregates – Sum, count, min, max easily kept during execution – Avg = Sum / count – Std = sqrt(ssum/count) External Sorting with Sort-Merge ●external vs internal sorting: relation/file does not fit in memory ●create runs phase: ●merge-runs phase: ● ● this assumes that a block from each run can be kept in main memory. If not, then the same algorithm has to be applied in multiple passes repeat until done read M blocks of the relation (or rest if <=M) internal sort using any sort method, e.g QuickSort(M) write the sorted tuples into a run R data file end read one block from each run; merge tuples on the result; advance the pointer from the run you appended last; if the block of a run is empty, read the next one until all blocks of all runs are done External Merge Sort Cost • Cost analysis: – Initial number of runs: br/M – Total number of merge passes required: logM–1(br/M). – Block transfers for initial run creation is br +br=2br • for final pass, we don’t count write cost we ignore final write cost for all operations since the output of an operation may be pipelined to the display or to a parent operation without being written to disk. If pipelined, it will be counted in the cost of the follow up operator • Thus total number of block transfers for external sorting: 2 br ( logM–1(br / M)) + br = br ( 2 logM–1(br / M) + 1) – If M ≥ br/M (only one pass is required) the expression logM–1(br/M) =1 total cost = 3br – However, if M > br then this expression evaluates to 0 total cost =br ONLY