Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Database Modeling and Design Lecture Notes, Exams of Database Management Systems (DBMS)

database management system (DBMS) -- a generalized software system for manipulating databases. Includes logical view (schema, sub-schema), physical view. ( ...

Typology: Exams

2022/2023

Uploaded on 05/11/2023

ekaram
ekaram 🇺🇸

4.7

(27)

16 documents

1 / 95

Toggle sidebar

Related documents


Partial preview of the text

Download Database Modeling and Design Lecture Notes and more Exams Database Management Systems (DBMS) in PDF only on Docsity! 1 Database Modeling and Design 3rd Edition Toby J. Teorey University of Michigan Lecture Notes Contents I. Database Systems and the Life Cycle (Chapter 1)……………………2 Introductory concepts; objectives of database management 2 Relational database life cycle 3 Characteristics of a good database design process 7 II. Requirements Analysis (Chapter 3)………………………………….8 III. Entity-Relationship (ER) Modeling (Chapters 2-4)……………… 11 Basic ER modeling concepts 11 Schema integration methods 22 Entity-relationship 26 Transformations from ER diagrams to SQL Tables 29 IV. Normalization and normal forms (Chapter 5)………………………35 First normal form (1NF) to third normal form (3NF) and BCNF 35 3NF synthesis algorithm (Bernstein) 42 Fourth normal form (4NF) 47 V. Access Methods (Chapter 6)…………………………..………………50 Sequential access methods 50 Random access methods 52 Secondary Indexes 58 Denormalization 62 Join strategies 64 VI. Database Distribution Strategies (Chapter 8)……………………….66 Requirements of a generalized DDBMS: Date’s 12 Rules 68 Distributed database requirements 72 The non-redundant “ best fit” method 74 The redundant “all beneficial sites” method 77 VII. Data Warehousing, OLAP, and Data Mining (Chapter 9)…….....79 Data warehousing 79 On-line analytical processing (OLAP) 86 Data mining 93 Revised 11/18/98 – modify Section V Revised 11/21/98 – insertions into Section VII Revised 1/14/99 – modify Section VI Revised 2/11/99 – modify Section IV, 4NF (p.47 FD, MVD mix) Revised 6/13/00 – modify Section V (secondary indexes) 2 I. Database Systems and the Life Cycle Introductory Concepts data—a fact, something upon which an inference is based (information or knowledge has value, data has cost) data item—smallest named unit of data that has meaning in the real world (examples: last name, address, ssn, political party) data aggregate (or group) -- a collection of related data items that form a whole concept; a simple group is a fixed collection, e.g. date (month, day, year); a repeating group is a variable length collection, e.g. a set of aliases. record—group of related data items treated as a unit by an application program (examples: presidents, elections, congresses) file—collection of records of a single type (examples: president, election) database—computerized collection of interrelated stored data that serves the needs of multiple users within one or more organizations, i.e. interrelated collections of records of potentially many types. Motivation for databases over files: integration for easy access and update, non-redundancy, multi-access. database management system (DBMS) -- a generalized software system for manipulating databases. Includes logical view (schema, sub-schema), physical view (access methods, clustering), data manipulation language, data definition language, utilities - security, recovery, integrity, etc. database administrator (DBA) -- person or group responsible for the effective use of database technology in an organization or enterprise. Motivation: control over all phases of the lifecycle. Objectives of Database Management 1. Data availability—make an integrated collection of data available to a wide variety of users * at reasonable cost—performance in query update, eliminate or control data redundancy * in meaningful format—data definition language, data dictionary * easy access—query language (4GL, SQL, forms, windows, menus); embedded SQL, etc.; utilities for editing, report generation, sorting 2. Data integrity—insure correctness and validity * checkpoint/restart/recovery * concurrency control and multi-user updates * accounting, audit trail (financial, legal) 3. Privacy (the goal) and security (the means) * schema/sub-schema, passwords 4. Management control—DBA: lifecycle control, training, maintenance Step Il.c Transformation of the ER diagram to SOL tables Customer create tabla customer cust- no cust- name {cust_no integer, cust_nema cherc15), cust_addr chert30}, sdles_nome charé15), Product prod_no integer, primary kay Ceust_na), prod-na | prod-name| gty-in-stock foreign key (sales_name} references solespersan, Torsign Key ¢prod_no) refarances product): Salesperson sales-namejaddr| dept | job-level | vacation—days Order, Order-product order-no |sales-name |cust-no order-no| prod-no Step Il.d Normalization of SOL tables (NF, BCNF, ANF, SNF) Decomposition of tables and removal of update anomalies Salesperson Sales-vacations sales-namejaddr| dept job-level job-level | vacation-days Customer cust-no | cust-name Sy Order ee? arder-no | sales-name | cust-no Step III Physical Design (including denormalization) Customer/retined cust-no | cust-name | sales-name Physical design parameters: Indexing, access methods, clustering Step IY Data distribution All parts All salespersons All custamers All parts Salespersons in Michigan Customers in greater Detroit All parts Salespersons in Michigan Customers in Michigan Salespersons in IITinois Customers in 111 mois read write All parts Salespersons in northeast Customers in northeast 31 = Ann Arbor, 82 = Detroit, 83 = Chicago, 54 = Boston, $5 = New York T1, 12, T3 are transactions (the figure shows all sites where they are initiated Decisions: fragmentation, replication, allocation Objectives: min. response time, min. communication cost, max availability 7 Characteristics of a Good Database Design Process * iterative requirements analysis - interview top-down - use simple models for data flow and data relationships - verify model * stepwise refinement and iterative re-design * well-defined design review process to reduce development costs review team -database designers -DBMS software group -end users in the application areas when to review - after requirements analysis & conceptual design - after physical design - after implementation (tuning) meeting format - short documentation in advance - formal presentation - criticize product, not person - goal is to locate problems, do solutions off line - time limit is 1-2 hours 10 Interviews at different levels Top management - business definition, plan/objectives, future plans Middle management - functions in operational areas, technical areas, job-titles, job functions Employees - individual tasks, data needed, data out Specific end-users of a DBMS - applications and data of interest Basic rules in interviewing 1. Investigate the business first 2. Agree with the interviewee on format for documentation (ERD, DFD, etc.) 3. Define human tasks and known computer applications 4. Develop and verify the flow diagram(s) and ER diagram(s) 5. Relate applications to data (this helps your programmers) Example: order entry clerk Function: Take customer orders and either fill them or make adjustments. Frequency: daily Task Def Volume Data Elements 1. Create order 2000 A, B, E, H 2. Validate order 2000 A, B, G, H, J 3. Fill out error form 25 A, C 4. Reserve item/price 6000 A, D, H 5. Request alternate items 75 A, E, I, K,M 6. Enter unit price 5925 A, F, J, N 11 III. Entity-Relationship (ER) Modeling Basic ER Modeling Concepts Entity - a class of real world objects having common characteristics and properties about which we wish to record information. Relationship - an association among two or more entities * occurrence - instance of a relationship is the collective instances of the related entities * degree - number of entities associated in the relationship (binary, ternary, other n-ary) * connectivity - one-to-one, one-to-many, many-to-many * existence dependency (constraint) - optional/mandatory Attribute - a characteristic of an entity or relationship * Identifier - uniquely determines an instance of an entity * Identity dependence - when a portion of an identifier is inherited from another entity * Multi-valued - same attribute having many values for one entity * Surrogate - system created and controlled unique key (e.g. Oracle’s “create sequence”) Concept Representation & Example Entity Employee Employee- Weak entit eek end job-history Relationship Attribute identifier (key) descriptor (nonkey} multivalued descriptor mc) Te) complex attribute address 12 15 Super-class (super-type)/subclass (subtype) relationship Generalization * similarities are generalized to a super-class entity, differences are specialized to a subclass entity, called an “ISA” relationship (“specialization” is the inverse relationship) * disjointness constraint - there is no overlap among subclasses * completeness constraint - constrains subclasses to be all-inclusive of the super-class or not (i.e. total or partial coverage of the superclass) * special property: hierarchical in nature * special property: inheritance - subclass inherits the primary key of the super-class, super-class has common nonkey attributes, each subclass has specialized non-key attributes Aggregation * “part-of” relationship among entities to a higher type aggregate entity (“contains” is the inverse relationship) * attributes within an entity, data aggregate (mo-day-year) * entity clustering variation: membership or “is-member-of” relationship a supertype Employee subtypes rionauer | [engineer | [Teste] [socrotory ] (a) Generalization with disjoint subtypes Individual (b) Generalization with overlapping subtypes and completeness constraint Constraints 16 17 Constraints in ER modeling * role - the function an entity plays in a relationship * existence constraint (existence dependency) - weak entity * exclusion constraint - restricts an entity to be related to only of several other * entities at a given point in time - mandatory/optional - specifies lower bound of connectivity of entity instances - participating in a relationship as 1 or 0 * uniqueness constraint – one-to-one functional dependency among key attributes in a relationship: binary, ternary, or higher n-ary ER diagram notation entity, attribute {no operation} Employee (a) Entity with attributes Object diagram using the Blaha and Premerlani variations of UML notation [BRUQ8, BIPr98] ClassName Employee attribute emp-id: string emp-name: string job-class: integer operation change-job-class change-name is-managed- Department Employee Department|c. by Employee (b) One-to-one Division Department Division|}———@ Department has (c) One-to-many, many side optional artice 1 in N [o-<Gfind>—Mfemorove Office is-occupied- by t+ IQs te Employee (d) Gne-to-many, one side optional Project M1 Employeey—O Employee warks-on io—_————@ Project (e) Many-to-many Emplayee Employee group-leader — is- (1) Recursive binary group-eader— of relationship 20 ER diagram Object diagram using notation the UML notation IBRI9S, BIPr9s] Individual Individual type-of Employee Customer Employee Customer (a) Generalization ("is-a") relationship with supertype and nondisjoint subtypes Group Individual Individual (b) Aggregation ("part-of") relationship Project Employee Employee emp-hired Skill Skill Project skill-used praiect- assigned tc) ternary relationship 21 22 Schema Integration Methods Goal in schema integration - to create a non-redundant unified (global) conceptual schema (1) completeness - all components must appear in the global schema (2) minimality - remove redundant concepts in the global schema (3) understandability - does global schema make sense? 1 . Comparing of schemas * look for correspondence (identity) among entities * detect possible conflicts - naming conflicts homonyms - same name for different concepts synonyms - different names for the same concept - structural conflicts type conflicts - different modeling construct for the same concept (e. g. “order” as an entity, attribute, relationship) - dependency conflicts - connectivity is different for different views (e.g. job-title vs. job-title-history) - key conflicts - same concept but different keys are assigned (e.g. ID-no vs. SSN) - behavioral conflicts - different integrity constraints (e.g. null rules for optional/mandatory: insert/delete rules) * determine inter-schema properties - possible new relationships to combine schemas - possible abstractions on existing entities or create new super-classes (super-types) 2. Conforming of schemas * resolve conflicts (often user interaction is required) * conform or align schemas to make compatible for integration * transform the schema via - renaming (homonyms, synonyms, key conflicts) - type transformations (type or dependency conflicts) - modify assertions (behavioral conflicts) 3. Merging and restructuring * superimpose entities * restructure result of superimposition M Publication 1 Topic-area Department] written-far, research ares Contractor (atiress) 25 26 Entity-Relationship Clustering Motivation * conceptual (ER) models are difficult to read and understand for large and complex databases, e.g. 10,000 or more data elements * there is a need for a tool to abstract the conceptual database schema (e. g. clustering of the ER diagram) * potential applications - end user communication - application design team communication - documentation of the database conceptual schema (in coordination with the data dictionary) Clustering Methodology Given an extended ER diagram for a database..... Step 1. Define points of grouping within functional areas. Step 2. Form entity clusters * group entities within the same functional area * resolve conflicts by combining at a higher functional grouping Step 3. Form higher entity clusters. Step 4. Validate the cluster diagram. * check for consistency of interfaces. * end-users must concur with each level. Department Contractor Project (a) ER model before clustering Gepartment Contractor 1 1 > Report DN NM] Report Report Suthor Pp t does rojec fentity cluster, 4 it rm I {b) ER model after clustering 27 Report. Abbreviation (a) one-to-one, both entities mandatory Department Every report has one abbreviation, and every abbreviation represents exactly one report. create table report (report_no integer, report_name varchar(256), primary key(report_na); create table abbreviation (abbrno char(6), report_no integer not null unique, primary key Cabbrina), foreign key (report_nojreferences report on delete cascade on update cascade); Every department must have a manager, but an employee can be a manager of at most ane department. create table department (dept-no integer, dept_name charf20}, rigr-id char€1) not nul] unique, primary key (dept_no), foreign key (mqr_id) references employee on delete set default on update cascade); Employee create table employee Cerp_id char¢1o), (b) one-to-one, one entity emp_name char(20), . optional, one mandatory primary key (emp-id)); Some desktop computers are allocated to engineers, Engineer but not necessarily to all engineers. Desktop (c) one-to-one, both entities optional Create table engineer Cemp_id char(10), desktop_no integer, primary key (emp_id)); create table desktop (desktop_no integer, emp-id char{10}, primary key (desktop_no), foreign key (emp_id) references engineer on delete set null on update cascade); 30 Every employee works in exactly one department, and Department each department has at least one employee create table department | (dept_no integer, dept_name char{20}, primary key (dept_no}); N create table employee (emp_id char(10), emp_name chart20}, dept_no integer not null, (qd) one-to-many, both primary key Cemp_id), entities mandatory foreign key (dept_na) references department on delete set default on update cascade); Employee Each department publishes one or more reports. A given report may not necessarily be published by a department. create table department (dept_no integer, publishes dept_name chart20), primary key (dept_no}); N create table report (report_no integer, Department Report dept_no integer, primary key (report_na), (e) one-to-many. ane entity foreign key (dept_no} references department optional, one unknown on delete set null on update cascade); Every professional association could have none, one, or Engineer thany engineer members. Each engineer could be a member of none, one, or many professional associations create table engineer Cermpid char(1o), primary key Cemp_id}); create table prof_assoc (assoc_name varchar(256), primary key (assoc_name)); Prof-agsoc create table belongs_to (emp_id char{ 10), assoc_name varchar(256), (f) many-to-many, both primary key (emp_id, assoc-name}, entities optional foreign key (emp_id) references engineer on delete cascade on update cascade, foreign key (assoc_name ) references prof-assoc on delete cascade on update cascade); 31 Employee (a) one-to-one, both sides optional Engineer (b) one-to-many, one side Any employee is allowed to be married to another employee in this campany. create table employee (emp_id char(10), emp_name chart20}, spouse_id char(10}, primary key (emp_id), foreign key (spouse _id) references employee on delete set null on update cascade); Engineers are divided into groups for certain projects. Each group has 4 leader. create table engineer (emp-id char(10), leader_id char(10) not null, primary key Cemp_id), foreign key (leader_id) references engineer on delete set default on update cascade); mandatory, many side optional Employee is coauthor wi th (c) many-to-many, both sides optional Each employee has the opportunity to coauthor a report with one or more other employees, or to write the report alone create table employee Cernp_id char(1o), emp_name char(20), primary key Cemp_id)), Create table coauthor (author_id char(1o), coauthorid char(1o), primary key (author_id, coauthor-id), foreign key (author_id) references employee on delete cascade on update cascade, foreign key (coauthor_id) references employee on delete cascade on update cascade); 32 35 IV. Normalization and Normal Forms First normal form (1NF) to third normal form (3NF) and BCNF Goals of normalization 1. Integrity 2. Maintainability Side effects of normalization * Reduced storage space required (usually, but it could increase) * Simpler queries (sometimes, but some could be more complex) * Simpler updates (sometimes, but some could be more complex) First normal form (1NF) -- a table R is in 1NF iff all underlying domains contain only atomic values, i.e. there are no repeating groups in a row. functional dependency —given a table R, a set of attributes B is functionally dependent on another set of attributes A if at each instant of time each A value is associated with only one B value. This is denoted by A -> B. A trivial FD is of the form XY --> X (subset). super-key -- a set of one or more attributes, which, when taken collectively, allows us to identify uniquely an entity or table. candidate key —any subset of the attributes of a super-key that is also a super-key, but not reducible. primary key -- arbitrarily selected from the set of candidate keys, as needed for indexing. Third normal form (3NF) A table is in 3NF if, for every nontrivial FD X --> A, either: (1) attribute X is a super-key, or (2) attribute A is a member of a candidate key (prime attribute) Boyce-Codd normal form (BCNF) A table is in BCNF if, for every nontrivial FD X --> A, (1) attribute X is a super-key. 36 Tables, Functional Dependencies, and Normal Forms First Normal Form TABLE SUPPLIER_PART (100k rows, 73 bytes/row => 7.3 MB) SNUM SNAME STATUS CITY PNUM PNAME WT QTY S1 SMITH 20 LONDON P1 S1 SMITH 20 LONDON P2 S1 SMITH 20 LONDON P3 S1 SMITH 20 LONDON P4 S1 SMITH 20 LONDON P5 S1 SMITH 20 LONDON P6 S2 JONES 10 PARIS P1 S2 JONES 10 PARIS P2 S3 BLAKE 10 PARIS P3 S3 BLAKE 10 PARIS P5 S4 CLARK 20 LONDON P2 BOLT 22 2 10-31-89 S4 CLARK 20 LONDON P4 WRENCH 24 3 7-14-90 S4 CLARK 20 LONDON P5 CLAMP 22 7 8-20-90 S5 ADAMS 30 ATHENS P5 CLAMP 22 5 8-11-91 Functional dependencies SNUM --> SNAME, STATUS,CITY CITY --> STATUS PNUM --> PNAME, WT SNUM,PNUM,SHIPDATE --> QTY Attribute sizes (bytes) SNUM 5 PNAME 10 SNAME 20 WT 5 STATUS 2 QTY 5 CITY 10 SHIPDATE 8 PNUM 8 Total size 73 Third Normal Form TABLE PART (100 rows, 23 bytes/row => 2.3 KB) PNUM PNAME WT Functional dependencies P1 NUT 12 PNUM --> PNAME, WT P2 BOLT 17 P3 WRENCH 17 P4 WRENCH 24 P5 CLAMP 12 P6 LEVEL 19 TABLE SHIPMENT (100k rows, 26 bytes/row => 2.6 MB) SNUM PNUM QTY SHIPDATE Functional dependency S1 P1 3 1-4-90 SNUM, PNUM, SHIPDATE--> QTY S1 P2 2 2-17-90 37 S1 P3 6 11-5-89 S1 P4 2 6-30-90 S1 P5 1 8-12-91 S1 P6 5 4-21-91 S2 P1 3 5-3-90 S2 P2 4 12-31-90 S3 P3 4 3-25-91 S3 P5 2 3-27-91 S4 P2 2 10-31-89 S4 P4 3 7-14-90 S4 P5 7 8-20-90 S5 P5 5 8-11-91 NOT Third Normal Form TABLE SUPPLIER (200 rows, 37 bytes/row => 7.4 KB) SNUM SNAME STATUS CITY Functional dependencies S1 SMITH 20 LONDON SNUM --> SNAME, STATUS, CITY S2 JONES 10 PARIS CITY --> STATUS S3 BLAKE 10 PARIS S4 CLARK 20 LONDON S5 ADAMS 30 ATHENS Decomposition of Table Supplier into two Third Normal Form (3NF) Tables 40 Example of Table Design and Normalization (3NF) from a collection of FDs and an ER diagram Functional dependencies (FDs) given empid, startdate --> jobtitle, enddate empid --> empname, phoneno, officeno, projno, deptno phoneno --> officeno projno --> projname, startdate, enddate deptno --> deptname, mgrid mgrid --> deptno In general, the FDs can be derived from 1. Explicit assertions given 2. ER diagram (implied by ER constructs) 3. Intuition (your experience with the problem data) Table 1: empid, startdate --> jobtitle, enddate This table has a composite key that must be separated from functional dependencies (FDs) that involve any individual component of this key (e.g. empno) on the left side. Table 2 Let us start with the following set of FDs and then refine them, eliminating transitive dependencies within the same table. Given: empid --> empname, phoneno, officeno, projno, deptno phoneno --> officeno We need to eliminate the redundant right sides of the transitive dependencies (office_no) and put them into other tables. Thus we get: Table 2a: empid --> empname, phoneno, projno, deptno Table 2b: phoneno --> officeno Table 3: projno --> projname, startdate, enddate Table 4: deptno --> deptname, mgrid mgrid --> deptno 41 Functional Dependency Inference rules (Armstrong’s Axioms) 1. Reflexivity If Y is a subset of the attributes of X, then X->Y. X = ABCD, Y = ABC => X->Y X->X trivial case 2. Augmentation If X->Y and Z is a subset of table R (i.e. Z is any set of attributes in R), then XZ -> YZ . 3. Transitivity If X->Y and Y->Z then X->Z. 4. Pseudo-transitivity If X->Y and YW->Z then XW->Z. (transitivity is a special case of pseudo-transitivity when W is null) 5. Union If X->Y and X->Z then X->YZ. 6. Decomposition If X->YZ then X->Y and X->Z. Superkey Rule 1. Any FD involving all attributes of a table defines a super-key on the LHS of the FD. Given: any FD containing all attributes in the table R(W,X,Y,Z), i.e. XY -> WZ. Proof: (1) XY -> WZ given (2) XY -> XY by the reflexivity axiom (3) XY -> XYWZ by the union axiom (4) XY uniquely determines every attribute in table R, as shown in (3) (5) XY uniquely defines table R, by the definition of a table as having no duplicate rows (6) XY is therefore a super-key, by the definition of a super-key. Super-key Rule 2. Any attribute that functionally determines a Super-key of a table, is also a super-key for that table. Given: Attribute A is a super-key for table R(A,B,C,D,E), and E -> A. Proof: (1) Attribute A uniquely defines each row in table R, by the def. of a super-key (2) A -> ABCDE by the definition of a super-key and a relational table (3) E -> A given (4) E -> ABCDE by the transitivity axiom (5) E is a super-key for table R, by the definition of a super-key. 42 3NF Synthesis Algorithm (Bernstein) Basic definitions g e H set of FDs H+ closure of H - set of all FDs derivable from H using all the FD inference rules H’ cover of H - any set of FDs from which every FD in H+ can be derived H’(non-redundant) – non-redundant cover of H, i.e. a cover which contains no proper subset which is also a cover. Can be determined with quadratic complexity O(n2). Example Given a set of FDs H, determine a minimal set of tables in 3NF, while preserving all FDs and maintaining only lossless decomposition/joins. H: AB->C DM->NP D->KL A->DEFG D->M E->G L->D F->DJ PR->S G->DI PQR->ST Step 1: Eliminate any extraneous attributes in the left hand sides of the FDs. We want to reduce the left hand sides of as many FDs as possible. In general: XY->Z and X->Z => Y is extraneous (Reduction Rule 1) XYZ->W and X->Y => Y is extraneous (Reduction Rule 2) For this example we mix left side reduction with the union and decomposition axioms: DM->NP => D->NP => D -> MNP D->M D->M PQR->ST => PQR->S, PQR->T => PQR->.T PR->S PR->S PR->S Step 2: Find a non-redundant cover H’ of H, i.e. eliminate any FD derivable from others in H using the inference rules (most frequently the transitivity axiom). A->E->G => eliminate A->G from the cover A->F->D => eliminate A->D from the cover Step 3: Partition H’ into tables such that all FDs with the same left side are in one table, thus eliminating any non-fully functional FDs. (Note: creating tables at this point would be a feasible solution for 3NF, but not necessarily minimal.) R1: AB->C R4: G->DI R7: L->D R2: A->EF R5: F->DJ R8: PQR->T R3: E->G R6: D->KLMNP R9: PR->S 45 Example of a 3NF table that is not BCNF, i.e. it has further anomalies: S = student, C = course, I = instructor SC -> I For each course, each student is taught by only one instructor. A course may be taught by more than one instructor. I -> C Each instructor teaches only one course. This table is 3NF with a candidate key SC: SCI student course instructor Sutton Math Von Neumann Sutton Journalism Murrow Niven Math Von Neumann Niven Physics Fermi Wilson Physics Einstein Delete anomaly: If Sutton drops Journalism, then we have no record of Murrow teaching Journalism. How can we decompose this table into BCNF? Decomposition 1 (bad)........eliminates the delete anomaly SC (no FDs) and I -> C (two tables) Problems - 1. lossy join 2. dependency SC -> I is not preserved SC student course IC instructor course Sutton Math Von Neumann Math Sutton Journalism Murrow Journalism Niven Math Fermi Physics Niven Physics Einstein Physics Wilson Physics ----------------join SC and IC ------------------ SCI’ student course instructor Sutton Math Von Neumann Sutton Journalism Murrow Niven Math Von Neumann Niven Physics Fermi Niven Physics Einstein (spurious row) Wilson Physics Fermi (spurious row) Wilson Physics Einstein 46 Decomposition 2 (better).....eliminates the delete anomaly SI (no FD) and I -> C Advantages – eliminates the delete anomaly, lossless Disadvantage - dependency SC -> I is not preserved SI student instructor IC instructor course Sutton Von Neumann Von Neumann Math Sutton Murrow Murrow Journalism Niven Von Neumann Fermi Physics Niven Fermi Einstein Physics Wilson Einstein Dantzig Math (new) Sutton Dantzig (new) The new row is allowed in SI using unique(student,instructor) in the create table command, and the join of SI and IC is lossless. However, a join of SI and IC now produces the following two rows: student course instructor Sutton Math Von Neumann Sutton Math Dantzig which violates the FD SC -> I. Oracle, for instance, has no way to automatically check SC->I, although you could write a procedure to do this at the expense of a lot of overhead. Decomposition 3 (tradeoff between integrity and performance) SC -> I and I -> C (two tables with redundant data) Problems -extra updates and storage cost 47 Fourth Normal Form (4NF) Fourth normal form (4NF) -- a table R is in 4NF iff it is in BCNF and whenever there exists a nontrivial multi-valued dependency (MVD) in R, say X-->>Y, X is a super-key for R. Multi-valued dependency (MVD) X -->> Y holds whenever a valid instance of R(X,Y,Z) contains a pair of rows that contain duplicate values of X, then the instance also contains the pair of rows obtained by interchanging the Y values in the original pair. Multi-valued Dependency Inference rules (Berri, Fagin, Howard...1977 ACM SIGMOD Proc.) 1. Reflexivity X -->> X 2. Augmentation If X -->> Y, then XZ -->> Y. 3. Transitivity If X -->>Y and Y -->> Z then X -->> (Z-Y). 4. Pseudo-transitivity If X -->> Y and YW -->> Z then XW -->> (Z-YW). (transitivity is a special case of pseudo-transitivity when W is null) 5. Union If X -->> Y and X -->> Z then X -->> YZ. 6. Decomposition If X -->> Y and X -->> Z, then X -->> Y intersect Z and X -->> (Z-Y) 7. Complement If X -->> Y and Z=R-X-Y, then X -->> Z. 8. FD => MVD If X -> Y, then X -->> Y. 9. FD, MVD mix If X -->> Y and Y -> Z’ (where Z’ is contained in Z, and Y and Z are disjoint), then X->Z’. Why is 4NF useful? Avoids certain update anomalies/inefficiencies. 1. delete anomaly - two independent facts get tied together unnaturally so there may be bad side effects of certain deletes, e.g. in “skills required” the last record of a skill may be lost if employee is temporarily not working on any projects). 2. update inefficiency - adding a new project in “skills required” requires insertions for many records (rows) that to include all required skills for that new project. Likewise, loss of a project requires many deletes. 3. 4NF maintains smaller pieces of information with less redundancy. 50 V. Access Methods Types of Queries Query type 1: access all records of a given type “Increase everyone’s salary by 10%” access method: sequential processing Query type 2: access at most one record “Find the address of John Smith, whose id number is 333-44-5555” access methods: hashing, B+ tree index Query type 3: access a subset of records of a given type “Find all employees who have C programming experience and over three years with the company” access method: secondary indexing (Oracle uses B+trees for this) Sequential Access Methods lra = n logical record accesses sba = ceil(n/bf ) sequential block accesses rba = 0 random block accesses iotime = sba*Tsba + rba*Trba seconds where Tsba is the average disk i/o service time for a sequential block and Trba is the average disk i/o service time for a random block access Disk service time in a dedicated environment sequential block access: Tsba = rot/2 + bks/tr where rot is the disk rotation time (for a full rotation), bks is the block size in bytes (bf*record size), and tr is the disk transfer rate in bytes per second. Trba = seek(avg) + rot/2 + bks/tr where seek(avg) is the average seek time over the extent of the file on disk D isk service time in a shared environment Tsba = Trba = seek(avg) + rot/2 + bks/tr where seek(avg) is the average disk seek time over the extent of the entire disk. 51 Batch processing of k sequentially stored records read the transaction file: lra = k where k = number of transaction records sba = ceil(k/tfbf) where tfbf is the transaction file blocking factor read the master file: lra = n sba = ceil(n/bf) where bf is the master file blocking factor write a new master file: lra = n + adds - deletes sba = ceil((n+adds-deletes)/bf) where adds is the number of records added or inserted, and deletes is the number of records deleted. 52 Random Access Methods Hashing Basic mechanism – transformation of a primary key directly to a physical address, called a bucket (or indirectly via a logical address) Collisions – handled by variations of chained overflow techniques random access to a hashed file lra = 1 + overflow(avg) rba = 1 + overflow(avg) insertion into a hashed file lra = 1 + overflow(avg) + rewrite rba = 1 + overflow(avg) rba=1 for the rewrite 55 B+-tree index basic characteristics * eliminates data pointers from all nodes except the leaf nodes * each non-leaf index node has p pointers and p-1 key values * each pointer at level i is for an index block (of key/pointer pairs) at level i+1 * each leaf index has a key value/pointer pair to point to the actual data block (and record) containing that primary key value * leaf index nodes can be logically connected via pointers for ordered sequence search * hybrid method for efficient random access and sequential search Example: B+-tree To determine the order of a B+-tree, let us assume that the database has 500,000 records of 200 bytes each, the search key is 15 bytes, the tree and data pointers are 5 bytes, and the index node (and data block size) is 1024 bytes. For this configuration we have non-leaf index node size = 1024 bytes = p*5 + (p-1)*15 bytes p = floor((1024+15)/20) = floor(51.95) = 51 number of search key values in the leaf nodes = floor ((1024-5)/(15+5))=50 h = height of the B+-tree (number of index levels, including the leaf index nodes n = number of records in the database (or file); all must be pointed at from the next to last level, h-1 ph-1(p-1) > n (h-1)log p + log(p-1) > log n (h-1)log p > log n-log(p-1) h > 1 + (log n-log(p-1)) / log p h > 1 + (log 500,000-log 50)/log 51 = 3.34, h=4 (nearest higher integer) A good approximation can be made by assuming that the leaf index nodes are implemented with p pointers and p key values: ph > n h log p > log n h > log n/log p In this case, the result above becomes h > 3.35 or h = 4. 56 B+-tree performance read a single record (B+-tree) = h+1 rba update a single record (B+-tree) = search cost + rewrite data block = (h+1) rba + 1 rba general update cost for insertion (B+-tree) =search cost (i.e., h+1 reads) +simple rewrite of data block and leaf index node pointing to the data block (i.e., 2 rewrites) +nos*(write of new split index node + rewrite of the index node pointer to the new index node) + nosb*(write of new split data block) = (h+1) rba + 2 rba + nos*(2 rba) + nosb*(1 rba) where nos is the number of index split node operations required and nosb is the number of data split block operations required general update cost for deletion (B+-tree) = search cost (i.e., h+1 reads) + simple rewrite of data block and leaf index node pointing to the data block (i.e., 2 rewrites) + noc*(rewrite of the node pointer to the remaining node) = (h+1) rba + 2 rba + noc*(1 rba) where noc is the number of consolidations of index nodes required. As an example, consider the insertion of a node (with key value 77) to the B+-tree shown in Fig. 6.6. This insertion requires a search (query) phase and an insertion phase with one split node. The total insertion cost for height 3 is insertion cost = (3 + 1) rba search cost + (2 rba) rewrite cost + 1 split *(2 rba rewrite cost) = 8 rba “Deep s6mst gebs|seqsag Tip e472 80 eae rt byt wad (a) &*-tree before the insertion of record with key value 77 t Lolssle(le[ev)e Zi! : CEES a ere ae eS 07 8a oe. <4 <4 (b) B*-tree after the insertion and split block operation 57 60 Example: Mail Order Business Assume we have a file of 10,000,000 records of mail order customers for a large commercial business. Customer records have attributes for customer name, customer number, street address, city, state, zip code, phone number, employer, job title, credit rating, date of last purchase, and total amount of purchases. Assume that the record size is 250 bytes; block size is 5000 bytes (bf=20); and pointer size, including record offset, is 5 bytes (bfac=1000). The query to be analyzed is “Find all customers whose job title is ‘engineer’, city is ‘chicago’, and total amount of purchases is greater than $1,000.” For each AND condition we have the following hit rates, that is, records that satisfy each condition: job title is ‘engineer’: 84,000 records city is ‘chicago’: 210,000 records total amount of purchases > $1000: 350,000 records total number of target records that satisfy all three conditions = 750 query cost (inverted file) = merge of 3 accession lists + access 750 target records = [ceil(n1/bfac) + ceil(n2/bfac) + ceil(n3/bfac)] sba + 750 rba = [ceil(84,000/1000) + ceil(210,000/1000) + ceil(350,000/1000] sba + 750 rba = (84+210+350) sba + 750 rba = 644 sba + 750 rba If we assume Tsba is 10 milliseconds and Trba is 25 milliseconds, we obtain query iotime (secondary index) = 644 sba*10 ms + 750 rba*25 ms = 25190 ms = 25.19 sec (much more efficient than sequential scan, see below) query iotime (sequential scan) = ceil(n/bf) sba *Tsba = ceil(10,000,000/20)*10 ms = 5,000,000 ms = 5000 sec 61 Secondary Indexes using B+-trees * used by Oracle and many others for non-unique indexes * index nodes contain key/pointer pairs in the same way as a primary key index using a B+-tree * key at each level, leaf and non-leaf, is the concatenation of attributes used in the query , e.g. jobtitle, city, total_purchases (as attributes of consumer) * leaf node pointers are to the blocks containing records with the given combination of attribute values indicated in the concatenated keys * analysis of queries and updates for this type of index proceeds in the same way as a primary key (unique) index, keeping in mind that the key formats are different in the two cases query iotime (B+tree secondary index) = rba*Trba = [h + ceil(t/bfac) – 1 + t] * Trba where h is the height of the B+tree index, bfac is the blocking factor for the accession list (i.e. the number of pointer/key pairs in the leaf nodes in the B+tree), and t is the number of target records in the table that satisfies all the conditions in the query. 62 Denormalization * motivation – poor performance by normalized databases * search for potential denormalizations that avoid or minimize delete anomalies To illustrate the effect of denormalization, let us assume that the table review is associated with the tables employee and manager as the table that follows shows. The extension of the review table, review-ext, is shown as a means of reducing the number of joins required in the query shown below. This extension results in a real denormalization, that is, review_no -> emp_id -> emp_name, emp_address with the side effects of add and update anomalies. However, the delete anomaly cannot occur because the original data is redundant in the extended schema. Original Tables and Process (Query) Table Primary Key Nonkeys employee emp_id emp_name, emp_address, mgr_id manager mgr_id emp_name, emp_address review review_no emp_id, mgr_id Query: For a given review number, display the employee name and address. select e.emp_name, e.emp_addr from employee as e, review as r where r.review_no = ‘xxxx’ and e.emp_id = r.emp_id; Extended table review_ext is not in 3NF create table review_ext as select r.review_no, e.emp_id, e.emp_name, e.emp_addr, e.mgr_id from employee as e, review as r where e.emp_id = r.emp_id; total cost = [iotime(q) + iotime(u)]*cost(q) + volume(s)*cost(s) where cost(q) = unit cost per I/O second for query or update processes cost(s) = unit cost per byte for stored data iotime(q) = I/O service time (sec) for query processes iotime(u) = I/O service time (sec) for update processes volume(s) = total volume in bytes for stored data 65 Merge-Join Case 3: Neither project nor assigned_to are ordered by . project_name. join cost = sort time for both tables + merge time for both tables = (50,000*log2 50,000)/100 +(250*log2 250)/50 + 50,000/100 + 250/50 = 8000 + 40 + 500 + 5 = 8545 sequential block accesses (or 85.45 seconds) We see that the sort phase of the merge-join strategy is the costliest component, but it still significantly improves performance compared to the nested loop strategy. Low Selectivity Joins Let ntr=100 qualifying rows for the foreign key table (assigned_to) and let ntr=1 row for the primary key table (project) in the example below. Assume h=2 for the unique index to project, Tsba = 10 ms, and Trba = 40 ms. select p.project_name, p.project_leader, a.emp_id from project as p, assigned_to as a where p.project_name = a.project_name and p.project_name = 'financial analysis'; Indexed join Case 1: Scan foreign key table once and index to the primary key join cost = scan the entire foreign key table (assigned_to) + index to the primary key table (project) qualifying row = 50,000/100 sba + (h+1) rba = 500 sba + 3 rba (or 5.12 seconds) For the next case, assume the nonunique index height, hn = 3, index blocking factor bfac = 500, with ntr = 100 target foreign key rows as given above. Indexed join Case 2: Index to both the primary key table and the foreign key Join cost = index to the primary key table + index to the foreign key table = (h+1) rba + [hn + ceil(ntr/bfac) – 1 + ntr] rba = 3 rba + [3 + 0 + 100] rba = 106 rba (or 4.24 seconds) Indexed join Case 3: Nonunique indexes required for both tables due to join on two nonkeys. Join cost = index to the first table + index to the second table = [h1 + ceil(ntr1/bfac1) –1 + ntr1] rba + [h2 + ceil(ntr2/bfac2) –1 + ntr2] rba Hash join Case 1: join cost = scan first table (assigned_to) + scan second table (project) + access qualifying rows in the two tables = 50,000/100 sba + 250/50 sba + 100 rba + 1 rba = 505 sba + 101 rba (or 9.09 seconds) In the hash join strategy, the table scans may only have to be done infrequently as long as the hash file in RAM remains intact for a series of queries, so in Case 1 above, the incremental cost for the given query requires only 101 rba or 4.04 seconds. 66 VI. Database Distribution Strategies Overview of Distributed Databases Distributed database - a collection of multiple, logically interrelated databases distributed over a computer network [OzVa91]. Distributed Database Management System (DDBMS) - a software system that permits the management of a distributed database and makes the distribution transparent to the users. If heterogeneous, it may allow transparent simultaneous access to data on multiple dissimilar systems. Advantages 1. Improves performance, e.g. it saves communication costs and reduces query delays by providing data at the sites where it is most frequently accessed. 2. Improves the reliability and availability of a system by providing alternate sites from where the information can be accessed. 3. Increases the capacity of a system by increasing the number of sites where the data can be located. 4. Allows users to exercise control over their own data while allowing others to share some of the data from other sites. 5. Helps solve more complex database problems. Disadvantages 1. Increases the complexity of the system and introduces several technical as well as management challenges especially when geographical and organizational boundaries are crossed. 2. Makes central control more difficult and raises several security issues because a data item stored at a remote site can be always accessed by the users at the remote site. 3. Makes performance evaluation difficult because a process running at one node may impact the entire network. update time data Byreer awantability Cost data storage cost query time 1 2 3 4 5 6 Number of copies of the fragment 67 70 Modified Life Cycle for Data Distribution IV. Data distribution (allocation). Create a data allocation schema that indicates where each copy of each table is to be stored. The allocation schema defines at which site(s) a table is located. A one-to-one mapping in the allocation schema results in non-redundancy, while a one-to- many mapping defines a redundant distributed database. Fragmentation. Fragmentation is the process of taking subsets of rows and/or columns of tables as the smallest unit of data to be sent across the network. Unfortunately, very few commercial systems have implemented this feature, but we include a brief discussion for historical reasons. We could define a fragmentation schema of the database based on dominant applications’ “select” predicates (set of conditions for retrieval specified in a select statement). Horizontal fragmentation partitions the rows of a global fragment into subsets. A fragment r1 is a selection on the global fragment r using a predicate Pi, its qualification. The reconstruction of r is obtained by taking the union of all fragments. Vertical fragmentation subdivides the attributes of the global fragment into groups. The simplest form of vertical fragmentation is decomposition. A unique row-id may be included in each fragment to guarantee that the reconstruction through a join operation is possible. Mixed fragmentation is the result of the successive application of both fragmentation techniques. Rules for Fragmentation 1. Fragments are formed by the select predicates associated with dominant database transactions. The predicates specify attribute values used in the conjunctive (AND) and disjunctive (OR) form of select commands, and rows (records) containing the same values form fragments. 2. Fragments must be disjoint and their union must become the whole fragment. Overlapping fragments are too difficult to analyze and implement. 3. The largest fragment is the whole table. The smallest table is a single record. Fragments should be designed to maintain a balance between these extremes. 71 Data Distribution Data distribution defines the constraints under which data allocation strategies may operate. They are determined by the system architecture and the available network database management software. The four basic data distribution approaches are : * Centralized In the centralized database approach, all the data are located at a single site. The implementation of this approach is simple. However, the size of the database is limited by the availability of the secondary storage at the central site. Furthermore, the database may become unavailable from any of the remote sites when communication failures occur, and the database system fails totally when the central site fails. * Partitioned In this approach, the database is partitioned by tables, and each table is assigned to a particular site. This strategy is particularly appropriate where either local secondary storage is limited compared to the database size, the reliability of the centralized database is not sufficient, or operating efficiencies can be gained through the exploitation of the locality of references in database accesses. * Replicated The replicated data distribution strategy allocates a complete copy of the database to each site in the network. This completely redundant distributed data strategy is particularly appropriate when reliability is critical, the database is small, and update inefficiency can be tolerated. * Hybrid The hybrid data distribution strategy partitions the database into critical and non-critical tables. Non-critical tables need only be stored once, while critical tables are duplicated as desired to meet the required level of reliability. 72 Distributed Database Requirements Database Description 1. Conceptual schema (ER diagram) 2. Transactions: functions and data accessed Configuration Information 1. Sources of data—where data can be located. 2. Sinks of data—where user transactions can be initiated and data transferred. 3. Transaction rate (frequency) and volume (data flow). 4. Processing capability at each site—CPU and I/O capability (speed). 5. Security—data ownership (who can update) and access authorization (who can query) for each transaction. 6. Recovery—estimated frequency and volume of backup operations. 7. Integrity — referential integrity, concurrency control, journaling, overhead, etc. Constraints 1. Network topology: Ethernet, token ring, ATM 2. Processing capability needed at each site. 3. Channel (link) transmission capacity. 4. Availability—related to mean-time-between-failures (MTBF) and mean-time-to-repair (MTTR). Objective Functions 1. Response time as a function of transaction size. 2. Total system cost—communications, local I/O, cpu time, disk space. 75 Local Reference Computations Our goal is to compute the number of local references to each table residing at each site, one by one. The site that maximizes the local references to a given table is chosen as the site where that table should reside. Table Site Trans. T1(freq) T2(freq) T3(freq) Total local refs R1 S1 3 read,1 write(1) 0 0 4 S2 0 2 read(2) 0 4 S3 0 0 0 0 S4 3 read,1 write(1) 2 read(2) 0 8 (max.) S5 3 read,1 write(1) 0 0 4 R2 S1 2 read(1) 0 0 2 S2 0 0 0 0 S3 0 0 3 read,1 write(3) 12 S4 2 read(1) 0 0 2 S5 2 read(1) 0 3 read,1 write(3) 14 (max.) R3 S1 0 0 0 0 S2 0 3 read,1 write(2) 0 8 (max.) S3 0 0 2 read(3) 6 S4 0 3 read,1 write(2) 0 8 (max.) S5 0 0 2 read(3) 6 Local references for each table at each of five possible sites. Allocation Decision Allocate R1 at site S4. Allocate R2 at site S5. Allocate R3 at either site S2 or S4 Additional information is needed to choose this allocation. For instance, if maximum availability of data is a major consideration, then choose site S2 because site S4 already has table R1 allocated to it and putting R3 there as well would decrease the potential availability of data should site S4 crash. Advantages - simple algorithm Disadvantages - number of local references may not accurately characterize time or cost (reads and writes given equal weights) - no insights regarding replication Relations (tables): R1, R2, R3 Sites: $1, 52.53, 54,55 Transactions: T1, 72, T3 716 77 The Redundant “All Beneficial Sites” Method This method can be used for either the redundant or non-redundant case. It selects all sites for a table allocation where the benefit is greater than the cost for one additional copy of that table. You are assumed to start with zero copies. The benefit for table R at site S is measured by the difference in elapsed time to do a remote query to table R from site S (i.e. no replicated copy available locally) and a local query to table R at site S (i.e. replicated copy available locally). Total benefit for table R at site S is the weighted sum of benefit for each query times the frequency of queries. The cost for table R at site S is the total elapsed time for all the local updates of table R, plus the total elapsed time for all the remote updates for the given table at that site. Total cost for table R at site S is weighted sum of cost for each update transaction times the frequency of update transactions. Example Cost/Benefit Computations for “All Beneficial Sites” Table Site Remote updates (local updates) No. of writes*freq*time Cost R1 S1 T1 from S4 and S5 (T1 from S1) 2*1*600 ms +(1*1*150) 1350 ms S2 T1 from S1, S4, S5 3*1*600 ms 1800 ms S3 T1 from S1, S4, S5 3*1*600 ms 1800 ms S4 T1 from S1 and S5 (T1 from S4) 2*1*600 ms +(1*1*150) 1350 ms S5 T1 from S1 and S4 (T1 from S5) 2*1*600 ms +(1*1*150) 1350 ms R2 S1 T3 from S3 and S5 2*3*700 ms 4200 ms S2 T3 from S3 and S5 2*3*700 ms 4200 ms S3 T3 from S5 (T3 from S3) 1*3*700 ms +(1*3*200) 2700 ms S4 T3 from S3 and S5 2*3*700 ms 4200 ms S5 T3 from S3 (T3 from S5) 1*3*700 ms +(1*3*200) 2700 ms R3 S1 T2 from S2 and S4 2*2*1100 ms 4400 ms S2 T2 from S4 (T2 from S2) 1*2*1100 ms +(1*2*250) 2700 ms S3 T2 from S2 and S4 2*2*1100 ms 4400 ms S4 T2 from S2 (T2 from S4) 1*2*1100 ms +(1*2*250) 2700 ms S5 T2 from S2 and S4 2*2*1100 ms 4400 ms Table Site Query (read) sources No. of reads*freq*(remote-local time) Benefit R1 S1 T1 at S1 3*1*(500 - 100) 1200 ms S2 T2 at S2 2*2*(500 - 100) 1600 ms S3 None 0 0 S4 T1 and T2 at S4 (3*1 + 2*2)*(500 - 100) 2800 ms** S5 T1 at S5 3*1*(500 - 100) 1200 ms 80 Core Requirements for Data Warehousing 1. DWs are organized around subject areas. 2. DWs should have some integration capability. 3. The data is considered to be nonvolatile and should be mass loaded. 4. Data tends to exist at multiple levels of granularity. 5. The DW should be flexible enough to meet changing requirements rapidly. . 6. The DW should have a capability for rewriting history, that is, allowing “what-if” analysis. 7. A usable DW user interface should be selected. 8. Data should be either centralized or distributed physically. 81 Data Warehouse Life Cycle I. Requirements analysis and specification 1.1 Analyze the end-user requirements and develop a requirements specification. This step follows the practice used by conventional relational databases (see Chapter 1). 1.2 Define the DW architecture and do some initial capacity planning for servers and tools. Integrate the servers, storage elements, and client tools. 1.3 Use enterprise data modeling II. Logical database design Design the enterprise DW schema and views. Star schema is the most often used format –- good performance, ease of use Fact table (one) – very large table containing numeric and/or non numeric attributes, including the primary keys from the dimension tables; similar to intersection tables between entities with many-to-many relationships Dimension tables (several) - smaller tables containing mostly non numeric attributes; similar to relational tables based on entities Snowflake schema – similar to star schema, except dimension tables are normalized Fact table family (constellation) – multiple fact tables interact with dimension tables Region reg-no reg-name mgr-name addr phone Fact Table Order Customer cust-id cust-name addr phone company order-no reg-no cust-id sales-id prod-no quantity total-price order-no ord-date shipping-date date-filled Salesperson sales-id sales-name addr phone Product Figure 9.4 Star schema for the “order” data warehouse 82 prod-no prod-type prod-name price 85 5.3 Populate the repository with the schema and view definitions, scripts, and other metadata. 5.4 Design and implement end-user applications. Rollout the DW and applications. 86 On-Line Analytical Processing (OLAP) Common Features of Multidimensional Databases (MDD) 1. Dimensions –- perspectives or entities about the real world 2. Hypercubes –- basic structure for multidimensional databases 3. Hierarchies –- certain dimensions are hierarchical in nature 4. Formulas –- derived data values can be defined by formulas (sum, average, etc.) 5. Links – links are needed to connect hypercubes and their data sources OLAP Logical Design Step 1 – Analyze the end-user requirements and environment Step 2 – Define cubes, dimensions, hierarchies, and links (high level) Step 3 – Define dimension members (low level) Step 4 – Define aggregations and other formulas (derived data) Aggregation Issues 1. Which data to aggregate 2. How to store aggregate data 3. When to pre-aggregate derived data • Pre-aggregate nothing • Pre-aggregate nothing, but save the materialized view (dynamic) • Pre-aggregate everything (static) • Pre-aggregate selectively, based on known statistics 87 Example of aggregation: 3-dimensional problem Dimensions Product Region Time-period Levels all-products world-wide 10-year product-type country 5-year -------------- state 3-year product-name city year -------- quarter store month week -------- day No. of aggregate levels 2 4 7 Potential variables: quantity sold, quota, gross revenue Potential further dimensions: customer, salesperson, type-of-sale Subset of the 3-dimensional problem Dimensions Product Region Time-period Levels product-type state month -------------- ----- ------- product-name store day No. of aggregate 1 1 1 levels Statistics on aggregates and hierarchies of aggregates Number of dimensions = d = 3 Number of possible views = 2d-1 = 7 1. product 2. region 3. time-period 4. product, region 5. product, time-period 6. region, time-period 7. product, region, time-period Number of aggregate levels for dimension i = ni Number of one-way aggregates = Σ i=1,d ni = 1 + 1 + 1 = 3 for the subset = Σ i=1,d ni = 2 + 4 + 7 = 13 for the full set Order | {order-no ord-date shipping-date date-filled Region 1 1 4 reg-no Salesperson reg-name ! sales-id mgr=neme Fact Table sales-name addr N addr phone order-no hone N_|reg-no p cust-id 4 N sales-id prod-no period-id N quantity total-price N Customer Product d— cust-id || 7| browne t-name | prod=type ol brod-name addr ime-Peri price Phone Time-Period company || Bertod-id day week Month 1 month month year year Sales-summary reg-no N | month revenue N profit Figure 9.6 Sales-surmmary as an example of aggregation 90 time-period region January 99a Southwest February 1998 Northwest March 1996 North-central Aprili99a South-central Mapl 996 Northeast June 1998 Midwest Ist-qtr 1998 Southeast and-qtr 1998 Srd-qtr1998 year 1997 year1998 product Ford-Mustang Chrysler -Eagle GM-camero Toyota-Camry variable quantity -sold total-revenue (a) Linear sequence of sample members from each of four dimensions Region: southwest Quantity | Total Sold Revenue January!996 | Ford-Mustang 426 6317 Chrysler-Eagle 179 3004 GM-Camero 318 5261 Toyota-Camry 299 4783 February199a] Ford-Mustang 451 6542 Chrysler -Eagle 192 3119 GM-Camero 356 6007 Toyota-Camry 301 4936 (p) 2-dimensional layout of four dimensions of data Figure 99 Display of multidimensional sales data 91 Report for January 996 Ford-Mustang Chrysler -Eagle GM-Camero Toyota-Camry Total of products Southwest | Northwest Total af regions 426 174 318 299 1222 457 216 245 322 1240 8B 395 563 621 2462 (a) Pure operations computed the same in any order (sums) Report for January1996, Southwest region Quota Quantity-sold | Quantity-sold/quata Ford-Mustang 400 426 1.065 Chrpsier-Eagle 200 i739 0.895 GM-camero 300 31a 1.060 Toyota-camry 300 209 0.997 Total 1200 1222 Ratio of sums = 1.016 Sur of ratios = 4017 (b) Mixed sums and ratios give inconsistent results Figure 9.10 Examples of mixing formulas for derived data values 92
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved