Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Fundamentals of Database Systems Reviewer Summative 3 (PL/SQL), Study notes of Principles of Database Management

A comprehensive review material that focuses on the practical application of PL/SQL programming in managing and manipulating data in a database system. PL/SQL is a procedural language designed specifically for managing and querying data in Oracle databases. This reviewer covers various topics such as control structures, functions, procedures, and triggers that are essential for developers and database administrators who are working with Oracle databases. The material includes sample codes and exercises that aim to reinforce the concepts and skills needed to create efficient and effective database systems. Additionally, this reviewer provides a step-by-step guide on how to design and implement PL/SQL programs that can improve the performance and functionality of database systems.

Typology: Study notes

2021/2022

Available from 05/03/2023

kimberly-montero-2
kimberly-montero-2 🇵🇭

3 documents

1 / 66

Toggle sidebar

Related documents


Partial preview of the text

Download Fundamentals of Database Systems Reviewer Summative 3 (PL/SQL) and more Study notes Principles of Database Management in PDF only on Docsity! Fundamentals of Database Systems Reviewer MODULE 8 Data Warehousing – Data Warehouse Concept Data Warehousing • An electronic storage of a large amount of information by a business or organization. • A type of data management system that is designed to enable and support business intelligence (BI) activities, especially analytics. Data warehouses are solely intended to perform queries and analysis and often contain large amounts of historical data. The data within a data warehouse is usually derived from a wide range of sources such as application log files and transaction applications. CRM >> Data Warehouse << Operations ERP >> Data Warehouse << HR Billing >> Data Warehouse << Other Data Data Warehousing vs. Databases • A database is a transactional system that is set to monitor and update real-time data in order to have only the most recent data available. • A data warehouse is programmed to aggregate structured data over a period of time. DATABASE - Tables - Logs - Schema Data Warehouse - CRM - ERM - Operations - Reporting - Data Mining - OLAP Cube Alternative Names for Data Warehouse System Other Names for Datawarehouse - Decision Support System - Executive Information System - Management Information System - Business Intelligence Solution - Analytic Application - Data Warehouse History of Data Warehouse • 1960- Dartmouth and General Mills in a joint research project, develop the terms dimensions and facts. • 1970- A Nielsen and IRI introduces dimensional data marts for retail sales. • 1983- Tera Data Corporation introduces a database management system which is specifically designed for decision support • Data warehousing started in the late 1980s when IBM worker Paul Murphy and Barry Devlin developed the Business Data Warehouse. • However, the real concept was given by Inmon Bill. He was considered as a father of data warehouse. He had written about a variety of topics for building, usage, and maintenance of the warehouse & the Corporate Information Factory. Benefits of Data Warehouse Benefits of Data Warehouse - Data Warehouse Enhances Business Intelligence - Time Efficient - Data Quality and Consistency - Enhanced Performance - High Return on Investment - BI from Heterogeneous Sources DATA WAREHOUSE ARCHITECTURE Simple. - All data warehouses share a basic design in which metadata, summary data, and raw data are stored within the central repository of the warehouse. The repository is fed by data sources on one end and accessed by end users for analysis, reporting, and mining on the other end. Simple with a staging area. - Operational data must be cleaned and processed before being put in the warehouse. Although this can be done programmatically, many data warehouses add a staging area for data before it enters the warehouse, to simplify data preparation. Hub and spoke. - Adding data marts between the central repository and end users allows an organization to customize its data warehouse to serve various lines of business. When the data is ready for use, it is moved to the appropriate data mart. Sandboxes. - Sandboxes are private, secure, safe areas that allow companies to quickly and informally explore new datasets or ways of analyzing data without having to conform to or comply with the formal rules and protocol of the data warehouse. The Data Warehouse Architecture generally comprises of three tiers. 3. Xplenty 4. CloverDX 5. Oracle 6. Amazon RedShift 7. Domo 8. Teradata 9. SAP 10. SAS 11. IBM DataStage 12. Informatica 13. MS SSIS 14. Talend Open Studio 15. The Ab Initio Software 16. Dundas 17. Sisense 18. TabLeau 19. MicroStrategy 20. Pentaho MODULE 8 Data Warehousing – Types of Data Warehouse Types of Data Warehouse There are three types of data warehouse: 1. Enterprise Data Warehouse 2. Operational Data Store 3. Data Mart Enterprise Data Warehouse (EDW) • Enterprise Data Warehouse is a centralized warehouse. • It provides decision support service across the enterprise. • It offers a unified approach for organizing and representing data. • It also provides the ability to classify data according to the subject and give access according to those divisions. Enterprise Data Warehouse Architecture One-tier Architecture ONE-TIER DATA WAREHOUSE ARCHITECTURE Source Source Source Source >> Data integration & ETL tools >> Enterprise data warehouse - Structured Data - Semi-structured data - Unstructured data >> Reporting layer - Reporting tools - Interactive dashboards - Ad hoc reports Two-tier Architecture TWO-TIER ARCHITECTURE (DATA WAREHOUSE + DATA MART LAYER Source Source Source Source >> Data integration & ETL tools >> Enterprise data warehouse - Structured data - Semi-structured data - Unstructured data >> Data mart layer - Sales Data Mart - Marketing Data Mart - HR Data Mart - PM Data Mart >> Reporting layer - Reporting tools - Interactive dashboards - Ad hoc reports Three-tier Architecture THREE-TIER ARCHITECTURE (WAREHOUSE + DATA MART LAYER + OLAP LAYER) Source Source Source Source >> Data integration & ETL tools >> Enterprise data warehouse - Structured data - Semi-structured data - Unstructured data >> Data mart layer - Sales Data Mart - PM Data Mart >> OLAP Layer - OLAP cube - OLAP cube - OLAP cube - OLAP cube >> End-user interface - Reporting tools - Interactive dashboards - Ad hoc reports One-tier architecture - for EDW means that you have a database directly connected with the analytical interfaces where the end user can make queries. Two-tier architecture - A data mart level is added between the user interface and EDW. A data mart is a low-level repository that contains domain- specific information. Simply put, it’s another, smaller-sized database that extends EDW with dedicated information for your sales/operational departments, marketing, etc. Three-tier architecture - On top of the data mart layer, enterprises also use online analytical processing (OLAP) cubes. An OLAP cube is a specific type of database that represents data from multiple dimensions. Enterprise Data Warehouse Components Conventional Data Warehouse Architecture Multi Data Sources Extract Extract Extract Extract Operational Data Stores >> Data Staging Area - Cleansed, Combined and Standardized Data >> Load >> Data Presentation Area - Consumable Summarized Data << Access << Access << Access << Access << Data Access & Visualization Tools - Reports 1. Sources • The databases where raw data is stored. 2. Extract, Transform, Load (ETL) or Extract, Load, Transform (ELT) layer. • These are the tools that perform actual connection with source data, its extraction, and loading to the place where it will be transformed. Transformation unifies data format. • ETL and ELT approaches differ in that in ETL the transformation is done before EDW, in a staging area. ELT is a more modern approach that handles all the transformation in a warehouse. 3. Staging Area • In the case of ETL, the staging area is the place data is loaded before EDW. Here, it will be cleaned and transformed to a given data model. 4. DW Database • The data is finally loaded into the storage space. 5. Reporting Layer • These are tools that give end users access to data. Operational Data Store Staging / Factory (Intermediate) Database >> Operational Data Store (ODS) >> Near Real Time Reporting Operational Data Store >> Data Warehouse(DW) >> Data Mart 1 – Data Mart 2 – Data Mart 3 – Data Mart n >> BI Reports – Other Detailed Reports Source DB 1 – Source DB 2 – Source DB 3 – Source DB 4 – Source DB n >> Staging / Factory (Intermediate) Database >> Operational Data Store(ODS) – Data Warehouse (DW) 3. Hybrid: - This type of data marts can take data from data warehouses or operational systems. Operational Sources >> Enterprise data warehouse >> Dependent departmental datamarts Steps in Implementing a Data Mart Designing >> Constructing >> Populating >> Accessing >> Managing Designing - It covers all the tasks between initiating the request for a data mart to gathering information about the requirements. Finally, we create the logical and physical design of the data mart. Constructing - It involves creating the physical database and the logical structures. Populating - data in populated in the data mart. Accessing - Involves putting the data to use querying the data, creating reports, charts, and publishing them. End-users submit queries to the database and display the results of the queries. Managing - This step covers management tasks such as- • Ongoing user access management. • System optimizations and fine-tuning to achieve the enhanced performance. • Adding and managing fresh data into the data mart. • Planning recovery scenarios and ensure system availability in the case when the system fails. Advantages of Data Mart • Data marts contain a subset of organization-wide data. This Data is valuable to a specific group of people in an organization. • It is cost-effective alternatives to a data warehouse, which can take high costs to build. • Data Mart allows faster access of Data. • Data Mart is easy to use as it is specifically designed for the needs of its users. Thus, a data mart can accelerate business processes. • Data Marts needs less implementation time compared to Data Warehouse systems. It is faster to implement Data Mart as you only need to concentrate on subset of data. • It contains historical data which enables the analyst to determine data trends. Disadvantages of Data Mart • Many times, enterprises create too many unrelated data marts without much benefit. It can become a big hurdle to maintain. • Data Mart cannot provide company-wide data analysis as their data set is limited. SUPPLEMENTARY ORACLE TRIGGER Summary: in this tutorial, you will learn about another named PL/SQL block called an Oracle trigger. You will also learn about different characters of triggers and their usage in the database. Oracle Trigger A trigger is a named PL/SQL block stored in the Oracle Database and executed automatically when a triggering event takes place. The event can be any of the following: • A data manipulation language (DML) statement executed against a table e.g., INSERT, UPDATE, or DELETE. For example, if you define a trigger that fires before an INSERT statement on the customers table, the trigger will fire once before a new row is inserted into the customers’ table. • A data definition language (DDL) statement executes e.g., CREATE or ALTER statement. These triggers are often used for auditing purposes to record changes of the schema. • A system event such as startup or shutdown of the Oracle Database. • A user event such as login or logout. The act of executing a trigger is also known as firing a trigger. We say that the trigger is fired. Oracle trigger usages Oracle triggers are useful in many cases such as the following: • Enforcing complex business rules that cannot be established using integrity constraint such as UNIQUE, NOT NULL, and CHECK. • Preventing invalid transactions. • Gathering statistical information on table accesses. • Generating value automatically for derived columns. • Auditing sensitive data. How to create a trigger in Oracle To create a new trigger in Oracle, you use the following CREATE TRIGGER statement: CREATE [OR REPLACE] TRIGGER trigger_name {BEFORE | AFTER } triggering_event ON table_name [FOR EACH ROW] [FOLLOWS | PRECEDES another_trigger] [ENABLE / DISABLE ] [WHEN condition] DECLARE declaration statements BEGIN executable statements EXCEPTION exception_handling statements END; Let’s examine the syntax of the CREATE TRIGGER statement in more detail. A trigger has two main parts: header and body. The following illustrates the trigger header: CREATE [OR REPLACE] TRIGGER trigger_name {BEFORE | AFTER } triggering_event ON table_name [FOR EACH ROW] [FOLLOWS | PRECEDES another_trigger] [ENABLE / DISABLE ] [WHEN condition] And this is the trigger body: DECLARE declaration statements BEGIN executable statements EXCEPTION exception_handling statements END; As you can see, the trigger body has the same structure as an anonymous PL/SQL block. 1. CREATE OR REPLACE - The CREATE keyword specifies that you are creating a new trigger. The OR REPLACE keywords are optional. They are used to modify an existing trigger. - Even though the OR REPLACE keywords are optional, they appear with the CREATE keyword in most cases. - For example, if today you define a new trigger named trigger_example: CREATE TRIGGER trigger_example ... - And on the next day, you decide to modify this trigger. - If you do not include the OR REPLACE keywords, you will receive an error message indicating that the name of your trigger is already used by another object: CREATE TRIGGER trigger_example ... Therefore, the CREATE OR REPLACE keywords will replace an existing trigger if it already exists and create a new trigger if the trigger does not: CREATE OR REPLACE trigger_example .... 2. Trigger name - Specify the name of the trigger that you want to create after the CREATE OR REPLACE keywords. 3. BEFORE | AFTER - The BEFORE or AFTER option specifies when the trigger fires, either before or after a triggering event e.g., INSERT, UPDATE, or DELETE 4. ON table_name The table_name is the name of the table associated with the trigger. 5. FOR EACH ROW - The clause FOR EACH ROW specifies that the trigger is a row-level trigger. A row-level trigger fires once for each row inserted, updated, or deleted. - Besides the row-level triggers, we have statement-level triggers. A statement- trigger fire once regardless of the number of rows affected by the triggering event. If you omit the FOR EACH ROW clause, the CREATE TRIGGER statement will create a statement-level trigger. 6. ENABLE / DISABLE - The ENABLE / DISABLE option specifies whether the trigger is created in the enabled or disabled state. Note that if a trigger is disabled, it is not fired when the triggering event occurs. By default, if you don’t specify the clause ENABLE / DISABLE, the trigger is created with the enabled state. History of Datawarehouse The Datawarehouse benefits users to understand and enhance their organization’s performance. The need to warehouse data evolved as computer systems became more complex and needed to handle increasing amounts of Information. However, Data Warehousing is a not a new thing. Here are some key events in evolution of Data Warehouse- • 1960- Dartmouth and General Mills in a joint research project, develop the terms dimensions and facts. • 1970- A Nielsen and IRI introduces dimensional data marts for retail sales. • 1983- Tera Data Corporation introduces a database management system which is specifically designed for decision support • Data warehousing started in the late 1980s when IBM worker Paul Murphy and Barry Devlin developed the Business Data Warehouse. • However, the real concept was given by Inmon Bill. He was considered as a father of data warehouse. He had written about a variety of topics for building, usage, and maintenance of the warehouse & the Corporate Information Factory. How Datawarehouse works A Data Warehouse works as a central repository where information arrives from one or more data sources. Data flows into a data warehouse from the transactional system and other relational databases. Data may be: 1. Structured 2. Semi-structured 3. Unstructured data The data is processed, transformed, and ingested so that users can access the processed data in the Data Warehouse through Business Intelligence tools, SQL clients, and spreadsheets. A data warehouse merges information coming from different sources into one comprehensive database. By merging all of this information in one place, an organization can analyze its customers more holistically. This helps to ensure that it has considered all the information available. Data warehousing makes data mining possible. Data mining is looking for patterns in the data that may lead to higher sales and profits. TYPES OF DATA WAREHOUSE Three main types of Data Warehouses (DWH) are: 1. Enterprise Data Warehouse (EDW): - Enterprise Data Warehouse (EDW) is a centralized warehouse. It provides decision support service across the enterprise. It offers a unified approach for organizing and representing data. It also provides the ability to classify data according to the subject and give access according to those divisions. 2. Operational Data Store: - Operational Data Store, which is also called ODS, are nothing but data store required when neither Data warehouse nor OLTP systems support organizations reporting needs. In ODS, Data warehouse is refreshed in real time. Hence, it is widely preferred for routine activities like storing records of the Employees. 3. Data Mart: - A data mart is a subset of the data warehouse. It specially designed for a particular line of business, such as sales, finance, sales, or finance. In an independent data mart, data can collect directly from sources. General stages of Data Warehouse Earlier, organizations started relatively simple use of data warehousing. However, over time, more sophisticated use of data warehousing begun. The following are general stages of use of the data warehouse (DWH): Offline Operational Database: - In this stage, data is just copied from an operational system to another server. In this way, loading, processing, and reporting of the copied data do not impact the operational system’s performance. Offline Data Warehouse: - Data in the Datawarehouse is regularly updated from the Operational Database. The data in Datawarehouse is mapped and transformed to meet the Datawarehouse objectives. Real time Data Warehouse: - In this stage, Data warehouses are updated whenever any transaction takes place in operational database. For example, Airline or railway booking system. Integrated Data Warehouse: - In this stage, Data Warehouses are updated continuously when the operational system performs a transaction. The Datawarehouse then generates transactions which are passed back to the operational system. COMPONENTS OF DATA WAREHOUSE Four components of Data Warehouses are: Load manager: - Load manager is also called the front component. It performs with all the operations associated with the extraction and load of data into the warehouse. These operations include transformations to prepare the data for entering into the Data warehouse. Warehouse Manager: - Warehouse manager performs operations associated with the management of the data in the warehouse. It performs operations like analysis of data to ensure consistency, creation of indexes and views, generation of denormalization and aggregations, transformation and merging of source data and archiving and baking-up data. Query Manager: - Query manager is also known as backend component. It performs all the operation operations related to the management of user queries. The operations of this Data warehouse components are direct queries to the appropriate tables for scheduling the execution of queries. End-user access tools: - This is categorized into five different groups like 1. Data Reporting 2. Query Tools 3. Application development tools 4. EIS tools, 5. OLAP tools and data mining tools. Who needs Data warehouse. DWH (Data warehouse) is needed for all types of users like: • Decision makers who rely on mass amount of data • Users who use customized, complex processes to obtain information from multiple data sources. • It is also used by the people who want simple technology to access the data. • It also essential for those people who want a systematic approach for making decisions. • If the user wants fast performance on a huge amount of data which is a necessity for reports, grids, or charts, then Data warehouse proves useful. • Data warehouse is a first step If you want to discover ‘hidden patterns’ of data-flows and groupings. Used For Data Warehouse Here, are most common sectors where Data warehouse is used: Airline: - In the Airline system, it is used for operation purpose like crew assignment, analyses of route profitability, frequent flyer program promotions, etc. Banking: - It is widely used in the banking sector to manage the resources available on desk effectively. Few banks also used for the market research, performance analysis of the product and operations. Healthcare: - Healthcare sector also used Data warehouse to strategize and predict outcomes, generate patient’s treatment reports, share data with tie-in insurance companies, medical aid services, etc. Public sector: - In the public sector, data warehouse is used for intelligence gathering. It helps government agencies to maintain and analyze tax records, health policy records, for every individual. Investment and Insurance sector: - In this sector, the warehouses are primarily used to analyze data patterns, customer trends, and to track market movements. Retail chain: - In retail chains, Data warehouse is widely used for distribution and marketing. It also helps to track items, customer buying pattern, promotions and also used for determining pricing policy. Telecommunication: - A data warehouse is used in this sector for product promotions, sales decisions and to make distribution decisions. Hospitality Industry: - This Industry utilizes warehouse services to design as well as estimate their advertising and promotion campaigns where they want to target clients based on their feedback and travel patterns. and in the cloud. It helps to optimize customer experiences by increasing operational efficiency. 3. Amazon RedShift: - Amazon Redshift is Data warehouse tool. It is a simple and cost-effective tool to analyze all types of data using standard SQL and existing BI tools. It also allows running complex queries against petabytes of structured data, using the technique of query optimization. KEY LEARNING • Data Warehouse (DWH) is also known as an Enterprise Data Warehouse (EDW). • A Data Warehouse is defined as a central repository where information is coming from one or more data sources. • Three main types of Data warehouses are Enterprise Data Warehouse (EDW), Operational Data Store, and Data Mart. • General state of a datawarehouse are Offline Operational Database, Offline Data Warehouse, Real time Data Warehouse and Integrated Data Warehouse. • Four main components of Datawarehouse are Load manager, Warehouse Manager, Query Manager, End-user access tools. • Datawarehouse is used in diverse industries like Airline, Banking, Healthcare, Insurance, Retail etc. • Implementing Datawarehouse is a 3 prong strategy viz. Enterprise strategy, Phased delivery and Iterative Prototyping. • Data warehouse allows business users to quickly access critical data from some sources all in one place. DATA WAREHOUSE ARCHITECTURE Introduction to Data Warehouse Architecture The Data Warehouse Architecture can be defined as a structural representation of the concrete functional arrangement based on which a Data Warehouse is constructed that should include all its major pragmatic components, which is typically enclosed with four refined layers, such as the Source layer where all the data from different sources are situated, the Staging layer where the data undergoes ETL processing, the Storage layer where the processed data are stored for future exercises, and the presentation layer where the front-end tools are employed as per the users’ convenience. Data Warehouse Architecture Top tier - Data Mining - Reporting Tool - Analysis / Other Tool Middle Tier - OLAP Server Bottom Tier - ETL (Data Source) - Data Warehouse - Data Marts ETL (Data Source) >> Data Warehouse >> Data Marts Data Warehouse >> OLAP Server >> Reporting Tool The Data Warehouse Architecture generally comprises of three tiers. • Top Tier • Middle Tier • Bottom Tier Top Tier • The Top Tier consists of the Client- side front end of the architecture. • The Transformed and Logic applied information stored in the Data Warehouse will be used and acquired for Business purposes in this Tier. • Several Tools for Report Generation and Analysis are present for the generation of desired information. • Data mining which has become a great trend these days is done here. • All Requirement Analysis documents, cost, and features that determine a profit-based Business deal are done based on these tools, which use the Data Warehouse information. Middle Tier • The Middle Tier consists of the OLAP Servers • OLAP is Online Analytical Processing Server • OLAP is used to provide information to business analysts and managers. • As it is located in the Middle Tier, it rightfully interacts with the information present in the Bottom Tier and passes on the insights to the Top Tier tools, which processes the available information. • Mostly Relational or MultiDimensional OLAP is used in Data warehouse architecture. Bottom Tier - The Bottom Tier mainly consists of the Data Sources, ETL Tool, and Data Warehouse. 1. Data Sources - The Data Sources consists of the Source Data that is acquired and provided to the Staging and ETL tools for further process. 2. ETL Tools • ETL tools are very important because they help in combining Logic, Raw Data, and Schema into one and loads the information to the Data Warehouse or Data Marts. • Sometimes, ETL loads the data into the Data Marts, and then information is stored in Data Warehouse. This approach is known as the Bottom-Up approach. • The approach where ETL loads information to the Data Warehouse directly is known as the Top-down Approach. Difference Between Top-down Approach and Bottom-up Approach Top-Down Approach Bottom-Up Approach Provides a definite and consistent view of information as information from data warehouse is used to create Data Marts Reports can be generated easily as Data marts are created first, and it is relatively easy to interact with data marts. Strong model and hence preferred by big companies Not as strong, but the data warehouse can be extended, and the number of data marts can be created. Time, Costa and Maintenance is high Time, Cost and Maintenance are low. Data Marts • Data Mart is also a storage component used to store data of a specific function or part related to a company by an individual authority. • Datamart gathers the information from Data Warehouse, and hence we can say data mart stores the subset of information in Data Warehouse. • Data Marts are flexible and small in size. 3. Data Warehouse • Data Warehouse is the central component of the whole Data Warehouse Architecture. • It acts as a repository to store information. • Big Amounts of data are stored in the Data Warehouse. • This information is used by several technologies like Big Data which require analyzing large subsets of information. • Data Mart is also a model of Data Warehouse. Different Layers of Data Warehouse Architecture Below are the different layers: Data Source - Data 1 - Data 2 >> ETL >> Data Staging - Staging & integration data >> Data Storage - summary data - Meta Data - Data Mart >> Data Presentation - Analysis Data - Reporting Tools - Data Mining There are four different types of layers which will always be present in Data Warehouse Architecture. 1. Data Source Layer • The Data Source Layer is the layer where the data from the source is encountered and subsequently sent to the other layers for desired operations. • The data can be of any type. • The Source Data can be a database, a Spreadsheet, or any other kind of text file. • The Source Data can be of any format. However, we cannot expect to get data with the same format considering the sources are vastly different. • In Real Life, some examples of Source Data can be. • Log Files of each specific application or job or entry of employers in a company. • Survey Data, Stock Exchange Data, etc. • Web Browser Data and many more. 2. Data Staging Layer The following steps take place in Data Staging Layer. Step #1: Data Extraction - The Data received by the Source Layer is feed into the Staging Layer, where the first process present in the Operational Data Store can be scrubbed and the redundancy which is present can be checked and resolved by checking the corresponding business rules. It also helps in integrating contrasting data from multiple sources so that business operations, analysis, and reporting can be easily carried out and help the business while the process is still in continuation. - Here most of the operations which are currently being performed are stored before they are moved to the data warehouse for a longer duration. It helps effectively on simple queries and small amounts of data. It acts as a short term or temporary memory which stores the recent information. The data warehouse stores the data for a comparatively long time and also stores relatively permanent information. - It helps in storing transactional data from one or more production systems and loosely integrates it. It is sometimes subject oriented and time variant. The integration is achieved by making use of EDW structures and contents. The integration of data can involve cleansing, resolving redundancy, checking business rules for integrity. It is usually designed to contain low-level atomic data that stores limited data. 3. Data Mart - Data Mart focuses on storing data for a particular functional area and it contains a subset of data that is stored in a data warehouse. Data Marts help in enhancing user responses and also reduces the volume of data for data analysis. It makes it easier to go ahead with the research. Data Mart being a subset of Datawarehouse is easy to implement. It is cost-effective when compared with a complete data warehouse. It is more open to change, and a single subject matter expert can define its structure and configuration. The data is partitioned, and the granularity can be easily controlled. Data Mart has three types. These types are: • Dependent • Independent • Hybrid Dependent Data Mart - By getting data from operational, external or both sources a dependent data mart can be created. It allows the sourcing organization’s data from a single data warehouse. All data is centralized and can help in developing more data marts. Independent Data Mart - This data mart does not require a central data warehouse. This is usually created for smaller groups which are present within an organization. It does not have any relationship with Enterprise Data Warehouse or any other data mart. All data is independent and can be used separately. Also, the analysis can be performed autonomously. To have a consistent and centralized store of data is very important so that multiple users can use it. Hybrid Data Mart - As the name suggests a hybrid data mart is used when inputs from different sources are a part of a data warehouse. It is useful when a user wants an ad hoc integration. Whenever an organization needs multiple database environments and fast implementation then this setup can be used. It requires the least data cleansing effort, and the data mart supports large storage structures. The best usage of a data mart is when smaller data-centric applications are being used. Conclusion A data warehouse is thus a very important component in the data industry. As database helps in storing and processing data, a data warehouse helps in analyzing it. Data warehouse thus helps in getting business trends and patterns which can later be presented in the form of reports which provide insight for how to go ahead in the process of business growth. Data warehouse thus plays a vital role in creating a touch base in the data industry. WHAT IS DATA MART IN DATA WAREHOUSE? TYPES & EXAMPLE Data Mart A Data Mart is focused on a single functional area of an organization and contains a subset of data stored in a Data Warehouse. A Data Mart is a condensed version of Data Warehouse and is designed for use by a specific department, unit or set of users in an organization. E.g., Marketing, Sales, HR, or finance. It is often controlled by a single department in an organization. Data Mart usually draws data from only a few sources compared to a Data warehouse. Data marts are small in size and are more flexible compared to a Datawarehouse. In this tutorial, you will learn- • Why do we need Data Mart? • Types of Data Mart o Dependent Data Mart o Independent Data Mart o Hybrid Data Mart • Steps in Implementing a Datamart o Designing o Constructing o Populating o Accessing o Managing • Best practices for Implementing Data Marts • Advantages and Disadvantages of a Data Mart Importance of Data Mart • Data Mart helps to enhance user’s response time due to reduction in volume of data. • It provides easy access to frequently requested data. • Data mart are simpler to implement when compared to corporate Datawarehouse. At the same time, the cost of implementing Data Mart is certainly lower compared with implementing a full data warehouse. • Compared to Data Warehouse, a datamart is agile. In case of change in model, datamart can be built quicker due to a smaller size. • A Datamart is defined by a single Subject Matter Expert. On the contrary data warehouse is defined by interdisciplinary SME from a variety of domains. Hence, Data mart is more open to change compared to Datawarehouse. • Data is partitioned and allows very granular access control privileges. • Data can be segmented and stored on different hardware/software platforms. Types of Data Mart There are three main types of data mart: 1. Dependent: - Dependent data marts are created by drawing data directly from operational, external or both sources. 2. Independent: - Independent data mart is created without the use of a central data warehouse. 3. Hybrid: - This type of data marts can take data from data warehouses or operational systems. Dependent Data Mart A dependent data mart allows sourcing organization’s data from a single Data Warehouse. It is one of the data mart examples which offers the benefit of centralization. If you need to develop one or more physical data marts, then you need to configure them as dependent data marts. Dependent Data Mart in data warehouse can be built in two different ways. Either where a user can access both the data mart and data warehouse, depending on need, or where access is limited only to the data mart. The second approach is not optimal as it produces sometimes referred to as a data junkyard. In the data junkyard, all data begins with a common source, but they are scrapped, and mostly junked. Operational Sources >> Enterprise Data Warehouse >> Dependent departmental - Datamarts Independent Data Mart An independent data mart is created without the use of central Data warehouse. This kind of Data Mart is an ideal option for smaller groups within an organization. An independent data mart has neither a relationship with the enterprise data warehouse nor with any other data mart. In Independent data mart, the data is input separately, and its analyses are also performed autonomously. Implementation of independent data marts is antithetical to the motivation for building a data warehouse. First of all, you need a consistent, centralized store of enterprise data which can be analyzed by multiple users with different interests who want widely varying information. Operational Sources >> Independent datamarts Hybrid Data Mart: A hybrid data mart combines input from sources apart from Data warehouse. This could be helpful when you want ad-hoc integration, like after a new group or product is added to the organization. Products and Technologies You Need You could use the GUI or command line for data mart management. Best practices for Implementing Data Marts Following are the best practices that you need to follow while in the Data Mart Implementation process: • The source of a Data Mart should be departmentally structured. • The implementation cycle of a Data Mart should be measured in short periods of time, i.e., in weeks instead of months or years. • It is important to involve all stakeholders in planning and designing phase as the data mart implementation could be complex. • Data Mart Hardware/Software, Networking and Implementation costs should be accurately budgeted in your plan. • Even though if the Data mart is created on the same hardware, they may need some different software to handle user queries. Additional processing power and disk storage requirements should be evaluated for fast user response. • A data mart may be on a different location from the data warehouse. That’s why it is important to ensure that they have enough networking capacity to handle the Data volumes needed to transfer data to the data mart. • Implementation cost should budget the time taken for Datamart loading process. Load time increases with increase in complexity of the transformations. ADVANTAGES AND DISADVANTAGES OF A DATA MART Advantages • Data marts contain a subset of organization-wide data. This Data is valuable to a specific group of people in an organization. • It is cost-effective alternatives to a data warehouse, which can take high costs to build. • Data Mart allows faster access of Data. • Data Mart is easy to use as it is specifically designed for the needs of its users. Thus, a data mart can accelerate business processes. • Data Marts needs less implementation time compared to Data Warehouse systems. It is faster to implement Data Mart as you only need to concentrate the only subset of the data. • It contains historical data which enables the analyst to determine data trends. Disadvantages • Many a times enterprises create too many disparate and unrelated data marts without much benefit. It can become a big hurdle to maintain. • Data Mart cannot provide company- wide data analysis as their data set is limited. Summary: • Define Data Mart: A Data Mart is defined as a subset of Data Warehouse that is focused on a single functional area of an organization. • Data Mart helps to enhance user’s response time due to a reduction in the volume of data. • Three types of data mart are. 1) Dependent 2) Independent 3) Hybrid • Important implementation steps of Data Mart are. 1) Designing 2) Constructing 3 Populating 4) Accessing and 5)Managing • The implementation cycle of a Data Mart should be measured in short periods of time, i.e., in weeks instead of months or years. • Data mart is cost-effective alternatives to a data warehouse, which can take high costs to build. • Data Mart cannot provide company- wide data analysis as data set is limited. MODULE 9 Types of Data Warehouse Schema – Star Schema – Snowflake Schema – Galaxy Schema Types of Data Warehouse Schema • Star Schema • Snowflake Schema • Galaxy Schema Star Schema Snowflake Schema Galaxy Schema Schema • A schema is a logical description that describes the entire database. • In the data warehouse there includes the name and description of records. • It has all data items and also different aggregates associated with the data. Star Schema • It is known as star schema as its structure resembles a star. • The star schema is the simplest type of Data Warehouse schema. • It is also known as Star Join Schema and is optimized for querying large data sets. • The center of the star can have one fact table and a number of associated dimension tables. Fact Table Fact Table Key Column -Foreign keys column allows joins with dimension tables - Example: Product id, time_id, location_id Measures or fact data - Measures columns contain the data that is being analyzed. - Measures those that contain numeric facts - Example: Sales_income, No_of_unit_sold, Net_profit • A Fact table in a Data Warehouse system is nothing but the table that contains all the facts or the business information, which can be subjected to analysis and reporting activities when required. • These tables hold fields that represent the direct facts, as well as the foreign fields that are used to connect the fact table with other dimension tables in the Data Warehouse system. • A Data Warehouse system can have one or more fact tables, depending on the model type used to design the Data Warehouse. Dimension Table • Dimension is a collection of reference information about a measurable in the fact table. • The primary key column of the dimension table has uniquely identifies each dimension record or row. • The dimension tables are organized has descriptive attributes. For example, a customer dimension’s attributes could (also known as Fact Constellation Schema). A schema is a logical description that describes the entire database. In the data warehouse there includes the name and description of records. It has all data items and also different aggregates associated with the data. Like a database has a schema, it is required to maintain a schema for a data warehouse as well. There are different schemas based on the setup and data which are maintained in a data warehouse. Types of Data Warehouse Schema Following are the three major types of schemas: • Star Schema • Snowflake Schema • Galaxy Schema There are fact tables and dimension tables that form the basis of any schema in the data warehouse that are important to be understood. The fact tables should have data corresponding data to any business process. Every row represents any event that can be associated with any process. It stores quantitative information for analysis. A dimension table stores data about how the data in fact table is being analyzed. They facilitate the fact table in gathering different dimensions on the measures which are to be taken. Let us have a look at all these in detail. 1. Star Schema - Here are some of the basic points of star schema which are as follows: • In a star schema, as the structure of a star, there is one fact table in the middle and a number of associated dimension tables. This structure resembles a star and hence it is known as a star schema. • The fact table here consists of primary information in the data warehouse. It surrounds the smaller dimension lookup tables which will have details for different fact tables. The primary key which is present in each dimension is related to a foreign key which is present in the fact table. • This infers that fact table has two types of columns having foreign keys to dimension tables and measures which contain numeric facts. At the center of the star, there is a fact table, and the points of the star are the dimension tables. • The fact tables are in 3NF form, and the dimension tables are in denormalized form. Every dimension in star schema should be represented by the only one-dimensional table. The dimension table should be joined to a fact table. The fact table should have a key and measure. 2. Snowflake Schema - Here are some of the basic points of snowflake schema which are as follows: • Snowflake schema acts like an extended version of a star schema. There are additional dimensions added to Star schema. This schema is known as snowflake due to its structure. • In this schema, the centralized fact table will be connected to different multiple dimensions. The dimensions present are in normalized form from the multiple related tables which are present. The snowflake structure is detailed and structured when compared to star schema. • There are multiple levels of relationships and child tables involved that have multiple parent tables. In snowflake schema, the affected tables are only the dimension tables and not the fact tables. • The difference between star and snowflake schema is that the dimensions of snowflake schema are maintained in such a way that they reduce the redundancy of data. The tables are easy to manage and maintain. They also save storage space. • However, due to this, it is needed to have more joins in the query in order to execute the query. The further expansion of the tables leads to snowflaking. When a dimension table has a low cardinality attribute of dimensions then it is said to be snowflaked. • The dimension tables have been divided into segregated normalized tables. Once they are segregated, they are further joined with the original dimension table which has a referential constraint. This schema may hamper the performance as the number of tables that are required are more so that the joins are satisfied. • The advantage of snowflake schema is that it uses small disk space. The implementation of dimensions is easy when they are added to this schema. The same set of attributes are published by different sources. 3. Fact Constellation Schema or Galaxy Schema - Here are some of the basic points of fact constellation schema which are as follows: • A fact constellation can consist of multiple fact tables. These are more than two tables that share the same dimension tables. This schema is also known as galaxy schema. • It is viewed as a collection of stars and hence the name galaxy. The shared dimensions in this schema are known as conformed dimensions. The dimensions in this schema are separated into segregated dimensions which are having different levels of hierarchy. • As an example, we can consider the four levels of hierarchy taking geography into consideration as region, country, state, and city. This galaxy schema has four dimensions. Another way of creating a galaxy schema is by splitting one-star schema into more star schemas. • The dimensions created as large and built on the basis of hierarchy. This schema is useful when aggregation of fact tables is necessary. Fact constellations are considered to be more complex than star and snowflake schemas. These are considered to be more flexible but hard to implement and maintain. • This type of schema is usually used for sophisticated applications. The multiple number of tables present in this schema makes it difficult and complex. Implementing this schema is hence difficult. The architecture is thus more complex when compared to star and snowflake schema. Conclusion Like the databases have relational schemas where all data is saved and maintained in the form of schemas, the data warehouse also uses the same concept to maintain the data. Having schemas makes it easier to maintain the data. There are three main types of schemas as discussed above. They mainly operate on fact tables and dimension tables. The star schema is the easiest of all schemas. It consists of one fact table surrounded by multiple dimension tables. The snowflake schema has multiple dimension tables that are in normalized form. The last type consists of multiple fact tables. All three schemas segregate data and help in filtering and managing data in an efficient way. These schemas thus play a major role in setting up any environment. The performance of queries can also be enhanced by using these schemas. STAR SCHEMA VS SNOWFLAKES SCHEMA – DIFFERENCE BETWEEN THEM Key Difference Between Star Schema and Snowflake Schema • The star schema is the simplest type of Data Warehouse schema. It is known as star schema as its structure resembles a star. • Comparing Snowflake vs Star schema, a Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions. It is called snowflake because its diagram resembles a Snowflake. • In a star schema, only single join defines the relationship between the fact table and any dimension tables. • Star schema contains a fact table surrounded by dimension tables. • Snowflake schema is surrounded by dimension table which are in turn surrounded by dimension table. • A snowflake schema requires many joins to fetch the data. • Comparing Star vs Snowflake schema, Star schema has simple DB design, while Snowflake schema has very complex DB design. Star Schema Star Schema in data warehouse, in which the center of the star can have one fact table and a number of associated dimension tables. It is known as star schema as its structure resembles a star. The Star Schema data model is the simplest type of Data Warehouse schema. It is also known as Star Join Schema and is optimized for querying large data sets. In the following Star Schema example, the fact table is at the center which contains keys to every dimension table like Dealer_ID, Model ID, Date_ID, Product_ID, Branch_ID & other attributes like Units sold and revenue. Example of Star Schema Diagram Snowflake Schema Snowflake Schema in data warehouse is a logical arrangement of tables in a multidimensional database such that the ER diagram resembles a snowflake shape. A Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions. The dimension tables are normalized which splits data into additional tables. In the following Snowflake Schema example, Country is further normalized into an individual table. Example of Snowflake Schema Difference between Star Schema and Snowflake Schema Following is a key difference between Snowflake schema vs Star schema: • It is basically based on the huge data that is called data warehouse; it collects the required data from the data warehouse and perform the business required analysis to take some decision in the business to improve in profit, to improve sale, to improve brand, to improve marketing and so all. OLAP Architecture Three Tier Data Warehouse Architecture Data Sources - Other Sources - Operational DBs >> Extract Transform Load Refresh >> Data Warehouse >> Data Marts Data Storage - Metadata <<>> Monitor & Integrate - Data Warehouse >> Data Marts >> Output >> Front-End Tools OLAP Engine - OLAP Server Front-End Tools - Analysis - Query - Reports - Data mining How Does It Work? • A Data warehouse would extract information from multiple data sources and formats like text files, excel sheet, multimedia files, etc. • The extracted data is cleaned and transformed. Data is loaded into an OLAP server (or OLAP cube) where information is pre-calculated in advance for further analysis. OLAP Cube • An OLAP Cube is a data structure that allows fast analysis of data according to the multiple Dimensions that define a business problem. A multidimensional cube for reporting sales might be, for example, composed of 7 Dimensions: Salesperson, Sales Amount, Region, Product, Region, Month, Year. BASIC ANALYTICAL OPERATIONS OF OLAP Roll-up - Also known as drill-up or consolidation, use to summarize operation data along with the dimension. roll-up on location (from cities to countries) Drill-down - To perform the analysis in deeper among the dimensions of data. For example, drilling down from “time period” to “years” and “months” and to “days” and so on to plot sales growth for a product. Drill down on time (from quarters to month) Slice - To perform the analysis, take one level of information for display, such as “sales in 2019.” slice for time = “Q1” Dice - To perform the analysis, select data from multiple dimensions to analyze, such as “Sales of Laptop in Region 4 in 2019.” Dice for (location = ‘Perth’ or ‘Sydney’) and (time = Q1 or Q2” and (Item = Books or “Clothes) Pivot - To perform the analysis that can gain a new view of data by rotating the data axes of the cube. Pivot Types of OLAP System • Relational OLAP (ROLAP) • Multidimensional OLAP (MOLAP) • Hybrid OLAP (HOLAP) • Desktop OLAP (DOLAP) • Web OLAP (WOLAP) • Mobile OLAP • Spatial OLAP (SOLAP) OLAP ROLAP – MOLAP – HOLAP – Others WOLAP – DOLAP – MOLAP – SOLAP TYPE OF OLAP EXPLANATION ROLAP utilize a relational database management system to keep and control the data. These are the servers that exist between the database and the user. ROLAP systems work on the information that resides in a relational database. MOLAP This server utilized a multi- dimensional Database (MDDB) for storing and analyzing information. MDDB can proficiently store summaries, giving a method for quick questioning and recovering information from the database for processing. HOLAP It is a blend of MOLAP and ROLAP. By utilizing both ROLAP and MOLAP information stores, Hybrid OLAP offers the qualities of both techniques. HOLAP stores data summaries in the binary files or in the pre- calculated cubes. It leaves the quantities of fact and dimension information in the relational database. DOLAP Desktop On-Line Analytical Processing (DOLAP) is a single tier, desktop-based OLAP technology. WOLAP Web OLAP which is OLAP system accessible via the web browser. WOLAP is a three- tiered architecture. It consists of three components: client, middleware, and a database server. Mobile OLAP Mobile OLAP helps users to access and analyze OLAP data using their mobile devices SOLAP SOLAP is created to facilitate management of both spatial and non-spatial data in OLAP, and statistical databases and OLTP. Types of OLAP Servers We have four types of OLAP servers − • Relational OLAP (ROLAP) • Multidimensional OLAP (MOLAP) • Hybrid OLAP (HOLAP) • Specialized SQL Servers Relational OLAP - ROLAP servers are placed between relational back-end server and client front- end tools. To store and manage warehouse data, ROLAP uses relational or extended-relational DBMS. ROLAP includes the following − • Implementation of aggregation navigation logic. • Optimization for each DBMS back end. • Additional tools and services. Multidimensional OLAP - MOLAP uses array-based multidimensional storage engines for multidimensional views of data. With multidimensional data stores, the storage utilization may be low if the data set is sparse. Therefore, many MOLAP server use two levels of data storage representation to handle dense and sparse data sets. Hybrid OLAP - Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large data volumes of detailed information. The aggregations are stored separately in MOLAP store. Specialized SQL Servers - Specialized SQL servers provide advanced query language and query processing support for SQL queries over star and snowflake schemas in a read- only environment. OLAP Operations Since OLAP servers are based on multidimensional view of data, we will discuss OLAP operations in multidimensional data. Here is the list of OLAP operations − • Roll-up • Drill-down • Slice and dice • Pivot (rotate) Roll-up - Roll-up performs aggregation on a data cube in any of the following ways − • By climbing up a concept hierarchy for a dimension • By dimension reduction - The following diagram illustrates how roll-up works. roll-up on location (from cities to countries) • Roll-up is performed by climbing up a concept hierarchy for the dimension location. • Initially the concept hierarchy was "street < city < province < country". • On rolling up, the data is aggregated by ascending the location hierarchy from the level of city to the level of country. • The data is grouped into cities rather than countries. • When roll-up is performed, one or more dimensions from the data cube are removed. Drill-down - Drill-down is the reverse operation of roll- up. It is performed by either of the following ways − • By stepping down a concept hierarchy for a dimension • By introducing a new dimension. - The following diagram illustrates how drill-down works – Drill down on time (from quarters to month) • Drill-down is performed by stepping down a concept hierarchy for the dimension time. • Initially the concept hierarchy was "day < month < quarter < year." • On drilling down, the time dimension is descended from the level of quarter to the level of month. • When drill-down is performed, one or more dimensions from the data cube are added. • It navigates the data from less detailed data to highly detailed data. Slice - The slice operation selects one particular dimension from a given cube and provides a new sub-cube. Consider the following diagram that shows how slice works. slice for time = “Q1” • Here Slice is performed for the dimension "time" using the criterion time = "Q1". • It will form a new sub-cube by selecting one or more dimensions. Dice - Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider the following diagram that shows the dice operation. Dice for (location = ‘Perth’ or ‘Sydney’) and (time = Q1 or Q2” and (Item = Books or “Clothes) The dice operation on the cube based on the following selection criteria involves three dimensions. • (location = "Toronto" or "Vancouver") • (time = "Q1" or "Q2") • (item = "Mobile" or "Modem") Pivot - The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an alternative presentation of data. Consider the following diagram that shows the pivot operation. Pivot OLAP vs OLTP Sr. No. Data Warehouse (OLAP) Operational Database (OLTP) 1 Involves historical processing of information. Involves day- to-day processing. 2 OLAP systems are used by knowledge workers such as executives, managers and analysts. OLTP systems are used by clerks, DBAs, or database professionals. 3 Useful in analyzing the business. Useful in running the business. 4 It focuses on Information out. It focuses on Data in. 5 Based on Star Schema, Snowflake, Schema and Fact Constellation Schema Based on Entity Relationship Model. 6 Contains historical data. Contains current data. superior processing capability to fulfill the processing requirements. Advantages and Disadvantages of OLAP Given below are the advantages and disadvantages mentioned: Advantages: • It can be used for data extraction or mining, analysis, and reporting to find the relationships between data items. • We can use ODBC (Open Database Connectivity) to import data from an existing relational to create an OLAP multidimensional database. • All transactional data is not required for trend analysis, so an OLAP database does not need to be as large as a data warehouse. Disadvantage: • Some of the disadvantages of OLAP are pre-modeling, which is a must, great dependence on IT, poor computation capability, slow in reacting, short of Interactive analysis ability, abstract model, and great potential risk. Some of the analytic tools (OLAP) are IBM Cognos, Micro Strategy, Palo OLAP Server, Apache Kylin, Oracle OLAP, icCube, Pentaho BI, JsHypercube, etc. Conclusion OLAP (Online Analytical Processing) is powerful technology behind many Business Intelligence (BI) applications that discovers data, report viewing capabilities, complex analytical calculations, and predictive scenario, budget planning, and forecast planning. It works as it first collects the data from multiple data sources (like a spreadsheet, video, XML, etc.) and is stored in data warehouses, which are then cleansed and organized into data cubes on which can run the user’s queries. The five types of analytical operations against the multidimensional databases that can perform are Roll-up, Drill-down, Slice, Dice, and Pivot. There are three OLAP types: MOLAP, ROLAP, and HOLAP. Some of the analytic tools (OLAP) are IBM Cognos, Micro Strategy, Palo OLAP Server, Apache Kylin, Oracle OLAP, icCube, Pentaho BI, JsHypercube, etc. TYPES OF OLAP SYSTEMS The Types of OLAP Systems OLAP systems vary quite a lot, and they have generally been distinguished by a letter tagged onto the front of the acronym “OLAP,” for On-Line Analytical Processing. MOLAP and ROLAP have classically been the most established types, and the other distinctions represent little more than the marketing programs on the part of the vendors to distinguish themselves, for example, SOLAP and DOLAP. Here, we aim to give you an idea of what these distinctions have meant. The New Direction in OLAP Technology: The newest software in the OLAP and Business Intelligence world combines, in real-time, the benefits of both relational tables and multidimensional business data modeling. The latest technology removes the proprietary format of its MOLAP predecessors by living/saving in source relational tables, like SQL Server. Lastly, new OLAP technology maintains a constant connection with existing back- end systems and delivers immediately responsive reports/analytics in Excel and other front-end tools (dashboards, query tools, etc.) If you like the sound of that, check out Olation® from PARIS Tech, the sponsor of OLAP.com. MAJOR OLAP TECHNOLOGY TYPES: Hybrid Transaction / Analytical Processing (HTAP) - Gartner coined the term HTAP in a paper in the beginning of 2014 to describe new in-memory data systems that do both online transaction processing (OLTP) and online analytical processing (OLAP). - HTAP relies on newer and much more powerful, often distributed, processing: sometimes it involves a new hardware “appliance”, and it almost always requires a new software platform. Beyond this, the key point seems to be that all the technology is sited in the relational database. And so, there’s no more data replication, and new transactional information becomes part of an analytical model in as fast a time as is technologically possible. - HTAP represents a new way to tie data together in a way that hasn’t been possible before– a real uniting of relational data stored in tables with the data models that are used for decision making by the business leaders. - For an example of an HTAP product, check out Olation® from PARIS Tech, the sponsor of OLAP.com. Olation can be categorized as an HTAP product — even the name Olation implies the combination of “OLAP” and “relational” technologies. Multidimensional OLAP (MOLAP) – Cube based. - MOLAP products enable end-users to model data in a multidimensional environment, rather than providing a multidimensional view of relational data, as ROLAP products do (see next tab). - The structure of a multidimensional model is not a series of tables (as exists in a relational database) but what is generally referred to as a cube. Cubes modeled in a multidimensional database extend the concept associated with spreadsheets: just as a cell in a spreadsheet represents the intersection of two dimensions (sales of product by region), a cell in a cube represents the intersection of an infinite number of dimension members (e.g., Products, Customers, Regions, Months …nth dimension). As in a spreadsheet, a cell might be calculated by formulas involving other cells. - In short, multidimensional databases allow users to add extra dimensions, rather than additional tables, as in a relational model. And the MOLAP cube structure allows for particularly fast, flexible data-modeling and calculations. For one, locating cells is vastly simplified—an application can identify a cell location by name (at the intersection of dimension members) rather than by searching an index or the entire model (via SQL SELECT statements), as in a relational database. Further, multidimensional models incorporate advanced array-processing techniques and algorithms for managing data and calculations. As a result, multidimensional databases can store data very efficiently and process calculations in a fraction of the time required of relational-based products. - What are the perceived drawbacks of MOLAP tools? - For one, relevant data must be transferred from relational systems, which is aa potentially “redundant” re-creation of data in another (multidimensional) database. Once data has been transferred, there may be no simple means for updating the MOLAP “engine” as individual transactions are recorded by the RDBMS. Also, MOLAP products are typically proprietary systems. For some IT departments, introducing a new database system is an anathema, even if it means significantly greater productivity for the type of planning, analysis, and reporting that end-users rely on the (MOLAP) solution to perform. - For a good example of a fast, scalable MOLAP product, check out PowerOLAP® from PARIS Tech, the sponsor of OLAP.com. Relational OLAP (ROLAP) – Star Schema based. - ROLAP products (for Relational OLAP) are credited with being able to directly access data stored in relational databases. The notion is that they can readily retrieve transactional data, although this becomes suspect when very large data sets are in play, or if more complex calculations are to be delivered, based on the transactional data. ROLAP products enable organizations to leverage their existing investments in RDBMS (relational database management system) software. - ROLAP products access a relational database by using SQL (structured query language), which is the standard language that is used to define and manipulate data in an RDBMS. Subsequent processing may occur in the RDBMS or within a mid-tier server, which accepts requests from clients, translates them into SQL statements, and passes them on to the RDBMS. - ROLAP products provide GUIs and generate SQL execution plans that typically remove end-users from the SQL writing process. However, this over- reliance on processing via SQL statements—including processing for multidimensional analysis—is a drawback. Whether it is generated “transparently” or not, SQL is the language of relational tables: SQL’s vocabulary is limited and its grammar often inflexible, at least to accommodate the most sophisticated modeling required for multidimensional analyses. - There are further drawbacks to structuring a multidimensional model solely within relational tables: Before end- users can submit requests, the relevant dimension data must be extracted and reformatted in de-normalized structures known as star schema or snowflakes (so- called because of the way the tables are conjoined). These tabular structures are necessary to provide acceptable analytical performance. Sophisticated ROLAP applications also require that aggregate tables be pre-built and maintained, eliminating the need to process summary data at runtime - One advantage of ROLAP over the other styles of OLAP analytic tools is that it is deemed to be more scalable in handling huge amounts of data. ROLAP sits on top of relational databases therefore enabling it to leverage several functionalities that a relational database is capable of. Hybrid OLAP (HOLAP) - HOLAP is the product of the attempt to incorporate the best features of MOLAP and ROLAP into a single architecture. This kind of tool tries to bridge the technology gap of both products by enabling access to or use of both multidimensional database (MDDB) and Relational Database Management System (RDBMS) data stores. HOLAP systems store larger quantities of detailed data in the relational tables while the aggregations are stored in the pre- calculated cubes. HOLAP also has the capacity to “drill through” from the cube down to the relational tables for delineated data. Some of the advantages of this system are better scalability, quick data processing and flexibility in accessing of data sources. The issue with HOLAP systems lies precisely in the fact that they are hybrids: at best they partake of the strengths of other systems…but they also Anomaly Detection. - It is used to determine when something is noticeably different from the regular pattern. It is used to eliminate any database inconsistencies or anomalies at the source. Regression Analysis. - This technique is used to make predictions based on relationships within the data set. Classification. - This deals with the things which have labels on it. Note in cluster detection, the things did not have a label in it and by using data mining we had to label and form into clusters, but in classification, there is information existing that can be easily classified using an algorithm. Associative Learning. - It is used to analyze which things tend to occur together either in pairs or larger groups. Example of Data Mining Tools • Rapid Miner • Orange • Weka • Knime • Sisense • SQL Server Data Tools (SSDT) • Apache Mahout • Oracle Data Mining • Rattle • DataMelt • IBM Cognos • IBM SPSS Modeler • SAS Data Mining • Teradata • Board • Dundas BI Benefits of Data Mining • Data mining technique helps companies to get knowledge-based information. • Data mining helps organizations to make the profitable adjustments in operation and production. • The data mining is a cost-effective and efficient solution compared to other statistical data applications. • Data mining helps with the decision- making process. • Facilitates automated prediction of trends and behaviors as well as automated discovery of hidden patterns. • It can be implemented in new systems as well as existing platforms • It is the speedy process which makes it easy for the users to analyze huge amount of data in less time. Disadvantages of Data Mining • There are chances of companies may sell useful information of their customers to other companies for money. • Many data mining analytics software is difficult to operate and requires advance training to work on. • Different data mining tools work in different manners due to different algorithms employed in their design. Therefore, the selection of correct data mining tool is a very difficult task. • The data mining techniques are not accurate, and so it can cause serious consequences in certain conditions. Data Mining Applications • Communications • Insurance • Education • Manufacturing • Banking • Retail • Service Providers • E-Commerce • Super Markets • Crime Investigation • Bioinformatics Applications Usage Communications Data mining techniques are used in communication sector to predict customer behavior to offer highly targeted and relevant campaigns. Insurance Data mining helps insurance companies price their products profitable and promote new offers to their new or existing customers. Education Data mining benefits educators to access student data, predict achievement levels and find students or groups of students which need extra attention. For example, students who are weak in math subject. Manufacturing With the help of Data Mining Manufacturers can predict wear and tear of production assets. They can anticipate maintenance which helps them reduce them to minimize downtime. Banking Data mining helps finance sector to get a view of market risks and manage regulatory compliance. It helps banks to identify probable defaulters to decide whether to issue credit cards, loans, etc. Retail Data Mining techniques help retails malls and grocery stores identify and arrange most sellable items in the most attentive positions. It helps store owners to comes up with the offer which encourages customers to increase their spending. Service Providers Service providers like mobile phone and utility industries use Data Mining to predict the reasons when a customer leaves their company. They analyze billing details, customer service interactions, complaints made to the company to assign each customer a probability score and offers incentives. E-Commerce E-commerce websites use Data Mining to offer cross-sells and up- sells through their websites. One of the most famous names is Amazon, who use Data mining techniques to get more customers into their eCommerce store. Super Markets Data Mining allows supermarket’s developed rules to predict if their shoppers were likely to be expecting. By evaluating their buying pattern, they could find woman customers who are most likely pregnant. They can start targeting products like baby powder, baby shop, diapers and so on. Crime Investigation Data Mining helps crime investigation agencies to deploy police workforce (where is a crime most likely to happen and when?), who to search at a border crossing etc. Bioinformatics Data Mining helps to mine biological data from massive datasets gathered in biology and medicine. 1. Business Understanding - Focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition and a preliminary plan. 2. Data Understanding - Starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information. 3. Data Preparation - The data preparation phase covers all activities to construct the final dataset from the initial raw data. 4. Modeling - Modeling techniques are selected and applied. Since some techniques like neural nets have specific requirements regarding the form of the data, there can be a loop back here to data prep. 5. Evaluation - Once one or more models have been built that appear to have high quality based on whichever loss functions have been selected, these need to be tested to ensure they generalize against unseen data and that all key business issues have been sufficiently considered. The end result is the selection of the champion model(s). 6. Deployment - Consists of presenting the results in a useful and understandable manner, and by achieving this, the project should achieve its goals. It is the only step not belonging to a cycle. Knowledge Discovery in Databases (KDD) • Some people don’t differentiate data mining from knowledge discovery while others view data mining as an essential step in the process of knowledge discovery. • The term knowledge discovery in databases, or KDD for short, refers to the broad process of finding knowledge in data, and emphasizes the high level application of particular data minded methods. It is of interest to researchers in machine learning, pattern recognition, databases, statistics, artificial intelligence, knowledge acquisition for expert systems, and data visualization. • The ultimate goal is to extract high-level knowledge from low-level data. 9-STEP KDD PROCESS STARTING FROM MANAGERIAL STEP Step 1. Developing and understanding of the application domain. This is the preparatory step that sets the scene for understanding what should be done with transformation, algorithms, and representation. Those in charge of the KDD project need to understand and define the goals of the end user, then where the knowledge discovery process will take place, and other relevant prior knowledge. Step 2. Selecting and creating a data set on which discovery will be performed, based on goals. Determine what data will be used for the knowledge discovery, such as: what data is available, obtaining additional necessary data, and the integrating all the data for the knowledge discovery into one data set, including the attributes that will be considered for the process. This process is very important because the data mining learns and discovers from the available data. This is the evidence base for constructing the models. If some important attributes are missing, then the entire study may fail. From this respect, the more attributes considered, the better. Step 3. Pre-processing and cleansing. Data reliability is enhanced in this stage. It includes data clearing, such as handling missing values, and removing of outliers. It may involve complex statistical methods or using a data mining algorithm in this context. For example: If one suspects that a certain attribute is of insufficient reliability, or has many missing data, then this attribute could become the goal of a data mining supervised algorithm. A prediction model for this attribute will be developed, and then missing data can be predicted. Step 4. Next is Data Transformation. In this stage, the generation of better data, for the data mining is prepared and developed. Methods here include dimension reduction, such as feature selection, and extraction, and record sampling, and attribute transformation such as discretization of numerical attributes and functional transformation. Step 5. Choosing the appropriate data mining task. We’re now ready to decide which type of data mining to use. For example: classification, regression, or clustering. This mostly depends on the KDD goals, and also on the previous steps. There are two major goals in data mining: prediction and description. Prediction is often referred to as supervised data mining, while descriptive data mining includes the unsupervised, and visualization aspects of data mining. Step 6. Choosing the data mining algorithm. Now that you have the strategy, we can decide which tactics to use. This stage includes selecting the specific method for searching patterns, including multiple inducers. For example, when considering precision versus understandability, the former is better with neural networks, while the latter is better with decision trees. Step 7. Next is employing the data mining algorithm. Finally, you can implement the appropriate data mining algorithm. In this step we might need to employ the algorithm several times until a satisfying the result is obtained. For instance, by tuning the algorithms control parameters, such as the minimum number of instances in a single leaf of a decision tree. Step 8. Evaluation In this stage we evaluate and interpret the mined patterns with respect to the goals defined in the first step. Here we consider the pre-processing steps with respect to their effect on the data mining algorithm results. This step focuses on the comprehensible nature and usefulness of the induced model. In this step, the discovered knowledge is also documented for further usage. Step 9. Using the discovered knowledge. We’re now ready to incorporate the knowledge into another system for further action. The knowledge becomes active in the sense that we may make changes to the system and measure the effects. Actually, the success of this step determines the effectiveness of the entire KDD process. SUPPLEMENTARY INTRODUCTION TO DATA MINING Introduction to Data Mining Here in this article, we will learn about the introduction to Data Mining as humans have been mining from the earth for centuries to get all sorts of valuable materials. Sometimes while mining, things are discovered from the ground, which no one expected to find in the first place. For example, in 1898, during the excavation of a tomb to find mummies in Saqqara, Egypt, a wooden artefact was found which exactly resembled an aeroplane. It was dated back to 200 BC, about 2200 years ago! But what possible information could we get from a large set of data? And even if we start mining it, are there any chances of getting any unexpected results from the data set? Before that, let’s go into what exactly is Data Mining. Data Mining • It is basically the extraction of vital information/knowledge from a large set of data. • Think of data as a large ground/rocky surface. We don’t know what is inside it; we don’t know if something useful is beneath the rocks. • In this introduction to data mining, we are looking for hidden information, but without any idea about what type of information we want to find and what we plan to use it for once we find it. • Just like in the concept of traditional mining, in data mining also there are various techniques and tools, which vary according to the type of data we are mining, so we have cleared that what is data mining through this topic of introduction to data mining. Example An example of data mining is as shown below: • So, there is a Mobile network operator. They consult a data-miner to dig into the call records of the operator. No specific targets are given to the Data Miner. • A quantitative target of finding at least 2 new patterns in a month is given. • As the data miner starts digging into the data, he finds a pattern that there are fewer international calls on Wednesday than on other days. • This information is shared with the management, and they come up with the plan to reduce the international call rates on Wednesdays and start a campaign. close to bread, lemons are placed alongside oranges and so on. Data Mining Ethical So, I plan on a weekend trip to Goa with a friend; I search the internet for right places to visit in Goa. The next time I open the internet, I find ads about various hotels in Goa for staying. Good Thing Yes, the internet has helped me simplify my trip. After all, if I decide to visit Goa, I would need to sleep somewhere and an ad showing me a hotel is much more useful than an ad showing me random clothes to buy. Bad Thing Yes! Why would a data mining company I have never heard of before, know where I am going on a vacation. What if I haven’t told anyone about this trip, but here the internet suddenly knows I am going there. The truth is the business model of the data mining company depends on this. They collect this data via cookies and scripts, then they sell it to advertisers who, in turn, try to sell me something else (In this case, a hotel room). So, it could be good or bad depending on the way we look at it. Also, we could always switch off the cookies or go incognito in the above case. Though whatever be the case, one thing is for sure. Data mining is here to stay. DATA MINING TUTORIAL: WHAT IS DATA MINING? TECHNIQUES, PROCESS Data Mining Data Mining is a process of finding potentially useful patterns from huge data sets. It is a multi-disciplinary skill that uses machine learning, statistics, and AI to extract information to evaluate future events probability. The insights derived from Data Mining are used for marketing, fraud detection, scientific discovery, etc. Data Mining is all about discovering hidden, unsuspected, and previously unknown yet valid relationships amongst the data. Data mining is also called Knowledge Discovery in Data (KDD), Knowledge extraction, data/pattern analysis, information harvesting, etc. In this Data Mining tutorial, you will learn the fundamentals of Data Mining like- • What is Data Mining? • Types of Data • Implementation Process of Data Mining o Business understanding: o Data understanding: o Data preparation: o Data transformation: o Modelling: • Data Mining Techniques • Challenges of Implementation of Data Mine: • Data Mining Examples: • Data Mining Tools • Benefits of Data Mining: • Disadvantages of Data Mining • Data Mining Applications Types of Data Data mining can be performed on following types of data. • Relational databases • Data warehouses • Advanced DB and information repositories • Object-oriented and object-relational databases • Transactional and Spatial databases • Heterogeneous and legacy databases • Multimedia and streaming database • Text databases • Text mining and Web mining Implementation Process of Data Mining Data Mining Implementation Process Business Understanding >> Data Understanding >> Data Preparation >> Modeling >> Evaluation >> Deployment Let’s study the Data Mining implementation process in detail.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved