Business Intelligence & Data Warehouses
Overview
Business Intelligence (BI) is a comprehensive, cohesive, and integrated set of tools and processes used to capture, collect, integrate, store, and analyze data with the purpose of generating and presenting information to support business decision making.
BI is a framework that allows a business to transform data into information, information into knowledge, and knowledge into wisdom. BI has the potential to positively affect a company's culture by creating continuous business performance improvement through active decision support at all levels in an organization.
The Need for Data Analysis
Organizations need to track daily transactions and analyze data to answer strategic questions such as:
- Are our sales promotions working?
- What market percentage are we controlling?
- Are we attracting new customers?
- What products are performing best?
- Which regions are most profitable?
Decision makers can no longer wait days for reports - they need quick decisions to remain competitive. The key is having the right data at the right time to support the decision-making process.
Business Intelligence
Business Intelligence Framework
BI provides a framework for:
- Collecting and storing operational data
- Aggregating operational data into decision support data
- Analyzing decision support data to generate information
- Presenting information to end users to support business decisions
- Monitoring results to evaluate outcomes
- Predicting future behaviors and outcomes
Business Intelligence Architecture
The BI architecture includes six basic components:
- ETL Tools: Data extraction, transformation, and loading tools that collect, filter, integrate, and aggregate internal and external data
- Data Store: Optimized for decision support, typically a data warehouse or data mart
- Query and Reporting: Tools for data selection, retrieval, and report creation
- Data Visualization: Presents data in meaningful ways (summary reports, maps, graphs, dashboards)
- Data Mining: Automated discovery of patterns and relationships in data
- OLAP Tools: Online Analytical Processing for multidimensional data analysis
Business Intelligence Benefits
- Improved decision making through better information
- Increased organizational efficiency
- Competitive advantage through data-driven insights
- Better customer understanding and service
- Identification of new business opportunities
- Reduced costs through better resource allocation
Operational Data vs Decision Support Data
Operational Data
Data used in day-to-day operations:
- Current and up-to-date
- Stored in normalized structures
- Optimized for transaction processing
- Detailed and atomic
- High update frequency
Decision Support Data
Data optimized for analysis and decision making:
- Historical and time-variant
- Stored in denormalized structures
- Optimized for query performance
- Aggregated and summarized
- Periodic updates (batch processing)
The Data Warehouse
A data warehouse is a specialized database that stores data in a format optimized for decision support. It contains historical data obtained from operational databases as well as data from other external sources.
Characteristics of Data Warehouses
- Subject-oriented: Organized around subjects (customers, products, sales)
- Integrated: Data from multiple sources is consistent and integrated
- Time-variant: Contains historical data for trend analysis
- Nonvolatile: Data is read-only and not updated frequently
- Summarized: Contains pre-aggregated data for fast queries
Twelve Rules That Define a Data Warehouse
- Data warehouse and operational environments are separated
- Data warehouse data is integrated
- Data warehouse contains historical data
- Data warehouse data is snapshot data captured at a point in time
- Data warehouse data is subject-oriented
- Data warehouse data is primarily read-only
- Data warehouse development life cycle differs from operational systems
- Data warehouse contains data with different levels of detail
- Data warehouse environment is characterized by read-only transactions
- Data warehouse environment has system of records
- Data warehouse environment has metadata
- Data warehouse contains a chargeback mechanism
Data Marts
A data mart is a small, single-subject data warehouse subset that provides decision support to a small group of people. Data marts can be:
- Independent: Created from operational data
- Dependent: Created from a data warehouse
- Hybrid: Combination of operational and warehouse data
Star Schemas
A star schema is a data modeling technique used in data warehouses. It consists of:
Facts
Facts are numeric measurements (values) that represent a specific business aspect or activity. Examples include:
- Sales amount
- Quantity sold
- Cost
- Profit margin
Dimensions
Dimensions are qualifying characteristics that provide additional perspectives to a given fact. Examples include:
- Time dimension (date, month, quarter, year)
- Location dimension (region, state, city)
- Product dimension (product category, product name)
- Customer dimension (customer type, customer segment)
Attributes
Attributes are characteristics of a dimension. For example, in a Time dimension, attributes might include:
- Date
- Month
- Quarter
- Year
- Day of week
Attribute Hierarchies
Attribute hierarchies provide top-down data organization. For example:
Time Dimension:
Year
└─ Quarter
└─ Month
└─ Week
└─ Day
Location Dimension:
Country
└─ Region
└─ State
└─ City
Star Schema Representation
In a star schema, facts are placed in a central fact table, and dimensions are placed in dimension tables. The fact table is typically very large, while dimension tables are smaller.
Fact Table (SALES):
- Time_ID (FK)
- Product_ID (FK)
- Location_ID (FK)
- Customer_ID (FK)
- Sales_Amount
- Quantity_Sold
- Cost
Dimension Tables:
- TIME (Time_ID, Date, Month, Quarter, Year)
- PRODUCT (Product_ID, Product_Name, Category)
- LOCATION (Location_ID, City, State, Region)
- CUSTOMER (Customer_ID, Customer_Name, Segment)
Online Analytical Processing (OLAP)
Online Analytical Processing (OLAP) is a set of tools that provide advanced data analysis for retrieving, processing, and modeling data from the data warehouse.
Multidimensional Data Analysis Techniques
- Slice: View data for a single dimension value
- Dice: View data for multiple dimension values
- Roll-up: Aggregate data at higher levels
- Drill-down: Navigate to more detailed levels
- Pivot: Rotate the view of data
OLAP Architecture
OLAP systems can be implemented using different architectures:
Relational OLAP (ROLAP)
Uses relational databases to store multidimensional data. Data is stored in star schema structures. Advantages include scalability and ability to handle large data volumes.
Multidimensional OLAP (MOLAP)
Uses multidimensional data structures (cubes) to store data. Advantages include fast query performance and intuitive data representation.
Hybrid OLAP (HOLAP)
Combines ROLAP and MOLAP approaches, using the best features of both.
ROLAP vs MOLAP
| Aspect | ROLAP | MOLAP |
|---|---|---|
| Storage | Relational database | Multidimensional cube |
| Scalability | High | Limited |
| Query Performance | Good | Excellent |
| Data Volume | Very large | Small to medium |
SQL Extensions for OLAP
ROLLUP Extension
The ROLLUP extension generates subtotals and grand totals:
SELECT Region, State, SUM(Sales_Amount) AS Total_Sales
FROM SALES
GROUP BY ROLLUP(Region, State);
-- Generates:
-- - Sales by Region and State
-- - Sales by Region (subtotal)
-- - Grand total
CUBE Extension
The CUBE extension generates all possible combinations of grouping columns:
SELECT Product, Region, SUM(Sales_Amount) AS Total_Sales
FROM SALES
GROUP BY CUBE(Product, Region);
-- Generates all combinations:
-- - Sales by Product and Region
-- - Sales by Product
-- - Sales by Region
-- - Grand total
Materialized Views
Materialized views store pre-computed query results to improve query performance:
CREATE MATERIALIZED VIEW monthly_sales AS
SELECT
YEAR(Sale_Date) AS Year,
MONTH(Sale_Date) AS Month,
SUM(Sales_Amount) AS Total_Sales
FROM SALES
GROUP BY YEAR(Sale_Date), MONTH(Sale_Date);
-- Query uses pre-computed results
SELECT * FROM monthly_sales;
Best Practices
- Start with business requirements: Understand what decisions need to be made
- Design for performance: Optimize star schemas and materialized views
- Ensure data quality: Clean and validate data during ETL
- Use appropriate OLAP tools: Choose ROLAP, MOLAP, or HOLAP based on needs
- Monitor and tune: Continuously monitor query performance
- Document everything: Maintain metadata and documentation
Next Steps
Explore modern data challenges with Big Data & NoSQL, or learn about connecting databases to applications with Database Connectivity.