Distributed Database Management Systems
Overview
A Distributed Database Management System (DDBMS) manages a database that is distributed across multiple computer sites within a computer network. Unlike a centralized database system, where all data is stored at a single location, a distributed database allows data to be stored and accessed from multiple locations while appearing as a single logical database.
The Evolution of Distributed Database Management Systems
Distributed database systems evolved from centralized database systems as organizations grew and needed to:
- Access data from multiple geographic locations
- Improve performance by locating data closer to users
- Enhance reliability through data replication
- Support organizational decentralization
- Reduce communication costs
DDBMS Advantages and Disadvantages
Advantages
- Data are located near the greatest demand site: Faster access to data for local users
- Faster data access: Reduced network latency for local queries
- Faster data processing: Parallel processing across sites
- Growth facilitation: Easier to add new sites and expand
- Improved communications: Better communication between sites
- Reduced operating costs: Lower communication costs
- User-friendly interface: Users interact with a single logical database
- Less danger of a single-point failure: System continues if one site fails
- Processor independence: Can use different hardware/software at each site
Disadvantages
- Complexity of management and control: More complex than centralized systems
- Security: More difficult to secure data across multiple sites
- Lack of standards: Fewer standards compared to centralized systems
- Increased storage requirements: Data replication increases storage needs
- Increased training costs: Personnel need training on distributed systems
- Higher software cost and complexity: More expensive and complex software
Distributed Processing vs Distributed Databases
It's important to distinguish between distributed processing and distributed databases:
Distributed Processing
Distributed processing shares the database's logical processing among multiple sites connected to a network. The database itself remains centralized, but processing is distributed.
Distributed Database
A distributed database is a logically related database that is stored in two or more physically independent sites. Both data and processing are distributed.
Characteristics of Distributed Database Management Systems
A true DDBMS must have the following characteristics:
- Application Transparency: Applications don't need to know where data is located
- Transaction Transparency: Transactions can execute across multiple sites
- Distribution Transparency: Users see a single logical database
- Failure Transparency: System continues operating despite site failures
- Performance Transparency: System performs as if it were centralized
- Heterogeneity Transparency: Supports different hardware/software platforms
DDBMS Components
A distributed database system includes:
- Computer workstations or remote devices: End-user access points
- Network hardware and software: Communication infrastructure
- Communications media: Physical connections between sites
- Transaction processor (TP) or Transaction Manager (TM): Software component that processes transactions
- Data processor (DP) or Data Manager (DM): Software component that stores and retrieves data
Levels of Data and Process Distribution
Single-Site Processing, Single-Site Data (SPSD)
All processing is done on a single CPU or host computer, and all data are stored on the host computer's local disk. This is the simplest type of system - essentially a centralized system.
Multiple-Site Processing, Single-Site Data (MPSD)
Multiple processes run on different computers sharing a single data repository. The data are stored on a single computer but can be accessed from multiple computers. This is typical of client/server systems.
Multiple-Site Processing, Multiple-Site Data (MPMD)
Fully distributed database system with data and processing distributed across multiple sites. This is a true DDBMS where each site can process local transactions and participate in global transactions.
Distributed Database Transparency Features
Transparency features hide the complexity of distribution from users and applications:
Distribution Transparency
Users see the database as a single, logical entity. The system handles:
- Fragmentation transparency: Users don't need to know data is fragmented
- Location transparency: Users don't need to know where data is located
- Local mapping transparency: System automatically maps logical to physical locations
Transaction Transparency
Ensures that distributed transactions maintain ACID properties across all sites:
Distributed Requests
A single SQL statement can access data at multiple sites:
SELECT *
FROM CUSTOMER_AT_SITE1 C, INVOICE_AT_SITE2 I
WHERE C.CUS_CODE = I.CUS_CODE;
Distributed Transactions
Multiple SQL statements that access data at multiple sites:
BEGIN TRANSACTION;
UPDATE ACCOUNT_AT_SITE1 SET BALANCE = BALANCE - 100;
UPDATE ACCOUNT_AT_SITE2 SET BALANCE = BALANCE + 100;
COMMIT;
Two-Phase Commit Protocol
The two-phase commit protocol ensures atomicity in distributed transactions:
- Phase 1 - Preparation: All sites prepare to commit. Each site locks resources and writes transaction log entries.
- Phase 2 - Final Commit: If all sites agree, the transaction is committed. If any site fails, the transaction is rolled back at all sites.
Distributed Concurrency Control
Distributed systems require sophisticated concurrency control mechanisms:
- Distributed lock manager: Manages locks across multiple sites
- Two-phase locking: Ensures serializability across sites
- Timestamp-based methods: Use timestamps to order transactions
- Replication control: Manages consistency across replicated data
Performance and Failure Transparency
Performance Transparency
The distributed database performs as if it were a centralized database. The system optimizes:
- Query optimization across sites
- Data allocation and replication
- Network traffic minimization
- Load balancing
Failure Transparency
The system continues operating even if one or more sites fail:
- Automatic detection of site failures
- Replication provides redundancy
- Transaction rollback and recovery
- Automatic failover mechanisms
Distributed Database Design
Data Fragmentation
Breaking a database into logical units (fragments) that can be stored at different sites:
Horizontal Fragmentation
Divides a table into subsets of rows. Each fragment contains rows that meet a specific condition:
-- Fragment 1: Customers in Region 1
SELECT * FROM CUSTOMER WHERE REGION = 1;
-- Fragment 2: Customers in Region 2
SELECT * FROM CUSTOMER WHERE REGION = 2;
Vertical Fragmentation
Divides a table into subsets of columns. Each fragment contains specific attributes:
-- Fragment 1: Employee basic info
SELECT EMP_ID, EMP_NAME, EMP_DEPT FROM EMPLOYEE;
-- Fragment 2: Employee salary info
SELECT EMP_ID, EMP_SALARY, EMP_BONUS FROM EMPLOYEE;
Mixed Fragmentation
Combines horizontal and vertical fragmentation techniques.
Data Replication
Storing copies of data at multiple sites to improve availability and performance:
- Full replication: Complete database copy at every site
- Partial replication: Selected fragments replicated at specific sites
- No replication: Each fragment stored at only one site
Replication Strategies:
- Mutual consistency: All copies must be identical at all times
- Master-slave replication: One master copy, multiple read-only slaves
- Peer-to-peer replication: All copies can be updated
Data Allocation
Determining where to place data fragments:
- Centralized: All data at one site
- Partitioned: Each fragment at one site
- Replicated: Copies of fragments at multiple sites
Allocation decisions consider:
- Query frequency at each site
- Update frequency
- Network costs
- Storage costs
- Performance requirements
The CAP Theorem
The CAP theorem states that a distributed system can guarantee at most two of the following three properties:
- Consistency (C): All nodes see the same data simultaneously
- Availability (A): System remains operational
- Partition tolerance (P): System continues despite network partitions
In practice, distributed systems must choose which two properties to prioritize, as network partitions are inevitable in distributed environments.
C.J. Date's 12 Commandments for Distributed Databases
E.F. Codd's rules for relational databases were extended by C.J. Date for distributed databases:
- Local Autonomy: Local sites are independent
- No Central Site: No reliance on a central site for operations
- Continuous Operation: System never stops for upgrades or reconfigurations
- Location Independence: Users don't need to know where data is stored
- Fragmentation Independence: Users don't need to know data is fragmented
- Replication Independence: Users don't need to know data is replicated
- Distributed Query Processing: System optimizes distributed queries
- Distributed Transaction Management: System manages distributed transactions
- Hardware Independence: System works with different hardware
- Operating System Independence: System works with different operating systems
- Network Independence: System works with different network protocols
- DBMS Independence: System works with different DBMSs
Best Practices
- Design for distribution: Consider distribution from the start
- Minimize network traffic: Place data close to users
- Replicate strategically: Balance availability with consistency needs
- Monitor performance: Track query performance across sites
- Plan for failures: Design redundancy and recovery mechanisms
- Test thoroughly: Test distributed transaction scenarios
Next Steps
Learn about analyzing business data with Business Intelligence & Data Warehouses, or explore modern data challenges with Big Data & NoSQL.