Home / Database / Distributed Databases

Distributed Database Management Systems

Overview

A Distributed Database Management System (DDBMS) manages a database that is distributed across multiple computer sites within a computer network. Unlike a centralized database system, where all data is stored at a single location, a distributed database allows data to be stored and accessed from multiple locations while appearing as a single logical database.

The Evolution of Distributed Database Management Systems

Distributed database systems evolved from centralized database systems as organizations grew and needed to:

Access data from multiple geographic locations
Improve performance by locating data closer to users
Enhance reliability through data replication
Support organizational decentralization
Reduce communication costs

DDBMS Advantages and Disadvantages

Advantages

Data are located near the greatest demand site: Faster access to data for local users
Faster data access: Reduced network latency for local queries
Faster data processing: Parallel processing across sites
Growth facilitation: Easier to add new sites and expand
Improved communications: Better communication between sites
Reduced operating costs: Lower communication costs
User-friendly interface: Users interact with a single logical database
Less danger of a single-point failure: System continues if one site fails
Processor independence: Can use different hardware/software at each site

Disadvantages

Complexity of management and control: More complex than centralized systems
Security: More difficult to secure data across multiple sites
Lack of standards: Fewer standards compared to centralized systems
Increased storage requirements: Data replication increases storage needs
Increased training costs: Personnel need training on distributed systems
Higher software cost and complexity: More expensive and complex software

Distributed Processing vs Distributed Databases

It's important to distinguish between distributed processing and distributed databases:

Distributed Processing

Distributed processing shares the database's logical processing among multiple sites connected to a network. The database itself remains centralized, but processing is distributed.

Distributed Database

A distributed database is a logically related database that is stored in two or more physically independent sites. Both data and processing are distributed.

Characteristics of Distributed Database Management Systems

A true DDBMS must have the following characteristics:

Application Transparency: Applications don't need to know where data is located
Transaction Transparency: Transactions can execute across multiple sites
Distribution Transparency: Users see a single logical database
Failure Transparency: System continues operating despite site failures
Performance Transparency: System performs as if it were centralized
Heterogeneity Transparency: Supports different hardware/software platforms

DDBMS Components

A distributed database system includes:

Computer workstations or remote devices: End-user access points
Network hardware and software: Communication infrastructure
Communications media: Physical connections between sites
Transaction processor (TP) or Transaction Manager (TM): Software component that processes transactions
Data processor (DP) or Data Manager (DM): Software component that stores and retrieves data

Levels of Data and Process Distribution

Single-Site Processing, Single-Site Data (SPSD)

All processing is done on a single CPU or host computer, and all data are stored on the host computer's local disk. This is the simplest type of system - essentially a centralized system.

Multiple-Site Processing, Single-Site Data (MPSD)

Multiple processes run on different computers sharing a single data repository. The data are stored on a single computer but can be accessed from multiple computers. This is typical of client/server systems.

Multiple-Site Processing, Multiple-Site Data (MPMD)

Fully distributed database system with data and processing distributed across multiple sites. This is a true DDBMS where each site can process local transactions and participate in global transactions.

Distributed Database Transparency Features

Transparency features hide the complexity of distribution from users and applications:

Distribution Transparency

Users see the database as a single, logical entity. The system handles:

Fragmentation transparency: Users don't need to know data is fragmented
Location transparency: Users don't need to know where data is located
Local mapping transparency: System automatically maps logical to physical locations

Transaction Transparency

Ensures that distributed transactions maintain ACID properties across all sites:

Distributed Requests

A single SQL statement can access data at multiple sites:

Distributed Request Example

SELECT * 
FROM CUSTOMER_AT_SITE1 C, INVOICE_AT_SITE2 I
WHERE C.CUS_CODE = I.CUS_CODE;

Distributed Transactions

Multiple SQL statements that access data at multiple sites:

Distributed Transaction

BEGIN TRANSACTION;
    UPDATE ACCOUNT_AT_SITE1 SET BALANCE = BALANCE - 100;
    UPDATE ACCOUNT_AT_SITE2 SET BALANCE = BALANCE + 100;
COMMIT;

Two-Phase Commit Protocol

The two-phase commit protocol ensures atomicity in distributed transactions:

Phase 1 - Preparation: All sites prepare to commit. Each site locks resources and writes transaction log entries.
Phase 2 - Final Commit: If all sites agree, the transaction is committed. If any site fails, the transaction is rolled back at all sites.

Distributed Concurrency Control

Distributed systems require sophisticated concurrency control mechanisms:

Distributed lock manager: Manages locks across multiple sites
Two-phase locking: Ensures serializability across sites
Timestamp-based methods: Use timestamps to order transactions
Replication control: Manages consistency across replicated data

Performance and Failure Transparency

Performance Transparency

The distributed database performs as if it were a centralized database. The system optimizes:

Query optimization across sites
Data allocation and replication
Network traffic minimization
Load balancing

Failure Transparency

The system continues operating even if one or more sites fail:

Automatic detection of site failures
Replication provides redundancy
Transaction rollback and recovery
Automatic failover mechanisms

Distributed Database Design

Data Fragmentation

Breaking a database into logical units (fragments) that can be stored at different sites:

Horizontal Fragmentation

Divides a table into subsets of rows. Each fragment contains rows that meet a specific condition:

Horizontal Fragmentation Example

-- Fragment 1: Customers in Region 1
SELECT * FROM CUSTOMER WHERE REGION = 1;

-- Fragment 2: Customers in Region 2
SELECT * FROM CUSTOMER WHERE REGION = 2;

Vertical Fragmentation

Divides a table into subsets of columns. Each fragment contains specific attributes:

Vertical Fragmentation Example

-- Fragment 1: Employee basic info
SELECT EMP_ID, EMP_NAME, EMP_DEPT FROM EMPLOYEE;

-- Fragment 2: Employee salary info
SELECT EMP_ID, EMP_SALARY, EMP_BONUS FROM EMPLOYEE;

Mixed Fragmentation

Combines horizontal and vertical fragmentation techniques.

Data Replication

Storing copies of data at multiple sites to improve availability and performance:

Full replication: Complete database copy at every site
Partial replication: Selected fragments replicated at specific sites
No replication: Each fragment stored at only one site

Replication Strategies:

Mutual consistency: All copies must be identical at all times
Master-slave replication: One master copy, multiple read-only slaves
Peer-to-peer replication: All copies can be updated

Data Allocation

Determining where to place data fragments:

Centralized: All data at one site
Partitioned: Each fragment at one site
Replicated: Copies of fragments at multiple sites

Allocation decisions consider:

Query frequency at each site
Update frequency
Network costs
Storage costs
Performance requirements

The CAP Theorem

The CAP theorem states that a distributed system can guarantee at most two of the following three properties:

Consistency (C): All nodes see the same data simultaneously
Availability (A): System remains operational
Partition tolerance (P): System continues despite network partitions

In practice, distributed systems must choose which two properties to prioritize, as network partitions are inevitable in distributed environments.

C.J. Date's 12 Commandments for Distributed Databases

E.F. Codd's rules for relational databases were extended by C.J. Date for distributed databases:

Local Autonomy: Local sites are independent
No Central Site: No reliance on a central site for operations
Continuous Operation: System never stops for upgrades or reconfigurations
Location Independence: Users don't need to know where data is stored
Fragmentation Independence: Users don't need to know data is fragmented
Replication Independence: Users don't need to know data is replicated
Distributed Query Processing: System optimizes distributed queries
Distributed Transaction Management: System manages distributed transactions
Hardware Independence: System works with different hardware
Operating System Independence: System works with different operating systems
Network Independence: System works with different network protocols
DBMS Independence: System works with different DBMSs

Best Practices

Design for distribution: Consider distribution from the start
Minimize network traffic: Place data close to users
Replicate strategically: Balance availability with consistency needs
Monitor performance: Track query performance across sites
Plan for failures: Design redundancy and recovery mechanisms
Test thoroughly: Test distributed transaction scenarios

Next Steps

Learn about analyzing business data with Business Intelligence & Data Warehouses, or explore modern data challenges with Big Data & NoSQL.

← Security & Administration Business Intelligence →