Big Data Analytics & NoSQL

Overview

As organizations deal with unprecedented volumes of data generated at high velocity in various formats, traditional relational database systems face challenges. Big Data and NoSQL databases have emerged to address these modern data challenges.

Big Data

Big Data refers to datasets that are too large and complex for traditional data processing applications to handle. Big Data is characterized by the "3 Vs":

1. Volume

Volume refers to the enormous amount of data generated every second. Examples:

2. Velocity

Velocity refers to the speed at which data is generated and must be processed:

3. Variety

Variety refers to the different types and formats of data:

Other Big Data Characteristics

Additional characteristics beyond the original 3 Vs:

Hadoop

Hadoop is an open-source framework that allows distributed processing of large datasets across clusters of computers using simple programming models. Hadoop is designed to scale up from single servers to thousands of machines.

Hadoop Distributed File System (HDFS)

HDFS is the primary storage system used by Hadoop. Key features:

HDFS Architecture

MapReduce

MapReduce is a programming model for processing large datasets. It consists of two phases:

Map Phase

The Map function processes input data and produces intermediate key-value pairs:

Map Function Example
Map Function (Word Count):
Input: "Hello World Hello"
Output: 
  (Hello, 1)
  (World, 1)
  (Hello, 1)

Reduce Phase

The Reduce function processes intermediate results and produces final output:

Reduce Function Example
Reduce Function (Word Count):
Input:
  (Hello, [1, 1])
  (World, [1])
Output:
  (Hello, 2)
  (World, 1)

Hadoop Ecosystem

The Hadoop ecosystem includes many tools and technologies:

NoSQL

NoSQL (Not only SQL) refers to a new generation of database management systems that are not based on the traditional relational database model. NoSQL databases are designed to handle:

NoSQL Database Types

1. Key-Value Databases

Store data as key-value pairs. Simple model, high performance:

Key-Value Example
Key: "user:1001"
Value: {
  "name": "John Doe",
  "email": "john@example.com",
  "age": 30
}

2. Document Databases

Store data as documents (typically JSON, BSON, or XML):

Document Database Example (MongoDB)
{
  "_id": ObjectId("507f1f77bcf86cd799439011"),
  "name": "John Doe",
  "email": "john@example.com",
  "address": {
    "street": "123 Main St",
    "city": "New York",
    "zip": "10001"
  },
  "orders": [
    {"order_id": 1, "total": 99.99},
    {"order_id": 2, "total": 149.50}
  ]
}

3. Column-Oriented Databases

Store data in columns rather than rows. Optimized for analytical workloads:

Column-Oriented Structure
Row Key: user_1001
  Column Family: Profile
    name: "John Doe"
    email: "john@example.com"
  Column Family: Activity
    2024-01-01: "login"
    2024-01-02: "purchase"

4. Graph Databases

Store data as nodes and relationships (edges). Optimized for relationship queries:

Graph Database Example
Nodes:
  (Person {name: "John"})
  (Person {name: "Jane"})
  (Product {name: "Book"})

Relationships:
  (John)-[:KNOWS]->(Jane)
  (John)-[:PURCHASED]->(Book)
  (Jane)-[:RECOMMENDS]->(Book)

5. NewSQL Databases

Combine SQL interface with NoSQL scalability:

Data Analytics

Data Mining

Data mining is the process of discovering patterns and relationships in large datasets using methods from statistics, machine learning, and database systems:

Predictive Analytics

Predictive analytics uses statistical techniques and machine learning to analyze current and historical data to make predictions about future events:

When to Use NoSQL vs Relational Databases

Use NoSQL When:

Use Relational Databases When:

Best Practices

Next Steps

Learn about connecting databases to applications with Database Connectivity, or explore distributed systems with Distributed Databases.