Fundamentals of Big Data — Exam Study Reference v2

LECTURE In professor's slides

BOOK Book refresher only — not in slides

BOTH In lectures AND book refresher

🎯 Exam Strategy

V's: 3 core (Volume, Velocity, Variety) + Veracity = 4th. Full 10 in lecture slide. LECTURE
Architectures: Shared-Everything (limits scalability; SMP=UMA, DSM=NUMA) vs Shared-Nothing (infinite scalability, each node independent). LECTURE
Distribution: Sharding ≠ Replication. Can combine. Master-Slave vs Peer-to-Peer. LECTURE
Processing: Batch (MapReduce) vs Real-time (Storm/S4/MillWheel) vs Parallel (1 machine) vs Distributed (cluster). LECTURE
Databases: RDBMS→ACID. NoSQL→CAP+BASE. NewSQL→ACID+scalability+shared-nothing. LECTURE
Scaling: Up=vertical (limited). Out=horizontal (commodity, unlimited). LECTURE
Virtualization: Encapsulation, Partitioning, Isolation — memorize. LECTURE
Clusters: High-avail vs Load-balancing. Symmetric vs Asymmetric. LECTURE
Book-only watch: Cloud (SaaS/PaaS/IaaS), data lifecycle terms (aggregation, cleaning, etc.), failover. NOT in lectures but heavy in book MCQs. BOOK
Hadoop architecture: Two layers: HDFS (storage) + MapReduce (processing). Ecosystem: 4 layers (storage, processing, access, management). LECTURE
HDFS defaults: Block size = 64 MB. Replication factor = 3. Reads = parallel. Writes = pipelined. NOT suitable for low-latency access. BOTH
NameNode vs DataNode: NameNode = metadata (single point of failure, NOT commodity hw). DataNode = actual data (commodity hw). BOTH
Hadoop 1.0 vs 2.0: H1 = single NameNode (SPOF), batch only, JobTracker does everything. H2 = Active+Standby NameNode, YARN, real-time support. BOTH
YARN components: ResourceManager, ApplicationMaster, NodeManager. Schedulers: FIFO, Capacity, Fair. BOTH
MapReduce pipeline: Input → Split → Map → Combine (optional) → Partition → Shuffle&Sort → Reduce → Output. Know Mapper, Combiner, Reducer roles. BOTH
Ecosystem tools: HBase (column DB, real-time), Cassandra (ring, no master), Hive (HQL/SQL), Pig (Pig Latin), SQOOP (RDBMS↔HDFS), Flume (streaming→HDFS), Oozie (workflow), Avro (serialization), Mahout (ML). LECTURE
Block calculation: 500MB / 64MB = 7.8125 → 8 blocks. 800MB / 128MB × replication 3 = 7 blocks × 3 = 21 total. BOOK

Course Flow Map

CH 1

Why big data exists

Start with definitions, the V's, RDBMS limits, and the shift from transaction systems to real-time analytics.

CH 2

How storage scales

Clusters, sharding, replication, and database models explain how big systems stay available and affordable.

CH 4

How processing changes

Different architectures and processing models explain when to use batch, streaming, parallel, or distributed execution.

CH 5

How Hadoop implements it

HDFS stores, MapReduce processes, YARN coordinates, and the ecosystem tools fill specific ingestion, query, workflow, and ML roles.

Best revision order: concept -> storage/distribution -> processing model -> Hadoop implementation.

CH 1

Big Data and Data Science

What is Big Data? BOTH▶

A blanket term for data too large, complex — structured/unstructured/semi-structured — arriving at high velocity. Difficult to process with traditional DBMS.

Challenges: capture, curation, storage, search, sharing, transfer, analysis, visualization.

Larger datasets → more insights (spot trends, prevent diseases, combat crime, traffic conditions).

Lecture facts: ~2.5 quintillion bytes/day (2.5×10¹⁸). ~80% unstructured. 44× growth 2009→2020 (0.8→35 ZB).

RDBMS vs Big Data LECTURE▶

Attribute	RDBMS	Big Data
Volume	GB–TB	PB–ZB
Organization	Centralized	Distributed
Data Type	Structured	Structured + Semi + Unstructured
Hardware	High-end	Commodity
Updates	R/W many times	Write once, read many (WORM)
Schema	Static	Dynamic

Scaling: Vertical = increase speed/storage/memory of one machine. Horizontal = cluster of commodity resources as single system.

Big Data vs Data Mining LECTURE▶

Data Mining	Big Data
Discovering knowledge from datasets	Massive volume (V's)
Structured (spreadsheets, RDBMS)	All types + NoSQL
High processing costs	Lower cost at scale
GB–TB	PB–ZB

The V's of Big Data (3→4→10) BOTH▶

3 V's:

1. Volume (Scale) — size of data. TB→EB at rest.

2. Variety (Complexity) — structured, semi-structured, unstructured.

3. Velocity (Speed) — rate of generation AND processing.

Exam trap: Velocity = storing AND processing (not just generation). Answer D in MCQ.

4th V — Veracity: "Data in Doubt" — inconsistency, incompleteness, ambiguity.

Fast Memory Hook

Volume

How much data exists. Think TB to ZB scale.

Velocity

How fast data arrives and must be processed.

Variety

Structured, semi-structured, and unstructured formats together.

Veracity

Can you trust the data, or is it noisy, missing, or ambiguous?

10 V's (lecture slide): Volume, Velocity, Variety, Veracity, Value, Validity, Variability, Venue, Vocabulary, Vagueness.

Data Types BOTH▶

Type	Description	Examples
Structured	Relational tables (rows/columns)	Employee records, financial transactions
Unstructured	Raw, no RDBMS fit. ~80% of data	Video, audio, images, emails, social media
Semi-Structured	Has structure, not relational	`JSON`, `XML`

OLTP → OLAP → RTAP LECTURE▶

1. OLTP — Online Transaction Processing (DBMSs, 1968/1970)

2. OLAP — Online Analytical Processing (Data Warehousing, 1983/1990)

3. RTAP — Real-Time Analytics Processing (Big Data, 2000/2010+)

Traditional DW (Exadata, Teradata) not suited. Shared-nothing, MPP, scale-out = ideal.

Evolution

OLTPRecord day-to-day transactions in operational databases.

OLAPAnalyze historical data inside data warehouses.

RTAPReact while data is still arriving and opportunities still exist.

Real-Time Analytics Use Cases LECTURE▶

• Relevant product recommendations • Counter customer switching in time • Marketing effectiveness while running • Fraud prevention as it occurs • Friend invitations expanding business

Challenges LECTURE▶

Bottleneck: technology (new architectures needed) + skills (experts needed). Lack of software (30%), analytic skills (28%), budget (25%), already using (11%).

Data Lifecycle Terms▶

⚠️ NOT in professor's lecture slides. Appears heavily in book MCQs (Q6-Q9).

Term	Definition (memorize for MCQs)
Data Aggregation	Collecting raw data, transmitting to storage, preprocessing
Data Preprocessing	Transform raw data into understandable format; ensure consistency
Data Integration	Combining data from different sources → unified view
Data Cleaning	Fill missing values, correct errors/inconsistencies, remove redundancy
Data Transformation	Transform into format acceptable by big data database
Data Reduction	Reduce volume/dimensions without losing integrity

Chapter 1 — MCQs

Q1Big Data is _____. BOTH

A) Structured

B) Semi-structured

C) Unstructured

D) All of the above

Big Data can be structured, unstructured, or semi-structured.

Q2Hardware used in big data: BOTH

A) High-performance PCs

B) Low-cost commodity hardware

C) Dumb terminal

D) None

Big data uses low-cost commodity hardware.

Q3Commodity hardware means: BOOK

A) Very cheap hardware

B) Industry-standard

C) Discarded

D) Low-spec industry-grade

Low-cost, low-performance, low-spec with no distinctive features. NOT "very cheap"!

Q4"Velocity" means: BOTH

A) Speed of generation

B) Speed of processors

C) Speed of ONLY storing

D) Speed of storing AND processing

Velocity = speed of storing AND processing.

Q5JSON and XML are: BOTH

A) Structured

B) Unstructured

C) Semi-structured

D) None

Semi-structured: has structure but not relational.

Q6_____ corrects errors and inconsistencies. BOOK

A) Data cleaning

B) Data Integration

C) Data transformation

D) Data reduction

Data cleaning corrects errors, fills missing values, removes redundancy.

Q7_____ transforms data into acceptable format. BOOK

A) Cleaning

B) Integration

C) Transformation

D) Reduction

Data transformation converts to format acceptable by big data DB.

Q8_____ combines data from different sources. BOOK

A) Cleaning

B) Integration

C) Transformation

D) Reduction

Data integration → unified view from different sources.

Q9_____ collects raw data, transmits, preprocesses. BOOK

A) Cleaning

B) Integration

C) Aggregation

D) Reduction

Data aggregation = collect + transmit + preprocess.

Short Answer

Drawbacks of traditional DBs? BOOK👁

1) Volume (TB/PB) challenged RDBMS. 2) Adding processors increased cost. 3) ~80% data semi/unstructured. 4) Couldn't capture high-velocity data.

What is data reduction? BOOK👁

Reducing volume/dimensions without losing integrity. Makes massive data analysis feasible.

Big data examples? BOOK👁

Facebook ~500 TB/day. Airlines ~10 TB sensor/30 min. NYSE ~1 TB/day.

CH 2

Big Data Storage Concepts

Storage Architecture LECTURE▶

Sources (machine/web/audio-video/external) → Hadoop Cluster (online archive, all data types) → Data Warehouse → Ad-hoc queries.

Raw data analyzed using MapReduce — Hadoop's programming paradigm.

Storage Path

Data SourcesMachine, web, external, and media inputs arrive continuously.

Hadoop ClusterActs as the online archive for raw structured and unstructured data.

Warehouse / MartsPrepared subsets move downstream for focused analytics.

QueriesUsers consume results through reporting and ad-hoc analysis.

Cluster Computing BOTH▶

Multiple standalone PCs via LANs → single integrated, highly available resource.

Benefits: high availability, load balancing, performance, reliability, fault tolerance, scalable (add/remove nodes), commodity hardware.

Cluster Types: High-Availability vs Load-Balancing BOTH▶

High-Availability	Load-Balancing
Minimize downtime, uninterrupted service	Distribute workloads across nodes
Eliminate single points of failure	Identical data copies across all nodes
Zero transactional data loss	Optimize resources, minimize response time, maximize throughput

Symmetric vs Asymmetric Cluster BOTH▶

Symmetric: Each node = independent computer running apps. User → any node.

Asymmetric: One head node = gateway. User → head node → workers.

Distribution: Sharding & Replication BOTH▶

Sharding = partition data into shards across nodes. No two shards of same file on same node. Improves fault tolerance.

Replication = copies of same data across servers. Fault tolerant.

Key: Sharding = different data on different nodes. Replication = same data on multiple nodes. Combined → fault tolerant + highly available.

Master-Slave (lecture): Writes → master. Reads → slaves. Master fails → writes stop.

Peer-to-Peer (lecture): All nodes equal. Workload partitioned.

Distribution Model Visual

Sharding

Node AShard 1

Node BShard 2

Node CShard 3

Replication

Node ABlock X copy

Node BBlock X copy

Node CBlock X copy

Exam shortcut: different pieces across nodes = sharding. Same piece copied across nodes = replication.

Failover & Switch Over▶

⚠️ Not in lecture slides.

Failover = automatic switch to redundant node. Switch Over = requires human intervention.

File System LECTURE▶

Organizes data on storage devices, tracks files. File = smallest unit of storage.

RDBMS → NoSQL → NewSQL BOTH▶

RDBMS: Vertical scaling, ACID, static schema. Can't handle big data.

NoSQL: CAP + BASE (eventually consistent). Types: Key-value, Document, Column, Graph.

NewSQL: NoSQL scalability + RDBMS ACID. Shared-nothing, SQL-compliant. 3 layers: admin, transactional, storage. Examples: VoltDB, NuoDB, Clustrix, MemSQL, TokuDB.

MCQ: "NoSQL exhibits ___" → BASE.

Scale-Up vs Scale-Out BOTH▶

Scale-Up (Vertical)	Scale-Out (Horizontal)
Add CPU/RAM/HDD to existing server	Add new nodes to cluster
Limited by max server capacity	Unlimited
Expensive	Commodity hardware
Ex: RAM 32→128 GB	Ex: +3 machines
Traditional RDBMS	Big data approach

Chapter 2 — MCQs

Q1Loosely connected computers = BOOK

A) LAN

B) WAN

C) Workstation

D) Cluster

Cluster = computers working together.

Q2Cluster classified into: BOTH

A) High-availability

B) Load-balancing

C) Both

D) None

Two types.

Q3Cluster emerged from: BOOK

A) ISA

B) Workstation

C) Supercomputers

D) Distributed systems

From distributed systems.

Q4Eliminate service interruptions: BOOK

A) Sharding

B) Replication

C) Failover

D) Partition

Failover = automatic switch to redundant node.

Q5Adds more resources/CPU to increase capacity: BOTH

A) Horizontal

B) Vertical scaling

C) Partition

D) All

Vertical = add resources to existing server.

Q6Copying same data across nodes: BOTH

A) Replication

B) Partition

C) Sharding

D) None

Replication = same data on multiple nodes.

Q7Dividing data over multiple servers: BOTH

A) Vertical

B) Sharding

C) Partition

D) All

Sharding = different data on different nodes.

Q8Sharded cluster is ____ for high availability: BOOK

A) Replicated

B) Partitioned

C) Clustered

D) None

Sharding + Replication = fault tolerant + available.

Q9NoSQL exhibits ____ properties: BOTH

A) ACID

B) BASE

C) Both

D) None

NoSQL → CAP + BASE (eventually consistent).

Short Answer

Replication vs Sharding? BOTH👁

Replication = same data on multiple nodes. Sharding = different data on different nodes.

Master-slave model? BOTH👁

Master controls slaves. Writes → master, reads → slaves. Master fails → writes stop until recovery/promotion.

Peer-to-peer model? BOTH👁

No master. All nodes equal responsibility. Workload partitioned.

What is NewSQL? BOTH👁

NoSQL scalability + RDBMS ACID. Distributed, horizontal, fault tolerant, shared-nothing, SQL-compliant. Examples: VoltDB, NuoDB, Clustrix, MemSQL.

Scale-up vs scale-out? BOTH👁

Up = add resources to existing server (limited). Out = add commodity nodes (unlimited). Big data uses scale-out.

Failover vs switchover? BOOK👁

Failover = automatic. Switchover = human intervention.

Distributed file system? BOOK👁

Stores files across cluster nodes. Physically distributed, logically appears local to client.

CH 4

Processing & Management Concepts

Data Processing BOTH▶

Collecting, processing, manipulating, managing data → meaningful information.

Stages (lecture diagram): Input (capture/collect/transmit) → Processing (classify/sort/math/transform) → Storage (store/retrieve/archive/govern) → Output (compute/format/present).

Types: Centralized (single location) vs Distributed (across locations).

Shared-Everything Architecture BOTH▶

Shares ALL resources. Limits scalability.

1. SMP = UMA (Uniform Memory Access). Single shared memory. Bandwidth choking on shared bus.

2. DSM = NUMA (Non-Uniform). Multiple memory pools. Latency depends on distance.

MCQ: SMP = UMA. DSM = NUMA.

Shared-Nothing Architecture BOTH▶

Each node = own memory, storage, disks. Infinite scalability. Ideal for web/Internet apps.

Teradata = shared-nothing (MCQ).

Architecture Quick Compare

Shared-Everything

All processors share core resources.
SMP = UMA and DSM = NUMA.
Simpler to reason about, but scalability is limited.

Shared-Nothing

Each node keeps its own CPU, memory, and storage.
Scale-out comes from adding more independent nodes.
Preferred for big web and Internet-scale applications.

Processing Types BOTH▶

Type	Key Details (from lecture)
Batch	Jobs executed sequentially/parallel → combined output. TB/PB, response time not critical. Hadoop MapReduce. Cost-effective. Ex: payroll, billing, DW, log analysis.
Real-Time	Continual data flow. In-memory processing while streaming. Stored to disk AFTER. Low latency. Ex: ATM, POS, fraud detection.
Parallel	Subtasks on multiple CPUs, single machine. Shared memory.
Distributed	Subtasks on separate networked machines (cluster).

Key: Parallel = 1 machine. Distributed = multiple machines.

Platforms (lecture table):

Platform	Dev	Type
Storm	Twitter	Streaming
S4	Yahoo	Streaming
MillWheel	Google	Streaming
Hadoop	Apache	Batch
Disco	Nokia	Batch

Virtualization BOTH▶

Software layer on hardware. Multiple OS on one machine, each independent.

Attributes (lecture):

1. Encapsulation — VM = single file. Dedicated VM per app → no interference.

2. Partitioning — Hardware → logical partitions, each with separate OS.

3. Isolation — VMs isolated. One crash → others unaffected.

Server virtualization (lecture): Server → multiple VMs. CPU/memory virtualized. Enables large-volume Big Data analysis.

Cloud Computing▶

⚠️ NOT in any lecture PDF. Heavy in book MCQs (~10 questions).

Deployment: Public (3rd-party, pay-as-you-go) | Private (owned by company, more secure) | Hybrid (public+private combined).

Services: SaaS (software subscription: Salesforce) | PaaS (dev platform: Heroku) | IaaS (infrastructure: AWS, IBM).

Virtualization types (book): Server, Desktop, Network, Storage, Application.

AWS = IaaS. Private cloud = enterprise data center. EUCALYPTUS = "Elastic Utility Computing Architecture for Linking Your Programs To Useful Systems".

Chapter 4 — MCQs

Q1If one site fails in distributed system: BOOK

A) Remaining continue

B) All stop

C) Connected stop

D) None

Others continue operating.

Q2Teradata is: BOOK

A) Shared-nothing

B) Shared-everything

C) DSM

D) None

Teradata = shared-nothing.

Q3Virtualization attributes: BOTH

A) Encapsulation

B) Partitioning

C) Isolation

D) All

All three.

Q4Architecture sharing all resources: BOTH

A) Shared-everything

B) Shared-nothing

C) Shared-disk

D) None

Shared-everything, but limits scalability.

Q5Splitting task simultaneously: BOTH

A) Parallel

B) Distributed

C) Both

D) None

Parallel computing (single machine, multiple CPUs).

Q6____ = uniform memory access: BOTH

A) Shared-nothing

B) SMP

C) DSM

D) Shared-everything

SMP = UMA.

Q7Log analysis uses: BOOK

A) Batch

B) Real-time

C) Parallel

D) None

Batch = collect over time, process later.

Q8Apps on distributed network + virtualized resources: BOOK

A) Cloud computing

B) Distributed

C) Parallel

D) Data processing

Cloud computing.

Q9Resource sharing concept: BOOK

A) Abstraction

B) Virtualization

C) Reliability

D) Availability

Virtualization.

Q10Server → multiple servers = BOOK

A) Server splitting

B) Server virtualization

C) Server partitioning

D) None

Server virtualization.

Q11Cloud deployment models: BOOK

A) Public

B) Private

C) Hybrid

D) All

Public, Private, Hybrid.

Q12Cloud services: BOOK

A) IaaS

B) PaaS

C) SaaS

D) All

IaaS, PaaS, SaaS.

Q13AWS type: BOOK

A) SaaS

B) IaaS

C) PaaS

D) None

AWS = IaaS.

Q14Cloud within enterprise = BOOK

A) Public

B) Private

C) Hybrid

D) None

Private cloud.

Short Answer

Shared-everything and its types? BOTH👁

Shares all resources, limits scalability. SMP (UMA: single memory pool) and DSM (NUMA: multiple pools, latency by distance).

Parallel vs distributed computing? BOTH👁

Parallel: multiple CPUs in ONE machine. Distributed: separate networked machines (cluster).

Virtualization and its attributes? BOTH👁

Software layer enabling multiple OS on one machine. Encapsulation (VM=file), Partitioning (hw→logical partitions), Isolation (VM crash doesn't affect others).

SaaS, PaaS, IaaS? BOOK👁

SaaS: software subscription (Salesforce). PaaS: dev platform (Heroku). IaaS: infrastructure (AWS).

EUCALYPTUS? BOOK👁

"Elastic Utility Computing Architecture for Linking Your Programs To Useful Systems" — open-source cluster software.

CH 5

Driving Big Data with Hadoop Tools & Technologies

Apache Hadoop Overview BOTH▶

Apache Hadoop = open-source Java framework for processing large data sets in streaming access pattern across clusters in a distributed computing environment.

Can store structured, semi-structured, and unstructured data in a DFS and process in parallel. Highly scalable and cost-effective.

Files are written once and read many times (WORM). Contents cannot be changed.

Architecture — Two layers: 1) HDFS (storage layer) 2) MapReduce engine (processing layer).

Four components: Hadoop Common (utilities), HDFS (distributed storage), YARN (resource negotiation), MapReduce (parallel processing).

Hadoop Stack

Access and ManagementHive, Pig, Oozie, Flume, SQOOP, Mahout, ZooKeeper, Avro.

Processing LayerMapReduce handles batch execution and YARN handles resource negotiation and scheduling.

Storage LayerHDFS stores blocks across the cluster, while HBase adds real-time column-oriented access.

Hadoop Ecosystem — 4 Layers LECTURE▶

Layer	Components
Data Storage	HDFS (distributed file system) + HBase (column-oriented DB)
Data Processing	MapReduce (job processing) + YARN (resource allocation, scheduling, monitoring)
Data Access	Hive (HQL query), Pig (data analysis scripting), Mahout (ML), Avro (serialization), SQOOP (RDBMS↔HDFS transfer)
Data Management	Oozie (workflow scheduler), Chukwa (monitoring), Flume (streaming data flow → HDFS), ZooKeeper (coordination service)

HDFS — Hadoop Distributed File System BOTH▶

Designed to store large data sets with streaming access pattern on commodity hardware by partitioning data into small chunks.

Key facts:

• Default block size: 64 MB. A 200 MB file → four blocks total: three full 64 MB blocks + one partial 8 MB block.

• Default replication factor: 3 (three copies across different nodes/racks for fault tolerance).

• Reads: parallel. Writes: pipelined (not parallel — to avoid inconsistency).

• NOT suitable for low-latency access.

Why not parallel writes? Two nodes writing in parallel → neither knows what the other wrote → data inconsistency.

Features: Cost-effective (open-source + commodity hw), Distributed storage, Data replication (3 copies default).

Why HDFS Wins

Single Machine~22.5 min

Lecture example: reading 500 GB with four 100 MB/s I/O channels.

Distributed FS~13.5 sec

Lecture example: spread across 100 machines and read in parallel.

Main IdeaParallel reads

Move work close to the data and remove one-node bottlenecks.

NameNode & DataNode BOTH▶

NameNode (Master):

• Manages namespace of entire file system

• Supervises health of DataNodes via Heartbeat signal

• Controls file access. Does NOT hold actual data — holds metadata (which blocks constitute a file + their locations)

• Single point of failure. Needs manual intervention in H1.0

• NOT commodity hardware — must be highly available system

DataNode (Slave):

• Where actual data resides, distributed across cluster

• Files split into blocks of 64 MB by default

• NameNode decides which block goes to which DataNode

• Serves read/write requests from clients

Secondary NameNode: Periodically backs up NameNode RAM data. NOT a substitute for NameNode — acts as recovery mechanism. Runs on separate machine (needs equivalent memory).

Rack: A collection of DataNodes stored at a single location. Replications placed across racks to overcome rack failure (1st replica on one rack, 2nd+3rd on a different rack).

Master node = NameNode + JobTracker. Slave node = DataNode + TaskTracker.

HDFS Read & Write Operations LECTURE▶

Write: Client → request to NameNode → receives metadata → writes to first DataNode → pipelined replication to other DataNodes → ACKs flow back → Block Received ACK to NameNode.

Read: Client → File Read Request to NameNode → gets metadata (block locations) → parallel read directly from DataNodes.

Write types (book): Posted (no ACK required) vs Non-posted (ACK required).

Read vs Write Flow

Write

1. NameNodeClient asks where blocks should go.

2. DataNode 1Client starts writing the block.

3. PipelineReplica copies are forwarded to other DataNodes.

4. ACKAcknowledgements return backward to the client and NameNode.

Read

1. NameNodeClient requests block locations.

2. MetadataNameNode returns the nearest or best replicas.

3. DataNodesClient reads blocks directly from storage nodes.

4. ParallelismMultiple blocks can be fetched at the same time.

MapReduce — Pipeline & Components BOTH▶

MapReduce = batch-processing programming model using divide-and-conquer. Highly scalable, reliable, fault tolerant. Supports only batch workloads.

Key advantage: moves processing TO the data (not data to processing).

Pipeline: Input → Split → Map → Combine (optional) → Partition (optional) → Shuffle & Sort → Reduce → Output

Pipeline In One View

InputRaw file enters HDFS.

SplitLogical input pieces are prepared.

MapEach split emits intermediate key-value pairs.

CombineOptional local reduction shrinks transfer volume.

PartitionOptional routing decides which reducer sees each key.

Shuffle / SortFramework groups equal keys together.

ReduceUser logic aggregates each key's values.

OutputFinal result is written back to DFS.

Mapper: Breaks large data into small blocks. Each resolved into key-value pairs (K1,V1). Executes user logic → produces intermediate (K2,V2). Processing done in parallel.

Combiner: Optimizes mapper output before sending to reducer. Essentially a "local reducer" — groups repeated keys and lists their values. Reduces overhead of data transfer between mapper and reducer.

Reducer: Processes sorted key-value pairs. Each reducer runs in isolation (no inter-reducer communication). Input sorted by key. Output written back to DFS.

Mini Example

Input rows: (Leeds,20) (Bexley,17) (Bradford,11) ... Map output: Leeds -> 20 Bexley -> 17 Bradford -> 11 Reduce output example: Leeds -> 22 Bexley -> 21 Bradford -> 21

Limitations of MapReduce: reducers depend on mapper output, so the job has a phase barrier and is poor for low-latency or iterative workloads.

JobTracker & TaskTracker (Hadoop 1.0) BOTH▶

Master/slave architecture: One JobTracker (master) + multiple TaskTrackers (slaves).

• One JobTracker per cluster. One TaskTracker per slave node.

• JobTracker: accepts job submissions, schedules tasks, monitors slave health, monitors task progress. Single point of failure.

• TaskTracker: executes tasks, sends Heartbeat signal to JobTracker (alive + task status). If Heartbeat not received → assumed dead.

• Communication via Remote Procedure Calls (RPC).

• If a task fails, JobTracker reschedules on same or another node. If task fails 4 times (default) → job killed.

• If multiple tasks of same job fail on one TaskTracker → refrained from that job. If tasks from different jobs fail → refrained 24 hours.

Input split vs HDFS block: Input split = logical division. HDFS block = physical division.

Hadoop 1.0 vs 2.0 BOTH▶

Hadoop 1.0	Hadoop 2.0
Single NameNode (SPOF)	Active + Standby NameNode
Only MapReduce jobs	MapReduce + non-MR apps (Spark, Giraph, Hama) via YARN
Batch-only	Batch + real-time / near-real-time
JobTracker does scheduling + resource mgmt	YARN splits: ResourceManager + ApplicationMaster
NameNode down → Secondary NameNode (recovery only)	NameNode down → Standby NameNode takes over

Accuracy note: Hadoop 2.0 still uses MapReduce for batch jobs, but YARN lets the same cluster run other frameworks that can support near-real-time processing.

YARN — Yet Another Resource Negotiator BOTH▶

Splits JobTracker responsibilities into two daemons: ResourceManager (global resources) + ApplicationMaster (per-job scheduling/monitoring).

Core Components:

1. ResourceManager (RM) — one per cluster. Has: ApplicationsManager (accepts/rejects/monitors apps) + Scheduler (allocates resources).

2. ApplicationMaster — per-application. Negotiates resource containers from RM.

3. NodeManager (NM) — per node. Components: NodeStatusUpdater, ContainerManager, ContainerExecutor, NodeHealthCheckerService, Security.

Execution Path

ClientSubmits an application to the cluster.

RMAccepts the application and assigns initial resources.

AMNegotiates more containers and tracks that one job.

NMLaunches and monitors containers on each node.

ContainersExecute task work and report progress back up.

YARN Schedulers:

Scheduler	Behavior
FIFO	Jobs executed in submission order. No priority. Long jobs block short ones.
Capacity	Multiple queues sharing cluster. Each queue gets calculated share. FIFO within queues. Idle resources can be assigned to other jobs.
Fair	All apps get fairly equal share of resources. Resources dynamically rebalanced as jobs come/go. Queues have weights (light/heavy).

YARN Failures:

• ResourceManager failure: Earlier = SPOF (manual restart). Latest = active/passive RM (or ZooKeeper-based failover).

• ApplicationMaster failure: Detected by RM → new container + new AM instance. Recovers state if available externally, else starts from scratch.

• NodeManager failure: Heartbeat stops → RM removes NM from cluster → AM reruns that portion.

• Container failure: AM detects no response → re-executes task. Configurable retry count.

HBase LECTURE▶

Column-oriented NoSQL DB built on top of HDFS. Horizontally scalable, open-source. Based on Google's Bigtable.

• Supports structured + unstructured data. Provides real-time access to HDFS data.

• Key-value pairs in column-oriented fashion.

• HBase vs Hadoop: Hadoop = flat files, WORM. HBase = read/write many times.

• Features: Automatic failover, Auto sharding, Horizontal scalability (commodity hw), Column-oriented, MapReduce support.

Apache Cassandra LECTURE▶

Highly available, linearly scalable, distributed database. Ring architecture — all nodes equal (no master/slave). Data accessed by partition key.

• Data replicated across nodes → highly available. Read/write on any node (coordinator node).

• Eventually consistent. No single point of failure.

• If node goes down → R/W continues on other nodes. Operations queued and updated when failed node recovers.

Ecosystem Tools: SQOOP, Flume, Avro, Pig, Hive, Oozie, Mahout LECTURE▶

Tool	Purpose	Key Detail
SQOOP	Transfer structured data between RDBMS ↔ HDFS/HBase/Hive	Bidirectional (import/export). Tables read in parallel. Output: text or binary Avro.
Flume	Collect streaming data from sources → HDFS/HBase	For sensors, social media, log files. SQOOP = structured; Flume = streaming.
Avro	Data serialization framework	Translates data to binary/text for transport. Multi-language (C, C++, Java, Python). Schema in JSON format.
Pig	Data analysis using Pig Latin scripting	Handles all data types. Reduces code vs writing Mapper/Reducer. Internally: Parser → Optimizer → Compiler → MapReduce execution.
Hive	Distributed data warehouse, HQL (SQL-like) queries	Built on HDFS. DDL like SQL. Data organized into Tables, Partitions, Buckets. Interfaces: ODBC/JDBC.
Oozie	Workflow scheduler	3 job types: Workflow (DAG, on-demand), Coordinator (periodic/data-triggered), Bundle (group of coordinators). Supports MapReduce/Hive/Pig/SQOOP.
Mahout	Machine learning platform	Clustering, Classification, Recommendations. Scalable (differentiator vs R).
ZooKeeper	Coordination service	Manages coordination between distributed components. Facilitates reachability.

Most Testable Tool Details

Hive Layout

Tables are the main container. Partitions cut large tables by partition columns so queries scan only relevant parts. Buckets subdivide partitions by hash to improve manageability and query performance.

Oozie Workflow

Workflow jobs are DAG-based and on demand. Coordinator jobs run periodically or when input arrives. Bundle jobs group coordinators. Action nodes do work; control nodes decide order.

Pig Latin Example

Useful when you want data analysis without hand-writing mapper and reducer code.

A = LOAD 'student.txt' USING PigStorage AS (name:chararray, age:int, gpa:float); B = FILTER A BY age < 20; C = FOREACH B GENERATE name, age;

Avro Schema Shape

Memorize the high-level fields: type, namespace, name, and fields.

{ "type": "record", "namespace": "example", "name": "StudentName", "fields": [ { "name": "first", "type": "string" }, { "name": "last", "type": "string" } ] }

Block Size Calculations BOOK▶

⚠️ These calculation patterns appear in book refresher — high likelihood of appearing on exam.

Example 1: File = 500 MB, Block = 64 MB, Replication = 1

500 ÷ 64 = 7.8125 → 8 blocks (round up — last block is partial)

Example 2: File = 800 MB, Block = 128 MB, Replication = 3

800 ÷ 128 = 6.25 → 7 blocks. Size: 6 blocks × 128 = 768 MB. 7th block = 800 - 768 = 32 MB. Total with replication: 7 × 3 = 21 block copies.

Chapter 5 — MCQs

Q1Default block size of HDFS: BOTH

A) 32 MB

B) 64 MB

C) 128 MB

D) 16 MB

HDFS splits files into blocks of 64 MB by default.

Q2Default replication factor of HDFS: BOTH

A) 4

B) 1

C) 3

D) 2

Default replication factor = 3 for fault tolerance.

Q3Can HDFS data blocks be read in parallel? BOTH

A) Yes

B) No

Reads = parallel. Writes = pipelined (not parallel).

Q4In Hadoop there exists _______: BOTH

A) One JobTracker per job

B) One JobTracker per Mapper

C) One JobTracker per node

D) One JobTracker per cluster

Master/slave: one JobTracker per cluster, TaskTrackers on slave nodes.

Q5Tasks from JobTracker are executed by _______: BOTH

A) MapReduce

B) Mapper

C) TaskTracker

D) JobTracker

TaskTracker executes tasks assigned by JobTracker.

Q6Default times a task can fail before job is killed: BOTH

A) 3

B) 4

C) 5

D) 6

4 is default. Can be modified.

Q7Input key-value pairs mapped by ________ into intermediate pairs: BOTH

A) Mapper

B) Reducer

C) Both

D) None

Mapper transforms input records into intermediate key-value pairs.

Q8________ negotiates resources from ResourceManager: BOTH

A) NodeManager

B) ResourceManager

C) ApplicationMaster

D) All

ApplicationMaster negotiates resource containers from RM.

Q9Hadoop YARN stands for: BOTH

A) Yet Another Resource Network

B) Yet Another Reserve Negotiator

C) Yet Another Resource Negotiator

D) All

YARN = Yet Another Resource Negotiator.

Q10When NameNode goes down in Hadoop 1.0: BOTH

A) Rack

B) DataNode

C) Secondary NameNode

D) None

Secondary NameNode for recovery in H1.0 (not a replacement — just backup).

Q11When active NameNode goes down in Hadoop 2.0: BOTH

A) Standby NameNode

B) DataNode

C) Secondary NameNode

D) None

In H2.0/YARN, Standby NameNode takes over from Active.

Short Answer

What is fault tolerance in Hadoop? BOTH👁

Ability to recover data even if a node fails. Achieved by replicating data across 3 nodes (default). If first crashes and second unavailable, third retrieves data.

Is NameNode commodity hardware? BOOK👁

No. NameNode is the single point of failure and the entire file system relies on it. Must be a highly available system.

Master vs slave node in Hadoop? BOTH👁

Master: NameNode + JobTracker (monitors storage + task status). Slave: DataNode + TaskTracker (stores actual data + processes MR jobs).

What is a combiner? BOTH👁

Essentially the local reducer of the map job. Groups mapper output — repeated keys combined with their values listed. Optimizes data transfer to reducer.

Why is HDFS not suited for many small files? BOOK👁

NameNode is expensive high-performance system. Many small files generate massive metadata filling NameNode space. Large files → less metadata per file → optimized.

What is Secondary NameNode? Is it a substitute? BOTH👁

Periodically backs up NameNode RAM data. NOT a substitute — acts as recovery mechanism only. Runs on separate machine with equivalent memory.

What is a Heartbeat signal? BOTH👁

TaskTracker sends to JobTracker to indicate it's alive + current task status or availability. If not received after specific interval → assumed dead.

Input split vs HDFS block? BOOK👁

Input split = logical division of data. HDFS block = physical division of data.

Why reads parallel but writes not? BOOK👁

Parallel writes → data inconsistency (neither node knows what the other wrote). Reads can safely occur in parallel for faster access.

Why replications in different racks? BOOK👁

1st replica on one rack, 2nd+3rd on a different rack. Overcomes entire rack failure.

What if all 3 DataNode replications fail? BOOK👁

Data cannot be recovered. For high-priority jobs, increase replication factor beyond 3 (configurable).

What happens when JobTracker goes down in H1.0? BOOK👁

All jobs will be restarted, interrupting overall execution.

Storage node vs compute node? BOOK👁

Storage node = where actual data resides. Compute node = where business logic is executed.