Fundamentals of Big Data

Exam Study Reference β€” HIAST Big Data Systems Master's
v2 β€” Cross-verified against professor's lecture PDFs. Book-only content clearly tagged.

Lect 1 β€” Big Data & Data ScienceLect 2 β€” Storage ConceptsLect 3 β€” Processing & ManagementLect 4 β€” Hadoop Tools & Technologies
LECTURE In professor's slides
BOOK Book refresher only β€” not in slides
BOTH In lectures AND book refresher

🎯 Exam Strategy

Course Flow Map
CH 1

Why big data exists

Start with definitions, the V's, RDBMS limits, and the shift from transaction systems to real-time analytics.

CH 2

How storage scales

Clusters, sharding, replication, and database models explain how big systems stay available and affordable.

CH 4

How processing changes

Different architectures and processing models explain when to use batch, streaming, parallel, or distributed execution.

CH 5

How Hadoop implements it

HDFS stores, MapReduce processes, YARN coordinates, and the ecosystem tools fill specific ingestion, query, workflow, and ML roles.

Best revision order: concept -> storage/distribution -> processing model -> Hadoop implementation.

CH 1

Big Data and Data Science

What is Big Data? BOTHβ–Ά

A blanket term for data too large, complex β€” structured/unstructured/semi-structured β€” arriving at high velocity. Difficult to process with traditional DBMS.

Challenges: capture, curation, storage, search, sharing, transfer, analysis, visualization.

Larger datasets β†’ more insights (spot trends, prevent diseases, combat crime, traffic conditions).

Lecture facts: ~2.5 quintillion bytes/day (2.5Γ—10¹⁸). ~80% unstructured. 44Γ— growth 2009β†’2020 (0.8β†’35 ZB).
RDBMS vs Big Data LECTUREβ–Ά
AttributeRDBMSBig Data
VolumeGB–TBPB–ZB
OrganizationCentralizedDistributed
Data TypeStructuredStructured + Semi + Unstructured
HardwareHigh-endCommodity
UpdatesR/W many timesWrite once, read many (WORM)
SchemaStaticDynamic

Scaling: Vertical = increase speed/storage/memory of one machine. Horizontal = cluster of commodity resources as single system.

Big Data vs Data Mining LECTUREβ–Ά
Data MiningBig Data
Discovering knowledge from datasetsMassive volume (V's)
Structured (spreadsheets, RDBMS)All types + NoSQL
High processing costsLower cost at scale
GB–TBPB–ZB
The V's of Big Data (3β†’4β†’10) BOTHβ–Ά

3 V's:

1. Volume (Scale) — size of data. TB→EB at rest.

2. Variety (Complexity) β€” structured, semi-structured, unstructured.

3. Velocity (Speed) β€” rate of generation AND processing.

Exam trap: Velocity = storing AND processing (not just generation). Answer D in MCQ.

4th V β€” Veracity: "Data in Doubt" β€” inconsistency, incompleteness, ambiguity.

Fast Memory Hook

Volume

How much data exists. Think TB to ZB scale.

Velocity

How fast data arrives and must be processed.

Variety

Structured, semi-structured, and unstructured formats together.

Veracity

Can you trust the data, or is it noisy, missing, or ambiguous?

10 V's (lecture slide): Volume, Velocity, Variety, Veracity, Value, Validity, Variability, Venue, Vocabulary, Vagueness.

Data Types BOTHβ–Ά
TypeDescriptionExamples
StructuredRelational tables (rows/columns)Employee records, financial transactions
UnstructuredRaw, no RDBMS fit. ~80% of dataVideo, audio, images, emails, social media
Semi-StructuredHas structure, not relationalJSON, XML
OLTP β†’ OLAP β†’ RTAP LECTUREβ–Ά

1. OLTP β€” Online Transaction Processing (DBMSs, 1968/1970)

2. OLAP β€” Online Analytical Processing (Data Warehousing, 1983/1990)

3. RTAP β€” Real-Time Analytics Processing (Big Data, 2000/2010+)

Traditional DW (Exadata, Teradata) not suited. Shared-nothing, MPP, scale-out = ideal.

Evolution
OLTPRecord day-to-day transactions in operational databases.
OLAPAnalyze historical data inside data warehouses.
RTAPReact while data is still arriving and opportunities still exist.
Real-Time Analytics Use Cases LECTUREβ–Ά

β€’ Relevant product recommendations β€’ Counter customer switching in time β€’ Marketing effectiveness while running β€’ Fraud prevention as it occurs β€’ Friend invitations expanding business

Challenges LECTUREβ–Ά

Bottleneck: technology (new architectures needed) + skills (experts needed). Lack of software (30%), analytic skills (28%), budget (25%), already using (11%).

Data Lifecycle Termsβ–Ά
⚠️ NOT in professor's lecture slides. Appears heavily in book MCQs (Q6-Q9).
TermDefinition (memorize for MCQs)
Data AggregationCollecting raw data, transmitting to storage, preprocessing
Data PreprocessingTransform raw data into understandable format; ensure consistency
Data IntegrationCombining data from different sources β†’ unified view
Data CleaningFill missing values, correct errors/inconsistencies, remove redundancy
Data TransformationTransform into format acceptable by big data database
Data ReductionReduce volume/dimensions without losing integrity

Chapter 1 β€” MCQs

Q1Big Data is _____. BOTH
A) Structured
B) Semi-structured
C) Unstructured
D) All of the above
Big Data can be structured, unstructured, or semi-structured.
Q2Hardware used in big data: BOTH
A) High-performance PCs
B) Low-cost commodity hardware
C) Dumb terminal
D) None
Big data uses low-cost commodity hardware.
Q3Commodity hardware means: BOOK
A) Very cheap hardware
B) Industry-standard
C) Discarded
D) Low-spec industry-grade
Low-cost, low-performance, low-spec with no distinctive features. NOT "very cheap"!
Q4"Velocity" means: BOTH
A) Speed of generation
B) Speed of processors
C) Speed of ONLY storing
D) Speed of storing AND processing
Velocity = speed of storing AND processing.
Q5JSON and XML are: BOTH
A) Structured
B) Unstructured
C) Semi-structured
D) None
Semi-structured: has structure but not relational.
Q6_____ corrects errors and inconsistencies. BOOK
A) Data cleaning
B) Data Integration
C) Data transformation
D) Data reduction
Data cleaning corrects errors, fills missing values, removes redundancy.
Q7_____ transforms data into acceptable format. BOOK
A) Cleaning
B) Integration
C) Transformation
D) Reduction
Data transformation converts to format acceptable by big data DB.
Q8_____ combines data from different sources. BOOK
A) Cleaning
B) Integration
C) Transformation
D) Reduction
Data integration β†’ unified view from different sources.
Q9_____ collects raw data, transmits, preprocesses. BOOK
A) Cleaning
B) Integration
C) Aggregation
D) Reduction
Data aggregation = collect + transmit + preprocess.

Short Answer

Drawbacks of traditional DBs? BOOKπŸ‘
1) Volume (TB/PB) challenged RDBMS. 2) Adding processors increased cost. 3) ~80% data semi/unstructured. 4) Couldn't capture high-velocity data.
What is data reduction? BOOKπŸ‘
Reducing volume/dimensions without losing integrity. Makes massive data analysis feasible.
Big data examples? BOOKπŸ‘
Facebook ~500 TB/day. Airlines ~10 TB sensor/30 min. NYSE ~1 TB/day.
CH 2

Big Data Storage Concepts

Storage Architecture LECTUREβ–Ά

Sources (machine/web/audio-video/external) β†’ Hadoop Cluster (online archive, all data types) β†’ Data Warehouse β†’ Ad-hoc queries.

Raw data analyzed using MapReduce β€” Hadoop's programming paradigm.

Storage Path
Data SourcesMachine, web, external, and media inputs arrive continuously.
Hadoop ClusterActs as the online archive for raw structured and unstructured data.
Warehouse / MartsPrepared subsets move downstream for focused analytics.
QueriesUsers consume results through reporting and ad-hoc analysis.
Cluster Computing BOTHβ–Ά

Multiple standalone PCs via LANs β†’ single integrated, highly available resource.

Benefits: high availability, load balancing, performance, reliability, fault tolerance, scalable (add/remove nodes), commodity hardware.

Cluster Types: High-Availability vs Load-Balancing BOTHβ–Ά
High-AvailabilityLoad-Balancing
Minimize downtime, uninterrupted serviceDistribute workloads across nodes
Eliminate single points of failureIdentical data copies across all nodes
Zero transactional data lossOptimize resources, minimize response time, maximize throughput
Symmetric vs Asymmetric Cluster BOTHβ–Ά

Symmetric: Each node = independent computer running apps. User β†’ any node.

Asymmetric: One head node = gateway. User β†’ head node β†’ workers.

Distribution: Sharding & Replication BOTHβ–Ά

Sharding = partition data into shards across nodes. No two shards of same file on same node. Improves fault tolerance.

Replication = copies of same data across servers. Fault tolerant.

Key: Sharding = different data on different nodes. Replication = same data on multiple nodes. Combined β†’ fault tolerant + highly available.

Master-Slave (lecture): Writes β†’ master. Reads β†’ slaves. Master fails β†’ writes stop.

Peer-to-Peer (lecture): All nodes equal. Workload partitioned.

Distribution Model Visual
Sharding
Node AShard 1
Node BShard 2
Node CShard 3
Replication
Node ABlock X copy
Node BBlock X copy
Node CBlock X copy

Exam shortcut: different pieces across nodes = sharding. Same piece copied across nodes = replication.

Failover & Switch Overβ–Ά
⚠️ Not in lecture slides.

Failover = automatic switch to redundant node. Switch Over = requires human intervention.

File System LECTUREβ–Ά

Organizes data on storage devices, tracks files. File = smallest unit of storage.

RDBMS β†’ NoSQL β†’ NewSQL BOTHβ–Ά

RDBMS: Vertical scaling, ACID, static schema. Can't handle big data.

NoSQL: CAP + BASE (eventually consistent). Types: Key-value, Document, Column, Graph.

NewSQL: NoSQL scalability + RDBMS ACID. Shared-nothing, SQL-compliant. 3 layers: admin, transactional, storage. Examples: VoltDB, NuoDB, Clustrix, MemSQL, TokuDB.

MCQ: "NoSQL exhibits ___" β†’ BASE.
Scale-Up vs Scale-Out BOTHβ–Ά
Scale-Up (Vertical)Scale-Out (Horizontal)
Add CPU/RAM/HDD to existing serverAdd new nodes to cluster
Limited by max server capacityUnlimited
ExpensiveCommodity hardware
Ex: RAM 32β†’128 GBEx: +3 machines
Traditional RDBMSBig data approach

Chapter 2 β€” MCQs

Q1Loosely connected computers = BOOK
A) LAN
B) WAN
C) Workstation
D) Cluster
Cluster = computers working together.
Q2Cluster classified into: BOTH
A) High-availability
B) Load-balancing
C) Both
D) None
Two types.
Q3Cluster emerged from: BOOK
A) ISA
B) Workstation
C) Supercomputers
D) Distributed systems
From distributed systems.
Q4Eliminate service interruptions: BOOK
A) Sharding
B) Replication
C) Failover
D) Partition
Failover = automatic switch to redundant node.
Q5Adds more resources/CPU to increase capacity: BOTH
A) Horizontal
B) Vertical scaling
C) Partition
D) All
Vertical = add resources to existing server.
Q6Copying same data across nodes: BOTH
A) Replication
B) Partition
C) Sharding
D) None
Replication = same data on multiple nodes.
Q7Dividing data over multiple servers: BOTH
A) Vertical
B) Sharding
C) Partition
D) All
Sharding = different data on different nodes.
Q8Sharded cluster is ____ for high availability: BOOK
A) Replicated
B) Partitioned
C) Clustered
D) None
Sharding + Replication = fault tolerant + available.
Q9NoSQL exhibits ____ properties: BOTH
A) ACID
B) BASE
C) Both
D) None
NoSQL β†’ CAP + BASE (eventually consistent).

Short Answer

Replication vs Sharding? BOTHπŸ‘
Replication = same data on multiple nodes. Sharding = different data on different nodes.
Master-slave model? BOTHπŸ‘
Master controls slaves. Writes β†’ master, reads β†’ slaves. Master fails β†’ writes stop until recovery/promotion.
Peer-to-peer model? BOTHπŸ‘
No master. All nodes equal responsibility. Workload partitioned.
What is NewSQL? BOTHπŸ‘
NoSQL scalability + RDBMS ACID. Distributed, horizontal, fault tolerant, shared-nothing, SQL-compliant. Examples: VoltDB, NuoDB, Clustrix, MemSQL.
Scale-up vs scale-out? BOTHπŸ‘
Up = add resources to existing server (limited). Out = add commodity nodes (unlimited). Big data uses scale-out.
Failover vs switchover? BOOKπŸ‘
Failover = automatic. Switchover = human intervention.
Distributed file system? BOOKπŸ‘
Stores files across cluster nodes. Physically distributed, logically appears local to client.
CH 4

Processing & Management Concepts

Data Processing BOTHβ–Ά

Collecting, processing, manipulating, managing data β†’ meaningful information.

Stages (lecture diagram): Input (capture/collect/transmit) β†’ Processing (classify/sort/math/transform) β†’ Storage (store/retrieve/archive/govern) β†’ Output (compute/format/present).

Types: Centralized (single location) vs Distributed (across locations).

Shared-Everything Architecture BOTHβ–Ά

Shares ALL resources. Limits scalability.

1. SMP = UMA (Uniform Memory Access). Single shared memory. Bandwidth choking on shared bus.

2. DSM = NUMA (Non-Uniform). Multiple memory pools. Latency depends on distance.

MCQ: SMP = UMA. DSM = NUMA.
Shared-Nothing Architecture BOTHβ–Ά

Each node = own memory, storage, disks. Infinite scalability. Ideal for web/Internet apps.

Teradata = shared-nothing (MCQ).
Architecture Quick Compare

Shared-Everything

  • All processors share core resources.
  • SMP = UMA and DSM = NUMA.
  • Simpler to reason about, but scalability is limited.

Shared-Nothing

  • Each node keeps its own CPU, memory, and storage.
  • Scale-out comes from adding more independent nodes.
  • Preferred for big web and Internet-scale applications.
Processing Types BOTHβ–Ά
TypeKey Details (from lecture)
BatchJobs executed sequentially/parallel β†’ combined output. TB/PB, response time not critical. Hadoop MapReduce. Cost-effective. Ex: payroll, billing, DW, log analysis.
Real-TimeContinual data flow. In-memory processing while streaming. Stored to disk AFTER. Low latency. Ex: ATM, POS, fraud detection.
ParallelSubtasks on multiple CPUs, single machine. Shared memory.
DistributedSubtasks on separate networked machines (cluster).
Key: Parallel = 1 machine. Distributed = multiple machines.

Platforms (lecture table):

PlatformDevType
StormTwitterStreaming
S4YahooStreaming
MillWheelGoogleStreaming
HadoopApacheBatch
DiscoNokiaBatch
Virtualization BOTHβ–Ά

Software layer on hardware. Multiple OS on one machine, each independent.

Attributes (lecture):

1. Encapsulation β€” VM = single file. Dedicated VM per app β†’ no interference.

2. Partitioning β€” Hardware β†’ logical partitions, each with separate OS.

3. Isolation β€” VMs isolated. One crash β†’ others unaffected.

Server virtualization (lecture): Server β†’ multiple VMs. CPU/memory virtualized. Enables large-volume Big Data analysis.

Cloud Computingβ–Ά
⚠️ NOT in any lecture PDF. Heavy in book MCQs (~10 questions).

Deployment: Public (3rd-party, pay-as-you-go) | Private (owned by company, more secure) | Hybrid (public+private combined).

Services: SaaS (software subscription: Salesforce) | PaaS (dev platform: Heroku) | IaaS (infrastructure: AWS, IBM).

Virtualization types (book): Server, Desktop, Network, Storage, Application.

AWS = IaaS. Private cloud = enterprise data center. EUCALYPTUS = "Elastic Utility Computing Architecture for Linking Your Programs To Useful Systems".

Chapter 4 β€” MCQs

Q1If one site fails in distributed system: BOOK
A) Remaining continue
B) All stop
C) Connected stop
D) None
Others continue operating.
Q2Teradata is: BOOK
A) Shared-nothing
B) Shared-everything
C) DSM
D) None
Teradata = shared-nothing.
Q3Virtualization attributes: BOTH
A) Encapsulation
B) Partitioning
C) Isolation
D) All
All three.
Q4Architecture sharing all resources: BOTH
A) Shared-everything
B) Shared-nothing
C) Shared-disk
D) None
Shared-everything, but limits scalability.
Q5Splitting task simultaneously: BOTH
A) Parallel
B) Distributed
C) Both
D) None
Parallel computing (single machine, multiple CPUs).
Q6____ = uniform memory access: BOTH
A) Shared-nothing
B) SMP
C) DSM
D) Shared-everything
SMP = UMA.
Q7Log analysis uses: BOOK
A) Batch
B) Real-time
C) Parallel
D) None
Batch = collect over time, process later.
Q8Apps on distributed network + virtualized resources: BOOK
A) Cloud computing
B) Distributed
C) Parallel
D) Data processing
Cloud computing.
Q9Resource sharing concept: BOOK
A) Abstraction
B) Virtualization
C) Reliability
D) Availability
Virtualization.
Q10Server β†’ multiple servers = BOOK
A) Server splitting
B) Server virtualization
C) Server partitioning
D) None
Server virtualization.
Q11Cloud deployment models: BOOK
A) Public
B) Private
C) Hybrid
D) All
Public, Private, Hybrid.
Q12Cloud services: BOOK
A) IaaS
B) PaaS
C) SaaS
D) All
IaaS, PaaS, SaaS.
Q13AWS type: BOOK
A) SaaS
B) IaaS
C) PaaS
D) None
AWS = IaaS.
Q14Cloud within enterprise = BOOK
A) Public
B) Private
C) Hybrid
D) None
Private cloud.

Short Answer

Shared-everything and its types? BOTHπŸ‘
Shares all resources, limits scalability. SMP (UMA: single memory pool) and DSM (NUMA: multiple pools, latency by distance).
Parallel vs distributed computing? BOTHπŸ‘
Parallel: multiple CPUs in ONE machine. Distributed: separate networked machines (cluster).
Virtualization and its attributes? BOTHπŸ‘
Software layer enabling multiple OS on one machine. Encapsulation (VM=file), Partitioning (hw→logical partitions), Isolation (VM crash doesn't affect others).
SaaS, PaaS, IaaS? BOOKπŸ‘
SaaS: software subscription (Salesforce). PaaS: dev platform (Heroku). IaaS: infrastructure (AWS).
EUCALYPTUS? BOOKπŸ‘
"Elastic Utility Computing Architecture for Linking Your Programs To Useful Systems" β€” open-source cluster software.
CH 5

Driving Big Data with Hadoop Tools & Technologies

Apache Hadoop Overview BOTHβ–Ά

Apache Hadoop = open-source Java framework for processing large data sets in streaming access pattern across clusters in a distributed computing environment.

Can store structured, semi-structured, and unstructured data in a DFS and process in parallel. Highly scalable and cost-effective.

Files are written once and read many times (WORM). Contents cannot be changed.

Architecture β€” Two layers: 1) HDFS (storage layer) 2) MapReduce engine (processing layer).

Four components: Hadoop Common (utilities), HDFS (distributed storage), YARN (resource negotiation), MapReduce (parallel processing).

Hadoop Stack
Access and ManagementHive, Pig, Oozie, Flume, SQOOP, Mahout, ZooKeeper, Avro.
Processing LayerMapReduce handles batch execution and YARN handles resource negotiation and scheduling.
Storage LayerHDFS stores blocks across the cluster, while HBase adds real-time column-oriented access.
Hadoop Ecosystem β€” 4 Layers LECTUREβ–Ά
LayerComponents
Data StorageHDFS (distributed file system) + HBase (column-oriented DB)
Data ProcessingMapReduce (job processing) + YARN (resource allocation, scheduling, monitoring)
Data AccessHive (HQL query), Pig (data analysis scripting), Mahout (ML), Avro (serialization), SQOOP (RDBMS↔HDFS transfer)
Data ManagementOozie (workflow scheduler), Chukwa (monitoring), Flume (streaming data flow β†’ HDFS), ZooKeeper (coordination service)
HDFS β€” Hadoop Distributed File System BOTHβ–Ά

Designed to store large data sets with streaming access pattern on commodity hardware by partitioning data into small chunks.

Key facts:

β€’ Default block size: 64 MB. A 200 MB file β†’ four blocks total: three full 64 MB blocks + one partial 8 MB block.

β€’ Default replication factor: 3 (three copies across different nodes/racks for fault tolerance).

β€’ Reads: parallel. Writes: pipelined (not parallel β€” to avoid inconsistency).

β€’ NOT suitable for low-latency access.

Why not parallel writes? Two nodes writing in parallel β†’ neither knows what the other wrote β†’ data inconsistency.

Features: Cost-effective (open-source + commodity hw), Distributed storage, Data replication (3 copies default).

Why HDFS Wins
Single Machine~22.5 min

Lecture example: reading 500 GB with four 100 MB/s I/O channels.

Distributed FS~13.5 sec

Lecture example: spread across 100 machines and read in parallel.

Main IdeaParallel reads

Move work close to the data and remove one-node bottlenecks.

NameNode & DataNode BOTHβ–Ά

NameNode (Master):

β€’ Manages namespace of entire file system

β€’ Supervises health of DataNodes via Heartbeat signal

β€’ Controls file access. Does NOT hold actual data β€” holds metadata (which blocks constitute a file + their locations)

β€’ Single point of failure. Needs manual intervention in H1.0

β€’ NOT commodity hardware β€” must be highly available system

DataNode (Slave):

β€’ Where actual data resides, distributed across cluster

β€’ Files split into blocks of 64 MB by default

β€’ NameNode decides which block goes to which DataNode

β€’ Serves read/write requests from clients

Secondary NameNode: Periodically backs up NameNode RAM data. NOT a substitute for NameNode β€” acts as recovery mechanism. Runs on separate machine (needs equivalent memory).

Rack: A collection of DataNodes stored at a single location. Replications placed across racks to overcome rack failure (1st replica on one rack, 2nd+3rd on a different rack).

Master node = NameNode + JobTracker. Slave node = DataNode + TaskTracker.
HDFS Read & Write Operations LECTUREβ–Ά

Write: Client β†’ request to NameNode β†’ receives metadata β†’ writes to first DataNode β†’ pipelined replication to other DataNodes β†’ ACKs flow back β†’ Block Received ACK to NameNode.

Read: Client β†’ File Read Request to NameNode β†’ gets metadata (block locations) β†’ parallel read directly from DataNodes.

Write types (book): Posted (no ACK required) vs Non-posted (ACK required).

Read vs Write Flow
Write
1. NameNodeClient asks where blocks should go.
2. DataNode 1Client starts writing the block.
3. PipelineReplica copies are forwarded to other DataNodes.
4. ACKAcknowledgements return backward to the client and NameNode.
Read
1. NameNodeClient requests block locations.
2. MetadataNameNode returns the nearest or best replicas.
3. DataNodesClient reads blocks directly from storage nodes.
4. ParallelismMultiple blocks can be fetched at the same time.
MapReduce β€” Pipeline & Components BOTHβ–Ά

MapReduce = batch-processing programming model using divide-and-conquer. Highly scalable, reliable, fault tolerant. Supports only batch workloads.

Key advantage: moves processing TO the data (not data to processing).

Pipeline: Input β†’ Split β†’ Map β†’ Combine (optional) β†’ Partition (optional) β†’ Shuffle & Sort β†’ Reduce β†’ Output

Pipeline In One View
InputRaw file enters HDFS.
SplitLogical input pieces are prepared.
MapEach split emits intermediate key-value pairs.
CombineOptional local reduction shrinks transfer volume.
PartitionOptional routing decides which reducer sees each key.
Shuffle / SortFramework groups equal keys together.
ReduceUser logic aggregates each key's values.
OutputFinal result is written back to DFS.

Mapper: Breaks large data into small blocks. Each resolved into key-value pairs (K1,V1). Executes user logic β†’ produces intermediate (K2,V2). Processing done in parallel.

Combiner: Optimizes mapper output before sending to reducer. Essentially a "local reducer" β€” groups repeated keys and lists their values. Reduces overhead of data transfer between mapper and reducer.

Reducer: Processes sorted key-value pairs. Each reducer runs in isolation (no inter-reducer communication). Input sorted by key. Output written back to DFS.

Mini Example
Input rows: (Leeds,20) (Bexley,17) (Bradford,11) ... Map output: Leeds -> 20 Bexley -> 17 Bradford -> 11 Reduce output example: Leeds -> 22 Bexley -> 21 Bradford -> 21

Limitations of MapReduce: reducers depend on mapper output, so the job has a phase barrier and is poor for low-latency or iterative workloads.

JobTracker & TaskTracker (Hadoop 1.0) BOTHβ–Ά

Master/slave architecture: One JobTracker (master) + multiple TaskTrackers (slaves).

β€’ One JobTracker per cluster. One TaskTracker per slave node.

β€’ JobTracker: accepts job submissions, schedules tasks, monitors slave health, monitors task progress. Single point of failure.

β€’ TaskTracker: executes tasks, sends Heartbeat signal to JobTracker (alive + task status). If Heartbeat not received β†’ assumed dead.

β€’ Communication via Remote Procedure Calls (RPC).

β€’ If a task fails, JobTracker reschedules on same or another node. If task fails 4 times (default) β†’ job killed.

β€’ If multiple tasks of same job fail on one TaskTracker β†’ refrained from that job. If tasks from different jobs fail β†’ refrained 24 hours.

Input split vs HDFS block: Input split = logical division. HDFS block = physical division.
Hadoop 1.0 vs 2.0 BOTHβ–Ά
Hadoop 1.0Hadoop 2.0
Single NameNode (SPOF)Active + Standby NameNode
Only MapReduce jobsMapReduce + non-MR apps (Spark, Giraph, Hama) via YARN
Batch-onlyBatch + real-time / near-real-time
JobTracker does scheduling + resource mgmtYARN splits: ResourceManager + ApplicationMaster
NameNode down β†’ Secondary NameNode (recovery only)NameNode down β†’ Standby NameNode takes over
Accuracy note: Hadoop 2.0 still uses MapReduce for batch jobs, but YARN lets the same cluster run other frameworks that can support near-real-time processing.
YARN β€” Yet Another Resource Negotiator BOTHβ–Ά

Splits JobTracker responsibilities into two daemons: ResourceManager (global resources) + ApplicationMaster (per-job scheduling/monitoring).

Core Components:

1. ResourceManager (RM) β€” one per cluster. Has: ApplicationsManager (accepts/rejects/monitors apps) + Scheduler (allocates resources).

2. ApplicationMaster β€” per-application. Negotiates resource containers from RM.

3. NodeManager (NM) β€” per node. Components: NodeStatusUpdater, ContainerManager, ContainerExecutor, NodeHealthCheckerService, Security.

Execution Path
ClientSubmits an application to the cluster.
RMAccepts the application and assigns initial resources.
AMNegotiates more containers and tracks that one job.
NMLaunches and monitors containers on each node.
ContainersExecute task work and report progress back up.

YARN Schedulers:

SchedulerBehavior
FIFOJobs executed in submission order. No priority. Long jobs block short ones.
CapacityMultiple queues sharing cluster. Each queue gets calculated share. FIFO within queues. Idle resources can be assigned to other jobs.
FairAll apps get fairly equal share of resources. Resources dynamically rebalanced as jobs come/go. Queues have weights (light/heavy).

YARN Failures:

β€’ ResourceManager failure: Earlier = SPOF (manual restart). Latest = active/passive RM (or ZooKeeper-based failover).

β€’ ApplicationMaster failure: Detected by RM β†’ new container + new AM instance. Recovers state if available externally, else starts from scratch.

β€’ NodeManager failure: Heartbeat stops β†’ RM removes NM from cluster β†’ AM reruns that portion.

β€’ Container failure: AM detects no response β†’ re-executes task. Configurable retry count.

HBase LECTUREβ–Ά

Column-oriented NoSQL DB built on top of HDFS. Horizontally scalable, open-source. Based on Google's Bigtable.

β€’ Supports structured + unstructured data. Provides real-time access to HDFS data.

β€’ Key-value pairs in column-oriented fashion.

β€’ HBase vs Hadoop: Hadoop = flat files, WORM. HBase = read/write many times.

β€’ Features: Automatic failover, Auto sharding, Horizontal scalability (commodity hw), Column-oriented, MapReduce support.

Apache Cassandra LECTUREβ–Ά

Highly available, linearly scalable, distributed database. Ring architecture β€” all nodes equal (no master/slave). Data accessed by partition key.

β€’ Data replicated across nodes β†’ highly available. Read/write on any node (coordinator node).

β€’ Eventually consistent. No single point of failure.

β€’ If node goes down β†’ R/W continues on other nodes. Operations queued and updated when failed node recovers.

Ecosystem Tools: SQOOP, Flume, Avro, Pig, Hive, Oozie, Mahout LECTUREβ–Ά
ToolPurposeKey Detail
SQOOPTransfer structured data between RDBMS ↔ HDFS/HBase/HiveBidirectional (import/export). Tables read in parallel. Output: text or binary Avro.
FlumeCollect streaming data from sources β†’ HDFS/HBaseFor sensors, social media, log files. SQOOP = structured; Flume = streaming.
AvroData serialization frameworkTranslates data to binary/text for transport. Multi-language (C, C++, Java, Python). Schema in JSON format.
PigData analysis using Pig Latin scriptingHandles all data types. Reduces code vs writing Mapper/Reducer. Internally: Parser β†’ Optimizer β†’ Compiler β†’ MapReduce execution.
HiveDistributed data warehouse, HQL (SQL-like) queriesBuilt on HDFS. DDL like SQL. Data organized into Tables, Partitions, Buckets. Interfaces: ODBC/JDBC.
OozieWorkflow scheduler3 job types: Workflow (DAG, on-demand), Coordinator (periodic/data-triggered), Bundle (group of coordinators). Supports MapReduce/Hive/Pig/SQOOP.
MahoutMachine learning platformClustering, Classification, Recommendations. Scalable (differentiator vs R).
ZooKeeperCoordination serviceManages coordination between distributed components. Facilitates reachability.
Most Testable Tool Details

Hive Layout

Tables are the main container. Partitions cut large tables by partition columns so queries scan only relevant parts. Buckets subdivide partitions by hash to improve manageability and query performance.

Oozie Workflow

Workflow jobs are DAG-based and on demand. Coordinator jobs run periodically or when input arrives. Bundle jobs group coordinators. Action nodes do work; control nodes decide order.

Pig Latin Example

Useful when you want data analysis without hand-writing mapper and reducer code.

A = LOAD 'student.txt' USING PigStorage AS (name:chararray, age:int, gpa:float); B = FILTER A BY age < 20; C = FOREACH B GENERATE name, age;

Avro Schema Shape

Memorize the high-level fields: type, namespace, name, and fields.

{ "type": "record", "namespace": "example", "name": "StudentName", "fields": [ { "name": "first", "type": "string" }, { "name": "last", "type": "string" } ] }
Block Size Calculations BOOKβ–Ά
⚠️ These calculation patterns appear in book refresher β€” high likelihood of appearing on exam.

Example 1: File = 500 MB, Block = 64 MB, Replication = 1

500 Γ· 64 = 7.8125 β†’ 8 blocks (round up β€” last block is partial)

Example 2: File = 800 MB, Block = 128 MB, Replication = 3

800 Γ· 128 = 6.25 β†’ 7 blocks. Size: 6 blocks Γ— 128 = 768 MB. 7th block = 800 - 768 = 32 MB. Total with replication: 7 Γ— 3 = 21 block copies.

Chapter 5 β€” MCQs

Q1Default block size of HDFS: BOTH
A) 32 MB
B) 64 MB
C) 128 MB
D) 16 MB
HDFS splits files into blocks of 64 MB by default.
Q2Default replication factor of HDFS: BOTH
A) 4
B) 1
C) 3
D) 2
Default replication factor = 3 for fault tolerance.
Q3Can HDFS data blocks be read in parallel? BOTH
A) Yes
B) No
Reads = parallel. Writes = pipelined (not parallel).
Q4In Hadoop there exists _______: BOTH
A) One JobTracker per job
B) One JobTracker per Mapper
C) One JobTracker per node
D) One JobTracker per cluster
Master/slave: one JobTracker per cluster, TaskTrackers on slave nodes.
Q5Tasks from JobTracker are executed by _______: BOTH
A) MapReduce
B) Mapper
C) TaskTracker
D) JobTracker
TaskTracker executes tasks assigned by JobTracker.
Q6Default times a task can fail before job is killed: BOTH
A) 3
B) 4
C) 5
D) 6
4 is default. Can be modified.
Q7Input key-value pairs mapped by ________ into intermediate pairs: BOTH
A) Mapper
B) Reducer
C) Both
D) None
Mapper transforms input records into intermediate key-value pairs.
Q8________ negotiates resources from ResourceManager: BOTH
A) NodeManager
B) ResourceManager
C) ApplicationMaster
D) All
ApplicationMaster negotiates resource containers from RM.
Q9Hadoop YARN stands for: BOTH
A) Yet Another Resource Network
B) Yet Another Reserve Negotiator
C) Yet Another Resource Negotiator
D) All
YARN = Yet Another Resource Negotiator.
Q10When NameNode goes down in Hadoop 1.0: BOTH
A) Rack
B) DataNode
C) Secondary NameNode
D) None
Secondary NameNode for recovery in H1.0 (not a replacement β€” just backup).
Q11When active NameNode goes down in Hadoop 2.0: BOTH
A) Standby NameNode
B) DataNode
C) Secondary NameNode
D) None
In H2.0/YARN, Standby NameNode takes over from Active.

Short Answer

What is fault tolerance in Hadoop? BOTHπŸ‘
Ability to recover data even if a node fails. Achieved by replicating data across 3 nodes (default). If first crashes and second unavailable, third retrieves data.
Is NameNode commodity hardware? BOOKπŸ‘
No. NameNode is the single point of failure and the entire file system relies on it. Must be a highly available system.
Master vs slave node in Hadoop? BOTHπŸ‘
Master: NameNode + JobTracker (monitors storage + task status). Slave: DataNode + TaskTracker (stores actual data + processes MR jobs).
What is a combiner? BOTHπŸ‘
Essentially the local reducer of the map job. Groups mapper output β€” repeated keys combined with their values listed. Optimizes data transfer to reducer.
Why is HDFS not suited for many small files? BOOKπŸ‘
NameNode is expensive high-performance system. Many small files generate massive metadata filling NameNode space. Large files β†’ less metadata per file β†’ optimized.
What is Secondary NameNode? Is it a substitute? BOTHπŸ‘
Periodically backs up NameNode RAM data. NOT a substitute β€” acts as recovery mechanism only. Runs on separate machine with equivalent memory.
What is a Heartbeat signal? BOTHπŸ‘
TaskTracker sends to JobTracker to indicate it's alive + current task status or availability. If not received after specific interval β†’ assumed dead.
Input split vs HDFS block? BOOKπŸ‘
Input split = logical division of data. HDFS block = physical division of data.
Why reads parallel but writes not? BOOKπŸ‘
Parallel writes β†’ data inconsistency (neither node knows what the other wrote). Reads can safely occur in parallel for faster access.
Why replications in different racks? BOOKπŸ‘
1st replica on one rack, 2nd+3rd on a different rack. Overcomes entire rack failure.
What if all 3 DataNode replications fail? BOOKπŸ‘
Data cannot be recovered. For high-priority jobs, increase replication factor beyond 3 (configurable).
What happens when JobTracker goes down in H1.0? BOOKπŸ‘
All jobs will be restarted, interrupting overall execution.
Storage node vs compute node? BOOKπŸ‘
Storage node = where actual data resides. Compute node = where business logic is executed.