π― Exam Strategy
- V's: 3 core (Volume, Velocity, Variety) + Veracity = 4th. Full 10 in lecture slide. LECTURE
- Architectures: Shared-Everything (limits scalability; SMP=UMA, DSM=NUMA) vs Shared-Nothing (infinite scalability, each node independent). LECTURE
- Distribution: Sharding β Replication. Can combine. Master-Slave vs Peer-to-Peer. LECTURE
- Processing: Batch (MapReduce) vs Real-time (Storm/S4/MillWheel) vs Parallel (1 machine) vs Distributed (cluster). LECTURE
- Databases: RDBMSβACID. NoSQLβCAP+BASE. NewSQLβACID+scalability+shared-nothing. LECTURE
- Scaling: Up=vertical (limited). Out=horizontal (commodity, unlimited). LECTURE
- Virtualization: Encapsulation, Partitioning, Isolation β memorize. LECTURE
- Clusters: High-avail vs Load-balancing. Symmetric vs Asymmetric. LECTURE
- Book-only watch: Cloud (SaaS/PaaS/IaaS), data lifecycle terms (aggregation, cleaning, etc.), failover. NOT in lectures but heavy in book MCQs. BOOK
- Hadoop architecture: Two layers: HDFS (storage) + MapReduce (processing). Ecosystem: 4 layers (storage, processing, access, management). LECTURE
- HDFS defaults: Block size = 64 MB. Replication factor = 3. Reads = parallel. Writes = pipelined. NOT suitable for low-latency access. BOTH
- NameNode vs DataNode: NameNode = metadata (single point of failure, NOT commodity hw). DataNode = actual data (commodity hw). BOTH
- Hadoop 1.0 vs 2.0: H1 = single NameNode (SPOF), batch only, JobTracker does everything. H2 = Active+Standby NameNode, YARN, real-time support. BOTH
- YARN components: ResourceManager, ApplicationMaster, NodeManager. Schedulers: FIFO, Capacity, Fair. BOTH
- MapReduce pipeline: Input β Split β Map β Combine (optional) β Partition β Shuffle&Sort β Reduce β Output. Know Mapper, Combiner, Reducer roles. BOTH
- Ecosystem tools: HBase (column DB, real-time), Cassandra (ring, no master), Hive (HQL/SQL), Pig (Pig Latin), SQOOP (RDBMSβHDFS), Flume (streamingβHDFS), Oozie (workflow), Avro (serialization), Mahout (ML). LECTURE
- Block calculation: 500MB / 64MB = 7.8125 β 8 blocks. 800MB / 128MB Γ replication 3 = 7 blocks Γ 3 = 21 total. BOOK
Why big data exists
Start with definitions, the V's, RDBMS limits, and the shift from transaction systems to real-time analytics.
How storage scales
Clusters, sharding, replication, and database models explain how big systems stay available and affordable.
How processing changes
Different architectures and processing models explain when to use batch, streaming, parallel, or distributed execution.
How Hadoop implements it
HDFS stores, MapReduce processes, YARN coordinates, and the ecosystem tools fill specific ingestion, query, workflow, and ML roles.
Best revision order: concept -> storage/distribution -> processing model -> Hadoop implementation.
Big Data and Data Science
A blanket term for data too large, complex β structured/unstructured/semi-structured β arriving at high velocity. Difficult to process with traditional DBMS.
Challenges: capture, curation, storage, search, sharing, transfer, analysis, visualization.
Larger datasets β more insights (spot trends, prevent diseases, combat crime, traffic conditions).
| Attribute | RDBMS | Big Data |
|---|---|---|
| Volume | GBβTB | PBβZB |
| Organization | Centralized | Distributed |
| Data Type | Structured | Structured + Semi + Unstructured |
| Hardware | High-end | Commodity |
| Updates | R/W many times | Write once, read many (WORM) |
| Schema | Static | Dynamic |
Scaling: Vertical = increase speed/storage/memory of one machine. Horizontal = cluster of commodity resources as single system.
| Data Mining | Big Data |
|---|---|
| Discovering knowledge from datasets | Massive volume (V's) |
| Structured (spreadsheets, RDBMS) | All types + NoSQL |
| High processing costs | Lower cost at scale |
| GBβTB | PBβZB |
3 V's:
1. Volume (Scale) β size of data. TBβEB at rest.
2. Variety (Complexity) β structured, semi-structured, unstructured.
3. Velocity (Speed) β rate of generation AND processing.
4th V β Veracity: "Data in Doubt" β inconsistency, incompleteness, ambiguity.
Volume
How much data exists. Think TB to ZB scale.
Velocity
How fast data arrives and must be processed.
Variety
Structured, semi-structured, and unstructured formats together.
Veracity
Can you trust the data, or is it noisy, missing, or ambiguous?
10 V's (lecture slide): Volume, Velocity, Variety, Veracity, Value, Validity, Variability, Venue, Vocabulary, Vagueness.
| Type | Description | Examples |
|---|---|---|
| Structured | Relational tables (rows/columns) | Employee records, financial transactions |
| Unstructured | Raw, no RDBMS fit. ~80% of data | Video, audio, images, emails, social media |
| Semi-Structured | Has structure, not relational | JSON, XML |
1. OLTP β Online Transaction Processing (DBMSs, 1968/1970)
2. OLAP β Online Analytical Processing (Data Warehousing, 1983/1990)
3. RTAP β Real-Time Analytics Processing (Big Data, 2000/2010+)
Traditional DW (Exadata, Teradata) not suited. Shared-nothing, MPP, scale-out = ideal.
β’ Relevant product recommendations β’ Counter customer switching in time β’ Marketing effectiveness while running β’ Fraud prevention as it occurs β’ Friend invitations expanding business
Bottleneck: technology (new architectures needed) + skills (experts needed). Lack of software (30%), analytic skills (28%), budget (25%), already using (11%).
| Term | Definition (memorize for MCQs) |
|---|---|
| Data Aggregation | Collecting raw data, transmitting to storage, preprocessing |
| Data Preprocessing | Transform raw data into understandable format; ensure consistency |
| Data Integration | Combining data from different sources β unified view |
| Data Cleaning | Fill missing values, correct errors/inconsistencies, remove redundancy |
| Data Transformation | Transform into format acceptable by big data database |
| Data Reduction | Reduce volume/dimensions without losing integrity |
Chapter 1 β MCQs
Short Answer
Big Data Storage Concepts
Sources (machine/web/audio-video/external) β Hadoop Cluster (online archive, all data types) β Data Warehouse β Ad-hoc queries.
Raw data analyzed using MapReduce β Hadoop's programming paradigm.
Multiple standalone PCs via LANs β single integrated, highly available resource.
Benefits: high availability, load balancing, performance, reliability, fault tolerance, scalable (add/remove nodes), commodity hardware.
| High-Availability | Load-Balancing |
|---|---|
| Minimize downtime, uninterrupted service | Distribute workloads across nodes |
| Eliminate single points of failure | Identical data copies across all nodes |
| Zero transactional data loss | Optimize resources, minimize response time, maximize throughput |
Symmetric: Each node = independent computer running apps. User β any node.
Asymmetric: One head node = gateway. User β head node β workers.
Sharding = partition data into shards across nodes. No two shards of same file on same node. Improves fault tolerance.
Replication = copies of same data across servers. Fault tolerant.
Master-Slave (lecture): Writes β master. Reads β slaves. Master fails β writes stop.
Peer-to-Peer (lecture): All nodes equal. Workload partitioned.
Exam shortcut: different pieces across nodes = sharding. Same piece copied across nodes = replication.
Failover = automatic switch to redundant node. Switch Over = requires human intervention.
Organizes data on storage devices, tracks files. File = smallest unit of storage.
RDBMS: Vertical scaling, ACID, static schema. Can't handle big data.
NoSQL: CAP + BASE (eventually consistent). Types: Key-value, Document, Column, Graph.
NewSQL: NoSQL scalability + RDBMS ACID. Shared-nothing, SQL-compliant. 3 layers: admin, transactional, storage. Examples: VoltDB, NuoDB, Clustrix, MemSQL, TokuDB.
| Scale-Up (Vertical) | Scale-Out (Horizontal) |
|---|---|
| Add CPU/RAM/HDD to existing server | Add new nodes to cluster |
| Limited by max server capacity | Unlimited |
| Expensive | Commodity hardware |
| Ex: RAM 32β128 GB | Ex: +3 machines |
| Traditional RDBMS | Big data approach |
Chapter 2 β MCQs
Short Answer
Processing & Management Concepts
Collecting, processing, manipulating, managing data β meaningful information.
Stages (lecture diagram): Input (capture/collect/transmit) β Processing (classify/sort/math/transform) β Storage (store/retrieve/archive/govern) β Output (compute/format/present).
Types: Centralized (single location) vs Distributed (across locations).
Shares ALL resources. Limits scalability.
1. SMP = UMA (Uniform Memory Access). Single shared memory. Bandwidth choking on shared bus.
2. DSM = NUMA (Non-Uniform). Multiple memory pools. Latency depends on distance.
Each node = own memory, storage, disks. Infinite scalability. Ideal for web/Internet apps.
Shared-Everything
- All processors share core resources.
- SMP = UMA and DSM = NUMA.
- Simpler to reason about, but scalability is limited.
Shared-Nothing
- Each node keeps its own CPU, memory, and storage.
- Scale-out comes from adding more independent nodes.
- Preferred for big web and Internet-scale applications.
| Type | Key Details (from lecture) |
|---|---|
| Batch | Jobs executed sequentially/parallel β combined output. TB/PB, response time not critical. Hadoop MapReduce. Cost-effective. Ex: payroll, billing, DW, log analysis. |
| Real-Time | Continual data flow. In-memory processing while streaming. Stored to disk AFTER. Low latency. Ex: ATM, POS, fraud detection. |
| Parallel | Subtasks on multiple CPUs, single machine. Shared memory. |
| Distributed | Subtasks on separate networked machines (cluster). |
Platforms (lecture table):
| Platform | Dev | Type |
|---|---|---|
| Storm | Streaming | |
| S4 | Yahoo | Streaming |
| MillWheel | Streaming | |
| Hadoop | Apache | Batch |
| Disco | Nokia | Batch |
Software layer on hardware. Multiple OS on one machine, each independent.
Attributes (lecture):
1. Encapsulation β VM = single file. Dedicated VM per app β no interference.
2. Partitioning β Hardware β logical partitions, each with separate OS.
3. Isolation β VMs isolated. One crash β others unaffected.
Server virtualization (lecture): Server β multiple VMs. CPU/memory virtualized. Enables large-volume Big Data analysis.
Deployment: Public (3rd-party, pay-as-you-go) | Private (owned by company, more secure) | Hybrid (public+private combined).
Services: SaaS (software subscription: Salesforce) | PaaS (dev platform: Heroku) | IaaS (infrastructure: AWS, IBM).
Virtualization types (book): Server, Desktop, Network, Storage, Application.
Chapter 4 β MCQs
Short Answer
Driving Big Data with Hadoop Tools & Technologies
Apache Hadoop = open-source Java framework for processing large data sets in streaming access pattern across clusters in a distributed computing environment.
Can store structured, semi-structured, and unstructured data in a DFS and process in parallel. Highly scalable and cost-effective.
Files are written once and read many times (WORM). Contents cannot be changed.
Architecture β Two layers: 1) HDFS (storage layer) 2) MapReduce engine (processing layer).
Four components: Hadoop Common (utilities), HDFS (distributed storage), YARN (resource negotiation), MapReduce (parallel processing).
| Layer | Components |
|---|---|
| Data Storage | HDFS (distributed file system) + HBase (column-oriented DB) |
| Data Processing | MapReduce (job processing) + YARN (resource allocation, scheduling, monitoring) |
| Data Access | Hive (HQL query), Pig (data analysis scripting), Mahout (ML), Avro (serialization), SQOOP (RDBMSβHDFS transfer) |
| Data Management | Oozie (workflow scheduler), Chukwa (monitoring), Flume (streaming data flow β HDFS), ZooKeeper (coordination service) |
Designed to store large data sets with streaming access pattern on commodity hardware by partitioning data into small chunks.
Key facts:
β’ Default block size: 64 MB. A 200 MB file β four blocks total: three full 64 MB blocks + one partial 8 MB block.
β’ Default replication factor: 3 (three copies across different nodes/racks for fault tolerance).
β’ Reads: parallel. Writes: pipelined (not parallel β to avoid inconsistency).
β’ NOT suitable for low-latency access.
Features: Cost-effective (open-source + commodity hw), Distributed storage, Data replication (3 copies default).
Lecture example: reading 500 GB with four 100 MB/s I/O channels.
Lecture example: spread across 100 machines and read in parallel.
Move work close to the data and remove one-node bottlenecks.
NameNode (Master):
β’ Manages namespace of entire file system
β’ Supervises health of DataNodes via Heartbeat signal
β’ Controls file access. Does NOT hold actual data β holds metadata (which blocks constitute a file + their locations)
β’ Single point of failure. Needs manual intervention in H1.0
β’ NOT commodity hardware β must be highly available system
DataNode (Slave):
β’ Where actual data resides, distributed across cluster
β’ Files split into blocks of 64 MB by default
β’ NameNode decides which block goes to which DataNode
β’ Serves read/write requests from clients
Secondary NameNode: Periodically backs up NameNode RAM data. NOT a substitute for NameNode β acts as recovery mechanism. Runs on separate machine (needs equivalent memory).
Rack: A collection of DataNodes stored at a single location. Replications placed across racks to overcome rack failure (1st replica on one rack, 2nd+3rd on a different rack).
Write: Client β request to NameNode β receives metadata β writes to first DataNode β pipelined replication to other DataNodes β ACKs flow back β Block Received ACK to NameNode.
Read: Client β File Read Request to NameNode β gets metadata (block locations) β parallel read directly from DataNodes.
Write types (book): Posted (no ACK required) vs Non-posted (ACK required).
MapReduce = batch-processing programming model using divide-and-conquer. Highly scalable, reliable, fault tolerant. Supports only batch workloads.
Key advantage: moves processing TO the data (not data to processing).
Pipeline: Input β Split β Map β Combine (optional) β Partition (optional) β Shuffle & Sort β Reduce β Output
Mapper: Breaks large data into small blocks. Each resolved into key-value pairs (K1,V1). Executes user logic β produces intermediate (K2,V2). Processing done in parallel.
Combiner: Optimizes mapper output before sending to reducer. Essentially a "local reducer" β groups repeated keys and lists their values. Reduces overhead of data transfer between mapper and reducer.
Reducer: Processes sorted key-value pairs. Each reducer runs in isolation (no inter-reducer communication). Input sorted by key. Output written back to DFS.
Limitations of MapReduce: reducers depend on mapper output, so the job has a phase barrier and is poor for low-latency or iterative workloads.
Master/slave architecture: One JobTracker (master) + multiple TaskTrackers (slaves).
β’ One JobTracker per cluster. One TaskTracker per slave node.
β’ JobTracker: accepts job submissions, schedules tasks, monitors slave health, monitors task progress. Single point of failure.
β’ TaskTracker: executes tasks, sends Heartbeat signal to JobTracker (alive + task status). If Heartbeat not received β assumed dead.
β’ Communication via Remote Procedure Calls (RPC).
β’ If a task fails, JobTracker reschedules on same or another node. If task fails 4 times (default) β job killed.
β’ If multiple tasks of same job fail on one TaskTracker β refrained from that job. If tasks from different jobs fail β refrained 24 hours.
| Hadoop 1.0 | Hadoop 2.0 |
|---|---|
| Single NameNode (SPOF) | Active + Standby NameNode |
| Only MapReduce jobs | MapReduce + non-MR apps (Spark, Giraph, Hama) via YARN |
| Batch-only | Batch + real-time / near-real-time |
| JobTracker does scheduling + resource mgmt | YARN splits: ResourceManager + ApplicationMaster |
| NameNode down β Secondary NameNode (recovery only) | NameNode down β Standby NameNode takes over |
Splits JobTracker responsibilities into two daemons: ResourceManager (global resources) + ApplicationMaster (per-job scheduling/monitoring).
Core Components:
1. ResourceManager (RM) β one per cluster. Has: ApplicationsManager (accepts/rejects/monitors apps) + Scheduler (allocates resources).
2. ApplicationMaster β per-application. Negotiates resource containers from RM.
3. NodeManager (NM) β per node. Components: NodeStatusUpdater, ContainerManager, ContainerExecutor, NodeHealthCheckerService, Security.
YARN Schedulers:
| Scheduler | Behavior |
|---|---|
| FIFO | Jobs executed in submission order. No priority. Long jobs block short ones. |
| Capacity | Multiple queues sharing cluster. Each queue gets calculated share. FIFO within queues. Idle resources can be assigned to other jobs. |
| Fair | All apps get fairly equal share of resources. Resources dynamically rebalanced as jobs come/go. Queues have weights (light/heavy). |
YARN Failures:
β’ ResourceManager failure: Earlier = SPOF (manual restart). Latest = active/passive RM (or ZooKeeper-based failover).
β’ ApplicationMaster failure: Detected by RM β new container + new AM instance. Recovers state if available externally, else starts from scratch.
β’ NodeManager failure: Heartbeat stops β RM removes NM from cluster β AM reruns that portion.
β’ Container failure: AM detects no response β re-executes task. Configurable retry count.
Column-oriented NoSQL DB built on top of HDFS. Horizontally scalable, open-source. Based on Google's Bigtable.
β’ Supports structured + unstructured data. Provides real-time access to HDFS data.
β’ Key-value pairs in column-oriented fashion.
β’ HBase vs Hadoop: Hadoop = flat files, WORM. HBase = read/write many times.
β’ Features: Automatic failover, Auto sharding, Horizontal scalability (commodity hw), Column-oriented, MapReduce support.
Highly available, linearly scalable, distributed database. Ring architecture β all nodes equal (no master/slave). Data accessed by partition key.
β’ Data replicated across nodes β highly available. Read/write on any node (coordinator node).
β’ Eventually consistent. No single point of failure.
β’ If node goes down β R/W continues on other nodes. Operations queued and updated when failed node recovers.
| Tool | Purpose | Key Detail |
|---|---|---|
| SQOOP | Transfer structured data between RDBMS β HDFS/HBase/Hive | Bidirectional (import/export). Tables read in parallel. Output: text or binary Avro. |
| Flume | Collect streaming data from sources β HDFS/HBase | For sensors, social media, log files. SQOOP = structured; Flume = streaming. |
| Avro | Data serialization framework | Translates data to binary/text for transport. Multi-language (C, C++, Java, Python). Schema in JSON format. |
| Pig | Data analysis using Pig Latin scripting | Handles all data types. Reduces code vs writing Mapper/Reducer. Internally: Parser β Optimizer β Compiler β MapReduce execution. |
| Hive | Distributed data warehouse, HQL (SQL-like) queries | Built on HDFS. DDL like SQL. Data organized into Tables, Partitions, Buckets. Interfaces: ODBC/JDBC. |
| Oozie | Workflow scheduler | 3 job types: Workflow (DAG, on-demand), Coordinator (periodic/data-triggered), Bundle (group of coordinators). Supports MapReduce/Hive/Pig/SQOOP. |
| Mahout | Machine learning platform | Clustering, Classification, Recommendations. Scalable (differentiator vs R). |
| ZooKeeper | Coordination service | Manages coordination between distributed components. Facilitates reachability. |
Hive Layout
Tables are the main container. Partitions cut large tables by partition columns so queries scan only relevant parts. Buckets subdivide partitions by hash to improve manageability and query performance.
Oozie Workflow
Workflow jobs are DAG-based and on demand. Coordinator jobs run periodically or when input arrives. Bundle jobs group coordinators. Action nodes do work; control nodes decide order.
Pig Latin Example
Useful when you want data analysis without hand-writing mapper and reducer code.
Avro Schema Shape
Memorize the high-level fields: type, namespace, name, and fields.
Example 1: File = 500 MB, Block = 64 MB, Replication = 1
500 Γ· 64 = 7.8125 β 8 blocks (round up β last block is partial)
Example 2: File = 800 MB, Block = 128 MB, Replication = 3
800 Γ· 128 = 6.25 β 7 blocks. Size: 6 blocks Γ 128 = 768 MB. 7th block = 800 - 768 = 32 MB. Total with replication: 7 Γ 3 = 21 block copies.