IT용어위키

Resilient Distributed Datasets (RDDs) are the fundamental data structure in Apache Spark that provide fault-tolerant, parallel computation on large datasets. RDDs enable efficient distributed data processing while ensuring resilience to failures.

Overview

RDDs are immutable, distributed collections of objects that can be processed in parallel. They are designed to optimize large-scale data processing by:

Fault Tolerance: Automatically recovering lost data using lineage (recomputing from original data).
In-Memory Processing: Storing intermediate results in memory to improve performance.
Lazy Evaluation: Transformations are not executed immediately but only when an action is triggered.
Partitioning: Data is split across nodes to allow parallel execution.

Key Features

Immutability: Once created, RDDs cannot be modified; transformations create new RDDs.
Lineage Tracking: Maintains a history of transformations to recompute lost partitions.
Lazy Evaluation: Delays execution until an action (e.g., count, collect) is called.
Fault Tolerance: Automatically recomputes lost partitions without replicating data.
Parallel Computation: Distributes tasks across nodes in a Spark cluster.

Creating RDDs

RDDs can be created in two main ways:

From an existing collection:

data = [1, 2, 3, 4, 5]
rdd = sparkContext.parallelize(data)

From an external data source:

rdd = sparkContext.textFile("hdfs://path/to/file.txt")

Transformations and Actions

RDDs support two types of operations:

Transformations (Lazy Evaluation)

Transformations produce new RDDs from existing ones but do not execute immediately:

map(func) – Applies a function to each element.
filter(func) – Keeps elements that satisfy a condition.
flatMap(func) – Similar to map but allows returning multiple values per input.
union(rdd) – Merges two RDDs.

Actions (Trigger Execution)

Actions compute and return results or store data:

collect() – Returns all elements to the driver.
count() – Returns the number of elements in the RDD.
reduce(func) – Aggregates elements using a function.
saveAsTextFile(path) – Saves the RDD to a storage location.

RDD Lineage and Fault Tolerance

RDDs achieve fault tolerance through lineage tracking:

Instead of replicating data, Spark logs the sequence of transformations.
If a node fails, Spark recomputes lost partitions from the original dataset.
This approach minimizes storage overhead while ensuring reliability.

Comparison with Other Distributed Data Models

Feature	RDDs (Spark)	MapReduce (Hadoop)	DataFrames (Spark)
Data Processing	In-memory	Disk-based	Optimized execution plans
Fault Tolerance	Lineage (recomputes lost data)	Replication	Lineage (like RDDs)
Performance	Fast (RAM-based)	Slow (disk I/O)	Faster (columnar storage)
Ease of Use	Low (requires functional programming)	Low (requires custom Java/Python)	High (SQL-like API)

Advantages

High Performance: In-memory computation reduces I/O overhead.
Scalability: Designed to handle petabyte-scale data.
Fault Tolerance: Efficient recovery via lineage tracking.
Flexible API: Supports functional programming in Scala, Python, Java.

Limitations

Complex API: Requires functional programming knowledge.
High Memory Usage: Inefficient for certain workloads compared to optimized data structures like DataFrames.
No Schema Optimization: Unlike DataFrames, RDDs do not optimize queries automatically.

Applications

Big Data Processing: Used in large-scale ETL and analytics pipelines.
Machine Learning: Supports distributed ML algorithms via MLlib.
Graph Processing: Backbone of GraphX for scalable graph analytics.