IT용어위키



Resilient Distributed Datasets

Resilient Distributed Datasets (RDDs) are the fundamental data structure in Apache Spark that provide fault-tolerant, parallel computation on large datasets. RDDs enable efficient distributed data processing while ensuring resilience to failures.

Overview

RDDs are immutable, distributed collections of objects that can be processed in parallel. They are designed to optimize large-scale data processing by:

  • Fault Tolerance: Automatically recovering lost data using lineage (recomputing from original data).
  • In-Memory Processing: Storing intermediate results in memory to improve performance.
  • Lazy Evaluation: Transformations are not executed immediately but only when an action is triggered.
  • Partitioning: Data is split across nodes to allow parallel execution.

Key Features

  • Immutability: Once created, RDDs cannot be modified; transformations create new RDDs.
  • Lineage Tracking: Maintains a history of transformations to recompute lost partitions.
  • Lazy Evaluation: Delays execution until an action (e.g., count, collect) is called.
  • Fault Tolerance: Automatically recomputes lost partitions without replicating data.
  • Parallel Computation: Distributes tasks across nodes in a Spark cluster.

Creating RDDs

RDDs can be created in two main ways:

  1. From an existing collection:
data = [1, 2, 3, 4, 5]
rdd = sparkContext.parallelize(data)
  1. From an external data source:
rdd = sparkContext.textFile("hdfs://path/to/file.txt")

Transformations and Actions

RDDs support two types of operations:

Transformations (Lazy Evaluation)

Transformations produce new RDDs from existing ones but do not execute immediately:

  • map(func) – Applies a function to each element.
  • filter(func) – Keeps elements that satisfy a condition.
  • flatMap(func) – Similar to map but allows returning multiple values per input.
  • union(rdd) – Merges two RDDs.

Actions (Trigger Execution)

Actions compute and return results or store data:

  • collect() – Returns all elements to the driver.
  • count() – Returns the number of elements in the RDD.
  • reduce(func) – Aggregates elements using a function.
  • saveAsTextFile(path) – Saves the RDD to a storage location.

RDD Lineage and Fault Tolerance

RDDs achieve fault tolerance through lineage tracking:

  • Instead of replicating data, Spark logs the sequence of transformations.
  • If a node fails, Spark recomputes lost partitions from the original dataset.
  • This approach minimizes storage overhead while ensuring reliability.

Comparison with Other Distributed Data Models

Feature RDDs (Spark) MapReduce (Hadoop) DataFrames (Spark)
Data Processing In-memory Disk-based Optimized execution plans
Fault Tolerance Lineage (recomputes lost data) Replication Lineage (like RDDs)
Performance Fast (RAM-based) Slow (disk I/O) Faster (columnar storage)
Ease of Use Low (requires functional programming) Low (requires custom Java/Python) High (SQL-like API)

Advantages

  • High Performance: In-memory computation reduces I/O overhead.
  • Scalability: Designed to handle petabyte-scale data.
  • Fault Tolerance: Efficient recovery via lineage tracking.
  • Flexible API: Supports functional programming in Scala, Python, Java.

Limitations

  • Complex API: Requires functional programming knowledge.
  • High Memory Usage: Inefficient for certain workloads compared to optimized data structures like DataFrames.
  • No Schema Optimization: Unlike DataFrames, RDDs do not optimize queries automatically.

Applications

  • Big Data Processing: Used in large-scale ETL and analytics pipelines.
  • Machine Learning: Supports distributed ML algorithms via MLlib.
  • Graph Processing: Backbone of GraphX for scalable graph analytics.

See Also


  출처: IT위키(IT위키에서 최신 문서 보기)
  * 본 페이지는 공대위키에서 미러링된 페이지입니다. 일부 오류나 표현의 누락이 있을 수 있습니다. 원본 문서는 공대위키에서 확인하세요!