IT용어위키



Apache Spark RDD Operation

Apache Spark RDD Operation refers to the various transformations and actions that can be applied to Resilient Distributed Datasets (RDDs) in Apache Spark. RDD operations enable efficient parallel processing of large datasets in a distributed environment.

Types of RDD Operations

RDD operations are classified into two types:

  • Transformations: Lazy operations that return a new RDD without immediate execution.
  • Actions: Operations that trigger computation and return results or store data.

Transformations

Transformations are applied to RDDs to produce new RDDs. They are lazy, meaning they are not executed until an action is performed.

Common Transformations

Transformation Description Example Result
map(func) Applies a function to each element in the RDD.
rdd = sparkContext.parallelize([1, 2, 3, 4])
mapped_rdd = rdd.map(lambda x: x * 2)
print(mapped_rdd.collect())
`[2, 4, 6, 8]`
filter(func) Retains elements that satisfy a condition.
rdd = sparkContext.parallelize([1, 2, 3, 4, 5, 6])
filtered_rdd = rdd.filter(lambda x: x % 2 == 0)
print(filtered_rdd.collect())
`[2, 4, 6]`
flatMap(func) Similar to map but allows multiple outputs per input.
rdd = sparkContext.parallelize(["hello world", "spark rdd"])
flat_mapped_rdd = rdd.flatMap(lambda line: line.split(" "))
print(flat_mapped_rdd.collect())
`["hello", "world", "spark", "rdd"]`
union(rdd) Merges two RDDs.
rdd1 = sparkContext.parallelize([1, 2, 3])
rdd2 = sparkContext.parallelize([4, 5, 6])
union_rdd = rdd1.union(rdd2)
print(union_rdd.collect())
`[1, 2, 3, 4, 5, 6]`
distinct() Removes duplicate elements.
rdd = sparkContext.parallelize([1, 2, 2, 3, 4, 4, 5])
distinct_rdd = rdd.distinct()
print(distinct_rdd.collect())
`[1, 2, 3, 4, 5]`
groupByKey() Groups data by key (for (K,V) pairs).
rdd = sparkContext.parallelize([("a", 1), ("b", 2), ("a", 3)])
grouped_rdd = rdd.groupByKey().mapValues(list)
print(grouped_rdd.collect())
`[('a', [1, 3]), ('b', [2])]`
reduceByKey(func) Merges values for each key using a function.
rdd = sparkContext.parallelize([("a", 1), ("b", 2), ("a", 3)])
reduced_rdd = rdd.reduceByKey(lambda x, y: x + y)
print(reduced_rdd.collect())
`[('a', 4), ('b', 2)]`
sortByKey() Sorts RDD by key.
rdd = sparkContext.parallelize([("b", 2), ("a", 1), ("c", 3)])
sorted_rdd = rdd.sortByKey()
print(sorted_rdd.collect())
`[('a', 1), ('b', 2), ('c', 3)]`

Actions

Actions compute and return results or store RDD data. They trigger the execution of all previous transformations.

Common Actions

Action Description Example Result
collect() Returns all elements of the RDD to the driver.
rdd = sparkContext.parallelize([1, 2, 3, 4])
print(rdd.collect())
`[1, 2, 3, 4]`
count() Returns the number of elements.
rdd = sparkContext.parallelize([1, 2, 3, 4])
print(rdd.count())
`4`
reduce(func) Aggregates elements using a binary function.
rdd = sparkContext.parallelize([1, 2, 3, 4])
sum_result = rdd.reduce(lambda x, y: x + y)
print(sum_result)
`10`
first() Returns the first element.
rdd = sparkContext.parallelize([10, 20, 30])
print(rdd.first())
`10`
take(n) Returns the first n elements.
rdd = sparkContext.parallelize([10, 20, 30, 40])
print(rdd.take(2))
`[10, 20]`
foreach(func) Applies a function to each element.
rdd = sparkContext.parallelize([1, 2, 3])
rdd.foreach(lambda x: print(x))
`1, 2, 3` (printed output)

Lazy Evaluation in RDDs

RDD transformations are lazy, meaning they do not execute immediately. Instead, Spark builds a DAG (Directed Acyclic Graph) representing operations, which is only executed when an action is called.

Example:

rdd = sparkContext.textFile("data.txt")  # No execution yet
words = rdd.flatMap(lambda line: line.split())  # Still not executed
word_count = words.count()  # Now execution starts

Persistence and Caching

RDDs can be cached in memory to speed up iterative computations:

  • cache() – Stores the RDD in memory.
  • persist(storage_level) – Stores the RDD using different storage levels (e.g., memory, disk).

Example:

rdd = sparkContext.textFile("data.txt").cache()
print(rdd.count())  # RDD is now cached in memory

Comparison with DataFrames

Feature RDD DataFrame
Abstraction Level Low (Resilient Distributed Dataset) High (Table-like structure)
Performance Slower (No optimizations) Faster (Uses Catalyst Optimizer)
Storage Format Unstructured Schema-based
Ease of Use Requires functional transformations SQL-like API

Advantages of RDDs

  • Fault Tolerant: Uses lineage to recompute lost partitions.
  • Parallel Execution: Automatically distributes computations across nodes.
  • Immutable and Lazy Evaluation: Optimizes execution by avoiding unnecessary computations.

Limitations of RDDs

  • Higher Memory Usage: No schema-based optimizations.
  • Verbose API: Requires functional programming.
  • Less Optimized than DataFrames: Lacks query optimizations found in Spark DataFrames.

Applications

  • Processing large-scale unstructured data.
  • Complex transformations requiring fine-grained control.
  • Iterative machine learning computations.

See Also


  출처: IT위키(IT위키에서 최신 문서 보기)
  * 본 페이지는 공대위키에서 미러링된 페이지입니다. 일부 오류나 표현의 누락이 있을 수 있습니다. 원본 문서는 공대위키에서 확인하세요!