Spark修炼之道(进阶篇)——Spark入门到精通:第五节 Spark编程模型(二)

  • 时间:
  • 浏览:0
  • 来源:大发5分3DAPP下载_大发5分3DAPP官方


(8)cogroup(otherDataset, [numTasks])

机会输入的RDD类型为(K, V) 和(K, W),则返回的RDD类型为 (K, (Iterable, Iterable)) . 该操作与 groupWith等同





* Return a new RDD containing the distinct elements in this RDD.


def distinct(): RDD[T]



(6)sortByKey([ascending], [numTasks])




输入数据为(K, V) 对, 返回的是 (K, Iterable) ,numTasks指定task数量,该参数是可选的,下面给出的是无参数的groupByKey最好的办法


* Group the values for each key in the RDD into a single sequence. Hash-partitions the

* resulting RDD with the existing partitioner/parallelism level. The ordering of elements

* within each group is not guaranteed, and may even differ each time the resulting RDD is

* evaluated.


* Note: This operation may be very expensive. If you are grouping in order to perform an

* aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]

* or [[PairRDDFunctions.reduceByKey]] will provide much better performance.


def groupByKey(): RDD[(K, Iterable[V])]

RDD与另外一1个RDD进行Union操作要是,一1个数据集中的占据 的重复元素


repartition(numPartitions),功能与coalesce函数相同,实质上它调用的要是coalesce函数,只也有shuffle = true,由于 机会会由于 少量的网络开销。



* Return a new RDD that has exactly numPartitions partitions.


* Can increase or decrease the level of parallelism in this RDD. Internally, this uses

* a shuffle to redistribute data.


* If you are decreasing the number of partitions in this RDD, consider using coalesce,

* which can avoid performing a shuffle.


def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {

coalesce(numPartitions, shuffle = true)





(5)reduceByKey(func, [numTasks])

reduceByKey函数输入数据为(K, V)对,返回的数据集结果也是(K,V)对,只不过V为经过聚合操作后的值


* Merge the values for each key using an associative reduce function. This will also perform

* the merging locally on each mapper before sending results to a reducer, similarly to a

* “combiner” in MapReduce. Output will be hash-partitioned with numPartitions partitions.


def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]




* Return a new RDD that is reduced into numPartitions partitions.


* This results in a narrow dependency, e.g. if you go from 50 partitions

* to 50 partitions, there will not be a shuffle, instead each of the 50

* new partitions will claim 10 of the current partitions.


* However, if you’re doing a drastic coalesce, e.g. to numPartitions = 1,

* this may result in your computation taking place on fewer nodes than

* you like (e.g. one node in the case of numPartitions = 1). To avoid this,

* you can pass shuffle = true. This will add a shuffle step, but means the

* current upstream partitions will be executed in parallel (per whatever

* the current partitioning is).


* Note: With shuffle = true, you can actually coalesce to a larger number

* of partitions. This is useful if you have a small number of partitions,

* say 50, potentially with a few partitions being abnormally large. Calling

* coalesce(50, shuffle = true) will result in 50 partitions with the

* data distributed using a hash partitioner.


def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)

: RDD[T]







* For each key k in this or other, return a resulting RDD that contains a tuple with the

* list of values for that key in this as well as other.


def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)

: RDD[(K, (Iterable[V], Iterable[W]))]





* Return the intersection of this RDD and another one. The output will not contain any duplicate

* elements, even if the input RDDs did.


* Note that this method performs a shuffle internally.


def intersection(other: RDD[T]): RDD[T]






* Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of

* elements (a, b) where a is in this and b is in other.


def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]

(7)join(otherDataset, [numTasks])

对于数据集类型为 (K, V) 及 (K, W)的RDD,join操作后返回类型为 (K, (V, W)),join函数有四种 :

def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]

def leftOuterJoin[W](

other: RDD[(K, W)],

partitioner: Partitioner): RDD[(K, (V, Option[W]))]

def rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner)

: RDD[(K, (Option[V], W))]