public class PairRDDFunctions<K,V> extends Object implements Logging, scala.Serializable
org.apache.spark.SparkContext._
at the top of your program to use these functions.Constructor and Description |
---|
PairRDDFunctions(RDD<scala.Tuple2<K,V>> self,
scala.reflect.ClassTag<K> kt,
scala.reflect.ClassTag<V> vt,
scala.math.Ordering<K> ord) |
Modifier and Type | Method and Description |
---|---|
<U> RDD<scala.Tuple2<K,U>> |
aggregateByKey(U zeroValue,
scala.Function2<U,V,U> seqOp,
scala.Function2<U,U,U> combOp,
scala.reflect.ClassTag<U> evidence$3)
Aggregate the values of each key, using given combine functions and a neutral "zero value".
|
<U> RDD<scala.Tuple2<K,U>> |
aggregateByKey(U zeroValue,
int numPartitions,
scala.Function2<U,V,U> seqOp,
scala.Function2<U,U,U> combOp,
scala.reflect.ClassTag<U> evidence$2)
Aggregate the values of each key, using given combine functions and a neutral "zero value".
|
<U> RDD<scala.Tuple2<K,U>> |
aggregateByKey(U zeroValue,
Partitioner partitioner,
scala.Function2<U,V,U> seqOp,
scala.Function2<U,U,U> combOp,
scala.reflect.ClassTag<U> evidence$1)
Aggregate the values of each key, using given combine functions and a neutral "zero value".
|
<W> RDD<scala.Tuple2<K,scala.Tuple2<scala.collection.Iterable<V>,scala.collection.Iterable<W>>>> |
cogroup(RDD<scala.Tuple2<K,W>> other)
For each key k in
this or other , return a resulting RDD that contains a tuple with the
list of values for that key in this as well as other . |
<W> RDD<scala.Tuple2<K,scala.Tuple2<scala.collection.Iterable<V>,scala.collection.Iterable<W>>>> |
cogroup(RDD<scala.Tuple2<K,W>> other,
int numPartitions)
For each key k in
this or other , return a resulting RDD that contains a tuple with the
list of values for that key in this as well as other . |
<W> RDD<scala.Tuple2<K,scala.Tuple2<scala.collection.Iterable<V>,scala.collection.Iterable<W>>>> |
cogroup(RDD<scala.Tuple2<K,W>> other,
Partitioner partitioner) |
<W1,W2> RDD<scala.Tuple2<K,scala.Tuple3<scala.collection.Iterable<V>,scala.collection.Iterable<W1>,scala.collection.Iterable<W2>>>> |
cogroup(RDD<scala.Tuple2<K,W1>> other1,
RDD<scala.Tuple2<K,W2>> other2)
For each key k in
this or other1 or other2 , return a resulting RDD that contains a
tuple with the list of values for that key in this , other1 and other2 . |
<W1,W2> RDD<scala.Tuple2<K,scala.Tuple3<scala.collection.Iterable<V>,scala.collection.Iterable<W1>,scala.collection.Iterable<W2>>>> |
cogroup(RDD<scala.Tuple2<K,W1>> other1,
RDD<scala.Tuple2<K,W2>> other2,
int numPartitions)
For each key k in
this or other1 or other2 , return a resulting RDD that contains a
tuple with the list of values for that key in this , other1 and other2 . |
<W1,W2> RDD<scala.Tuple2<K,scala.Tuple3<scala.collection.Iterable<V>,scala.collection.Iterable<W1>,scala.collection.Iterable<W2>>>> |
cogroup(RDD<scala.Tuple2<K,W1>> other1,
RDD<scala.Tuple2<K,W2>> other2,
Partitioner partitioner) |
<W1,W2,W3> RDD<scala.Tuple2<K,scala.Tuple4<scala.collection.Iterable<V>,scala.collection.Iterable<W1>,scala.collection.Iterable<W2>,scala.collection.Iterable<W3>>>> |
cogroup(RDD<scala.Tuple2<K,W1>> other1,
RDD<scala.Tuple2<K,W2>> other2,
RDD<scala.Tuple2<K,W3>> other3) |
<W1,W2,W3> RDD<scala.Tuple2<K,scala.Tuple4<scala.collection.Iterable<V>,scala.collection.Iterable<W1>,scala.collection.Iterable<W2>,scala.collection.Iterable<W3>>>> |
cogroup(RDD<scala.Tuple2<K,W1>> other1,
RDD<scala.Tuple2<K,W2>> other2,
RDD<scala.Tuple2<K,W3>> other3,
int numPartitions)
For each key k in
this or other1 or other2 or other3 ,
return a resulting RDD that contains a tuple with the list of values
for that key in this , other1 , other2 and other3 . |
<W1,W2,W3> RDD<scala.Tuple2<K,scala.Tuple4<scala.collection.Iterable<V>,scala.collection.Iterable<W1>,scala.collection.Iterable<W2>,scala.collection.Iterable<W3>>>> |
cogroup(RDD<scala.Tuple2<K,W1>> other1,
RDD<scala.Tuple2<K,W2>> other2,
RDD<scala.Tuple2<K,W3>> other3,
Partitioner partitioner)
For each key k in
this or other1 or other2 or other3 ,
return a resulting RDD that contains a tuple with the list of values
for that key in this , other1 , other2 and other3 . |
scala.collection.Map<K,V> |
collectAsMap()
Return the key-value pairs in this RDD to the master as a Map.
|
<C> RDD<scala.Tuple2<K,C>> |
combineByKey(scala.Function1<V,C> createCombiner,
scala.Function2<C,V,C> mergeValue,
scala.Function2<C,C,C> mergeCombiners)
Simplified version of combineByKey that hash-partitions the resulting RDD using the
existing partitioner/parallelism level.
|
<C> RDD<scala.Tuple2<K,C>> |
combineByKey(scala.Function1<V,C> createCombiner,
scala.Function2<C,V,C> mergeValue,
scala.Function2<C,C,C> mergeCombiners,
int numPartitions)
Simplified version of combineByKey that hash-partitions the output RDD.
|
<C> RDD<scala.Tuple2<K,C>> |
combineByKey(scala.Function1<V,C> createCombiner,
scala.Function2<C,V,C> mergeValue,
scala.Function2<C,C,C> mergeCombiners,
Partitioner partitioner,
boolean mapSideCombine,
Serializer serializer)
Generic function to combine the elements for each key using a custom set of aggregation
functions.
|
RDD<scala.Tuple2<K,Object>> |
countApproxDistinctByKey(double relativeSD)
Return approximate number of distinct values for each key in this RDD.
|
RDD<scala.Tuple2<K,Object>> |
countApproxDistinctByKey(double relativeSD,
int numPartitions)
Return approximate number of distinct values for each key in this RDD.
|
RDD<scala.Tuple2<K,Object>> |
countApproxDistinctByKey(double relativeSD,
Partitioner partitioner)
Return approximate number of distinct values for each key in this RDD.
|
RDD<scala.Tuple2<K,Object>> |
countApproxDistinctByKey(int p,
int sp,
Partitioner partitioner)
:: Experimental ::
|
scala.collection.Map<K,Object> |
countByKey()
Count the number of elements for each key, and return the result to the master as a Map.
|
PartialResult<scala.collection.Map<K,BoundedDouble>> |
countByKeyApprox(long timeout,
double confidence)
:: Experimental ::
Approximate version of countByKey that can return a partial result if it does
not finish within a timeout.
|
<U> RDD<scala.Tuple2<K,U>> |
flatMapValues(scala.Function1<V,scala.collection.TraversableOnce<U>> f)
Pass each value in the key-value pair RDD through a flatMap function without changing the
keys; this also retains the original RDD's partitioning.
|
RDD<scala.Tuple2<K,V>> |
foldByKey(V zeroValue,
scala.Function2<V,V,V> func)
Merge the values for each key using an associative function and a neutral "zero value" which
may be added to the result an arbitrary number of times, and must not change the result
(e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).
|
RDD<scala.Tuple2<K,V>> |
foldByKey(V zeroValue,
int numPartitions,
scala.Function2<V,V,V> func)
Merge the values for each key using an associative function and a neutral "zero value" which
may be added to the result an arbitrary number of times, and must not change the result
(e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).
|
RDD<scala.Tuple2<K,V>> |
foldByKey(V zeroValue,
Partitioner partitioner,
scala.Function2<V,V,V> func)
Merge the values for each key using an associative function and a neutral "zero value" which
may be added to the result an arbitrary number of times, and must not change the result
(e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).
|
RDD<scala.Tuple2<K,scala.collection.Iterable<V>>> |
groupByKey()
Group the values for each key in the RDD into a single sequence.
|
RDD<scala.Tuple2<K,scala.collection.Iterable<V>>> |
groupByKey(int numPartitions)
Group the values for each key in the RDD into a single sequence.
|
RDD<scala.Tuple2<K,scala.collection.Iterable<V>>> |
groupByKey(Partitioner partitioner)
Group the values for each key in the RDD into a single sequence.
|
<W> RDD<scala.Tuple2<K,scala.Tuple2<scala.collection.Iterable<V>,scala.collection.Iterable<W>>>> |
groupWith(RDD<scala.Tuple2<K,W>> other)
Alias for cogroup.
|
<W1,W2> RDD<scala.Tuple2<K,scala.Tuple3<scala.collection.Iterable<V>,scala.collection.Iterable<W1>,scala.collection.Iterable<W2>>>> |
groupWith(RDD<scala.Tuple2<K,W1>> other1,
RDD<scala.Tuple2<K,W2>> other2)
Alias for cogroup.
|
<W1,W2,W3> RDD<scala.Tuple2<K,scala.Tuple4<scala.collection.Iterable<V>,scala.collection.Iterable<W1>,scala.collection.Iterable<W2>,scala.collection.Iterable<W3>>>> |
groupWith(RDD<scala.Tuple2<K,W1>> other1,
RDD<scala.Tuple2<K,W2>> other2,
RDD<scala.Tuple2<K,W3>> other3)
Alias for cogroup.
|
<W> RDD<scala.Tuple2<K,scala.Tuple2<V,W>>> |
join(RDD<scala.Tuple2<K,W>> other)
Return an RDD containing all pairs of elements with matching keys in
this and other . |
<W> RDD<scala.Tuple2<K,scala.Tuple2<V,W>>> |
join(RDD<scala.Tuple2<K,W>> other,
int numPartitions)
Return an RDD containing all pairs of elements with matching keys in
this and other . |
<W> RDD<scala.Tuple2<K,scala.Tuple2<V,W>>> |
join(RDD<scala.Tuple2<K,W>> other,
Partitioner partitioner)
Return an RDD containing all pairs of elements with matching keys in
this and other . |
RDD<K> |
keys()
Return an RDD with the keys of each tuple.
|
<W> RDD<scala.Tuple2<K,scala.Tuple2<V,scala.Option<W>>>> |
leftOuterJoin(RDD<scala.Tuple2<K,W>> other)
Perform a left outer join of
this and other . |
<W> RDD<scala.Tuple2<K,scala.Tuple2<V,scala.Option<W>>>> |
leftOuterJoin(RDD<scala.Tuple2<K,W>> other,
int numPartitions)
Perform a left outer join of
this and other . |
<W> RDD<scala.Tuple2<K,scala.Tuple2<V,scala.Option<W>>>> |
leftOuterJoin(RDD<scala.Tuple2<K,W>> other,
Partitioner partitioner)
Perform a left outer join of
this and other . |
scala.collection.Seq<V> |
lookup(K key)
Return the list of values in the RDD for key
key . |
<U> RDD<scala.Tuple2<K,U>> |
mapValues(scala.Function1<V,U> f)
Pass each value in the key-value pair RDD through a map function without changing the keys;
this also retains the original RDD's partitioning.
|
RDD<scala.Tuple2<K,V>> |
partitionBy(Partitioner partitioner)
Return a copy of the RDD partitioned using the specified partitioner.
|
RDD<scala.Tuple2<K,V>> |
reduceByKey(scala.Function2<V,V,V> func)
Merge the values for each key using an associative reduce function.
|
RDD<scala.Tuple2<K,V>> |
reduceByKey(scala.Function2<V,V,V> func,
int numPartitions)
Merge the values for each key using an associative reduce function.
|
RDD<scala.Tuple2<K,V>> |
reduceByKey(Partitioner partitioner,
scala.Function2<V,V,V> func)
Merge the values for each key using an associative reduce function.
|
scala.collection.Map<K,V> |
reduceByKeyLocally(scala.Function2<V,V,V> func)
Merge the values for each key using an associative reduce function, but return the results
immediately to the master as a Map.
|
scala.collection.Map<K,V> |
reduceByKeyToDriver(scala.Function2<V,V,V> func)
Alias for reduceByKeyLocally
|
<W> RDD<scala.Tuple2<K,scala.Tuple2<scala.Option<V>,W>>> |
rightOuterJoin(RDD<scala.Tuple2<K,W>> other)
Perform a right outer join of
this and other . |
<W> RDD<scala.Tuple2<K,scala.Tuple2<scala.Option<V>,W>>> |
rightOuterJoin(RDD<scala.Tuple2<K,W>> other,
int numPartitions)
Perform a right outer join of
this and other . |
<W> RDD<scala.Tuple2<K,scala.Tuple2<scala.Option<V>,W>>> |
rightOuterJoin(RDD<scala.Tuple2<K,W>> other,
Partitioner partitioner)
Perform a right outer join of
this and other . |
RDD<scala.Tuple2<K,V>> |
sampleByKey(boolean withReplacement,
scala.collection.Map<K,Object> fractions,
long seed)
Return a subset of this RDD sampled by key (via stratified sampling).
|
RDD<scala.Tuple2<K,V>> |
sampleByKeyExact(boolean withReplacement,
scala.collection.Map<K,Object> fractions,
long seed)
::Experimental::
Return a subset of this RDD sampled by key (via stratified sampling) containing exactly
math.ceil(numItems * samplingRate) for each stratum (group of pairs with the same key).
|
void |
saveAsHadoopDataset(org.apache.hadoop.mapred.JobConf conf)
Output the RDD to any Hadoop-supported storage system, using a Hadoop JobConf object for
that storage system.
|
void |
saveAsHadoopFile(String path,
Class<?> keyClass,
Class<?> valueClass,
Class<? extends org.apache.hadoop.mapred.OutputFormat<?,?>> outputFormatClass,
Class<? extends org.apache.hadoop.io.compress.CompressionCodec> codec)
Output the RDD to any Hadoop-supported file system, using a Hadoop
OutputFormat class
supporting the key and value types K and V in this RDD. |
void |
saveAsHadoopFile(String path,
Class<?> keyClass,
Class<?> valueClass,
Class<? extends org.apache.hadoop.mapred.OutputFormat<?,?>> outputFormatClass,
org.apache.hadoop.mapred.JobConf conf,
scala.Option<Class<? extends org.apache.hadoop.io.compress.CompressionCodec>> codec)
Output the RDD to any Hadoop-supported file system, using a Hadoop
OutputFormat class
supporting the key and value types K and V in this RDD. |
<F extends org.apache.hadoop.mapred.OutputFormat<K,V>> |
saveAsHadoopFile(String path,
Class<? extends org.apache.hadoop.io.compress.CompressionCodec> codec,
scala.reflect.ClassTag<F> fm)
Output the RDD to any Hadoop-supported file system, using a Hadoop
OutputFormat class
supporting the key and value types K and V in this RDD. |
<F extends org.apache.hadoop.mapred.OutputFormat<K,V>> |
saveAsHadoopFile(String path,
scala.reflect.ClassTag<F> fm)
Output the RDD to any Hadoop-supported file system, using a Hadoop
OutputFormat class
supporting the key and value types K and V in this RDD. |
void |
saveAsNewAPIHadoopDataset(org.apache.hadoop.conf.Configuration conf)
Output the RDD to any Hadoop-supported storage system with new Hadoop API, using a Hadoop
Configuration object for that storage system.
|
void |
saveAsNewAPIHadoopFile(String path,
Class<?> keyClass,
Class<?> valueClass,
Class<? extends org.apache.hadoop.mapreduce.OutputFormat<?,?>> outputFormatClass,
org.apache.hadoop.conf.Configuration conf)
Output the RDD to any Hadoop-supported file system, using a new Hadoop API
OutputFormat
(mapreduce.OutputFormat) object supporting the key and value types K and V in this RDD. |
<F extends org.apache.hadoop.mapreduce.OutputFormat<K,V>> |
saveAsNewAPIHadoopFile(String path,
scala.reflect.ClassTag<F> fm)
Output the RDD to any Hadoop-supported file system, using a new Hadoop API
OutputFormat
(mapreduce.OutputFormat) object supporting the key and value types K and V in this RDD. |
<W> RDD<scala.Tuple2<K,V>> |
subtractByKey(RDD<scala.Tuple2<K,W>> other,
scala.reflect.ClassTag<W> evidence$4)
Return an RDD with the pairs from
this whose keys are not in other . |
<W> RDD<scala.Tuple2<K,V>> |
subtractByKey(RDD<scala.Tuple2<K,W>> other,
int numPartitions,
scala.reflect.ClassTag<W> evidence$5)
Return an RDD with the pairs from `this` whose keys are not in `other`.
|
<W> RDD<scala.Tuple2<K,V>> |
subtractByKey(RDD<scala.Tuple2<K,W>> other,
Partitioner p,
scala.reflect.ClassTag<W> evidence$6)
Return an RDD with the pairs from `this` whose keys are not in `other`.
|
RDD<V> |
values()
Return an RDD with the values of each tuple.
|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
initialized, initializeIfNecessary, initializeLogging, initLock, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning
public <C> RDD<scala.Tuple2<K,C>> combineByKey(scala.Function1<V,C> createCombiner, scala.Function2<C,V,C> mergeValue, scala.Function2<C,C,C> mergeCombiners, Partitioner partitioner, boolean mapSideCombine, Serializer serializer)
- createCombiner
, which turns a V into a C (e.g., creates a one-element list)
- mergeValue
, to merge a V into a C (e.g., adds it to the end of a list)
- mergeCombiners
, to combine two C's into a single one.
In addition, users can control the partitioning of the output RDD, and whether to perform map-side aggregation (if a mapper can produce multiple items with the same key).
public <C> RDD<scala.Tuple2<K,C>> combineByKey(scala.Function1<V,C> createCombiner, scala.Function2<C,V,C> mergeValue, scala.Function2<C,C,C> mergeCombiners, int numPartitions)
public <U> RDD<scala.Tuple2<K,U>> aggregateByKey(U zeroValue, Partitioner partitioner, scala.Function2<U,V,U> seqOp, scala.Function2<U,U,U> combOp, scala.reflect.ClassTag<U> evidence$1)
public <U> RDD<scala.Tuple2<K,U>> aggregateByKey(U zeroValue, int numPartitions, scala.Function2<U,V,U> seqOp, scala.Function2<U,U,U> combOp, scala.reflect.ClassTag<U> evidence$2)
public <U> RDD<scala.Tuple2<K,U>> aggregateByKey(U zeroValue, scala.Function2<U,V,U> seqOp, scala.Function2<U,U,U> combOp, scala.reflect.ClassTag<U> evidence$3)
public RDD<scala.Tuple2<K,V>> foldByKey(V zeroValue, Partitioner partitioner, scala.Function2<V,V,V> func)
public RDD<scala.Tuple2<K,V>> foldByKey(V zeroValue, int numPartitions, scala.Function2<V,V,V> func)
public RDD<scala.Tuple2<K,V>> foldByKey(V zeroValue, scala.Function2<V,V,V> func)
public RDD<scala.Tuple2<K,V>> sampleByKey(boolean withReplacement, scala.collection.Map<K,Object> fractions, long seed)
Create a sample of this RDD using variable sampling rates for different keys as specified by
fractions
, a key to sampling rate map, via simple random sampling with one pass over the
RDD, to produce a sample of size that's approximately equal to the sum of
math.ceil(numItems * samplingRate) over all key values.
withReplacement
- whether to sample with or without replacementfractions
- map of specific keys to sampling ratesseed
- seed for the random number generatorpublic RDD<scala.Tuple2<K,V>> sampleByKeyExact(boolean withReplacement, scala.collection.Map<K,Object> fractions, long seed)
This method differs from sampleByKey
in that we make additional passes over the RDD to
create a sample size that's exactly equal to the sum of math.ceil(numItems * samplingRate)
over all key values with a 99.99% confidence. When sampling without replacement, we need one
additional pass over the RDD to guarantee sample size; when sampling with replacement, we need
two additional passes.
withReplacement
- whether to sample with or without replacementfractions
- map of specific keys to sampling ratesseed
- seed for the random number generatorpublic RDD<scala.Tuple2<K,V>> reduceByKey(Partitioner partitioner, scala.Function2<V,V,V> func)
public RDD<scala.Tuple2<K,V>> reduceByKey(scala.Function2<V,V,V> func, int numPartitions)
public RDD<scala.Tuple2<K,V>> reduceByKey(scala.Function2<V,V,V> func)
public scala.collection.Map<K,V> reduceByKeyLocally(scala.Function2<V,V,V> func)
public scala.collection.Map<K,V> reduceByKeyToDriver(scala.Function2<V,V,V> func)
public scala.collection.Map<K,Object> countByKey()
public PartialResult<scala.collection.Map<K,BoundedDouble>> countByKeyApprox(long timeout, double confidence)
public RDD<scala.Tuple2<K,Object>> countApproxDistinctByKey(int p, int sp, Partitioner partitioner)
Return approximate number of distinct values for each key in this RDD.
The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here.
The relative accuracy is approximately 1.054 / sqrt(2^p)
. Setting a nonzero sp > p
would trigger sparse representation of registers, which may reduce the memory consumption
and increase accuracy when the cardinality is small.
p
- The precision value for the normal set.
p
must be a value between 4 and sp
if sp
is not zero (32 max).sp
- The precision value for the sparse set, between 0 and 32.
If sp
equals 0, the sparse representation is skipped.partitioner
- Partitioner to use for the resulting RDD.public RDD<scala.Tuple2<K,Object>> countApproxDistinctByKey(double relativeSD, Partitioner partitioner)
The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here.
relativeSD
- Relative accuracy. Smaller values create counters that require more space.
It must be greater than 0.000017.partitioner
- partitioner of the resulting RDDpublic RDD<scala.Tuple2<K,Object>> countApproxDistinctByKey(double relativeSD, int numPartitions)
The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here.
relativeSD
- Relative accuracy. Smaller values create counters that require more space.
It must be greater than 0.000017.numPartitions
- number of partitions of the resulting RDDpublic RDD<scala.Tuple2<K,Object>> countApproxDistinctByKey(double relativeSD)
The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here.
relativeSD
- Relative accuracy. Smaller values create counters that require more space.
It must be greater than 0.000017.public RDD<scala.Tuple2<K,scala.collection.Iterable<V>>> groupByKey(Partitioner partitioner)
Note: This operation may be very expensive. If you are grouping in order to perform an
aggregation (such as a sum or average) over each key, using PairRDDFunctions.aggregateByKey
or PairRDDFunctions.reduceByKey
will provide much better performance.
public RDD<scala.Tuple2<K,scala.collection.Iterable<V>>> groupByKey(int numPartitions)
numPartitions
partitions.
Note: This operation may be very expensive. If you are grouping in order to perform an
aggregation (such as a sum or average) over each key, using PairRDDFunctions.aggregateByKey
or PairRDDFunctions.reduceByKey
will provide much better performance.
public RDD<scala.Tuple2<K,V>> partitionBy(Partitioner partitioner)
public <W> RDD<scala.Tuple2<K,scala.Tuple2<V,W>>> join(RDD<scala.Tuple2<K,W>> other, Partitioner partitioner)
this
and other
. Each
pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this
and
(k, v2) is in other
. Uses the given Partitioner to partition the output RDD.public <W> RDD<scala.Tuple2<K,scala.Tuple2<V,scala.Option<W>>>> leftOuterJoin(RDD<scala.Tuple2<K,W>> other, Partitioner partitioner)
this
and other
. For each element (k, v) in this
, the
resulting RDD will either contain all pairs (k, (v, Some(w))) for w in other
, or the
pair (k, (v, None)) if no elements in other
have key k. Uses the given Partitioner to
partition the output RDD.public <W> RDD<scala.Tuple2<K,scala.Tuple2<scala.Option<V>,W>>> rightOuterJoin(RDD<scala.Tuple2<K,W>> other, Partitioner partitioner)
this
and other
. For each element (k, w) in other
, the
resulting RDD will either contain all pairs (k, (Some(v), w)) for v in this
, or the
pair (k, (None, w)) if no elements in this
have key k. Uses the given Partitioner to
partition the output RDD.public <C> RDD<scala.Tuple2<K,C>> combineByKey(scala.Function1<V,C> createCombiner, scala.Function2<C,V,C> mergeValue, scala.Function2<C,C,C> mergeCombiners)
public RDD<scala.Tuple2<K,scala.collection.Iterable<V>>> groupByKey()
Note: This operation may be very expensive. If you are grouping in order to perform an
aggregation (such as a sum or average) over each key, using PairRDDFunctions.aggregateByKey
or PairRDDFunctions.reduceByKey
will provide much better performance.
public <W> RDD<scala.Tuple2<K,scala.Tuple2<V,W>>> join(RDD<scala.Tuple2<K,W>> other)
this
and other
. Each
pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this
and
(k, v2) is in other
. Performs a hash join across the cluster.public <W> RDD<scala.Tuple2<K,scala.Tuple2<V,W>>> join(RDD<scala.Tuple2<K,W>> other, int numPartitions)
this
and other
. Each
pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this
and
(k, v2) is in other
. Performs a hash join across the cluster.public <W> RDD<scala.Tuple2<K,scala.Tuple2<V,scala.Option<W>>>> leftOuterJoin(RDD<scala.Tuple2<K,W>> other)
this
and other
. For each element (k, v) in this
, the
resulting RDD will either contain all pairs (k, (v, Some(w))) for w in other
, or the
pair (k, (v, None)) if no elements in other
have key k. Hash-partitions the output
using the existing partitioner/parallelism level.public <W> RDD<scala.Tuple2<K,scala.Tuple2<V,scala.Option<W>>>> leftOuterJoin(RDD<scala.Tuple2<K,W>> other, int numPartitions)
this
and other
. For each element (k, v) in this
, the
resulting RDD will either contain all pairs (k, (v, Some(w))) for w in other
, or the
pair (k, (v, None)) if no elements in other
have key k. Hash-partitions the output
into numPartitions
partitions.public <W> RDD<scala.Tuple2<K,scala.Tuple2<scala.Option<V>,W>>> rightOuterJoin(RDD<scala.Tuple2<K,W>> other)
this
and other
. For each element (k, w) in other
, the
resulting RDD will either contain all pairs (k, (Some(v), w)) for v in this
, or the
pair (k, (None, w)) if no elements in this
have key k. Hash-partitions the resulting
RDD using the existing partitioner/parallelism level.public <W> RDD<scala.Tuple2<K,scala.Tuple2<scala.Option<V>,W>>> rightOuterJoin(RDD<scala.Tuple2<K,W>> other, int numPartitions)
this
and other
. For each element (k, w) in other
, the
resulting RDD will either contain all pairs (k, (Some(v), w)) for v in this
, or the
pair (k, (None, w)) if no elements in this
have key k. Hash-partitions the resulting
RDD into the given number of partitions.public scala.collection.Map<K,V> collectAsMap()
Warning: this doesn't return a multimap (so if you have multiple values to the same key, only one value per key is preserved in the map returned)
public <U> RDD<scala.Tuple2<K,U>> mapValues(scala.Function1<V,U> f)
public <U> RDD<scala.Tuple2<K,U>> flatMapValues(scala.Function1<V,scala.collection.TraversableOnce<U>> f)
public <W1,W2,W3> RDD<scala.Tuple2<K,scala.Tuple4<scala.collection.Iterable<V>,scala.collection.Iterable<W1>,scala.collection.Iterable<W2>,scala.collection.Iterable<W3>>>> cogroup(RDD<scala.Tuple2<K,W1>> other1, RDD<scala.Tuple2<K,W2>> other2, RDD<scala.Tuple2<K,W3>> other3, Partitioner partitioner)
this
or other1
or other2
or other3
,
return a resulting RDD that contains a tuple with the list of values
for that key in this
, other1
, other2
and other3
.public <W> RDD<scala.Tuple2<K,scala.Tuple2<scala.collection.Iterable<V>,scala.collection.Iterable<W>>>> cogroup(RDD<scala.Tuple2<K,W>> other, Partitioner partitioner)
public <W1,W2> RDD<scala.Tuple2<K,scala.Tuple3<scala.collection.Iterable<V>,scala.collection.Iterable<W1>,scala.collection.Iterable<W2>>>> cogroup(RDD<scala.Tuple2<K,W1>> other1, RDD<scala.Tuple2<K,W2>> other2, Partitioner partitioner)
public <W1,W2,W3> RDD<scala.Tuple2<K,scala.Tuple4<scala.collection.Iterable<V>,scala.collection.Iterable<W1>,scala.collection.Iterable<W2>,scala.collection.Iterable<W3>>>> cogroup(RDD<scala.Tuple2<K,W1>> other1, RDD<scala.Tuple2<K,W2>> other2, RDD<scala.Tuple2<K,W3>> other3)
public <W> RDD<scala.Tuple2<K,scala.Tuple2<scala.collection.Iterable<V>,scala.collection.Iterable<W>>>> cogroup(RDD<scala.Tuple2<K,W>> other)
this
or other
, return a resulting RDD that contains a tuple with the
list of values for that key in this
as well as other
.public <W1,W2> RDD<scala.Tuple2<K,scala.Tuple3<scala.collection.Iterable<V>,scala.collection.Iterable<W1>,scala.collection.Iterable<W2>>>> cogroup(RDD<scala.Tuple2<K,W1>> other1, RDD<scala.Tuple2<K,W2>> other2)
this
or other1
or other2
, return a resulting RDD that contains a
tuple with the list of values for that key in this
, other1
and other2
.public <W> RDD<scala.Tuple2<K,scala.Tuple2<scala.collection.Iterable<V>,scala.collection.Iterable<W>>>> cogroup(RDD<scala.Tuple2<K,W>> other, int numPartitions)
this
or other
, return a resulting RDD that contains a tuple with the
list of values for that key in this
as well as other
.public <W1,W2> RDD<scala.Tuple2<K,scala.Tuple3<scala.collection.Iterable<V>,scala.collection.Iterable<W1>,scala.collection.Iterable<W2>>>> cogroup(RDD<scala.Tuple2<K,W1>> other1, RDD<scala.Tuple2<K,W2>> other2, int numPartitions)
this
or other1
or other2
, return a resulting RDD that contains a
tuple with the list of values for that key in this
, other1
and other2
.public <W1,W2,W3> RDD<scala.Tuple2<K,scala.Tuple4<scala.collection.Iterable<V>,scala.collection.Iterable<W1>,scala.collection.Iterable<W2>,scala.collection.Iterable<W3>>>> cogroup(RDD<scala.Tuple2<K,W1>> other1, RDD<scala.Tuple2<K,W2>> other2, RDD<scala.Tuple2<K,W3>> other3, int numPartitions)
this
or other1
or other2
or other3
,
return a resulting RDD that contains a tuple with the list of values
for that key in this
, other1
, other2
and other3
.public <W> RDD<scala.Tuple2<K,scala.Tuple2<scala.collection.Iterable<V>,scala.collection.Iterable<W>>>> groupWith(RDD<scala.Tuple2<K,W>> other)
public <W1,W2> RDD<scala.Tuple2<K,scala.Tuple3<scala.collection.Iterable<V>,scala.collection.Iterable<W1>,scala.collection.Iterable<W2>>>> groupWith(RDD<scala.Tuple2<K,W1>> other1, RDD<scala.Tuple2<K,W2>> other2)
public <W1,W2,W3> RDD<scala.Tuple2<K,scala.Tuple4<scala.collection.Iterable<V>,scala.collection.Iterable<W1>,scala.collection.Iterable<W2>,scala.collection.Iterable<W3>>>> groupWith(RDD<scala.Tuple2<K,W1>> other1, RDD<scala.Tuple2<K,W2>> other2, RDD<scala.Tuple2<K,W3>> other3)
public <W> RDD<scala.Tuple2<K,V>> subtractByKey(RDD<scala.Tuple2<K,W>> other, scala.reflect.ClassTag<W> evidence$4)
this
whose keys are not in other
.
Uses this
partitioner/partition size, because even if other
is huge, the resulting
RDD will be <= us.
public <W> RDD<scala.Tuple2<K,V>> subtractByKey(RDD<scala.Tuple2<K,W>> other, int numPartitions, scala.reflect.ClassTag<W> evidence$5)
public <W> RDD<scala.Tuple2<K,V>> subtractByKey(RDD<scala.Tuple2<K,W>> other, Partitioner p, scala.reflect.ClassTag<W> evidence$6)
public scala.collection.Seq<V> lookup(K key)
key
. This operation is done efficiently if the
RDD has a known partitioner by only searching the partition that the key maps to.public <F extends org.apache.hadoop.mapred.OutputFormat<K,V>> void saveAsHadoopFile(String path, scala.reflect.ClassTag<F> fm)
OutputFormat
class
supporting the key and value types K and V in this RDD.public <F extends org.apache.hadoop.mapred.OutputFormat<K,V>> void saveAsHadoopFile(String path, Class<? extends org.apache.hadoop.io.compress.CompressionCodec> codec, scala.reflect.ClassTag<F> fm)
OutputFormat
class
supporting the key and value types K and V in this RDD. Compress the result with the
supplied codec.public <F extends org.apache.hadoop.mapreduce.OutputFormat<K,V>> void saveAsNewAPIHadoopFile(String path, scala.reflect.ClassTag<F> fm)
OutputFormat
(mapreduce.OutputFormat) object supporting the key and value types K and V in this RDD.public void saveAsNewAPIHadoopFile(String path, Class<?> keyClass, Class<?> valueClass, Class<? extends org.apache.hadoop.mapreduce.OutputFormat<?,?>> outputFormatClass, org.apache.hadoop.conf.Configuration conf)
OutputFormat
(mapreduce.OutputFormat) object supporting the key and value types K and V in this RDD.public void saveAsHadoopFile(String path, Class<?> keyClass, Class<?> valueClass, Class<? extends org.apache.hadoop.mapred.OutputFormat<?,?>> outputFormatClass, Class<? extends org.apache.hadoop.io.compress.CompressionCodec> codec)
OutputFormat
class
supporting the key and value types K and V in this RDD. Compress with the supplied codec.public void saveAsHadoopFile(String path, Class<?> keyClass, Class<?> valueClass, Class<? extends org.apache.hadoop.mapred.OutputFormat<?,?>> outputFormatClass, org.apache.hadoop.mapred.JobConf conf, scala.Option<Class<? extends org.apache.hadoop.io.compress.CompressionCodec>> codec)
OutputFormat
class
supporting the key and value types K and V in this RDD.public void saveAsNewAPIHadoopDataset(org.apache.hadoop.conf.Configuration conf)
public void saveAsHadoopDataset(org.apache.hadoop.mapred.JobConf conf)