public final class DataFrameStatFunctions
extends Object
DataFrames.
 | Modifier and Type | Method and Description | 
|---|---|
| double[][] | approxQuantile(String[] cols,
              double[] probabilities,
              double relativeError)Calculates the approximate quantiles of numerical columns of a DataFrame. | 
| double[] | approxQuantile(String col,
              double[] probabilities,
              double relativeError)Calculates the approximate quantiles of a numerical column of a DataFrame. | 
| BloomFilter | bloomFilter(Column col,
           long expectedNumItems,
           double fpp)Builds a Bloom filter over a specified column. | 
| BloomFilter | bloomFilter(Column col,
           long expectedNumItems,
           long numBits)Builds a Bloom filter over a specified column. | 
| BloomFilter | bloomFilter(String colName,
           long expectedNumItems,
           double fpp)Builds a Bloom filter over a specified column. | 
| BloomFilter | bloomFilter(String colName,
           long expectedNumItems,
           long numBits)Builds a Bloom filter over a specified column. | 
| double | corr(String col1,
    String col2)Calculates the Pearson Correlation Coefficient of two columns of a DataFrame. | 
| double | corr(String col1,
    String col2,
    String method)Calculates the correlation of two columns of a DataFrame. | 
| CountMinSketch | countMinSketch(Column col,
              double eps,
              double confidence,
              int seed)Builds a Count-min Sketch over a specified column. | 
| CountMinSketch | countMinSketch(Column col,
              int depth,
              int width,
              int seed)Builds a Count-min Sketch over a specified column. | 
| CountMinSketch | countMinSketch(String colName,
              double eps,
              double confidence,
              int seed)Builds a Count-min Sketch over a specified column. | 
| CountMinSketch | countMinSketch(String colName,
              int depth,
              int width,
              int seed)Builds a Count-min Sketch over a specified column. | 
| double | cov(String col1,
   String col2)Calculate the sample covariance of two numerical columns of a DataFrame. | 
| Dataset<Row> | crosstab(String col1,
        String col2)Computes a pair-wise frequency table of the given columns. | 
| Dataset<Row> | freqItems(scala.collection.Seq<String> cols)(Scala-specific) Finding frequent items for columns, possibly with false positives. | 
| Dataset<Row> | freqItems(scala.collection.Seq<String> cols,
         double support)(Scala-specific) Finding frequent items for columns, possibly with false positives. | 
| Dataset<Row> | freqItems(String[] cols)Finding frequent items for columns, possibly with false positives. | 
| Dataset<Row> | freqItems(String[] cols,
         double support)Finding frequent items for columns, possibly with false positives. | 
| <T> Dataset<Row> | sampleBy(Column col,
        java.util.Map<T,Double> fractions,
        long seed)(Java-specific) Returns a stratified sample without replacement based on the fraction given
 on each stratum. | 
| <T> Dataset<Row> | sampleBy(Column col,
        scala.collection.immutable.Map<T,Object> fractions,
        long seed)Returns a stratified sample without replacement based on the fraction given on each stratum. | 
| <T> Dataset<Row> | sampleBy(String col,
        java.util.Map<T,Double> fractions,
        long seed)Returns a stratified sample without replacement based on the fraction given on each stratum. | 
| <T> Dataset<Row> | sampleBy(String col,
        scala.collection.immutable.Map<T,Object> fractions,
        long seed)Returns a stratified sample without replacement based on the fraction given on each stratum. | 
public double[] approxQuantile(String col,
                               double[] probabilities,
                               double relativeError)
 The result of this algorithm has the following deterministic bound:
 If the DataFrame has N elements and if we request the quantile at probability p up to error
 err, then the algorithm will return a sample x from the DataFrame so that the *exact* rank
 of x is close to (p * N).
 More precisely,
 
   floor((p - err) * N) <= rank(x) <= ceil((p + err) * N)
 This method implements a variation of the Greenwald-Khanna algorithm (with some speed optimizations). The algorithm was first present in Space-efficient Online Computation of Quantile Summaries by Greenwald and Khanna.
col - the name of the numerical columnprobabilities - a list of quantile probabilities
   Each number must belong to [0, 1].
   For example 0 is the minimum, 0.5 is the median, 1 is the maximum.relativeError - The relative target precision to achieve (greater than or equal to 0).
   If set to zero, the exact quantiles are computed, which could be very expensive.
   Note that values greater than 1 are accepted but give the same result as 1.public double[][] approxQuantile(String[] cols,
                                 double[] probabilities,
                                 double relativeError)
cols - the names of the numerical columnsprobabilities - a list of quantile probabilities
   Each number must belong to [0, 1].
   For example 0 is the minimum, 0.5 is the median, 1 is the maximum.relativeError - The relative target precision to achieve (greater than or equal to 0).
   If set to zero, the exact quantiles are computed, which could be very expensive.
   Note that values greater than 1 are accepted but give the same result as 1.approxQuantile(col:Str* approxQuantile) for detailed description.
 public BloomFilter bloomFilter(String colName, long expectedNumItems, double fpp)
colName - name of the column over which the filter is builtexpectedNumItems - expected number of items which will be put into the filter.fpp - expected false positive probability of the filter.public BloomFilter bloomFilter(Column col, long expectedNumItems, double fpp)
col - the column over which the filter is builtexpectedNumItems - expected number of items which will be put into the filter.fpp - expected false positive probability of the filter.public BloomFilter bloomFilter(String colName, long expectedNumItems, long numBits)
colName - name of the column over which the filter is builtexpectedNumItems - expected number of items which will be put into the filter.numBits - expected number of bits of the filter.public BloomFilter bloomFilter(Column col, long expectedNumItems, long numBits)
col - the column over which the filter is builtexpectedNumItems - expected number of items which will be put into the filter.numBits - expected number of bits of the filter.public double corr(String col1,
                   String col2,
                   String method)
col1 - the name of the columncol2 - the name of the column to calculate the correlation againstmethod - (undocumented)
    val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
      .withColumn("rand2", rand(seed=27))
    df.stat.corr("rand1", "rand2")
    res1: Double = 0.613...
 public double corr(String col1,
                   String col2)
col1 - the name of the columncol2 - the name of the column to calculate the correlation against
    val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
      .withColumn("rand2", rand(seed=27))
    df.stat.corr("rand1", "rand2", "pearson")
    res1: Double = 0.613...
 public CountMinSketch countMinSketch(String colName, int depth, int width, int seed)
colName - name of the column over which the sketch is builtdepth - depth of the sketchwidth - width of the sketchseed - random seedCountMinSketch over column colNamepublic CountMinSketch countMinSketch(String colName, double eps, double confidence, int seed)
colName - name of the column over which the sketch is builteps - relative error of the sketchconfidence - confidence of the sketchseed - random seedCountMinSketch over column colNamepublic CountMinSketch countMinSketch(Column col, int depth, int width, int seed)
col - the column over which the sketch is builtdepth - depth of the sketchwidth - width of the sketchseed - random seedCountMinSketch over column colNamepublic CountMinSketch countMinSketch(Column col, double eps, double confidence, int seed)
col - the column over which the sketch is builteps - relative error of the sketchconfidence - confidence of the sketchseed - random seedCountMinSketch over column colNamepublic double cov(String col1,
                  String col2)
col1 - the name of the first columncol2 - the name of the second column
    val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
      .withColumn("rand2", rand(seed=27))
    df.stat.cov("rand1", "rand2")
    res1: Double = 0.065...
 public Dataset<Row> crosstab(String col1, String col2)
col1 and the column names will
 be the distinct values of col2. The name of the first column will be col1_col2. Counts
 will be returned as Longs. Pairs that have no occurrences will have zero as their counts.
 Null elements will be replaced by "null", and back ticks will be dropped from elements if they
 exist.
 col1 - The name of the first column. Distinct items will make the first item of
             each row.col2 - The name of the second column. Distinct items will make the column names
             of the DataFrame.
    val df = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3)))
      .toDF("key", "value")
    val ct = df.stat.crosstab("key", "value")
    ct.show()
    +---------+---+---+---+
    |key_value|  1|  2|  3|
    +---------+---+---+---+
    |        2|  2|  0|  1|
    |        1|  1|  1|  0|
    |        3|  0|  1|  1|
    +---------+---+---+---+
 public Dataset<Row> freqItems(String[] cols, double support)
support should be greater than 1e-4.
 
 This function is meant for exploratory data analysis, as we make no guarantee about the
 backward compatibility of the schema of the resulting DataFrame.
 
cols - the names of the columns to search frequent items in.support - The minimum frequency for an item to be considered frequent. Should be greater
                than 1e-4.
    val rows = Seq.tabulate(100) { i =>
      if (i % 2 == 0) (1, -1.0) else (i, i * -1.0)
    }
    val df = spark.createDataFrame(rows).toDF("a", "b")
    // find the items with a frequency greater than 0.4 (observed 40% of the time) for columns
    // "a" and "b"
    val freqSingles = df.stat.freqItems(Array("a", "b"), 0.4)
    freqSingles.show()
    +-----------+-------------+
    |a_freqItems|  b_freqItems|
    +-----------+-------------+
    |    [1, 99]|[-1.0, -99.0]|
    +-----------+-------------+
    // find the pair of items with a frequency greater than 0.1 in columns "a" and "b"
    val pairDf = df.select(struct("a", "b").as("a-b"))
    val freqPairs = pairDf.stat.freqItems(Array("a-b"), 0.1)
    freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show()
    +----------+
    |   freq_ab|
    +----------+
    |  [1,-1.0]|
    |   ...    |
    +----------+
 public Dataset<Row> freqItems(String[] cols)
default support of 1%.
 
 This function is meant for exploratory data analysis, as we make no guarantee about the
 backward compatibility of the schema of the resulting DataFrame.
 
cols - the names of the columns to search frequent items in.public Dataset<Row> freqItems(scala.collection.Seq<String> cols, double support)
 This function is meant for exploratory data analysis, as we make no guarantee about the
 backward compatibility of the schema of the resulting DataFrame.
 
cols - the names of the columns to search frequent items in.support - (undocumented)
    val rows = Seq.tabulate(100) { i =>
      if (i % 2 == 0) (1, -1.0) else (i, i * -1.0)
    }
    val df = spark.createDataFrame(rows).toDF("a", "b")
    // find the items with a frequency greater than 0.4 (observed 40% of the time) for columns
    // "a" and "b"
    val freqSingles = df.stat.freqItems(Seq("a", "b"), 0.4)
    freqSingles.show()
    +-----------+-------------+
    |a_freqItems|  b_freqItems|
    +-----------+-------------+
    |    [1, 99]|[-1.0, -99.0]|
    +-----------+-------------+
    // find the pair of items with a frequency greater than 0.1 in columns "a" and "b"
    val pairDf = df.select(struct("a", "b").as("a-b"))
    val freqPairs = pairDf.stat.freqItems(Seq("a-b"), 0.1)
    freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show()
    +----------+
    |   freq_ab|
    +----------+
    |  [1,-1.0]|
    |   ...    |
    +----------+
 public Dataset<Row> freqItems(scala.collection.Seq<String> cols)
default support of 1%.
 
 This function is meant for exploratory data analysis, as we make no guarantee about the
 backward compatibility of the schema of the resulting DataFrame.
 
cols - the names of the columns to search frequent items in.public <T> Dataset<Row> sampleBy(String col, scala.collection.immutable.Map<T,Object> fractions, long seed)
col - column that defines stratafractions - sampling fraction for each stratum. If a stratum is not specified, we treat
                  its fraction as zero.seed - random seedDataFrame that represents the stratified sample
 
    val df = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2),
      (3, 3))).toDF("key", "value")
    val fractions = Map(1 -> 1.0, 3 -> 0.5)
    df.stat.sampleBy("key", fractions, 36L).show()
    +---+-----+
    |key|value|
    +---+-----+
    |  1|    1|
    |  1|    2|
    |  3|    2|
    +---+-----+
 public <T> Dataset<Row> sampleBy(String col, java.util.Map<T,Double> fractions, long seed)
col - column that defines stratafractions - sampling fraction for each stratum. If a stratum is not specified, we treat
                  its fraction as zero.seed - random seedDataFrame that represents the stratified sample
 public <T> Dataset<Row> sampleBy(Column col, scala.collection.immutable.Map<T,Object> fractions, long seed)
col - column that defines stratafractions - sampling fraction for each stratum. If a stratum is not specified, we treat
                  its fraction as zero.seed - random seedDataFrame that represents the stratified sample
 The stratified sample can be performed over multiple columns:
    import org.apache.spark.sql.Row
    import org.apache.spark.sql.functions.struct
    val df = spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 17),
      ("Alice", 10))).toDF("name", "age")
    val fractions = Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0)
    df.stat.sampleBy(struct($"name", $"age"), fractions, 36L).show()
    +-----+---+
    | name|age|
    +-----+---+
    | Nico|  8|
    |Alice| 10|
    +-----+---+
 public <T> Dataset<Row> sampleBy(Column col, java.util.Map<T,Double> fractions, long seed)
col - column that defines stratafractions - sampling fraction for each stratum. If a stratum is not specified, we treat
                  its fraction as zero.seed - random seedDataFrame that represents the stratified sample