Dataset

Instance Constructors

new Dataset(sqlContext: SQLContext, logicalPlan: LogicalPlan, encoder: Encoder[T])
new Dataset(sparkSession: SparkSession, logicalPlan: LogicalPlan, encoder: Encoder[T])

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def agg(expr: Column, exprs: Column*): DataFrame

Aggregates on the entire Dataset without groups.
Aggregates on the entire Dataset without groups.
```
// ds.agg(...) is a shorthand for ds.groupBy().agg(...)
ds.agg(max($"age"), avg($"salary"))
ds.groupBy().agg(max($"age"), avg($"salary"))
```
Annotations
@varargs()
Since
2.0.0
def agg(exprs: Map[String, String]): DataFrame

(Java-specific) Aggregates on the entire Dataset without groups.
(Java-specific) Aggregates on the entire Dataset without groups.
```
// ds.agg(...) is a shorthand for ds.groupBy().agg(...)
ds.agg(Map("age" -> "max", "salary" -> "avg"))
ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
```
Since
2.0.0
def agg(exprs: Map[String, String]): DataFrame

(Scala-specific) Aggregates on the entire Dataset without groups.
(Scala-specific) Aggregates on the entire Dataset without groups.
```
// ds.agg(...) is a shorthand for ds.groupBy().agg(...)
ds.agg(Map("age" -> "max", "salary" -> "avg"))
ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
```
Since
2.0.0
def agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame

(Scala-specific) Aggregates on the entire Dataset without groups.
(Scala-specific) Aggregates on the entire Dataset without groups.
```
// ds.agg(...) is a shorthand for ds.groupBy().agg(...)
ds.agg("age" -> "max", "salary" -> "avg")
ds.groupBy().agg("age" -> "max", "salary" -> "avg")
```
Since
2.0.0
def alias(alias: Symbol): Dataset[T]

(Scala-specific) Returns a new Dataset with an alias set.
(Scala-specific) Returns a new Dataset with an alias set. Same as as.

Since
2.0.0
def alias(alias: String): Dataset[T]

Returns a new Dataset with an alias set.
Returns a new Dataset with an alias set. Same as as.

Since
2.0.0
def apply(colName: String): Column

Selects column based on the column name and return it as a Column.
Selects column based on the column name and return it as a Column. Note that the column name can also reference to a nested column like a.b.

Since
2.0.0
def as(alias: Symbol): Dataset[T]

(Scala-specific) Returns a new Dataset with an alias set.
(Scala-specific) Returns a new Dataset with an alias set.

Since
2.0.0
def as(alias: String): Dataset[T]

Returns a new Dataset with an alias set.
Returns a new Dataset with an alias set.

Since
1.6.0
def as[U](implicit arg0: Encoder[U]): Dataset[U]

:: Experimental :: Returns a new Dataset where each record has been mapped on to the specified type.
:: Experimental :: Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of U:
- When U is a class, fields for the class will be mapped to columns of the same name (case sensitivity is determined by spark.sql.caseSensitive).
- When U is a tuple, the columns will be be mapped by ordinal (i.e. the first column will be assigned to _1).
- When U is a primitive type (i.e. String, Int, etc), then the first column of the DataFrame will be used.
If the schema of the Dataset does not match the desired U type, you can use select along with alias or as to rearrange or rename as required.
Annotations
@Experimental()
Since
1.6.0
final def asInstanceOf[T0]: T0

Definition Classes
Any
def cache(): Dataset.this.type

Persist this Dataset with the default storage level (MEMORY_AND_DISK).
Persist this Dataset with the default storage level (MEMORY_AND_DISK).

Since
1.6.0
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def coalesce(numPartitions: Int): Dataset[T]

Returns a new Dataset that has exactly numPartitions partitions.
Returns a new Dataset that has exactly numPartitions partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.

Since
1.6.0
def col(colName: String): Column

Selects column based on the column name and return it as a Column.
Selects column based on the column name and return it as a Column. Note that the column name can also reference to a nested column like a.b.

Since
2.0.0
def collect(): Array[T]

Returns an array that contains all of Rows in this Dataset.
Returns an array that contains all of Rows in this Dataset.
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
For Java API, use collectAsList.

Since
1.6.0
def collectAsList(): List[T]

Returns a Java list that contains all of Rows in this Dataset.
Returns a Java list that contains all of Rows in this Dataset.
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.

Since
1.6.0
def columns: Array[String]

Returns all column names as an array.
Returns all column names as an array.

Since
1.6.0
def count(): Long

Returns the number of rows in the Dataset.
Returns the number of rows in the Dataset.

Since
1.6.0
def createOrReplaceTempView(viewName: String): Unit

Creates a temporary view using the given name.
Creates a temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that was used to create this Dataset.

Since
2.0.0
def createTempView(viewName: String): Unit

Creates a temporary view using the given name.
Creates a temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that was used to create this Dataset.

Annotations
@throws( ... )
Since
2.0.0
Exceptions thrown
AnalysisException if the view name already exists
def cube(col1: String, cols: String*): RelationalGroupedDataset

Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).
```
// Compute the average for all numeric columns cubed by department and group.
ds.cube("department", "group").avg()

// Compute the max age and average salary, cubed by department and gender.
ds.cube($"department", $"gender").agg(Map(
  "salary" -> "avg",
  "age" -> "max"
))
```
Annotations
@varargs()
Since
2.0.0
def cube(cols: Column*): RelationalGroupedDataset

Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
```
// Compute the average for all numeric columns cubed by department and group.
ds.cube($"department", $"group").avg()

// Compute the max age and average salary, cubed by department and gender.
ds.cube($"department", $"gender").agg(Map(
  "salary" -> "avg",
  "age" -> "max"
))
```
Annotations
@varargs()
Since
2.0.0
def describe(cols: String*): DataFrame

Computes statistics for numeric columns, including count, mean, stddev, min, and max.
Computes statistics for numeric columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns.
This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the agg function instead.
```
ds.describe("age", "height").show()

// output:
// summary age   height
// count   10.0  10.0
// mean    53.3  178.05
// stddev  11.6  15.7
// min     18.0  163.0
// max     92.0  192.0
```
Annotations
@varargs()
Since
1.6.0
def distinct(): Dataset[T]

Returns a new Dataset that contains only the unique rows from this Dataset.
Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for dropDuplicates.
Note that, equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.

Since
2.0.0
def drop(col: Column): DataFrame

Returns a new Dataset with a column dropped.
Returns a new Dataset with a column dropped. This version of drop accepts a Column rather than a name. This is a no-op if the Dataset doesn't have a column with an equivalent expression.

Since
2.0.0
def drop(colNames: String*): DataFrame

Returns a new Dataset with columns dropped.
Returns a new Dataset with columns dropped. This is a no-op if schema doesn't contain column name(s).
This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.

Annotations
@varargs()
Since
2.0.0
def drop(colName: String): DataFrame

Returns a new Dataset with a column dropped.
Returns a new Dataset with a column dropped. This is a no-op if schema doesn't contain column name.
This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.

Since
2.0.0
def dropDuplicates(col1: String, cols: String*): Dataset[T]

Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
Returns a new Dataset with duplicate rows removed, considering only the subset of columns.

Annotations
@varargs()
Since
2.0.0
def dropDuplicates(colNames: Array[String]): Dataset[T]

Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
Returns a new Dataset with duplicate rows removed, considering only the subset of columns.

Since
2.0.0
def dropDuplicates(colNames: Seq[String]): Dataset[T]

(Scala-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
(Scala-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.

Since
2.0.0
def dropDuplicates(): Dataset[T]

Returns a new Dataset that contains only the unique rows from this Dataset.
Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for distinct.

Since
2.0.0
def dtypes: Array[(String, String)]

Returns all column names and their data types as an array.
Returns all column names and their data types as an array.

Since
1.6.0
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def except(other: Dataset[T]): Dataset[T]

Returns a new Dataset containing rows in this Dataset but not in another Dataset.
Returns a new Dataset containing rows in this Dataset but not in another Dataset. This is equivalent to EXCEPT in SQL.
Note that, equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.

Since
2.0.0
def explain(): Unit

Prints the physical plan to the console for debugging purposes.
Prints the physical plan to the console for debugging purposes.

Since
1.6.0
def explain(extended: Boolean): Unit

Prints the plans (logical and physical) to the console for debugging purposes.
Prints the plans (logical and physical) to the console for debugging purposes.

Since
1.6.0
def filter(func: FilterFunction[T]): Dataset[T]

:: Experimental :: (Java-specific) Returns a new Dataset that only contains elements where func returns true.
:: Experimental :: (Java-specific) Returns a new Dataset that only contains elements where func returns true.

Annotations
@Experimental()
Since
1.6.0
def filter(func: (T) ⇒ Boolean): Dataset[T]

:: Experimental :: (Scala-specific) Returns a new Dataset that only contains elements where func returns true.
:: Experimental :: (Scala-specific) Returns a new Dataset that only contains elements where func returns true.

Annotations
@Experimental()
Since
1.6.0
def filter(conditionExpr: String): Dataset[T]

Filters rows using the given SQL expression.
Filters rows using the given SQL expression.
```
peopleDs.filter("age > 15")
```
Since
1.6.0
def filter(condition: Column): Dataset[T]

Filters rows using the given condition.
Filters rows using the given condition.
```
// The following are equivalent:
peopleDs.filter($"age" > 15)
peopleDs.where($"age" > 15)
```
Since
1.6.0
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
def first(): T

Returns the first row.
Returns the first row. Alias for head().

Since
1.6.0
def flatMap[U](f: FlatMapFunction[T, U], encoder: Encoder[U]): Dataset[U]

:: Experimental :: (Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
:: Experimental :: (Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

Annotations
@Experimental()
Since
1.6.0
def flatMap[U](func: (T) ⇒ TraversableOnce[U])(implicit arg0: Encoder[U]): Dataset[U]

:: Experimental :: (Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
:: Experimental :: (Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

Annotations
@Experimental()
Since
1.6.0
def foreach(func: ForeachFunction[T]): Unit

(Java-specific) Runs func on each element of this Dataset.
(Java-specific) Runs func on each element of this Dataset.

Since
1.6.0
def foreach(f: (T) ⇒ Unit): Unit

Applies a function f to all rows.
Applies a function f to all rows.

Since
1.6.0
def foreachPartition(func: ForeachPartitionFunction[T]): Unit

(Java-specific) Runs func on each partition of this Dataset.
(Java-specific) Runs func on each partition of this Dataset.

Since
1.6.0
def foreachPartition(f: (Iterator[T]) ⇒ Unit): Unit

Applies a function f to each partition of this Dataset.
Applies a function f to each partition of this Dataset.

Since
1.6.0
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def groupBy(col1: String, cols: String*): RelationalGroupedDataset

Groups the Dataset using the specified columns, so that we can run aggregation on them.
Groups the Dataset using the specified columns, so that we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).
```
// Compute the average for all numeric columns grouped by department.
ds.groupBy("department").avg()

// Compute the max age and average salary, grouped by department and gender.
ds.groupBy($"department", $"gender").agg(Map(
  "salary" -> "avg",
  "age" -> "max"
))
```
Annotations
@varargs()
Since
2.0.0
def groupBy(cols: Column*): RelationalGroupedDataset

Groups the Dataset using the specified columns, so we can run aggregation on them.
Groups the Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
```
// Compute the average for all numeric columns grouped by department.
ds.groupBy($"department").avg()

// Compute the max age and average salary, grouped by department and gender.
ds.groupBy($"department", $"gender").agg(Map(
  "salary" -> "avg",
  "age" -> "max"
))
```
Annotations
@varargs()
Since
2.0.0
def groupByKey[K](func: MapFunction[T, K], encoder: Encoder[K]): KeyValueGroupedDataset[K, T]

:: Experimental :: (Java-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.
:: Experimental :: (Java-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.

Annotations
@Experimental()
Since
2.0.0
def groupByKey[K](func: (T) ⇒ K)(implicit arg0: Encoder[K]): KeyValueGroupedDataset[K, T]

:: Experimental :: (Scala-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.
:: Experimental :: (Scala-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.

Annotations
@Experimental()
Since
2.0.0
def hashCode(): Int

Definition Classes
AnyRef → Any
def head(): T

Returns the first row.
Returns the first row.

Since
1.6.0
def head(n: Int): Array[T]

Returns the first n rows.
Returns the first n rows.

Since
1.6.0
Note
this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.
def inputFiles: Array[String]

Returns a best-effort snapshot of the files that compose this Dataset.
Returns a best-effort snapshot of the files that compose this Dataset. This method simply asks each constituent BaseRelation for its respective files and takes the union of all results. Depending on the source relations, this may not find all input files. Duplicates are removed.

Since
2.0.0
def intersect(other: Dataset[T]): Dataset[T]

Returns a new Dataset containing rows only in both this Dataset and another Dataset.
Returns a new Dataset containing rows only in both this Dataset and another Dataset. This is equivalent to INTERSECT in SQL.
Note that, equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.

Since
1.6.0
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def isLocal: Boolean

Returns true if the collect and take methods can be run locally (without any Spark executors).
Returns true if the collect and take methods can be run locally (without any Spark executors).

Since
1.6.0
def isStreaming: Boolean

Returns true if this Dataset contains one or more sources that continuously return data as it arrives.
Returns true if this Dataset contains one or more sources that continuously return data as it arrives. A Dataset that reads data from a streaming source must be executed as a StreamingQuery using the start() method in DataStreamWriter. Methods that return a single answer, e.g. count() or collect(), will throw an AnalysisException when there is a streaming source present.

Annotations
@Experimental()
Since
2.0.0
def javaRDD: JavaRDD[T]

Returns the content of the Dataset as a JavaRDD of Ts.
Returns the content of the Dataset as a JavaRDD of Ts.

Since
1.6.0
def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame

Join with another DataFrame, using the given join expression.
Join with another DataFrame, using the given join expression. The following performs a full outer join between df1 and df2.
```
// Scala:
import org.apache.spark.sql.functions._
df1.join(df2, $"df1Key" === $"df2Key", "outer")

// Java:
import static org.apache.spark.sql.functions.*;
df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
```
right
Right side of the join.
joinExprs
Join expression.
joinType
One of: inner, outer, left_outer, right_outer, leftsemi.

Since
2.0.0
def join(right: Dataset[_], joinExprs: Column): DataFrame

Inner join with another DataFrame, using the given join expression.
Inner join with another DataFrame, using the given join expression.
```
// The following two are equivalent:
df1.join(df2, $"df1Key" === $"df2Key")
df1.join(df2).where($"df1Key" === $"df2Key")
```
Since
2.0.0
def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame

Equi-join with another DataFrame using the given columns.
Equi-join with another DataFrame using the given columns.
Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.
Note that if you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
right
Right side of the join operation.
usingColumns
Names of the columns to join on. This columns must exist on both sides.
joinType
One of: inner, outer, left_outer, right_outer, leftsemi.

Since
2.0.0
def join(right: Dataset[_], usingColumns: Seq[String]): DataFrame

Inner equi-join with another DataFrame using the given columns.
Inner equi-join with another DataFrame using the given columns.
Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.
```
// Joining df1 and df2 using the columns "user_id" and "user_name"
df1.join(df2, Seq("user_id", "user_name"))
```
Note that if you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
right
Right side of the join operation.
usingColumns
Names of the columns to join on. This columns must exist on both sides.

Since
2.0.0
def join(right: Dataset[_], usingColumn: String): DataFrame

Inner equi-join with another DataFrame using the given column.
Inner equi-join with another DataFrame using the given column.
Different from other join functions, the join column will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.
```
// Joining df1 and df2 using the column "user_id"
df1.join(df2, "user_id")
```
Note that if you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
right
Right side of the join operation.
usingColumn
Name of the column to join on. This column must exist on both sides.

Since
2.0.0
def join(right: Dataset[_]): DataFrame

Cartesian join with another DataFrame.
Cartesian join with another DataFrame.
Note that cartesian joins are very expensive without an extra filter that can be pushed down.
right
Right side of the join operation.

Since
2.0.0
def joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)]

:: Experimental :: Using inner equi-join to join this Dataset returning a Tuple2 for each pair where condition evaluates to true.
:: Experimental :: Using inner equi-join to join this Dataset returning a Tuple2 for each pair where condition evaluates to true.
other
Right side of the join.
condition
Join expression.

Annotations
@Experimental()
Since
1.6.0
def joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]

:: Experimental :: Joins this Dataset returning a Tuple2 for each pair where condition evaluates to true.
:: Experimental :: Joins this Dataset returning a Tuple2 for each pair where condition evaluates to true.
This is similar to the relation join function with one important difference in the result schema. Since joinWith preserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names _1 and _2.
This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.
other
Right side of the join.
condition
Join expression.
joinType
One of: inner, outer, left_outer, right_outer, leftsemi.

Annotations
@Experimental()
Since
1.6.0
def limit(n: Int): Dataset[T]

Returns a new Dataset by taking the first n rows.
Returns a new Dataset by taking the first n rows. The difference between this function and head is that head is an action and returns an array (by triggering query execution) while limit returns a new Dataset.

Since
2.0.0
def map[U](func: MapFunction[T, U], encoder: Encoder[U]): Dataset[U]

:: Experimental :: (Java-specific) Returns a new Dataset that contains the result of applying func to each element.
:: Experimental :: (Java-specific) Returns a new Dataset that contains the result of applying func to each element.

Annotations
@Experimental()
Since
1.6.0
def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]

:: Experimental :: (Scala-specific) Returns a new Dataset that contains the result of applying func to each element.
:: Experimental :: (Scala-specific) Returns a new Dataset that contains the result of applying func to each element.

Annotations
@Experimental()
Since
1.6.0
def mapPartitions[U](f: MapPartitionsFunction[T, U], encoder: Encoder[U]): Dataset[U]

:: Experimental :: (Java-specific) Returns a new Dataset that contains the result of applying f to each partition.
:: Experimental :: (Java-specific) Returns a new Dataset that contains the result of applying f to each partition.

Annotations
@Experimental()
Since
1.6.0
def mapPartitions[U](func: (Iterator[T]) ⇒ Iterator[U])(implicit arg0: Encoder[U]): Dataset[U]

:: Experimental :: (Scala-specific) Returns a new Dataset that contains the result of applying func to each partition.
:: Experimental :: (Scala-specific) Returns a new Dataset that contains the result of applying func to each partition.

Annotations
@Experimental()
Since
1.6.0
def na: DataFrameNaFunctions

Returns a DataFrameNaFunctions for working with missing data.
Returns a DataFrameNaFunctions for working with missing data.
```
// Dropping rows containing any null values.
ds.na.drop()
```
Since
1.6.0
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def orderBy(sortExprs: Column*): Dataset[T]

Returns a new Dataset sorted by the given expressions.
Returns a new Dataset sorted by the given expressions. This is an alias of the sort function.

Annotations
@varargs()
Since
2.0.0
def orderBy(sortCol: String, sortCols: String*): Dataset[T]

Returns a new Dataset sorted by the given expressions.
Returns a new Dataset sorted by the given expressions. This is an alias of the sort function.

Annotations
@varargs()
Since
2.0.0
def persist(newLevel: StorageLevel): Dataset.this.type

Persist this Dataset with the given storage level.
Persist this Dataset with the given storage level.
newLevel
One of: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.

Since
1.6.0
def persist(): Dataset.this.type

Persist this Dataset with the default storage level (MEMORY_AND_DISK).
Persist this Dataset with the default storage level (MEMORY_AND_DISK).

Since
1.6.0
def printSchema(): Unit

Prints the schema to the console in a nice tree format.
Prints the schema to the console in a nice tree format.

Since
1.6.0
val queryExecution: QueryExecution
def randomSplit(weights: Array[Double]): Array[Dataset[T]]

Randomly splits this Dataset with the provided weights.
Randomly splits this Dataset with the provided weights.
weights
weights for splits, will be normalized if they don't sum to 1.

Since
2.0.0
def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]

Randomly splits this Dataset with the provided weights.
Randomly splits this Dataset with the provided weights.
weights
weights for splits, will be normalized if they don't sum to 1.
seed
Seed for sampling. For Java API, use randomSplitAsList.

Since
2.0.0
def randomSplitAsList(weights: Array[Double], seed: Long): List[Dataset[T]]

Returns a Java list that contains randomly split Dataset with the provided weights.
Returns a Java list that contains randomly split Dataset with the provided weights.
weights
weights for splits, will be normalized if they don't sum to 1.
seed
Seed for sampling.

Since
2.0.0
lazy val rdd: RDD[T]

Represents the content of the Dataset as an RDD of T.
Represents the content of the Dataset as an RDD of T.

Since
1.6.0
def reduce(func: ReduceFunction[T]): T

:: Experimental :: (Java-specific) Reduces the elements of this Dataset using the specified binary function.
:: Experimental :: (Java-specific) Reduces the elements of this Dataset using the specified binary function. The given func must be commutative and associative or the result may be non-deterministic.

Annotations
@Experimental()
Since
1.6.0
def reduce(func: (T, T) ⇒ T): T

:: Experimental :: (Scala-specific) Reduces the elements of this Dataset using the specified binary function.
:: Experimental :: (Scala-specific) Reduces the elements of this Dataset using the specified binary function. The given func must be commutative and associative or the result may be non-deterministic.

Annotations
@Experimental()
Since
1.6.0
def repartition(partitionExprs: Column*): Dataset[T]

Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions.
Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions. The resulting Dataset is hash partitioned.
This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).

Annotations
@varargs()
Since
2.0.0
def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]

Returns a new Dataset partitioned by the given partitioning expressions into numPartitions.
Returns a new Dataset partitioned by the given partitioning expressions into numPartitions. The resulting Dataset is hash partitioned.
This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).

Annotations
@varargs()
Since
2.0.0
def repartition(numPartitions: Int): Dataset[T]

Returns a new Dataset that has exactly numPartitions partitions.
Returns a new Dataset that has exactly numPartitions partitions.

Since
1.6.0
def rollup(col1: String, cols: String*): RelationalGroupedDataset

Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).
```
// Compute the average for all numeric columns rolluped by department and group.
ds.rollup("department", "group").avg()

// Compute the max age and average salary, rolluped by department and gender.
ds.rollup($"department", $"gender").agg(Map(
  "salary" -> "avg",
  "age" -> "max"
))
```
Annotations
@varargs()
Since
2.0.0
def rollup(cols: Column*): RelationalGroupedDataset

Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
```
// Compute the average for all numeric columns rolluped by department and group.
ds.rollup($"department", $"group").avg()

// Compute the max age and average salary, rolluped by department and gender.
ds.rollup($"department", $"gender").agg(Map(
  "salary" -> "avg",
  "age" -> "max"
))
```
Annotations
@varargs()
Since
2.0.0
def sample(withReplacement: Boolean, fraction: Double): Dataset[T]

Returns a new Dataset by sampling a fraction of rows, using a random seed.
Returns a new Dataset by sampling a fraction of rows, using a random seed.
withReplacement
Sample with replacement or not.
fraction
Fraction of rows to generate.

Since
1.6.0
def sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T]

Returns a new Dataset by sampling a fraction of rows.
Returns a new Dataset by sampling a fraction of rows.
withReplacement
Sample with replacement or not.
fraction
Fraction of rows to generate.
seed
Seed for sampling.

Since
1.6.0
def schema: StructType

Returns the schema of this Dataset.
Returns the schema of this Dataset.

Since
1.6.0
def select[U1, U2, U3, U4, U5](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2], c3: TypedColumn[T, U3], c4: TypedColumn[T, U4], c5: TypedColumn[T, U5]): Dataset[(U1, U2, U3, U4, U5)]

:: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.
:: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.

Annotations
@Experimental()
Since
1.6.0
def select[U1, U2, U3, U4](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2], c3: TypedColumn[T, U3], c4: TypedColumn[T, U4]): Dataset[(U1, U2, U3, U4)]

:: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.
:: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.

Annotations
@Experimental()
Since
1.6.0
def select[U1, U2, U3](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2], c3: TypedColumn[T, U3]): Dataset[(U1, U2, U3)]

:: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.
:: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.

Annotations
@Experimental()
Since
1.6.0
def select[U1, U2](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2]): Dataset[(U1, U2)]

:: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.
:: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.

Annotations
@Experimental()
Since
1.6.0
def select[U1](c1: TypedColumn[T, U1])(implicit arg0: Encoder[U1]): Dataset[U1]

:: Experimental :: Returns a new Dataset by computing the given Column expression for each element.
:: Experimental :: Returns a new Dataset by computing the given Column expression for each element.
```
val ds = Seq(1, 2, 3).toDS()
val newDS = ds.select(expr("value + 1").as[Int])
```
Annotations
@Experimental()
Since
1.6.0
def select(col: String, cols: String*): DataFrame

Selects a set of columns.
Selects a set of columns. This is a variant of select that can only select existing columns using column names (i.e. cannot construct expressions).
```
// The following two are equivalent:
ds.select("colA", "colB")
ds.select($"colA", $"colB")
```
Annotations
@varargs()
Since
2.0.0
def select(cols: Column*): DataFrame

Selects a set of column based expressions.
Selects a set of column based expressions.
```
ds.select($"colA", $"colB" + 1)
```
Annotations
@varargs()
Since
2.0.0
def selectExpr(exprs: String*): DataFrame

Selects a set of SQL expressions.
Selects a set of SQL expressions. This is a variant of select that accepts SQL expressions.
```
// The following are equivalent:
ds.selectExpr("colA", "colB as newName", "abs(colC)")
ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
```
Annotations
@varargs()
Since
2.0.0
def selectUntyped(columns: TypedColumn[_, _]*): Dataset[_]

Internal helper function for building typed selects that return tuples.
Internal helper function for building typed selects that return tuples. For simplicity and code reuse, we do this without the help of the type system and then use helper functions that cast appropriately for the user facing interface.

Attributes
protected
def show(numRows: Int, truncate: Boolean): Unit

Displays the Dataset in a tabular form.
Displays the Dataset in a tabular form. For example:
```
year  month AVG('Adj Close) MAX('Adj Close)
1980  12    0.503218        0.595103
1981  01    0.523289        0.570307
1982  02    0.436504        0.475256
1983  03    0.410516        0.442194
1984  04    0.450090        0.483521
```
numRows
Number of rows to show
truncate
Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right

Since
1.6.0
def show(truncate: Boolean): Unit

Displays the top 20 rows of Dataset in a tabular form.
Displays the top 20 rows of Dataset in a tabular form.
truncate
Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right

Since
1.6.0
def show(): Unit

Displays the top 20 rows of Dataset in a tabular form.
Displays the top 20 rows of Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right.

Since
1.6.0

def show(numRows: Int): Unit

Displays the Dataset in a tabular form.

Displays the Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right. For example:

year  month AVG('Adj Close) MAX('Adj Close)
1980  12    0.503218        0.595103
1981  01    0.523289        0.570307
1982  02    0.436504        0.475256
1983  03    0.410516        0.442194
1984  04    0.450090        0.483521

numRows: Number of rows to show

Since: 1.6.0

def sort(sortExprs: Column*): Dataset[T]

Returns a new Dataset sorted by the given expressions.
Returns a new Dataset sorted by the given expressions. For example:
```
ds.sort($"col1", $"col2".desc)
```
Annotations
@varargs()
Since
2.0.0
def sort(sortCol: String, sortCols: String*): Dataset[T]

Returns a new Dataset sorted by the specified column, all in ascending order.
Returns a new Dataset sorted by the specified column, all in ascending order.
```
// The following 3 are equivalent
ds.sort("sortcol")
ds.sort($"sortcol")
ds.sort($"sortcol".asc)
```
Annotations
@varargs()
Since
2.0.0
def sortWithinPartitions(sortExprs: Column*): Dataset[T]

Returns a new Dataset with each partition sorted by the given expressions.
Returns a new Dataset with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).

Annotations
@varargs()
Since
2.0.0
def sortWithinPartitions(sortCol: String, sortCols: String*): Dataset[T]

Returns a new Dataset with each partition sorted by the given expressions.
Returns a new Dataset with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).

Annotations
@varargs()
Since
2.0.0
val sparkSession: SparkSession
lazy val sqlContext: SQLContext
def stat: DataFrameStatFunctions

Returns a DataFrameStatFunctions for working statistic functions support.
Returns a DataFrameStatFunctions for working statistic functions support.
```
// Finding frequent items in column with name 'a'.
ds.stat.freqItems(Seq("a"))
```
Since
1.6.0
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def take(n: Int): Array[T]

Returns the first n rows in the Dataset.
Returns the first n rows in the Dataset.
Running take requires moving data into the application's driver process, and doing so with a very large n can crash the driver process with OutOfMemoryError.

Since
1.6.0
def takeAsList(n: Int): List[T]

Returns the first n rows in the Dataset as a list.
Returns the first n rows in the Dataset as a list.
Running take requires moving data into the application's driver process, and doing so with a very large n can crash the driver process with OutOfMemoryError.

Since
1.6.0
def toDF(colNames: String*): DataFrame

Converts this strongly typed collection of data to generic DataFrame with columns renamed.
Converts this strongly typed collection of data to generic DataFrame with columns renamed. This can be quite convenient in conversion from a RDD of tuples into a DataFrame with meaningful names. For example:
```
val rdd: RDD[(Int, String)] = ...
rdd.toDF()  // this implicit conversion creates a DataFrame with column name `_1` and `_2`
rdd.toDF("id", "name")  // this creates a DataFrame with column name "id" and "name"
```
Annotations
@varargs()
Since
2.0.0
def toDF(): DataFrame

Converts this strongly typed collection of data to generic Dataframe.
Converts this strongly typed collection of data to generic Dataframe. In contrast to the strongly typed objects that Dataset operations work on, a Dataframe returns generic Row objects that allow fields to be accessed by ordinal or name.

Since
1.6.0
def toJSON: Dataset[String]

Returns the content of the Dataset as a Dataset of JSON strings.
Returns the content of the Dataset as a Dataset of JSON strings.

Since
2.0.0
def toJavaRDD: JavaRDD[T]

Returns the content of the Dataset as a JavaRDD of Ts.
Returns the content of the Dataset as a JavaRDD of Ts.

Since
1.6.0
def toLocalIterator(): Iterator[T]

Return an iterator that contains all of Rows in this Dataset.
Return an iterator that contains all of Rows in this Dataset.
The iterator will consume as much memory as the largest partition in this Dataset.
Note: this results in multiple Spark jobs, and if the input Dataset is the result of a wide transformation (e.g. join with different partitioners), to avoid recomputing the input Dataset should be cached first.

Since
2.0.0
def toString(): String

Definition Classes
Dataset → AnyRef → Any
def transform[U](t: (Dataset[T]) ⇒ Dataset[U]): Dataset[U]

Concise syntax for chaining custom transformations.
Concise syntax for chaining custom transformations.
```
def featurize(ds: Dataset[T]): Dataset[U] = ...

ds
  .transform(featurize)
  .transform(...)
```
Since
1.6.0
def union(other: Dataset[T]): Dataset[T]

Returns a new Dataset containing union of rows in this Dataset and another Dataset.
Returns a new Dataset containing union of rows in this Dataset and another Dataset. This is equivalent to UNION ALL in SQL.
To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.

Since
2.0.0
def unpersist(): Dataset.this.type

Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.

Since
1.6.0
def unpersist(blocking: Boolean): Dataset.this.type

Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
blocking
Whether to block until all blocks are deleted.

Since
1.6.0
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
def where(conditionExpr: String): Dataset[T]

Filters rows using the given SQL expression.
Filters rows using the given SQL expression.
```
peopleDs.where("age > 15")
```
Since
1.6.0
def where(condition: Column): Dataset[T]

Filters rows using the given condition.
Filters rows using the given condition. This is an alias for filter.
```
// The following are equivalent:
peopleDs.filter($"age" > 15)
peopleDs.where($"age" > 15)
```
Since
1.6.0
def withColumn(colName: String, col: Column): DataFrame

Returns a new Dataset by adding a column or replacing the existing column that has the same name.
Returns a new Dataset by adding a column or replacing the existing column that has the same name.

Since
2.0.0
def withColumnRenamed(existingName: String, newName: String): DataFrame

Returns a new Dataset with a column renamed.
Returns a new Dataset with a column renamed. This is a no-op if schema doesn't contain existingName.

Since
2.0.0
def write: DataFrameWriter[T]

:: Experimental :: Interface for saving the content of the non-streaming Dataset out into external storage.
:: Experimental :: Interface for saving the content of the non-streaming Dataset out into external storage.

Annotations
@Experimental()
Since
1.6.0
def writeStream: DataStreamWriter[T]

:: Experimental :: Interface for saving the content of the streaming Dataset out into external storage.
:: Experimental :: Interface for saving the content of the streaming Dataset out into external storage.

Annotations
@Experimental()
Since
2.0.0

Deprecated Value Members

def explode[A, B](inputColumn: String, outputColumn: String)(f: (A) ⇒ TraversableOnce[B])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[B]): DataFrame

(Scala-specific) Returns a new Dataset where a single column has been expanded to zero or more rows by the provided function.
(Scala-specific) Returns a new Dataset where a single column has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. All columns of the input row are implicitly joined with each value that is output by the function.
Given that this is deprecated, as an alternative, you can explode columns either using functions.explode():
```
ds.select(explode(split('words, " ")).as("word"))
```
or flatMap():
```
ds.flatMap(_.words.split(" "))
```
Annotations
@deprecated
Deprecated
(Since version 2.0.0) use flatMap() or select() with functions.explode() instead
Since
2.0.0
def explode[A <: Product](input: Column*)(f: (Row) ⇒ TraversableOnce[A])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[A]): DataFrame

(Scala-specific) Returns a new Dataset where each row has been expanded to zero or more rows by the provided function.
(Scala-specific) Returns a new Dataset where each row has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. The columns of the input row are implicitly joined with each row that is output by the function.
Given that this is deprecated, as an alternative, you can explode columns either using functions.explode() or flatMap(). The following example uses these alternatives to count the number of books that contain a given word:
```
case class Book(title: String, words: String)
val ds: Dataset[Book]

val allWords = ds.select('title, explode(split('words, " ")).as("word"))

val bookCountPerWord = allWords.groupBy("word").agg(countDistinct("title"))
```
Using flatMap() this can similarly be exploded as:
```
ds.flatMap(_.words.split(" "))
```
Annotations
@deprecated
Deprecated
(Since version 2.0.0) use flatMap() or select() with functions.explode() instead
Since
2.0.0
def registerTempTable(tableName: String): Unit

Registers this Dataset as a temporary table using the given name.
Registers this Dataset as a temporary table using the given name. The lifetime of this temporary table is tied to the SparkSession that was used to create this Dataset.

Annotations
@deprecated
Deprecated
(Since version 2.0.0) Use createOrReplaceTempView(viewName) instead.
Since
1.6.0
def unionAll(other: Dataset[T]): Dataset[T]

Returns a new Dataset containing union of rows in this Dataset and another Dataset.
Returns a new Dataset containing union of rows in this Dataset and another Dataset. This is equivalent to UNION ALL in SQL.
To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.

Annotations
@deprecated
Deprecated
(Since version 2.0.0) use union()
Since
2.0.0

Related Doc: package sql

class Dataset[T] extends Serializable

Instance Constructors

new Dataset(sqlContext: SQLContext, logicalPlan: LogicalPlan, encoder: Encoder[T])

new Dataset(sparkSession: SparkSession, logicalPlan: LogicalPlan, encoder: Encoder[T])

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

def agg(expr: Column, exprs: Column*): DataFrame

def agg(exprs: Map[String, String]): DataFrame

def agg(exprs: Map[String, String]): DataFrame

def agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame

def alias(alias: Symbol): Dataset[T]

def alias(alias: String): Dataset[T]

def apply(colName: String): Column

def as(alias: Symbol): Dataset[T]

def as(alias: String): Dataset[T]

def as[U](implicit arg0: Encoder[U]): Dataset[U]

final def asInstanceOf[T0]: T0

def cache(): Dataset.this.type

def clone(): AnyRef

def coalesce(numPartitions: Int): Dataset[T]

def col(colName: String): Column

def collect(): Array[T]

def collectAsList(): List[T]

def columns: Array[String]

def count(): Long

def createOrReplaceTempView(viewName: String): Unit

def createTempView(viewName: String): Unit

def cube(col1: String, cols: String*): RelationalGroupedDataset

def cube(cols: Column*): RelationalGroupedDataset

def describe(cols: String*): DataFrame

def distinct(): Dataset[T]

def drop(col: Column): DataFrame

def drop(colNames: String*): DataFrame

def drop(colName: String): DataFrame

def dropDuplicates(col1: String, cols: String*): Dataset[T]

def dropDuplicates(colNames: Array[String]): Dataset[T]

def dropDuplicates(colNames: Seq[String]): Dataset[T]

def dropDuplicates(): Dataset[T]

def dtypes: Array[(String, String)]

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def except(other: Dataset[T]): Dataset[T]

def explain(): Unit

def explain(extended: Boolean): Unit

def filter(func: FilterFunction[T]): Dataset[T]

def filter(func: (T) ⇒ Boolean): Dataset[T]

def filter(conditionExpr: String): Dataset[T]

def filter(condition: Column): Dataset[T]

def finalize(): Unit

def first(): T

def flatMap[U](f: FlatMapFunction[T, U], encoder: Encoder[U]): Dataset[U]

def flatMap[U](func: (T) ⇒ TraversableOnce[U])(implicit arg0: Encoder[U]): Dataset[U]

def foreach(func: ForeachFunction[T]): Unit

def foreach(f: (T) ⇒ Unit): Unit

def foreachPartition(func: ForeachPartitionFunction[T]): Unit

def foreachPartition(f: (Iterator[T]) ⇒ Unit): Unit

final def getClass(): Class[_]

def groupBy(col1: String, cols: String*): RelationalGroupedDataset

def groupBy(cols: Column*): RelationalGroupedDataset

def groupByKey[K](func: MapFunction[T, K], encoder: Encoder[K]): KeyValueGroupedDataset[K, T]

def groupByKey[K](func: (T) ⇒ K)(implicit arg0: Encoder[K]): KeyValueGroupedDataset[K, T]

def hashCode(): Int

def head(): T

def head(n: Int): Array[T]

def inputFiles: Array[String]

def intersect(other: Dataset[T]): Dataset[T]

final def isInstanceOf[T0]: Boolean

def isLocal: Boolean

def isStreaming: Boolean

def javaRDD: JavaRDD[T]

def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame

def join(right: Dataset[_], joinExprs: Column): DataFrame

def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame

def join(right: Dataset[_], usingColumns: Seq[String]): DataFrame

def join(right: Dataset[_], usingColumn: String): DataFrame

def join(right: Dataset[_]): DataFrame