Specifies schema of actual data files.
Specifies schema of actual data files. For partitioned relations, if one or more partitioned
columns are contained in the data files, they should also appear in dataSchema
.
1.4.0
Base paths of this relation.
Base paths of this relation. For partitioned relations, it should be either root directories of all partition directories.
1.4.0
Prepares a write job and returns an OutputWriterFactory.
Prepares a write job and returns an OutputWriterFactory. Client side job preparation can be put here. For example, user defined output committer can be configured here by setting the output committer class in the conf of spark.sql.sources.outputCommitterClass.
Note that the only side effect expected here is mutating job
via its setters. Especially,
Spark SQL caches BaseRelation instances for performance, mutating relation internal states
may cause unexpected behaviors.
1.4.0
For a non-partitioned relation, this method builds an RDD[Row]
containing all rows within
this relation.
For a non-partitioned relation, this method builds an RDD[Row]
containing all rows within
this relation. For partitioned relations, this method is called for each selected partition,
and builds an RDD[Row]
containing all rows within that single partition.
Required columns.
Candidate filters to be pushed down. The actual filter should be the conjunction
of all filters
. The pushed down filters are currently purely an optimization as they
will all be evaluated again. This means it is safe to use them with methods that produce
false positives such as filtering partitions based on a bloom filter.
For a non-partitioned relation, it contains paths of all data files in the relation. For a partitioned relation, it contains paths of all data files in a single selected partition.
1.4.0
For a non-partitioned relation, this method builds an RDD[Row]
containing all rows within
this relation.
For a non-partitioned relation, this method builds an RDD[Row]
containing all rows within
this relation. For partitioned relations, this method is called for each selected partition,
and builds an RDD[Row]
containing all rows within that single partition.
Required columns.
For a non-partitioned relation, it contains paths of all data files in the relation. For a partitioned relation, it contains paths of all data files in a single selected partition.
1.4.0
For a non-partitioned relation, this method builds an RDD[Row]
containing all rows within
this relation.
For a non-partitioned relation, this method builds an RDD[Row]
containing all rows within
this relation. For partitioned relations, this method is called for each selected partition,
and builds an RDD[Row]
containing all rows within that single partition.
For a non-partitioned relation, it contains paths of all data files in the relation. For a partitioned relation, it contains paths of all data files in a single selected partition.
1.4.0
Whether does it need to convert the objects in Row to internal representation, for example: java.
Whether does it need to convert the objects in Row to internal representation, for example: java.lang.String -> UTF8String java.lang.Decimal -> Decimal
Note: The internal representation is not stable across releases and thus data sources outside of Spark SQL should leave this as true.
1.4.0
Partition columns.
Partition columns. Can be either defined by userDefinedPartitionColumns or automatically discovered. Note that they should always be nullable.
1.4.0
Schema of this relation.
Schema of this relation. It consists of columns appearing in dataSchema and all partition columns not appearing in dataSchema.
1.4.0
Returns an estimated size of this relation in bytes.
Returns an estimated size of this relation in bytes. This information is used by the planner to decided when it is safe to broadcast a relation and can be overridden by sources that know the size ahead of time. By default, the system will assume that tables are too large to broadcast. This method will be called multiple times during query planning and thus should not perform expensive operations for each invocation.
Note that it is always better to overestimate size than underestimate, because underestimation could lead to execution plans that are suboptimal (i.e. broadcasting a very large table).
1.3.0
Optional user defined partition columns.
Optional user defined partition columns.
1.4.0
::Experimental:: A BaseRelation that provides much of the common code required for formats that store their data to an HDFS compatible filesystem.
For the read path, similar to PrunedFilteredScan, it can eliminate unneeded columns and filter using selected predicates before producing an RDD containing all matching tuples as Row objects. In addition, when reading from Hive style partitioned tables stored in file systems, it's able to discover partitioning information from the paths of input directories, and perform partition pruning before start reading the data. Subclasses of HadoopFsRelation() must override one of the three
buildScan
methods to implement the read path.For the write path, it provides the ability to write to both non-partitioned and partitioned tables. Directory layout of the partitioned tables is compatible with Hive.
1.4.0