FileInputDStream (Spark 1.2.1 JavaDoc)

Object
- org.apache.spark.streaming.dstream.DStream<T>
- - org.apache.spark.streaming.dstream.InputDStream<scala.Tuple2<K,V>>
  - - org.apache.spark.streaming.dstream.FileInputDStream<K,V,F>

All Implemented Interfaces:

java.io.Serializable, Logging
```
public class FileInputDStream<K,V,F extends org.apache.hadoop.mapreduce.InputFormat<K,V>>
extends InputDStream<scala.Tuple2<K,V>>
```
This class represents an input stream that monitors a Hadoop-compatible filesystem for new files and creates a stream out of them. The way it works as follows.
At each batch interval, the file system is queried for files in the given directory and detected new files are selected for that batch. In this case "new" means files that became visible to readers during that time period. Some extra care is needed to deal with the fact that files may become visible after they are created. For this purpose, this class remembers the information about the files selected in past batches for a certain duration (say, "remember window") as shown in the figure below.
|<----- remember window ----->| ignore threshold --->| |<--- current batch time |____.____.____.____.____.____| | | | | | | | ---------------------|----|----|----|----|----|----|-----------------------> Time |____|____|____|____|____|____| remembered batches
The trailing end of the window is the "ignore threshold" and all files whose mod times are less than this threshold are assumed to have already been selected and are therefore ignored. Files whose mod times are within the "remember window" are checked against files that have already been selected. At a high level, this is how new files are identified in each batch - files whose mod times are greater than the ignore threshold and have not been considered within the remember window. See the documentation on the method isNewFile for more details.
This makes some assumptions from the underlying file system that the system is monitoring. - The clock of the file system is assumed to synchronized with the clock of the machine running the streaming app. - If a file is to be visible in the directory listings, it must be visible within a certain duration of the mod time of the file. This duration is the "remember window", which is set to 1 minute (see FileInputDStream.MIN_REMEMBER_DURATION). Otherwise, the file will never be selected as the mod time will be less than the ignore threshold when it becomes visible. - Once a file is visible, the mod time cannot change. If it does due to appends, then the processing semantics are undefined.

See Also:
Serialized Form

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`class`	`FileInputDStream.FileInputDStreamCheckpointData` A custom version of the DStreamCheckpointData that stores names of Hadoop files as checkpoint data.

Constructor Summary

Constructors
Constructor and Description
`FileInputDStream(StreamingContext ssc_, String directory, scala.Function1<org.apache.hadoop.fs.Path,Object> filter, boolean newFilesOnly, scala.reflect.ClassTag<K> evidence$1, scala.reflect.ClassTag<V> evidence$2, scala.reflect.ClassTag<F> evidence$3)`

Method Summary

Methods
Modifier and Type	Method and Description
`scala.collection.mutable.HashMap<Time,String[]>`	`batchTimeToSelectedFiles()`
`static int`	`calculateNumBatchesToRemember(Duration batchDuration)` Calculate the number of last batches to remember, such that all the files selected in at least last MIN_REMEMBER_DURATION duration can be remembered.
`scala.Option<RDD<scala.Tuple2<K,V>>>`	`compute(Time validTime)` Finds the files that were modified since the last time this method was called and makes a union RDD out of them.
`static boolean`	`defaultFilter(org.apache.hadoop.fs.Path path)`
`void`	`start()` Method called to start receiving data.
`void`	`stop()` Method called to stop receiving data.

Methods inherited from class org.apache.spark.streaming.dstream.InputDStream
dependencies, isTimeValid, lastValidTime, slideDuration

Methods inherited from class org.apache.spark.streaming.dstream.DStream
cache, checkpoint, checkpointDuration, clearCheckpointData, context, count, countByValue, countByValueAndWindow, countByWindow, creationSite, filter, flatMap, foreach, foreach, foreachRDD, foreachRDD, generatedRDDs, generateJob, getCreationSite, getOrCompute, glom, graph, initialize, isInitialized, map, mapPartitions, mustCheckpoint, parentRememberDuration, persist, persist, print, reduce, reduceByWindow, reduceByWindow, register, remember, rememberDuration, repartition, restoreCheckpointData, saveAsObjectFiles, saveAsTextFiles, setContext, setGraph, slice, slice, ssc, storageLevel, transform, transform, transformWith, transformWith, union, updateCheckpointData, validate, window, window, zeroTime

Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.spark.Logging
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning

- Constructor Detail
  - FileInputDStream
```
public FileInputDStream(StreamingContext ssc_,
                String directory,
                scala.Function1<org.apache.hadoop.fs.Path,Object> filter,
                boolean newFilesOnly,
                scala.reflect.ClassTag<K> evidence$1,
                scala.reflect.ClassTag<V> evidence$2,
                scala.reflect.ClassTag<F> evidence$3)
```
- Method Detail
  - defaultFilter
```
public static boolean defaultFilter(org.apache.hadoop.fs.Path path)
```
  - calculateNumBatchesToRemember
```
public static int calculateNumBatchesToRemember(Duration batchDuration)
```
    Calculate the number of last batches to remember, such that all the files selected in at least last MIN_REMEMBER_DURATION duration can be remembered.
  - batchTimeToSelectedFiles
```
public scala.collection.mutable.HashMap<Time,String[]> batchTimeToSelectedFiles()
```
  - start
```
public void start()
```
    Description copied from class: InputDStream
    
    Method called to start receiving data. Subclasses must implement this method.
    
    Specified by:
    
    start in class InputDStream<scala.Tuple2<K,V>>
  - stop
```
public void stop()
```
    Description copied from class: InputDStream
    
    Method called to stop receiving data. Subclasses must implement this method.
    
    Specified by:
    
    stop in class InputDStream<scala.Tuple2<K,V>>
  - compute
```
public scala.Option<RDD<scala.Tuple2<K,V>>> compute(Time validTime)
```
    Finds the files that were modified since the last time this method was called and makes a union RDD out of them. Note that this maintains the list of files that were processed in the latest modification time in the previous call to this method. This is because the modification time returned by the FileStatus API seems to return times only at the granularity of seconds. And new files may have the same modification time as the latest modification time in the previous call to this method yet was not reported in the previous call.
    
    Specified by:
    
    compute in class DStream<scala.Tuple2<K,V>>

Class FileInputDStream<K,V,F extends org.apache.hadoop.mapreduce.InputFormat<K,V>>

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class org.apache.spark.streaming.dstream.InputDStream

Methods inherited from class org.apache.spark.streaming.dstream.DStream

Methods inherited from class Object

Methods inherited from interface org.apache.spark.Logging

Constructor Detail

FileInputDStream

Method Detail

defaultFilter

calculateNumBatchesToRemember

batchTimeToSelectedFiles

start

stop

compute