RegexTokenizer

java.lang.Object
- org.apache.spark.ml.PipelineStage
- - org.apache.spark.ml.Transformer
  - - org.apache.spark.ml.UnaryTransformer<java.lang.String,scala.collection.Seq<java.lang.String>,RegexTokenizer>
    - - org.apache.spark.ml.feature.RegexTokenizer

All Implemented Interfaces:

java.io.Serializable, Logging, Params, Identifiable
```
public class RegexTokenizer
extends UnaryTransformer<java.lang.String,scala.collection.Seq<java.lang.String>,RegexTokenizer>
```
:: Experimental :: A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.

See Also:
Serialized Form

Constructor Summary

Constructors
Constructor and Description

RegexTokenizer()

RegexTokenizer(java.lang.String uid)

Constructors
Constructor and Description
`RegexTokenizer()`
`RegexTokenizer(java.lang.String uid)`

Method Summary

Methods
Modifier and Type	Method and Description
`RegexTokenizer`	`copy(ParamMap extra)` Creates a copy of this instance with the same UID and some extra params.
`protected scala.Function1<java.lang.String,scala.collection.Seq<java.lang.String>>`	`createTransformFunc()` Creates the transform function using the given param map.
`BooleanParam`	`gaps()` Indicates whether regex splits on gaps (true) or matches tokens (false).
`boolean`	`getGaps()`
`int`	`getMinTokenLength()`
`java.lang.String`	`getPattern()`
`boolean`	`getToLowercase()`
`static RegexTokenizer`	`load(java.lang.String path)`
`IntParam`	`minTokenLength()` Minimum token length, >= 0.
`protected DataType`	`outputDataType()` Returns the data type of the output column.
`Param<java.lang.String>`	`pattern()` Regex pattern used to match delimiters if `gaps` is true or tokens if `gaps` is false.
`RegexTokenizer`	`setGaps(boolean value)`
`RegexTokenizer`	`setMinTokenLength(int value)`
`RegexTokenizer`	`setPattern(java.lang.String value)`
`RegexTokenizer`	`setToLowercase(boolean value)`
`BooleanParam`	`toLowercase()` Indicates whether to convert all characters to lowercase before tokenizing.
`java.lang.String`	`uid()` An immutable unique ID for the object and its derivatives.
`protected void`	`validateInputType(DataType inputType)` Validates the input type.

Methods inherited from class org.apache.spark.ml.UnaryTransformer
setInputCol, setOutputCol, transform, transformSchema

Methods inherited from class org.apache.spark.ml.Transformer
transform, transform, transform

Methods inherited from class org.apache.spark.ml.PipelineStage
transformSchema

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.spark.Logging
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning

Methods inherited from interface org.apache.spark.ml.param.Params
clear, copyValues, defaultCopy, defaultParamMap, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, paramMap, params, set, set, set, setDefault, setDefault, shouldOwn, validateParams

Methods inherited from interface org.apache.spark.ml.util.Identifiable
toString

- Constructor Detail
  - RegexTokenizer
```
public RegexTokenizer(java.lang.String uid)
```
  - RegexTokenizer
```
public RegexTokenizer()
```
- Method Detail
  - load
```
public static RegexTokenizer load(java.lang.String path)
```
  - uid
```
public java.lang.String uid()
```
    Description copied from interface: Identifiable
    
    An immutable unique ID for the object and its derivatives.
    
    Returns:
    (undocumented)
  - minTokenLength
```
public IntParam minTokenLength()
```
    Minimum token length, >= 0. Default: 1, to avoid returning empty strings
    
    Returns:
    (undocumented)
  - setMinTokenLength
```
public RegexTokenizer setMinTokenLength(int value)
```
  - getMinTokenLength
```
public int getMinTokenLength()
```
  - gaps
```
public BooleanParam gaps()
```
    Indicates whether regex splits on gaps (true) or matches tokens (false). Default: true
    
    Returns:
    (undocumented)
  - setGaps
```
public RegexTokenizer setGaps(boolean value)
```
  - getGaps
```
public boolean getGaps()
```
  - pattern
```
public Param<java.lang.String> pattern()
```
    Regex pattern used to match delimiters if gaps is true or tokens if gaps is false. Default: "\\s+"
    
    Returns:
    (undocumented)
  - setPattern
```
public RegexTokenizer setPattern(java.lang.String value)
```
  - getPattern
```
public java.lang.String getPattern()
```
  - toLowercase
```
public final BooleanParam toLowercase()
```
    Indicates whether to convert all characters to lowercase before tokenizing. Default: true
    
    Returns:
    (undocumented)
  - setToLowercase
```
public RegexTokenizer setToLowercase(boolean value)
```
  - getToLowercase
```
public boolean getToLowercase()
```
  - createTransformFunc
```
protected scala.Function1<java.lang.String,scala.collection.Seq<java.lang.String>> createTransformFunc()
```
    Description copied from class: UnaryTransformer
    
    Creates the transform function using the given param map. The input param map already takes account of the embedded param map. So the param values should be determined solely by the input param map.
    
    Specified by:
    
    createTransformFunc in class UnaryTransformer<java.lang.String,scala.collection.Seq<java.lang.String>,RegexTokenizer>
    
    Returns:
    (undocumented)
  - validateInputType
```
protected void validateInputType(DataType inputType)
```
    Description copied from class: UnaryTransformer
    
    Validates the input type. Throw an exception if it is invalid.
    
    Overrides:
    
    validateInputType in class UnaryTransformer<java.lang.String,scala.collection.Seq<java.lang.String>,RegexTokenizer>
    
    Parameters:
    inputType - (undocumented)
  - outputDataType
```
protected DataType outputDataType()
```
    Description copied from class: UnaryTransformer
    
    Returns the data type of the output column.
    
    Specified by:
    
    outputDataType in class UnaryTransformer<java.lang.String,scala.collection.Seq<java.lang.String>,RegexTokenizer>
    
    Returns:
    (undocumented)
  - copy
```
public RegexTokenizer copy(ParamMap extra)
```
    Description copied from interface: Params
    
    Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly.
    
    Specified by:
    
    copy in interface Params
    
    Overrides:
    
    copy in class UnaryTransformer<java.lang.String,scala.collection.Seq<java.lang.String>,RegexTokenizer>
    
    Parameters:
    extra - (undocumented)
    
    Returns:
    (undocumented)
    See Also:
    defaultCopy()

Class RegexTokenizer

Constructor Summary

Method Summary

Methods inherited from class org.apache.spark.ml.UnaryTransformer

Methods inherited from class org.apache.spark.ml.Transformer

Methods inherited from class org.apache.spark.ml.PipelineStage

Methods inherited from class java.lang.Object

Methods inherited from interface org.apache.spark.Logging

Methods inherited from interface org.apache.spark.ml.param.Params

Methods inherited from interface org.apache.spark.ml.util.Identifiable

Constructor Detail

RegexTokenizer

RegexTokenizer

Method Detail

load

uid

minTokenLength

setMinTokenLength

getMinTokenLength

gaps

setGaps

getGaps

pattern

setPattern

getPattern

toLowercase

setToLowercase

getToLowercase

createTransformFunc

validateInputType

outputDataType

copy