text

Value Members

object docs

Stub object necessary due to https://issues.scala-lang.org/browse/SI-8124
Stub object necessary due to https://issues.scala-lang.org/browse/SI-8124
Documentation for ops.core.dataframe.text can be found at software.uncharted.sparkpipe.ops.core.dataframe.text

Attributes
protected[this]
See also
software.uncharted.sparkpipe.ops.core.dataframe.text
def includeRowTermFilter(stringCol: String, pattern: Regex)(input: DataFrame): DataFrame

Applies regex matching to text in a field, and includes rows with hits.
Applies regex matching to text in a field, and includes rows with hits.
stringCol
Column containing text to match against.
pattern
Regex pattern describing text matches.
input
DataFrame from previous stage
returns
Filtered DataFrame
def includeRowTermFilter(stringCol: String, terms: Seq[String], caseSensitive: Boolean = false)(input: DataFrame): DataFrame

Checks for keyword matches in a text field and keeps rows with hits.
Checks for keyword matches in a text field and keeps rows with hits.
stringCol
Column containing text to match against.
terms
Keywords to match.
caseSensitive
True if matching should be case sensitive, False otherwise.
input
DataFrame from previous stage
returns
Filtered DataFrame
def includeTermFilter(arrayCol: String, includePattern: Regex)(input: DataFrame): DataFrame

Pipeline op to filter a string column down to terms which match a certain pattern
Pipeline op to filter a string column down to terms which match a certain pattern
arrayCol
The name of an ArrayType(StringType) column in the input DataFrame
includePattern
A Regex pattern describing words to include
input
Input pipeline data to filter.
returns
Transformed pipeline data, with non-matching words removed from the specified column
def includeTermFilter(arrayCol: String, includeTerms: Set[String])(input: DataFrame): DataFrame

Pipeline op to filter a string column down to terms of interest
Pipeline op to filter a string column down to terms of interest
arrayCol
The name of an ArrayType(StringType) column in the input DataFrame
includeTerms
A Set[String] of words to filter to
input
Input pipeline data to filter.
returns
Transformed pipeline data, with the specified column filterd down to terms of interest
def mapTerms[O](arrayCol: String, mapFcn: (String) ⇒ O)(input: DataFrame)(implicit tag: scala.reflect.api.JavaUniverse.TypeTag[O]): DataFrame

Apply a transformation to every String in an Array[String] column.
Apply a transformation to every String in an Array[String] column.
arrayCol
The name of an ArrayType(StringType) column in the input DataFrame
mapFcn
A transformation function String => O
input
Input pipeline data to transform
returns
Transformed pipeline data, with the mapFcn applied to every term in every row of the Array[String] column
def removeAll(stringCol: String, pattern: Regex)(input: DataFrame): DataFrame

Removes all occurrences of pattern in a String column
Removes all occurrences of pattern in a String column
stringCol
the name of a String column in the input DataFrame
pattern
a regular expression
input
Input pipeline data to transform
returns
Transformed pipeline data, with instances of the given pattern in input removed
def replaceAll(stringCol: String, pattern: Regex, sub: String)(input: DataFrame): DataFrame

Replaces all occurrences of pattern in a String column with sub
Replaces all occurrences of pattern in a String column with sub
stringCol
the name of a String column in the input DataFrame
pattern
a regular expression
sub
the string to substitute for the pattern
input
Input pipeline data to transform
returns
Transformed pipeline data, with instances of the given pattern in input replaced with sub
def split(stringCol: String, delimiter: String = "\\s+")(input: DataFrame): DataFrame

Splits a String column into an Array[String] column using a delimiter (whitespace, by default)
Splits a String column into an Array[String] column using a delimiter (whitespace, by default)
stringCol
the name of a String column in the input DataFrame
delimiter
a delimiter to split the String column on
input
Input pipeline data to transform
returns
Transformed pipeline data, with the given string column split on the delimiter
def stopRowTermFilter(stringCol: String, pattern: Regex)(input: DataFrame): DataFrame

Applies regex matching to text in a field, and excludes rows with hits.
Applies regex matching to text in a field, and excludes rows with hits.
stringCol
Column containing text to match against.
pattern
Regex pattern describing text matches.
input
DataFrame from previous stage
returns
Filtered DataFrame
def stopRowTermFilter(stringCol: String, terms: Seq[String], caseSensitive: Boolean = false)(input: DataFrame): DataFrame

Checks for keyword matches in a text field and removes rows with hits.
Checks for keyword matches in a text field and removes rows with hits.
stringCol
Column containing text to match against.
terms
Keywords to match.
caseSensitive
True if matching should be case sensitive, False otherwise.
input
DataFrame from previous stage
returns
Filtered DataFrame
def stopTermFilter(arrayCol: String, stopPattern: Regex)(input: DataFrame): DataFrame

Pipeline op to remove stop patterns from a string column
Pipeline op to remove stop patterns from a string column
arrayCol
The name of an ArrayType(StringType) column in the input DataFrame
stopPattern
A Regex pattern describing words to remove
input
Input pipeline data to filter.
returns
Transformed pipeline data, with matching words removed from the specified column
def stopTermFilter(arrayCol: String, stopTerms: Set[String])(input: DataFrame): DataFrame

Pipeline op to remove stop words from a string column
Pipeline op to remove stop words from a string column
arrayCol
The name of an ArrayType(StringType) column in the input DataFrame
stopTerms
A Set[String] of words to remove
input
Input pipeline data to filter.
returns
Transformed pipeline data, with stop words removed from the specified column
def uniqueTerms(arrayCol: String)(input: DataFrame): Map[String, Int]

Produces a Map[String,Int] of unique terms from an Array[String] column along with associated counts
Produces a Map[String,Int] of unique terms from an Array[String] column along with associated counts
arrayCol
The name of an ArrayType(StringType) column in the input DataFrame
input
Input pipeline data to analyze
returns
the Map[String, Int] of unique terms and their counts

package text

Value Members

object docs

def includeRowTermFilter(stringCol: String, pattern: Regex)(input: DataFrame): DataFrame

def includeRowTermFilter(stringCol: String, terms: Seq[String], caseSensitive: Boolean = false)(input: DataFrame): DataFrame

def includeTermFilter(arrayCol: String, includePattern: Regex)(input: DataFrame): DataFrame

def includeTermFilter(arrayCol: String, includeTerms: Set[String])(input: DataFrame): DataFrame

def mapTerms[O](arrayCol: String, mapFcn: (String) ⇒ O)(input: DataFrame)(implicit tag: scala.reflect.api.JavaUniverse.TypeTag[O]): DataFrame

def removeAll(stringCol: String, pattern: Regex)(input: DataFrame): DataFrame

def replaceAll(stringCol: String, pattern: Regex, sub: String)(input: DataFrame): DataFrame

def split(stringCol: String, delimiter: String = "\\s+")(input: DataFrame): DataFrame

def stopRowTermFilter(stringCol: String, pattern: Regex)(input: DataFrame): DataFrame

def stopRowTermFilter(stringCol: String, terms: Seq[String], caseSensitive: Boolean = false)(input: DataFrame): DataFrame

def stopTermFilter(arrayCol: String, stopPattern: Regex)(input: DataFrame): DataFrame

def stopTermFilter(arrayCol: String, stopTerms: Set[String])(input: DataFrame): DataFrame

def uniqueTerms(arrayCol: String)(input: DataFrame): Map[String, Int]

Inherited from AnyRef

Inherited from Any

Ungrouped