Package

software.uncharted.sparkpipe.ops.core.dataframe

text

Permalink

package text

Common pipeline operations for dealing with textual data

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. text
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Value Members

  1. object docs

    Permalink

    Stub object necessary due to https://issues.scala-lang.org/browse/SI-8124

    Stub object necessary due to https://issues.scala-lang.org/browse/SI-8124

    Documentation for ops.core.dataframe.text can be found at software.uncharted.sparkpipe.ops.core.dataframe.text

    Attributes
    protected[this]
    See also

    software.uncharted.sparkpipe.ops.core.dataframe.text

  2. def includeRowTermFilter(stringCol: String, pattern: Regex)(input: DataFrame): DataFrame

    Permalink

    Applies regex matching to text in a field, and includes rows with hits.

    Applies regex matching to text in a field, and includes rows with hits.

    stringCol

    Column containing text to match against.

    pattern

    Regex pattern describing text matches.

    input

    DataFrame from previous stage

    returns

    Filtered DataFrame

  3. def includeRowTermFilter(stringCol: String, terms: Seq[String], caseSensitive: Boolean = false)(input: DataFrame): DataFrame

    Permalink

    Checks for keyword matches in a text field and keeps rows with hits.

    Checks for keyword matches in a text field and keeps rows with hits.

    stringCol

    Column containing text to match against.

    terms

    Keywords to match.

    caseSensitive

    True if matching should be case sensitive, False otherwise.

    input

    DataFrame from previous stage

    returns

    Filtered DataFrame

  4. def includeTermFilter(arrayCol: String, includePattern: Regex)(input: DataFrame): DataFrame

    Permalink

    Pipeline op to filter a string column down to terms which match a certain pattern

    Pipeline op to filter a string column down to terms which match a certain pattern

    arrayCol

    The name of an ArrayType(StringType) column in the input DataFrame

    includePattern

    A Regex pattern describing words to include

    input

    Input pipeline data to filter.

    returns

    Transformed pipeline data, with non-matching words removed from the specified column

  5. def includeTermFilter(arrayCol: String, includeTerms: Set[String])(input: DataFrame): DataFrame

    Permalink

    Pipeline op to filter a string column down to terms of interest

    Pipeline op to filter a string column down to terms of interest

    arrayCol

    The name of an ArrayType(StringType) column in the input DataFrame

    includeTerms

    A Set[String] of words to filter to

    input

    Input pipeline data to filter.

    returns

    Transformed pipeline data, with the specified column filterd down to terms of interest

  6. def mapTerms[O](arrayCol: String, mapFcn: (String) ⇒ O)(input: DataFrame)(implicit tag: scala.reflect.api.JavaUniverse.TypeTag[O]): DataFrame

    Permalink

    Apply a transformation to every String in an Array[String] column.

    Apply a transformation to every String in an Array[String] column.

    arrayCol

    The name of an ArrayType(StringType) column in the input DataFrame

    mapFcn

    A transformation function String => O

    input

    Input pipeline data to transform

    returns

    Transformed pipeline data, with the mapFcn applied to every term in every row of the Array[String] column

  7. def removeAll(stringCol: String, pattern: Regex)(input: DataFrame): DataFrame

    Permalink

    Removes all occurrences of pattern in a String column

    Removes all occurrences of pattern in a String column

    stringCol

    the name of a String column in the input DataFrame

    pattern

    a regular expression

    input

    Input pipeline data to transform

    returns

    Transformed pipeline data, with instances of the given pattern in input removed

  8. def replaceAll(stringCol: String, pattern: Regex, sub: String)(input: DataFrame): DataFrame

    Permalink

    Replaces all occurrences of pattern in a String column with sub

    Replaces all occurrences of pattern in a String column with sub

    stringCol

    the name of a String column in the input DataFrame

    pattern

    a regular expression

    sub

    the string to substitute for the pattern

    input

    Input pipeline data to transform

    returns

    Transformed pipeline data, with instances of the given pattern in input replaced with sub

  9. def split(stringCol: String, delimiter: String = "\\s+")(input: DataFrame): DataFrame

    Permalink

    Splits a String column into an Array[String] column using a delimiter (whitespace, by default)

    Splits a String column into an Array[String] column using a delimiter (whitespace, by default)

    stringCol

    the name of a String column in the input DataFrame

    delimiter

    a delimiter to split the String column on

    input

    Input pipeline data to transform

    returns

    Transformed pipeline data, with the given string column split on the delimiter

  10. def stopRowTermFilter(stringCol: String, pattern: Regex)(input: DataFrame): DataFrame

    Permalink

    Applies regex matching to text in a field, and excludes rows with hits.

    Applies regex matching to text in a field, and excludes rows with hits.

    stringCol

    Column containing text to match against.

    pattern

    Regex pattern describing text matches.

    input

    DataFrame from previous stage

    returns

    Filtered DataFrame

  11. def stopRowTermFilter(stringCol: String, terms: Seq[String], caseSensitive: Boolean = false)(input: DataFrame): DataFrame

    Permalink

    Checks for keyword matches in a text field and removes rows with hits.

    Checks for keyword matches in a text field and removes rows with hits.

    stringCol

    Column containing text to match against.

    terms

    Keywords to match.

    caseSensitive

    True if matching should be case sensitive, False otherwise.

    input

    DataFrame from previous stage

    returns

    Filtered DataFrame

  12. def stopTermFilter(arrayCol: String, stopPattern: Regex)(input: DataFrame): DataFrame

    Permalink

    Pipeline op to remove stop patterns from a string column

    Pipeline op to remove stop patterns from a string column

    arrayCol

    The name of an ArrayType(StringType) column in the input DataFrame

    stopPattern

    A Regex pattern describing words to remove

    input

    Input pipeline data to filter.

    returns

    Transformed pipeline data, with matching words removed from the specified column

  13. def stopTermFilter(arrayCol: String, stopTerms: Set[String])(input: DataFrame): DataFrame

    Permalink

    Pipeline op to remove stop words from a string column

    Pipeline op to remove stop words from a string column

    arrayCol

    The name of an ArrayType(StringType) column in the input DataFrame

    stopTerms

    A Set[String] of words to remove

    input

    Input pipeline data to filter.

    returns

    Transformed pipeline data, with stop words removed from the specified column

  14. def uniqueTerms(arrayCol: String)(input: DataFrame): Map[String, Int]

    Permalink

    Produces a Map[String,Int] of unique terms from an Array[String] column along with associated counts

    Produces a Map[String,Int] of unique terms from an Array[String] column along with associated counts

    arrayCol

    The name of an ArrayType(StringType) column in the input DataFrame

    input

    Input pipeline data to analyze

    returns

    the Map[String, Int] of unique terms and their counts

Inherited from AnyRef

Inherited from Any

Ungrouped