All pages
Powered by GitBook
1 of 1

Loading...

RumbleML

RumbleDB ML

RumbleDB ML is a Machine Learning library built on top of the RumbleDB engine that makes it more productive and easier to perform ML tasks thanks to the abstraction layer provided by JSONiq.

The machine learning capabilities are exposed through JSONiq function items. The concepts of "estimator" and "transformer", which are core to Machine Learning, are naturally function items and fit seamlessly in the JSONiq data model.

Training sets, test sets, and validation sets, which contain features and labels, are exposed through JSONiq sequences of object items: the keys of these objects are the features and labels.

The names of the estimators and of the transformers, as well as the functionality they encapsulate, are directly inherited from the SparkML library which RumbleDB ML is based on: we chose not to reinvent the wheel.

Transformers

A transformer is a function item that maps a sequence of objects to a sequence of objects.

It is an abstraction that either performs a feature transformation or generates predictions based on trained models. For example:

  • Tokenizer is a feature transformer that receives textual input data and splits it into individual terms (usually words), which are called tokens.

  • KMeansModel is a trained model and a transformer that can read a dataset containing features and generate predictions as its output.

Estimators

An estimator is a function item that maps a sequence of objects to a transformer (yes, you got it right: that's a function item returned by a function item. This is why they are also called higher-order functions!).

Estimators abstract the concept of a Machine Learning algorithm or any algorithm that fits or trains on data. For example, a learning algorithm such as KMeans is implemented as an Estimator. Calling this estimator on data essentially trains a KMeansModel, which is a Model and hence a Transformer.

Parameters

Transformers and estimators are function items in the RumbleDB Data Model. Their first argument is the sequence of objects that represents, for example, the training set or test set. Parameters can be provided as their second argument. This second argument is expected to be an object item. The machine learning parameters form the fields of the said object item as key-value pairs.

Type Annotations

RumbleDB ML works on highly structured data, because it requires full type information for all the fields in the training set or test set. It is on our development plan to automate the detection of these types when the sequence of objects gets created in the fly.

RumbleDB supports a user-defined type system with which you can validate and annotate datasets against a JSound schema.

This annotation is required to be applied on any dataset that must be used as input to RumbleDB ML, but it is superfluous if the data was directly read from a structured input format such as Parquet, CSV, Avro, SVM or ROOT.

Examples

  • Tokenizer Example:

  • KMeans Example:

RumbleDB ML Functionality Overview:

RumblDB eML - Catalogue of Estimators:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

RumbleDB ML - Catalogue of Transformers:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

Parameters:

AFTSurvivalRegression
ALS
BisectingKMeans
BucketedRandomProjectionLSH
ChiSqSelector
CountVectorizer
CrossValidator
DecisionTreeClassifier
DecisionTreeRegressor
FPGrowth
GBTClassifier
GBTRegressor
GaussianMixture
GeneralizedLinearRegression
IDF
Imputer
IsotonicRegression
KMeans
LDA
LinearRegression
LinearSVC
LogisticRegression
MaxAbsScaler
MinHashLSH
MinMaxScaler
MultilayerPerceptronClassifier
NaiveBayes
OneHotEncoder
OneVsRest
PCA
Pipeline
QuantileDiscretizer
RFormula
RandomForestClassifier
RandomForestRegressor
StandardScaler
StringIndexer
TrainValidationSplit
VectorIndexer
Word2Vec
AFTSurvivalRegressionModel
ALSModel
Binarizer
BisectingKMeansModel
BucketedRandomProjectionLSHModel
Bucketizer
ChiSqSelectorModel
ColumnPruner
CountVectorizerModel
CrossValidatorModel
DCT
DecisionTreeClassificationModel
DecisionTreeRegressionModel
DistributedLDAModel
ElementwiseProduct
FPGrowthModel
FeatureHasher
GBTClassificationModel
GBTRegressionModel
GaussianMixtureModel
GeneralizedLinearRegressionModel
HashingTF
IDFModel
ImputerModel
IndexToString
Interaction
IsotonicRegressionModel
KMeansModel
LinearRegressionModel
LinearSVCModel
LocalLDAModel
LogisticRegressionModel
MaxAbsScalerModel
MinHashLSHModel
MinMaxScalerModel
MultilayerPerceptronClassificationModel
NGram
NaiveBayesModel
Normalizer
OneHotEncoder
OneHotEncoderModel
OneVsRestModel
PCAModel
PipelineModel
PolynomialExpansion
RFormulaModel
RandomForestClassificationModel
RandomForestRegressionModel
RegexTokenizer
SQLTransformer
StandardScalerModel
StopWordsRemover
StringIndexerModel
Tokenizer
TrainValidationSplitModel
VectorAssembler
VectorAttributeRewriter
VectorIndexerModel
VectorSizeHint
VectorSlicer
Word2VecModel

declare type local:id-and-sentence as {
  "id": "integer",
  "sentence": "string"
};


let $local-data := (
    {"id": 1, "sentence": "Hi I heard about Spark"},
    {"id": 2, "sentence": "I wish Java could use case classes"},
    {"id": 3, "sentence": "Logistic regression models are neat"}
)
let $df-data := validate type local:id-and-sentence* { $local-data }

let $transformer := get-transformer("Tokenizer")
for $i in $transformer(
    $df-data,
    {"inputCol": "sentence", "outputCol": "output"}
)
return $i

// returns
// { "id" : 1, "sentence" : "Hi I heard about Spark", "output" : [ "hi", "i", "heard", "about", "spark" ] }
// { "id" : 2, "sentence" : "I wish Java could use case classes", "output" : [ "i", "wish", "java", "could", "use", "case", "classes" ] }
// { "id" : 3, "sentence" : "Logistic regression models are neat", "output" : [ "logistic", "regression", "models", "are", "neat" ] }
declare type local:col-1-2-3 as {
  "id": "integer",
  "col1": "decimal",
  "col2": "decimal",
  "col3": "decimal"
};

let $vector-assembler := get-transformer("VectorAssembler")(
  ?,
  { "inputCols" : [ "col1", "col2", "col3" ], "outputCol" : "features" }
)

let $local-data := (
    {"id": 0, "col1": 0.0, "col2": 0.0, "col3": 0.0},
    {"id": 1, "col1": 0.1, "col2": 0.1, "col3": 0.1},
    {"id": 2, "col1": 0.2, "col2": 0.2, "col3": 0.2},
    {"id": 3, "col1": 9.0, "col2": 9.0, "col3": 9.0},
    {"id": 4, "col1": 9.1, "col2": 9.1, "col3": 9.1},
    {"id": 5, "col1": 9.2, "col2": 9.2, "col3": 9.2}
)
let $df-data := validate type local:col-1-2-3* {$local-data }
let $df-data := $vector-assembler($df-data)

let $est := get-estimator("KMeans")
let $tra := $est(
    $df-data,
    {"featuresCol": "features"}
)

for $i in $tra(
    $df-data,
    {"featuresCol": "features"}
)
return $i

// returns
// { "id" : 0, "col1" : 0, "col2" : 0, "col3" : 0, "prediction" : 0 }
// { "id" : 1, "col1" : 0.1, "col2" : 0.1, "col3" : 0.1, "prediction" : 0 }
// { "id" : 2, "col1" : 0.2, "col2" : 0.2, "col3" : 0.2, "prediction" : 0 }
// { "id" : 3, "col1" : 9, "col2" : 9, "col3" : 9, "prediction" : 1 }
// { "id" : 4, "col1" : 9.1, "col2" : 9.1, "col3" : 9.1, "prediction" : 1 }
// { "id" : 5, "col1" : 9.2, "col2" : 9.2, "col3" : 9.2, "prediction" : 1 }
- aggregationDepth: integer
- censorCol: string
- featuresCol: string
- fitIntercept: boolean
- labelCol: string
- maxIter: integer
- predictionCol: string
- quantileProbabilities: array (of double)
- quantilesCol: string
- tol: double
- alpha: double
- checkpointInterval: integer
- coldStartStrategy: string
- finalStorageLevel: string
- implicitPrefs: boolean
- intermediateStorageLevel: string
- itemCol: string
- maxIter: integer
- nonnegative: boolean
- numBlocks: integer
- numItemBlocks: integer
- numUserBlocks: integer
- predictionCol: string
- rank: integer
- ratingCol: string
- regParam: double
- seed: double
- userCol: string
- distanceMeasure: string
- featuresCol: string
- k: integer
- maxIter: integer
- minDivisibleClusterSize: double
- predictionCol: string
- seed: double
- bucketLength: double
- inputCol: string
- numHashTables: integer
- outputCol: string
- seed: double
- fdr: double
- featuresCol: string
- fpr: double
- fwe: double
- labelCol: string
- numTopFeatures: integer
- outputCol: string
- percentile: double
- selectorType: string
- binary: boolean
- inputCol: string
- maxDF: double
- minDF: double
- minTF: double
- outputCol: string
- vocabSize: integer
- collectSubModels: boolean
- estimator: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- numFolds: integer
- parallelism: integer
- seed: double
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- impurity: string
- labelCol: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- seed: double
- thresholds: array (of double)
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- impurity: string
- labelCol: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- predictionCol: string
- seed: double
- varianceCol: string
- itemsCol: string
- minConfidence: double
- minSupport: double
- numPartitions: integer
- predictionCol: string
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- labelCol: string
- lossType: string
- maxBins: integer
- maxDepth: integer
- maxIter: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- seed: double
- stepSize: double
- subsamplingRate: double
- thresholds: array (of double)
- validationIndicatorCol: string
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- labelCol: string
- lossType: string
- maxBins: integer
- maxDepth: integer
- maxIter: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- predictionCol: string
- seed: double
- stepSize: double
- subsamplingRate: double
- validationIndicatorCol: string
- featuresCol: string
- k: integer
- maxIter: integer
- predictionCol: string
- probabilityCol: string
- seed: double
- tol: double
- family: string
- featuresCol: string
- fitIntercept: boolean
- labelCol: string
- link: string
- linkPower: double
- linkPredictionCol: string
- maxIter: integer
- offsetCol: string
- predictionCol: string
- regParam: double
- solver: string
- tol: double
- variancePower: double
- weightCol: string
- inputCol: string
- minDocFreq: integer
- outputCol: string
- inputCols: array (of string)
- missingValue: double
- outputCols: array (of string)
- strategy: string
- featureIndex: integer
- featuresCol: string
- isotonic: boolean
- labelCol: string
- predictionCol: string
- weightCol: string
- distanceMeasure: string
- featuresCol: string
- initMode: string
- initSteps: integer
- k: integer
- maxIter: integer
- predictionCol: string
- seed: double
- tol: double
- checkpointInterval: integer
- docConcentration: double
- docConcentration: array (of double)
- featuresCol: string
- k: integer
- keepLastCheckpoint: boolean
- learningDecay: double
- learningOffset: double
- maxIter: integer
- optimizeDocConcentration: boolean
- optimizer: string
- seed: double
- subsamplingRate: double
- topicConcentration: double
- topicDistributionCol: string
- aggregationDepth: integer
- elasticNetParam: double
- epsilon: double
- featuresCol: string
- fitIntercept: boolean
- labelCol: string
- loss: string
- maxIter: integer
- predictionCol: string
- regParam: double
- solver: string
- standardization: boolean
- tol: double
- weightCol: string
- aggregationDepth: integer
- featuresCol: string
- fitIntercept: boolean
- labelCol: string
- maxIter: integer
- predictionCol: string
- rawPredictionCol: string
- regParam: double
- standardization: boolean
- threshold: double
- tol: double
- weightCol: string
- aggregationDepth: integer
- elasticNetParam: double
- family: string
- featuresCol: string
- fitIntercept: boolean
- labelCol: string
- lowerBoundsOnCoefficients: object (of object of double)
- lowerBoundsOnIntercepts: object (of double)
- maxIter: integer
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- regParam: double
- standardization: boolean
- threshold: double
- thresholds: array (of double)
- tol: double
- upperBoundsOnCoefficients: object (of object of double)
- upperBoundsOnIntercepts: object (of double)
- weightCol: string
- inputCol: string
- outputCol: string
- inputCol: string
- numHashTables: integer
- outputCol: string
- seed: double
- inputCol: string
- max: double
- min: double
- outputCol: string
- blockSize: integer
- featuresCol: string
- initialWeights: object (of double)
- labelCol: string
- layers: array (of integer)
- maxIter: integer
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- seed: double
- solver: string
- stepSize: double
- thresholds: array (of double)
- tol: double
- featuresCol: string
- labelCol: string
- modelType: string
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- smoothing: double
- thresholds: array (of double)
- weightCol: string
- dropLast: boolean
- handleInvalid: string
- inputCols: array (of string)
- outputCols: array (of string)
- featuresCol: string
- labelCol: string
- parallelism: integer
- predictionCol: string
- rawPredictionCol: string
- weightCol: string
- inputCol: string
- k: integer
- outputCol: string
- handleInvalid: string
- inputCol: string
- inputCols: array (of string)
- numBuckets: integer
- numBucketsArray: array (of integer)
- outputCol: string
- outputCols: array (of string)
- relativeError: double
- featuresCol: string
- forceIndexLabel: boolean
- formula: string
- handleInvalid: string
- labelCol: string
- stringIndexerOrderType: string
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- labelCol: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- numTrees: integer
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- seed: double
- subsamplingRate: double
- thresholds: array (of double)
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- labelCol: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- numTrees: integer
- predictionCol: string
- seed: double
- subsamplingRate: double
- inputCol: string
- outputCol: string
- withMean: boolean
- withStd: boolean
- handleInvalid: string
- inputCol: string
- outputCol: string
- stringOrderType: string
- collectSubModels: boolean
- estimator: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- parallelism: integer
- seed: double
- trainRatio: double
- handleInvalid: string
- inputCol: string
- maxCategories: integer
- outputCol: string
- inputCol: string
- maxIter: integer
- maxSentenceLength: integer
- minCount: integer
- numPartitions: integer
- outputCol: string
- seed: double
- stepSize: double
- vectorSize: integer
- windowSize: integer
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- quantileProbabilities: array (of double)
- quantilesCol: string
- coldStartStrategy: string
- itemCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- userCol: string
- inputCol: string
- outputCol: string
- threshold: double
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- handleInvalid: string
- inputCol: string
- inputCols: array (of string)
- outputCol: string
- outputCols: array (of string)
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- splits: array (of double)
- splitsArray: array (of array of double)
- featuresCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- binary: boolean
- inputCol: string
- minTF: double
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- inputCol: string
- inverse: boolean
- outputCol: string
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- impurity: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- seed: double
- thresholds: array (of double)
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- impurity: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- seed: double
- varianceCol: string
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- seed: double
- topicDistributionCol: string
- inputCol: string
- outputCol: string
- scalingVec: object (of double)
- itemsCol: string
- minConfidence: double
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- categoricalCols: array (of string)
- inputCols: array (of string)
- numFeatures: integer
- outputCol: string
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- maxBins: integer
- maxDepth: integer
- maxIter: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- seed: double
- stepSize: double
- subsamplingRate: double
- thresholds: array (of double)
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- maxBins: integer
- maxDepth: integer
- maxIter: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- seed: double
- stepSize: double
- subsamplingRate: double
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- probabilityCol: string
- featuresCol: string
- linkPredictionCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- binary: boolean
- inputCol: string
- numFeatures: integer
- outputCol: string
- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- inputCols: array (of string)
- outputCols: array (of string)
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- inputCol: string
- labels: array (of string)
- outputCol: string
- inputCols: array (of string)
- outputCol: string
- featureIndex: integer
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- rawPredictionCol: string
- threshold: double
- weightCol: double
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- seed: double
- topicDistributionCol: string
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- threshold: double
- thresholds: array (of double)
- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- inputCol: string
- max: double
- min: double
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- thresholds: array (of double)
- inputCol: string
- n: integer
- outputCol: string
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- thresholds: array (of double)
- inputCol: string
- outputCol: string
- p: double
- dropLast: boolean
- inputCol: string
- outputCol: string
- dropLast: boolean
- handleInvalid: string
- inputCols: array (of string)
- outputCols: array (of string)
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- rawPredictionCol: string
- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- degree: integer
- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- numTrees: integer
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- seed: double
- subsamplingRate: double
- thresholds: array (of double)
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- numTrees: integer
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- seed: double
- subsamplingRate: double
- gaps: boolean
- inputCol: string
- minTokenLength: integer
- outputCol: string
- pattern: string
- toLowercase: boolean
- statement: string
- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- caseSensitive: boolean
- inputCol: string
- locale: string
- outputCol: string
- stopWords: array (of string)
- handleInvalid: string
- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- handleInvalid: string
- inputCols: array (of string)
- outputCol: string
- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- handleInvalid: string
- inputCol: string
- size: integer
- indices: array (of integer)
- inputCol: string
- names: array (of string)
- outputCol: string
- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)