1 of 6

Function library

We list here the most important functions supported by RumbleDB, and introduce them by means of examples. Highly detailed specifications can be found in the underlying W3C standard, unless the function is marked as specific to JSON or RumbleDB, in which case it can be found here. JSONiq and RumbleDB intentionally do not support builtin functions on XML nodes, NOTATION or QNames. RumbleDB supports almost all other W3C-standardized functions, please contact us if you are still missing one.

For the sake of ease of use, all W3C standard builtin functions and JSONiq builtin functions are in the RumbleDB namespace, which is the default function namespace and does not require any prefix in front of function names.

It is recommended that user-defined functions are put in the local namespace, i.e., their name should have the local: prefix (which is predefined). Otherwise, there is the risk that your code becomes incompatible with subsequent releases if new (unprefixed) builtin functions are introduced.

Errors and diagnostics

Diagnostic tracing

trace

Fully implemented

returns (1, 2, 3) and logs it in the log-path if specified

Functions and operators on numerics

Functions on numeric values

abs

Fully implemented

returns 2.0

ceiling

Fully implemented

returns 3.0

floor

Fully implemented

returns 2.0

round

Fully implemented

returns 2.0

returns 2.23

round-half-to-even

Fully implemented

Parsing numbers

number

Fully implemented

returns 15 as a double

returns NaN as a double

returns 15 as a double

Formatting integers

format-integer

Not implemented

##Formatting numbers

format-number

Not implemented

##Trigonometric and exponential functions

###pi

Fully implemented

returns 3.141592653589793

###exp

Fully implemented

###exp10

Fully implemented

log

Fully implemented

log10

Fully implemented

pow

Fully implemented

sqrt

Fully implemented

returns 2

sin

Fully implemented

cos

Fully implemented

cosh

JSONiq-specific. Fully implemented

sinh

JSONiq-specific. Fully implemented

tan

Fully implemented

asin

Fully implemented

acos

Fully implemented

atan

Fully implemented

atan2

Fully implemented

Random numbers

random-number-generator

Not implemented

Functions on strings

Functions to assemble and disassemble strings

string-to-codepoint

Fully implemented

returns (84, 104, 233, 114, 232, 115, 101)

returns ()

codepoints-to-string

Fully implemented

returns "अशॊक"

returns ""

Comparison of strings

compare

Fully implemented

returns -1

codepoint-equal

Fully implemented

returns true

returns ()

collation-key

Not implemented

contains-token

Not implemented

Functions on string values

concat

Fully implemented

returns "foobarfoobar"

string-join

Fully implemented

returns "foobarfoobar"

returns "foo-bar-foobar"

substring

Fully implemented

returns "bar"

returns "ba"

string-length

Fully implemented

Returns the length of the supplied string, or 0 if the empty sequence is supplied.

returns 3.

returns 0.

###normalize-space

Fully implemented

Normalization of spaces in a string.

returns "The wealthy curled darlings of our nation."

normalize-unicode

Fully implemented

Returns the value of the input after applying Unicode normalization.

returns the unicode-normalized version of the input string. Normalization forms NFC, NFD, NFKC, and NFKD are supported. "FULLY-NORMALIZED" though supported, should be used with caution as only the composition exclusion characters supported FULLY-NORMALIZED are which are uncommented in the .

upper-case

Fully implemented

returns "ABCD0"

lower-case

Fully implemented

returns "abc!d"

translate

Fully implemented

returns "BAr"

returns "AAA"

Functions based on substring matching

contains

Fully implemented

returns true.

starts-with

Fully implemented

returns true

ends-with

Fully implemented

returns true.

substring-before

Fully implemented

returns "foo"

returns "f"

substring-after

Fully implemented

returns "bar"

returns ""

String functions that use regular expressions

matches

Arity 2 implemented, arity 3 is not.

Regular expression matching. The semantics of regular expressions are those of Java's Pattern class.

returns true.

replace

Arity 3 implemented, arity 4 is not.

Regular expression matching and replacing. The semantics of regular expressions are those of Java's Pattern class.

returns "a*cada*"

returns "abbraccaddabbra"

tokenize

Arity 2 implemented, arity 3 is not.

returns ("aa", "bb", "cc", "dd")

analyze-string

Not implemented

Functions that manipulate URIs

resolve-uri

Fully implemented

returns http://www.examples.com/examples

encode-for-uri

Fully implemented

returns 100%25%20organic

iri-to-uri

Not implemented

escape-html-uri

Not implemented

Functions and operators on Boolean values

Boolean constant functions

true

Fully implemented

returns true

false

Fully implemented

returns false

boolean

Fully implemented

returns true

returns false

not

Fully implemented

returns false

returns true

Functions and operators on durations

Component extraction functions on durations

years-from-duration

Fully implemented

returns 2021.

months-from-duration

Fully implemented

returns 6.

days-from-duration

Fully implemented

returns 17.

hours-from-duration

Fully implemented

returns 12.

minutes-from-duration

Fully implemented

returns 35.

seconds-from-duration

Fully implemented

returns 30.

Functions and operators on dates and times

Constructing a DateTime

dateTime

Fully implemented

returns 2004-04-12T13:20:00+14:00

Component extraction functions on dates and times

year-from-dateTime

Fully implemented

returns 2021.

month-from-dateTime

Fully implemented

returns 04.

day-from-dateTime

Fully implemented

returns 12.

hours-from-dateTime

Fully implemented

returns 13.

minutes-from-dateTime

Fully implemented

returns 20.

seconds-from-dateTime

Fully implemented

returns 32.

timezone-from-dateTime

Fully implemented

returns PT2H.

year-from-date

Fully implemented

returns 2021.

month-from-date

Fully implemented

returns 6.

day-from-date

Fully implemented

returns 4.

timezone-from-date

Fully implemented

returns -PT14H.

hours-from-time

Fully implemented

returns 13.

minutes-from-time

Fully implemented

returns 20.

seconds-from-time

Fully implemented

returns 32.123.

timezone-from-time

Fully implemented

returns PT2H.

Timezone adjustment functions on dates and time values

adjust-dateTime-to-timezone

Fully implemented

returns 2004-04-12T03:25:15+04:05.

adjust-date-to-timezone

Fully implemented

returns 2014-03-12+04:00.

adjust-time-to-timezone

Fully implemented

returns 04:20:00-14:00.

Formatting dates and times functions

The functions in this section accept a simplified version of the picture string, in which a variable marker accepts only:

One of the following component specifiers: Y, M, d, D, F, H, m, s, P
A first presentation modifier, for which the value can be:
- Nn, for all supported component specifiers, besides P
- N, if the component specifier is P

format-dateTime

Fully implemented

returns 20-13-12-4-2004

format-date

Fully implemented

returns 12-4-2004

format-time

Fully implemented

returns 13-20-0

Not implemented

Functions and operators on sequences

General functions and operators on sequences

empty

Fully implemented

Returns a boolean whether the input sequence is empty or not.

returns false.

exists

Fully implemented

Returns a boolean whether the input sequence has at least one item or not.

returns true.

returns false.

This is pushed down to Spark and works on big sequences.

head

Fully implemented

Returns the first item of a sequence, or the empty sequence if it is empty.

returns 1.

returns ().

This is pushed down to Spark and works on big sequences.

tail

Fully implemented

Returns all but the last item of a sequence, or the empty sequence if it is empty.

returns (2, 3, 4, 5).

returns ().

This is pushed down to Spark and works on big sequences.

insert-before

Fully implemented

returns (1, 2, 3, 4, 5).

remove

Fully implemented

returns (1, 2).

reverse

Fully implemented

returns (3, 2, 1).

subsequence

Fully implemented

returns (2, 3).

unordered

Fully implemented

returns (1, 2, 3).

Functions that compare values in sequences

distinct-values

Fully implemented

Eliminates duplicates from a sequence of atomic items.

returns (1, 4, 3, "foo", true, 5).

This is pushed down to Spark and works on big sequences.

index-of

Fully implemented

returns 3.

returns "".

deep-equal

Fully implemented

returns true.

returns false.

Functions that test the cardinality of sequences

zero-or-one

Fully implemented

returns "a".

returns an error.

one-or-more

Fully implemented

returns "a".

returns an error.

exactly-one

Fully implemented

returns "a".

returns an error.

Aggregate functions

count

Fully implemented

returns 4.

Count calls are pushed down to Spark, so this works on billions of items as well:

avg

Fully implemented

returns 2.5.

Avg calls are pushed down to Spark, so this works on billions of items as well:

max

Fully implemented

returns 4.

returns (1, 2, 3).

Max calls are pushed down to Spark, so this works on billions of items as well:

min

Fully implemented

returns 1.

returns (1, 2, 3).

Min calls are pushed down to Spark, so this works on billions of items as well:

sum

Fully implemented

returns 10.

Sum calls are pushed down to Spark, so this works on billions of items as well:

Functions giving access to external information

doc

Fully implemented

Returns the corresponding document node

collection

Not implemented

Parsing and serializing

serialize

Fully implemented

Serializes the supplied input sequence, returning the serialized representation of the sequence as a string

returns { "hello" : "world" }

Context Functions

position

Fully implemented

returns 5

last

Fully implemented

returns 10

current-dateTime

Fully implemented

returns 2020-02-26T11:22:48.423+01:00

current-date

Fully implemented

returns 2020-02-26Europe/Zurich

current-time

Fully implemented

returns 11:24:10.064+01:00

implicit-timezone

Fully implemented

returns PT1H.

default-collation

Fully implemented

returns http://www.w3.org/2005/xpath-functions/collation/codepoint.

High order functions

Functions on functions

function-lookup

Not implemented

function-name

Not implemented

function-arity

Not implemented

Basic higher-order functions

for-each

Not implemented

filter

Not implemented

fold-left

Not implemented

fold-right

Not implemented

for-each-pair

Not implemented

JSONiq functions

keys

Fully implemented

returns ("foo", "bar"). Also works on an input sequence, eliminating duplicates

Keys calls are pushed down to Spark, so this works on billions of items as well:

members

Fully implemented

This function returns the members as an array, but not recursively, i.e., nested arrays are not unboxed.

Returns the first 100 integers as a sequence. Also works on an input sequence, in a distributive way.

null

Fully implemented

Returns a JSON null (also available as the literal null).

parse-json

Fully implemented

size

Fully implemented

returns 100. Also works if the empty sequence is supplied, in which case it returns the empty sequence.

accumulate

Fully implemented

returns

descendant-arrays

Fully implemented

returns

descendant-objects

Fully implemented

returns

descendant-pairs

Fully implemented

returns

flatten

Fully implemented

Unboxes arrays recursively, stopping the recursion when any other item is reached (object or atomic). Also works on an input sequence, in a distributive way.

Returns (1, 2, 3, 4, 5, 6, 7, 8, 9).

intersect

Fully implemented

returns

project

Fully implemented

returns the object {"foo" : "bar", "bar" : "foobar"}. Also works on an input sequence, in a distributive way.

remove-keys

Fully implemented

returns the object {"foobar" : "foo"}. Also works on an input sequence, in a distributive way.

values

Fully implemented

returns ("bar", "foobar"). Also works on an input sequence, in a distributive way.

Values calls are pushed down to Spark, so this works on billions of items as well:

encode-for-roundtrip

Not implemented

decode-from-roundtrip

Not implemented

json-doc

returns the (unique) JSON value parsed from a local JSON (but not necessarily JSON Lines) file where this value may be spread over multiple lines.

User-defined types

RumbleDB now supports user-defined array and object types both with the JSound compact syntax and the JSound verbose syntax.

JSound Schema Compact syntax

RumbleDB user-defined types can be defined with the JSound syntax. A tutorial for the JSound syntax can be found here.

For now, RumbleDB only allows the definition of user-defined types for objects and arrays. User-defined atomic types and union types will follow soon. The @ (primary key) and ? (nullable) shortcuts are supported as of version 2.0.5. The behavior of nulls with absent vs. nullable fields can be tweaked in the configuration (e.g., if a null is present in an optional, non-nullable field, RumbleBD can be lenient and simply remove it instead of throwing an error).

The implementation is still experimental and bugs are still expected, which we will appreciate to be informed of.

Type declaration

A new type can be declared in the prolog, at the same location where you also define global variables and user-defined functions.

In the above query, although the type is defined, the query returns an object that was not validated against this type.

Type declaration

To validate and annotate a sequence of objects, you need to use the validate-type expression, like so:

You can use user-defined types wherever other types can appear: as type annotation for FLWOR variables or global variables, as function parameter or return types, in instance-of or treat-as expressions, etc.

You can validate larger sequences

You can also validate, in parallel, an entire JSON Lines file, like so:

Optional vs. required fields

By defaults, fields are optional:

You can, however, make a field required by adding a ! in front of its name:

Or you can provide a default value with the equal sign:

Extra fields

Extra fields will be rejected. However, the verbose version of JSound supports allowing extra fields (open objects) and will be supported in a future version of RumbleDB.

Nested arrays

With the JSound comptact syntax, you can easily define nested array structures:

You can even further nest objects:

Or split your definitions into several types that refer to each other:

DataFrames

In fact, RumbleDB will internally convert the sequence of objects to a Spark DataFrame, leading to faster execution times.

In other words, the JSound Compact Schema Syntax is perfect for defining DataFrames schema!

Verbose syntax

For advanced JSound features, such as open object types or subtypes, the verbose syntax must be used, like so:

The JSound type system, as its name indicates, is sound: you can only make subtypes more restrictive than the super type. The complete specification of both syntaxes is available on the .

In the feature, RumbleDB will support user-defined atomic types and union types via the verbose syntax.

What's next?

Once you have validated your data as a dataframe with a user-defined type, you are all set to use the RumbleDB ML Machine Learning library and feed it through ML pipelines!

Function library

Errors and diagnostics

Diagnostic tracing

trace

Fully implemented

returns (1, 2, 3) and logs it in the log-path if specified

Functions and operators on numerics

Functions on numeric values

abs

Fully implemented

returns 2.0

ceiling

Fully implemented

returns 3.0

floor

Fully implemented

returns 2.0

round

Fully implemented

returns 2.0

returns 2.23

round-half-to-even

Fully implemented

Parsing numbers

number

Fully implemented

returns 15 as a double

returns NaN as a double

returns 15 as a double

Formatting integers

format-integer

Not implemented

##Formatting numbers

format-number

Not implemented

##Trigonometric and exponential functions

###pi

Fully implemented

returns 3.141592653589793

###exp

Fully implemented

###exp10

Fully implemented

log

Fully implemented

log10

Fully implemented

pow

Fully implemented

sqrt

Fully implemented

returns 2

sin

Fully implemented

cos

Fully implemented

cosh

JSONiq-specific. Fully implemented

sinh

JSONiq-specific. Fully implemented

tan

Fully implemented

asin

Fully implemented

acos

Fully implemented

atan

Fully implemented

atan2

Fully implemented

Random numbers

random-number-generator

Not implemented

Functions on strings

Functions to assemble and disassemble strings

string-to-codepoint

Fully implemented

returns (84, 104, 233, 114, 232, 115, 101)

returns ()

codepoints-to-string

Fully implemented

returns "अशॊक"

returns ""

Comparison of strings

compare

Fully implemented

returns -1

codepoint-equal

Fully implemented

returns true

returns ()

collation-key

Not implemented

contains-token

Not implemented

Functions on string values

concat

Fully implemented

returns "foobarfoobar"

string-join

Fully implemented

returns "foobarfoobar"

returns "foo-bar-foobar"

substring

Fully implemented

returns "bar"

returns "ba"

string-length

Fully implemented

Returns the length of the supplied string, or 0 if the empty sequence is supplied.

returns 3.

returns 0.

###normalize-space

Fully implemented

Normalization of spaces in a string.

returns "The wealthy curled darlings of our nation."

normalize-unicode

Fully implemented

Returns the value of the input after applying Unicode normalization.

upper-case

Fully implemented

returns "ABCD0"

lower-case

Fully implemented

returns "abc!d"

translate

Fully implemented

returns "BAr"

returns "AAA"

Functions based on substring matching

contains

Fully implemented

returns true.

starts-with

Fully implemented

returns true

ends-with

Fully implemented

returns true.

substring-before

Fully implemented

returns "foo"

returns "f"

substring-after

Fully implemented

returns "bar"

returns ""

String functions that use regular expressions

matches

Arity 2 implemented, arity 3 is not.

Regular expression matching. The semantics of regular expressions are those of Java's Pattern class.

returns true.

replace

Arity 3 implemented, arity 4 is not.

Regular expression matching and replacing. The semantics of regular expressions are those of Java's Pattern class.

returns "a*cada*"

returns "abbraccaddabbra"

tokenize

Arity 2 implemented, arity 3 is not.

returns ("aa", "bb", "cc", "dd")

analyze-string

Not implemented

Functions that manipulate URIs

resolve-uri

Fully implemented

returns http://www.examples.com/examples

encode-for-uri

Fully implemented

returns 100%25%20organic

iri-to-uri

Not implemented

escape-html-uri

Not implemented

Functions and operators on Boolean values

Boolean constant functions

true

Fully implemented

returns true

false

Fully implemented

returns false

boolean

Fully implemented

returns true

returns false

not

Fully implemented

returns false

returns true

Functions and operators on durations

Component extraction functions on durations

years-from-duration

Fully implemented

returns 2021.

months-from-duration

Fully implemented

returns 6.

days-from-duration

Fully implemented

returns 17.

hours-from-duration

Fully implemented

returns 12.

minutes-from-duration

Fully implemented

returns 35.

seconds-from-duration

Fully implemented

returns 30.

Functions and operators on dates and times

Constructing a DateTime

dateTime

Fully implemented

returns 2004-04-12T13:20:00+14:00

Component extraction functions on dates and times

year-from-dateTime

Fully implemented

returns 2021.

month-from-dateTime

Fully implemented

returns 04.

day-from-dateTime

Fully implemented

returns 12.

hours-from-dateTime

Fully implemented

returns 13.

minutes-from-dateTime

Fully implemented

returns 20.

seconds-from-dateTime

Fully implemented

returns 32.

timezone-from-dateTime

Fully implemented

returns PT2H.

year-from-date

Fully implemented

returns 2021.

month-from-date

Fully implemented

returns 6.

day-from-date

Fully implemented

returns 4.

timezone-from-date

Fully implemented

returns -PT14H.

hours-from-time

Fully implemented

returns 13.

minutes-from-time

Fully implemented

returns 20.

seconds-from-time

Fully implemented

returns 32.123.

timezone-from-time

Fully implemented

returns PT2H.

Timezone adjustment functions on dates and time values

adjust-dateTime-to-timezone

Fully implemented

returns 2004-04-12T03:25:15+04:05.

adjust-date-to-timezone

Fully implemented

returns 2014-03-12+04:00.

adjust-time-to-timezone

Fully implemented

returns 04:20:00-14:00.

Formatting dates and times functions

The functions in this section accept a simplified version of the picture string, in which a variable marker accepts only:

One of the following component specifiers: Y, M, d, D, F, H, m, s, P
A first presentation modifier, for which the value can be:
- Nn, for all supported component specifiers, besides P
- N, if the component specifier is P

format-dateTime

Fully implemented

returns 20-13-12-4-2004

format-date

Fully implemented

returns 12-4-2004

format-time

Fully implemented

returns 13-20-0

Not implemented

Functions and operators on sequences

General functions and operators on sequences

empty

Fully implemented

Returns a boolean whether the input sequence is empty or not.

returns false.

exists

Fully implemented

Returns a boolean whether the input sequence has at least one item or not.

returns true.

returns false.

This is pushed down to Spark and works on big sequences.

head

Fully implemented

Returns the first item of a sequence, or the empty sequence if it is empty.

returns 1.

returns ().

This is pushed down to Spark and works on big sequences.

tail

Fully implemented

Returns all but the last item of a sequence, or the empty sequence if it is empty.

returns (2, 3, 4, 5).

returns ().

This is pushed down to Spark and works on big sequences.

insert-before

Fully implemented

returns (1, 2, 3, 4, 5).

remove

Fully implemented

returns (1, 2).

reverse

Fully implemented

returns (3, 2, 1).

subsequence

Fully implemented

returns (2, 3).

unordered

Fully implemented

returns (1, 2, 3).

Functions that compare values in sequences

distinct-values

Fully implemented

Eliminates duplicates from a sequence of atomic items.

returns (1, 4, 3, "foo", true, 5).

This is pushed down to Spark and works on big sequences.

index-of

Fully implemented

returns 3.

returns "".

deep-equal

Fully implemented

returns true.

returns false.

Functions that test the cardinality of sequences

zero-or-one

Fully implemented

returns "a".

returns an error.

one-or-more

Fully implemented

returns "a".

returns an error.

exactly-one

Fully implemented

returns "a".

returns an error.

Aggregate functions

count

Fully implemented

returns 4.

Count calls are pushed down to Spark, so this works on billions of items as well:

avg

Fully implemented

returns 2.5.

Avg calls are pushed down to Spark, so this works on billions of items as well:

max

Fully implemented

returns 4.

returns (1, 2, 3).

Max calls are pushed down to Spark, so this works on billions of items as well:

min

Fully implemented

returns 1.

returns (1, 2, 3).

Min calls are pushed down to Spark, so this works on billions of items as well:

sum

Fully implemented

returns 10.

Sum calls are pushed down to Spark, so this works on billions of items as well:

Functions giving access to external information

doc

Fully implemented

Returns the corresponding document node

collection

Not implemented

Parsing and serializing

serialize

Fully implemented

Serializes the supplied input sequence, returning the serialized representation of the sequence as a string

returns { "hello" : "world" }

Context Functions

position

Fully implemented

returns 5

last

Fully implemented

returns 10

current-dateTime

Fully implemented

returns 2020-02-26T11:22:48.423+01:00

current-date

Fully implemented

returns 2020-02-26Europe/Zurich

current-time

Fully implemented

returns 11:24:10.064+01:00

implicit-timezone

Fully implemented

returns PT1H.

default-collation

Fully implemented

returns http://www.w3.org/2005/xpath-functions/collation/codepoint.

High order functions

Functions on functions

function-lookup

Not implemented

function-name

Not implemented

function-arity

Not implemented

Basic higher-order functions

for-each

Not implemented

filter

Not implemented

fold-left

Not implemented

fold-right

Not implemented

for-each-pair

Not implemented

JSONiq functions

keys

Fully implemented

returns ("foo", "bar"). Also works on an input sequence, eliminating duplicates

Keys calls are pushed down to Spark, so this works on billions of items as well:

members

Fully implemented

This function returns the members as an array, but not recursively, i.e., nested arrays are not unboxed.

Returns the first 100 integers as a sequence. Also works on an input sequence, in a distributive way.

null

Fully implemented

Returns a JSON null (also available as the literal null).

parse-json

Fully implemented

size

Fully implemented

returns 100. Also works if the empty sequence is supplied, in which case it returns the empty sequence.

accumulate

Fully implemented

returns

descendant-arrays

Fully implemented

returns

descendant-objects

Fully implemented

returns

descendant-pairs

Fully implemented

returns

flatten

Fully implemented

Unboxes arrays recursively, stopping the recursion when any other item is reached (object or atomic). Also works on an input sequence, in a distributive way.

Returns (1, 2, 3, 4, 5, 6, 7, 8, 9).

intersect

Fully implemented

returns

project

Fully implemented

returns the object {"foo" : "bar", "bar" : "foobar"}. Also works on an input sequence, in a distributive way.

remove-keys

Fully implemented

returns the object {"foobar" : "foo"}. Also works on an input sequence, in a distributive way.

values

Fully implemented

returns ("bar", "foobar"). Also works on an input sequence, in a distributive way.

Values calls are pushed down to Spark, so this works on billions of items as well:

encode-for-roundtrip

Not implemented

decode-from-roundtrip

Not implemented

json-doc

returns the (unique) JSON value parsed from a local JSON (but not necessarily JSON Lines) file where this value may be spread over multiple lines.

RumbleML

RumbleDB ML

RumbleDB ML is a Machine Learning library built on top of the RumbleDB engine that makes it more productive and easier to perform ML tasks thanks to the abstraction layer provided by JSONiq.

The machine learning capabilities are exposed through JSONiq function items. The concepts of "estimator" and "transformer", which are core to Machine Learning, are naturally function items and fit seamlessly in the JSONiq data model.

Training sets, test sets, and validation sets, which contain features and labels, are exposed through JSONiq sequences of object items: the keys of these objects are the features and labels.

The names of the estimators and of the transformers, as well as the functionality they encapsulate, are directly inherited from the SparkML library which RumbleDB ML is based on: we chose not to reinvent the wheel.

Transformers

A transformer is a function item that maps a sequence of objects to a sequence of objects.

It is an abstraction that either performs a feature transformation or generates predictions based on trained models. For example:

Tokenizer is a feature transformer that receives textual input data and splits it into individual terms (usually words), which are called tokens.
KMeansModel is a trained model and a transformer that can read a dataset containing features and generate predictions as its output.

Estimators

An estimator is a function item that maps a sequence of objects to a transformer (yes, you got it right: that's a function item returned by a function item. This is why they are also called higher-order functions!).

Estimators abstract the concept of a Machine Learning algorithm or any algorithm that fits or trains on data. For example, a learning algorithm such as KMeans is implemented as an Estimator. Calling this estimator on data essentially trains a KMeansModel, which is a Model and hence a Transformer.

Parameters

Transformers and estimators are function items in the RumbleDB Data Model. Their first argument is the sequence of objects that represents, for example, the training set or test set. Parameters can be provided as their second argument. This second argument is expected to be an object item. The machine learning parameters form the fields of the said object item as key-value pairs.

Type Annotations

RumbleDB ML works on highly structured data, because it requires full type information for all the fields in the training set or test set. It is on our development plan to automate the detection of these types when the sequence of objects gets created in the fly.

RumbleDB supports a user-defined type system with which you can validate and annotate datasets against a JSound schema.

This annotation is required to be applied on any dataset that must be used as input to RumbleDB ML, but it is superfluous if the data was directly read from a structured input format such as Parquet, CSV, Avro, SVM or ROOT.

Examples

Tokenizer Example:

KMeans Example:

RumbleDB ML Functionality Overview:

RumblDB eML - Catalogue of Estimators:

Parameters:

RumbleDB ML - Catalogue of Transformers:

Parameters: