Only this pageAll pages
Powered by GitBook
1 of 55

Documentation

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

On the online sandbox

If you really want to start writing queries right now, there is a public sandbox here that will just work and guide you. You only need to have a Google account to be able to execute them, as this exposes our Jupyter notebook via the Colab environment. You are also free to download and use this notebook with any other provider or even your own local Jupyter and it will work just the same: the queries are all shipped to our own, small public backend no matter what. However, this may require a bit of configuration (JAVA_HOME pointing to Java 17 or 21, and if you have conflicting Spark installations in addition to pyspark, SPARK_HOME pointing to a Spark 4.0 installation).

If you do not have a Google account, you can also use our simpler sandbox page without Jupyter, here where you can type small queries and see the results.

With the sandboxes above, you can only inline your data in the query or access a dataset with an HTTP URL.

Once you want to take it to the next level and query your own data on your laptop, you will find instructions below to use RumbleDB on your own computer manually, which among others will allow you to query any files stored on your local disk. And then, you can take a leap of faith and use RumbleDB on a large cluster (Amazon EMR, your company's cluster, etc).

With homebrew

It is also possible to use RumbleDB with brew, however there is currently no way to adjust memory usage. To install RumbleDB with brew, type the commands:

You can test that it works with:

Then, launch a JSONiq shell with:

The RumbleDB shell appears:

You can now start typing simple queries like the following few examples. Press three times the return key to execute a query.

Type mapping

Any expression in JSONiq returns a sequence of items. Any variable in JSONiq is bound to a sequence of items. Items can be objects, arrays, or atomic values (strings, integers, booleans, nulls, dates, binary, durations, doubles, decimal numbers, etc). A sequence of items can be a sequence of just one item, but it can also be empty, or it can be as large as to contain millions, billions or even trillions of items. Obviously, for sequence longer than a billion items, it is a better idea to use a cluster than a laptop. A relational table (or more generally a data frame) corresponds to a sequence of object items sharing the same schema. However, sequences of items are more general than tables or data frames and support heterogeneity seamlessly.

When passing Python values to JSONiq or getting them from a JSONiq queries, the mapping to and from Python is as follows:

Python
JSONiq

As a pip package

You can use RumbleDB from within Python programmes by running

Java version

Important note: since the jsoniq package depends on pyspark 4, Java 17 or Java 21 is a requirement. If another version of Java is installed, the execution of a Python program attempting to create a RumbleSession will lead to an error message on stderr that contains explanations.

You can control your Java version with:

Information about how this package is used can be found .

list

array item

str

string item

int

integer item

bool

boolean item

None

null item

Furthermore, other JSONiq types will be mapped to string literals. Users who want to preserve JSONiq types can use the Item API instead.

JSONiq is very powerful and expressive. You will find tutorials as well as a reference on JSONiq.org.

tuple

sequence of items

dict

object item

Through the Java API with Maven

RumbleDB can also be used as a maven dependency. You can find it here.

The JavaDoc documentation is accessible here.

JSONiq 1.0

JSONiq 1.0 is the first version of the JSONiq language, currently in use.

It is a cousin of the XQuery 3.0 language and was developed by W3C XML Query Working Group members as a proposal of how to integrate JSON support into the language, while making it appealing to the JSON community, and making it easy for an existing XQuery engine to implement.

or

or

brew tap rumbledb/rumble
brew install --build-from-source rumble
rumbledb run -q '1+1'
rumbledb repl
"Hello, World"
Common issue: colliding Spark version

Some users who have already configured a Spark installation on their machine may encounter a version issue if SPARK_HOME points to this alternate installation, and it is a different version of Spark (e.g., 3.5 or 3.4). The jsoniq package requires Spark 4.0.

If this happens, RumbleDB should output an informative error message. They are two ways to fix such conflicts:

  • The easiest is remove the SPARK_HOME environment variable completely. This will have RumbleDB fall back to the Spark 4.0 installation that ships with its pyspark dependency.

  • Or you can instead change the value of SPARK_HOME to point to a Spark 4.0 installation, if you have one. This would be for more advanced users who know what they are doing.

If you have another working Spark installation on your machine, you can see which version it is with

The above command is of course expected not to work for first-time users who only installed the jsoniq package and never installed Spark additionally on their machine.

pip install jsoniq
java -version
in this section
    ____                  __    __     ____  ____ 
   / __ \__  ______ ___  / /_  / /__  / __ \/ __ )
  / /_/ / / / / __ `__ \/ __ \/ / _ \/ / / / __  |  The distributed JSONiq engine
 / _, _/ /_/ / / / / / / /_/ / /  __/ /_/ / /_/ /   2.0.0 "Lemon Ironwood" beta
/_/ |_|\__,_/_/ /_/ /_/_.___/_/\___/_____/_____/  


Master: local[*]
Item Display Limit: 200
Output Path: -
Log Path: -
Query Path : -

rumble$
 1 + 1
 
 (3 * 4) div 5
 
spark-submit --version

Writing JSONiq queries in Python

You can use RumbleDB from within Python programmes by running

Java version

Important note: since the jsoniq package depends on pyspark 4, Java 17 or Java 21 is a requirement. If another version of Java is installed, the execution of a Python program attempting to create a RumbleSession will lead to an error message on stderr that contains explanations.

You can control your Java version with:

Information about how this package is used can be found in this section.

Common issue: colliding Spark version

Some advanced users who have already configured a Spark installation on their machine may encounter a version issue if SPARK_HOME points to this alternate installation, and it is a different version of Spark (e.g., 3.5 or 3.4). The jsoniq package requires Spark 4.0.

If this happens, RumbleDB should output an informative error message. They are two ways to fix such conflicts:

  • The easiest is remove the SPARK_HOME environment variable completely. This will have RumbleDB fall back to the Spark 4.0 installation that ships with its pyspark dependency.

  • Or you can instead change the value of SPARK_HOME to point to a Spark 4.0 installation, if you have one. This would be for more advanced users who know what they are doing.

If you have another working Spark installation on your machine, you can see which version it is with

The above command is of course expected not to work for first-time users who only installed the jsoniq package and never installed Spark additionally on their machine.

High-level information on the library

A RumbleSession is a wrapper around a SparkSession that additionally makes sure the RumbleDB environment is in scope.

JSONiq queries are invoked with rumble.jsoniq() in a way similar to the way Spark SQL queries are invoked with spark.sql().

JSONiq variables can be bound to lists of JSON values (str, int, float, True, False, None, dict, list) or to Pyspark DataFrames. A JSONiq query can use as many variables as needed (for example, it can join between different collections).

It will later also be possible to read tables registered in the Hive metastore, similar to spark.sql(). Alternatively, the JSONiq query can also read many files of many different formats from many places (local drive, HTTP, S3, HDFS, ...) directly with simple such as json-lines(), text-file(), parquet-file(), csv-file(), etc.

The resulting sequence of items can be retrieved as a list of JSON values, as a Pyspark DataFrame, or, for advanced users, as an RDD or with a streaming iteration over the items using the .

It is also possible to write the sequence of items to the local disk, to HDFS, to S3, etc in a way similar to how DataFrames are written back by Pyspark.

The design goal is that it is possible to chain DataFrames between JSONiq and Spark SQL queries seamlessly. For example, JSONiq can be used to clean up very messy data and turn it into a clean DataFrame, which can then be processed with Spark SQL, spark.ml, etc.

Any feedback or error reports are very welcome.

Ways to install and use

There are many ways to install and use RumbleDB. For example:

  • By simply using one of our online sandboxes (Jupyter notebook or simple sandbox page)

  • Our newest library: by installing a pip package (pip install jsoniq)

  • By running the standalone RumbleDB jar with Java on your laptop

  • By installing with homebrew

  • By installing Spark yourself on your laptop (for more control on Spark parameters) and use a small RumbleDB jar with spark-submit

  • By using our docker image on your laptop (go to the "Run with docker" section on the left menu)

  • By uploading the small RumbleDB jar to an existing Spark cluster (such as AWS EMR)

  • By running RumbleDB as an HTTP server in the background and connecting to it in a Jupyter notebook with the %%jsoniq magic.

  • By installing it manually on your machine.

Further steps

After installing RumbleDB, further steps could involve:

  • Learning JSONiq. More details can be found in the JSONiq section of this documentation and in the and .

  • Storing some data on S3, creating a Spark cluster on Amazon EMR (or Azure blob storage and Azure, etc), and querying the data with RumbleDB. More details are found in the cluster section of this documentation.

  • Using RumbleDB with Jupyter notebooks. For this, you can run RumbleDB as a server with a simple command, and get started by downloading the and just clicking your way through it. More details are found in the Jupyter notebook section of this documentation. Jupyter notebooks work both locally and on a cluster.

Command line (java -jar)

Java version (important)

You need to make sure that you have Java 11 or 17 and that, if you have several versions installed, JAVA_HOME correctly points to Java 11 or 17.

RumbleDB works with both Java 11 and Java 17. You can check the Java version that is configured on your machine with:

If you do not have Java, you can download version 11 or 17 from .

Do make sure it is not Java 8, which will not work.

In jupyter notebooks

The Python edition of Rumble can be used to directly write JSONiq queries in Jupyter notebook cells. This is explained . You first need to install the library as described .

pip install jsoniq
java -version
Write JSONiq code, and share it on the Web, as others can import it from HTTP in just one line from within their queries (no package publication or installation required) or specify an HTTP URL as an input query to RumbleDB!
JSONiq specification
tutorials
main JSONiq tutorial as a Jupyter notebook
here
here

The JSONiq language

JSONiq is a query and processing language specifically designed for the popular JSON data model. The main ideas behind JSONiq are based on lessons learned in more than 30 years of relational query systems and more than 15 years of experience with designing and implementing query languages for semi-structured data like XML and RDF.

The main source of inspiration behind JSONiq is XQuery, which has been proven so far a successful and productive query language for semi-structured data (in particular XML). JSONiq borrowed a large numbers of ideas from XQuery, like the structure and semantics of a FLWOR construct, the functional aspect of the language, the semantics of comparisons in the face of data heterogeneity, the declarative, snapshot-based updates. However, unlike XQuery, JSON is not concerned with the peculiarities of XML, like mixed content, ordered children, the confusion between attributes and elements, the complexities of namespaces and QNames, or the complexities of XML Schema, and so on.

The power of the XQuery's FLWOR construct and the functional aspect, combined with the simplicity of the JSON data model result in a clean, sleek and easy to understand data processing language. As a matter of fact, JSONiq is a language that can do more than queries: it can describe powerful data processing programs, from transformations, selections, joins of heterogeneous data sets, data enrichment, information extraction, information cleaning, and so on.

Technically, the main characteristics of JSONiq (and XQuery) are the following:

  • It is a set-oriented language. While most programming languages are designed to manipulate one object at a time, JSONiq is designed to process sets (actually, sequences) of data objects.

  • It is a functional language. A JSONiq program is an expression; the result of the program is the result of the evaluation of the expression. Expressions have fundamental role in the language: every language construct is an expression, and expressions are fully composable.

  • It is a declarative language. A program specifies what is the result being calculated, and does not specify low level algorithms like the sort algorithm, the fact that an algorithm is executed in main memory or is external, on a single machine or parallelized on several machines, or what access patterns (aka indexes) are being used during the evaluation of the program. Such implementation decisions should be taken automatically, by an optimizer, based on the physical characteristics of the data, and of the hardware environment. Just like a traditional database would do. The language has been designed from day one with optimizability in mind.

  • It is designed for nested, heterogeneous, semi-structured data. Data structures in JSON can be nested with arbitrary depth, do not have a specific type pattern (i.e. are heterogeneous), and may or may not have one or more schemas that describe the data. Even in the case of a schema, such a schema can be open, and/or simply partially describe the data. Unlike SQL, which is designed to query tabular, flat, homogeneous structures. JSONiq has been designed from scratch as a query for nested and heterogeneous data.

RumbleDB 2.0 "Lemon Ironwood"

RumbleDB is a querying engine that allows you to query your large, messy datasets with ease and productivity. It covers the entire data pipeline: clean up, structure, normalize, validate, convert to an efficient binary format, and feed it right into Machine Learning estimators and models, all within the JSONiq language.

RumbleDB supports JSON-like datasets including JSON, JSON Lines, Parquet, Avro, SVM, CSV, ROOT as well as text files, of any size from kB to at least the two-digit TB range (we have not found the limit yet).

RumbleDB is both good at handling small amounts of data on your laptop (in which case it simply runs locally and efficiently in a single-thread) as well as large amounts of data by spreading computations on your laptop cores, or onto a large cluster (in which case it leverages Spark automagically).

RumbleDB can also be used to easily and efficiently convert data from a format to another, including from JSON to Parquet thanks to JSound validation.

It runs on many local or distributed filesystems such as HDFS, S3, Azure blob storage, and HTTP (read-only), and of course your local drive as well. You can use any of these file systems to store your datasets, but also to store and share your queries and functions as library modules with other users, worldwide or within your institution, who can import them with just one line of code. You can also output the results of your query or the log to these filesystems (as long as you have write access).

With RumbleDB, queries can be written in the tailor-made and expressive JSONiq language. Users can write their queries declaratively and start with just a few lines. No need for complex JSON parsing machinery as JSONiq supports the JSON data model natively.

The core of RumbleDB lies in JSONiq's FLWOR expressions, the semantics of which map beautifully to DataFrames and Spark SQL. Likewise expression semantics is seamlessly translated to transformations on RDDs or DataFrames, depending on whether a structure is recognized or not. Transformations are not exposed as function calls, but are completely hidden behind JSONiq queries, giving the user the simplicity of an SQL-like language and the flexibility needed to query heterogeneous, tree-like data that does not fit in DataFrames.

This documentation provides you with instructions on how to get started, examples of data sets and queries that can be executed locally or on a cluster, links to JSONiq reference and tutorials, notes on the function library implemented so far, and instructions on how to compile RumbleDB from scratch.

Please note that this is a (maturing) beta version. We welcome bug reports in the GitHub issues section.

RumbleDB Reference

Applying updates

At the end of an updating program, the resulting PUL is applied with upd:applyUpdates (part of the XQuery Update Facility standard), which is extended as follows:

  • First, before applying any update, each update primitive (except the jupd:insert-into-object primitives, which do not have any target) locks onto its target by resolving the selector on the object or array it updates. If the selector is resolved to the empty sequence, the update primitive is ignored in step 2. After this operation, each of these update primitives will contain a reference to either the pair (for an object) or the value (for an array) on or relatively to which it operates.

  • Then each update primitive is applied, using the target references that were resolved at step 1. The order in which they are applied is not relevant and does not affect the final instance of the data model. After applying all updates, an error jerr:JNUP0006 is raised upon pair name collision within the same object.

builtin function calls
RumbleDB Item API
Download RumbleDB

RumbleDB is just a download and no installation is required.

In order to run RumbleDB, you simply need to download rumbledb-2.0.0-standalone.jar from the download page and put it in a directory of your choice, for example, right besides your data.

Make sure to use the corresponding jar name accordingly in all our instructions in lieu of rumbledb.jar.

You can test that it works with:

or launch a JSONiq shell with:

If you run out of memory, you can set allocate more memory to Java with an additional Java parameter, e.g., -Xmx10g

The RumbleDB shell appears:

You can now start typing simple queries like the following few examples. Press three times the return key to execute a query.

or

or

Javadoc

If you plan to add the jar to your Java environment to use RumbleDB in your Java programs, the JavaDoc documentation can be found here. The entry point is the class org.rumbledb.api.Rumble.

java -version
AdoptOpenJDK
spark-submit --version
java -jar rumbledb-2.0.0-standalone.jar run -q '1+1'
java -jar rumbledb-2.0.0-standalone.jar repl
    ____                  __    __     ____  ____ 
   / __ \__  ______ ___  / /_  / /__  / __ \/ __ )
  / /_/ / / / / __ `__ \/ __ \/ / _ \/ / / / __  |  The distributed JSONiq engine
 / _, _/ /_/ / / / / / / /_/ / /  __/ /_/ / /_/ /   2.0.0 "Lemon Ironwood" beta
/_/ |_|\__,_/_/ /_/ /_/_.___/_/\___/_____/_____/  


Master: local[*]
Item Display Limit: 200
Output Path: -
Log Path: -
Query Path : -

rumble$
"Hello, World"
 1 + 1
 
 (3 * 4) div 5
 

Command line (with spark-submit and an existing Spark installation)

This method gives you more control about the Spark configuration than the experimental standalone jar, in particular you can increase the memory used, change the number of cores, and so on.

If you use Linux, Florian Kellner also kindly contributed an installation script for Linux users that roughly takes care of what is described below for you.

Users of the Python edition (pip install jsoniq) should not have to install Spark manually because the pip package automatically installs pyspark and this contains a Spark 4 installation. However, advanced users who have multiple Spark installations or encounter a Spark version conflict in Python may find the information below useful.

Install Spark (if you do not have installed already)

RumbleDB requires an Apache Spark installation on Linux, Mac or Windows. Important note: it needs to be either Spark 4, or the Scala 2.13 build of Spark 3.5.

It is straightforward to directly , unpack it and put it at a location of your choosing. We recommend to pick Spark 4.0.0.

SPARK_HOME and PATH (you need to check even if you already have an existing installation)

You then need to point the SPARK_HOME environment variable to this directory, and to additionally add the subdirectory "bin" within the unpacked directory to the PATH variable. On macOS this is done by adding.

Users of the Python edition who have additional Spark installations must ensure that SPARK_HOME and PATH point to a Spark 4 installation. The Python edition does not work with Spark 3.5.

(with SPARK_HOME appropriately set to match your unzipped Spark directory) to the file .zshrc in your home directory, then making sure to force the change with

in the shell. In Windows, changing the PATH variable is done in the control panel. In Linux, it is similar to macOS.

As an alternative, users who love the command line can also install Spark with a package management system instead, such as brew (on macOS) or apt-get (on Ubuntu). However, these might be less predictable than a raw download.

You can test that Spark was correctly installed with:

Java version (important)

You need to make sure that you have Java 11 (for Spark 3.5) or 17 (for Spark 3.5 or 4.0) or 21 (for Spark 4.0) and that, if you have several versions installed, JAVA_HOME correctly points to the correct Java installation. Spark only supports Java 11 or 17 or 21 depending on the version.

Spark 4+ is documented to work with both Java 17 and Java 21. If there is an issue with the Java version, RumbleDB will inform you with an appropriate error message. You can check the Java version that is configured on your machine with:

Download the small version of the RumbleDB jar

Like Spark, RumbleDB is just a download and no installation is required.

In order to run RumbleDB, you simply need to download one of the small .jar files from the and put it in a directory of your choice, for example, right besides your data.

If you use Spark 3.5, use rumbledb-2.0.0-for-spark-3.5-scala-2.13.jar.

If you use Spark 4.0, use rumbledb-2.0.0-for-spark-4.0.jar.

These jars do not embed Spark, since you chose to set it up separately. They will work with your Spark installation with the spark-submit command.

Make sure to use the corresponding jar name accordingly in all our instructions in lieu of rumbledb.jar, replacing rumbledb.jar with the actual name of the jar file you downloaded.

In a shell, from the directory where the RumbleDB .jar lies, type, all on one line:

replacing rumbledb.jar with the actual name of the jar file you downloaded.

The RumbleDB shell appears:

You can now start typing simple queries like the following few examples. Press three times the return key to execute a query.

or

or

As an HTTP server

Now that there is a pip package available, using it may appeal more to some users than this older approach based on running RumbleDB as a server (you can put your JSONiq queries in rumble.jsoniq() calls). We keep this documentation for any users interested in the server capabilities of RumbleDB.

Starting the HTTP server

RumbleDB can be run as an HTTP server that listens for queries. In order to do so, you can use the --server and --port parameters:

This command will not return until you force it to (Ctrl+C on Linux and Mac). This is because the server has to run permanently to listen to incoming requests.

Most users will not have to do anything beyond running the above command. For most of them, the next step would be to open a Jupyter notebook that connects to this server automatically.

This HTTP server is built as a basic server for the single user use case, i.e., the user runs their own RumbleDB server on their laptop or cluster, and connects to it via their Jupyter notebook, one query at a time. Some of our users have more advanced needs, or have a larger user base, and typically prefer to implement their own HTTP server, lauching RumbleDB queries either via the public RumbleDB Java API (like the basic HTTP server does -- so its code can serve as a demo of the Java API) or via the RumbleDB CLI.

Caution! Launching a server always has consequences on security, especially as RumbleDB can read from and write to your disk; So make sure you activate your firewall. In later versions, we may support authentication tokens.

Testing that it works (not necessary for most end users)

The HTTP server is meant not to be used directly by end users, but instead to make it possible to integrate RumbleDB in other languages and environments, such as Python and Jupyter notebooks.

To test that the server is running, you can try the following address in your browser, assuming you have a query stored locally at /tmp/query.jq. All queries have to go to the /jsoniq path.

The request returns a JSON object, and the resulting sequence of items is in the values array.

Almost all parameters from the command line are exposed as HTTP parameters.

A query can also be submitted in the request body:

Use with Jupyter notebooks

With the HTTP server running, if you have installed Python and Jupyter notebooks (for example with the Anaconda data science package that does all of it automatically), you can create a RumbleDB magic by just executing the following code in a cell:

Where, of course, you need to adapt the port (8001) to the one you picked previously.

Then, you can execute queries in subsequent cells with:

or on multiple lines:

Use with clusters

You can also let RumbleDB run as an HTTP server on the master node of a cluster, e.g. on Amazon EMR or Azure. You just need to:

  • Create the cluster (it is usually just the push of a few buttons in Amazon or Azure)

  • Wait for a few minutes

  • Make sure that your own IP has incoming access to EMR machines by configuring the security group properly. You usually only need to do so the first time you set up a cluster (if your IP address remains the same), because the security group configuration will be reused for future EMR clusters.

Then there are two options

With SSH tunneling

  • Connect to the master with SSH with an extra parameter for securely tunneling the HTTP connection (for example -L 8001:localhost:8001 or any port of your choosing)

  • Download the RumbleDB jar to the master node

    wget https://github.com/RumbleDB/rumble/releases/download/v1.24.0/rumbledb-1.24.0.jar

  • Launch the HTTP server on the master node (it will be accessible under http://localhost:8001/jsoniq).

With the EC2 hostname

There is also another way that does not need any tunnelling: you can specify the hostname of your EC2 machine (copied over from the EC2 dashboard) with the --host parameter. For example, with the placeholder :

You also need to make sure in your EMR security group that the chosen port (e.g., 8001) is accessible from the machine in which you run your Jupyter notebook. Then, you can point your Jupyter notebook on this machine to http://<ec2-hostname>:8001/jsoniq.

Be careful not to open this port to the whole world, as queries can be sent that read and write to the EC2 machine and anything it has access to (like S3).

Ways to get and process the output of a JSONiq query

There are several ways to get back the output of the JSONiq query. There are many examples of use further down this page.

Method
Description
Requirement in availableOutputs()
Scale

availableOutputs()

Returns a list that helps you understand which output methods you can call. The strings in this list can be Local, RDD, DataFrame, or PUL.

-

json()

Returns the results as a tuple containing dicts, lists, strs, ints, floats, booleans, Nones.

Local

Interacting with pandas DataFrames

RumbleDB can work out of the box with pandas DataFrames, both as input and (when the output has a schema) as output.

Binding JSONiq variables to pandas DataFrames

bind() also accepts pandas dataframes

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [30,25,35]};
pdf = pd.DataFrame(data);

rumble.bind('$a',pdf);
seq = rumble.jsoniq('$a.Name')

The same goes for extra named parameters.

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [30,25,35]};
pdf = pd.DataFrame(data);

seq = rumble.jsoniq('$a.Name', a=pdf)

Getting the results as a pandas DataFrame

It is also possible to get the results back as a pandas dataframe with pdf() (if the output has a schema, which you can check by calling availableOutputs() and seeing if "DataFrame" is in the returned list).

Installing from source (for the adventurous)

We show here how to install RumbleDB from the GitHub repository and build it yourself if you wish to do so (for example, to use the latest master). However, the easiest way to use RumbleDB is to simply download the already compiled .jar files.

Requirements

The following software is required:

  • : the version of Java is important, as RumbleDB only works with Java 11 (Standalone or Spark 3.5), 17 (Standalone or Spark 3.5 or Spark 4 or Python) or 21 (Spark 4 or Python). The current master branch corresponds to Spark 4.0, meaning that Java 17 or 21 is required.

Your first programs

The syntax to start a session is similar to that of Spark. A RumbleSession is a SparkSession that additionally knows about RumbleDB. All attributes and methods of SparkSession are also available on RumbleSession.

Even though RumbleDB uses Spark internally, it can be used without any knowledge of Spark.

Executing a query is done with rumble.jsoniq() like so.

A query returns a sequence of items, here the sequence with just the integer item 2.

There are several ways to retrieve the results of the query. Calling the json() is just one of them. It retrieves the sequence of as a tuple of JSON values that Python can process. The detailed . Other methods for .

Interacting with pyspark DataFrames

RumbleDB can work out of the box with pyspark DataFrames, both as input and (when the output has a schema) as output.

Using Pyspark DataFrames with JSONiq

The power users can also interface our library with pyspark DataFrames. JSONiq sequences of items can have billions of items, and our library supports this out of the box: it can also run on clusters on AWS Elastic MapReduce for example. But your laptop is just fine, too: it will spread the computations on your cores. You can bind a DataFrame to a JSONiq variable. JSONiq will recognize this DataFrame as a sequence of object items.

Creating a data frame also similar to Spark (but using the rumble object).

This is how to bind a JSONiq variable to a dataframe. You can bind as many variables as you want.

Binding JSONiq variables to Python values

It is possible to bind a JSONiq variable to a tuple of native Python values and then use it in a query. JSONiq, variables are bound to sequences of items, just like the results of JSONiq queries are sequence of items. A Python tuple will be seamlessly converted to a sequence of items by the library. Currently we only support strs, ints, floats, booleans, None, and (recursively) lists and dicts. But if you need more (like date, bytes, etc) we will add them without any problem. JSONiq has a rich type system.

Values can be passed with extra named parameters, like so.

It is also possible to bind variables more durably (across multiple jsoniq() calls) with bind().

It is possible to bind only one value. The it must be provided as a singleton tuple. This is because in JSONiq, an item is the same a sequence of one item.

For convenience and code readability, you can also use bindOne().

A variable that was durably bound with bind() or bindOne() can be unbound with unbind().

Writing queries directly in Jupyter notebook cells

The Python edition of RumbleDB comes out of the box with a JSONiq magic.

If you are in a Jupyter notebook and have installed the jsoniq pip package, you can activate the jsoniq magic with:

Then, you can run JSONiq in standalone cells and see the results:

Of course, you can still continue to use rumble.jsoniq() calls and process the outputs as you see fit.

An example of the magic in action is available in our .

Note: This is a different magic than the magic that works with the RumbleDB HTTP server. It is more modern and running a server is no longer needed with this different magic. It suffices to install the jsoniq Python package.

Write back to the disk (or data lake)

Generally, it is possible to write output by to disk using the pandas DataFrame API, the pyspark DataFrame API, or Python's library to write JSON values to disk.

For convenience, we provide a way to also directly do so with the sequence object output by the query.

it is possible to write the output to a file locally or on a cluster. The API is similar to that of Spark dataframes. Note that it creates a directory and stores the (potentially very large) output in a sharded directory. RumbleDB was already tested with up to 64 AWS machines and 100s of TBs of data.

Of course the examples below are so small that it makes more sense to process the results locally with Python, but this shows how GBs or TBs of data obtained from JSONiq can be written back to disk.

The transform expression

Updates can be applied to a clone of an existing instance with the expression.

The content of the modify clause may build a complex Pending Update List with multiple updates. Remember that, with snapshot semantics, each update is applied against the initial snapshot, and updates do not see each other's effects.

Updating expression can also be combined with conditional expressions (in the then and else clauses), switch expressions (in the return clauses), FLWOR expressions (in the return clause), etc for more powerful queries based on patterns in the available data (from any source visible to the JSONiq query).

The updates generated inside the modify clause may only target the cloned object, i.e., the variable specified in the copy clause.

Example 191. JSON copy-modify-return expression

Result: { "foo" : true, "bar" : 123, "foobar" : [ true, false ] }

Advanced configuration

RumbleDB's specific configuration

It is possible to access RumbleDB's advanced configuration parameters with

Then, you can change the value of some parameters. For example, you can increase the number of JSON values that you can retrieve with a json() call:

You can also configure RumbleDB to output verbose information about the internal query plan, type and mode detection, and optimizations. This can be of interest to data engineers or researchers to understand how RumbleDB works.

The complete API for configuring RumbleDB is accessible in our pages. These methods are also callable in Python.

Warning: some of the configuration methods do not make sense in Python and are specific to the command line edition of RumbleDB (such as setting the query content or an output path and input/output format). Also, setting external variables in Python should not be done via the configuration, but with the bind() and unbind() functions or extra parameters in jsoniq() calls.

Licenses

RumbleDB uses the following software:

  • ANTLR v4 Framework - BSD License

  • Apache Commons Text - Apache License

  • Apache Commons Lang - Apache License

JSONiq Update Facility

JSONiq follows the and introduces update primitives and update expressions specific to JSON data.

In JSONiq, updates are not immediately applied. Rather, a snapshot of the current data is taken, and a list of updates, called the Pending Update List, is collected. Then, upon explicit request by the user (via specific expressions), the Pending Update List is applied atomically, leading to a new snapshot. It is also possible for an engine to persist (to the local disk, to a database management system, to a data lake...) the resulting Pending Update List after a query has been completed.

Merging updates

In the middle of a program, several PULs can be produced against the same snapshot. They are then merged with upd:mergeUpdates (part of the XQuery Update Facility standard), which is extended as follows.

  • Several deletes on the same object are replaced with a unique delete on that object, with a list of all selectors (names) to be deleted, where duplicates have been eliminated.

  • Several deletes on the same array and selector (position) are replaced with a unique delete on that array and with that selector.

  • Several inserts on the same array and selector (position) are equivalent to a unique insert on that array and selector with the content of those original inserts appended in an implementation-dependent order (like XQUF).

Apache Commons IO - Apache License
  • Apache HTTP client - Apache License

  • gson - Apache License

  • JLine terminal framework - BSD License

  • Kryo serialization framework - BSD License

  • Laurelin (ROOT parser) - BSD-3

  • Spark Libraries - Apache License

  • As well as the JSONiq language - CC BY-SA 3.0 License

  • Several inserts on the same object are equivalent to a unique insert where the objects containing the pairs to insert are merged. An error jerr:JNUP0005 is raised if a collision occurs.

  • Several replaces on the same object or array and with the same selector raise an error jerr:JNUP0009.

  • Several renames on the same object and with the same selector raise an error jerr:JNUP0010.

  • If there is a replace and a delete on the same object or array and with the same selector, the replace is omitted in the merged PUL.

  • If there is a rename and a delete on the same object or array and with the same selector, the rename is omitted in the merged PUL.

  • Sequence length below the materialization cap. The default is 200 but it can be increased in the RumbleDB configuration.

    df()

    Returns the results as a pyspark data frame

    DataFrame (i.e., RumbleDB was able to infer an output schema)

    No limitation, but beyond a billion items, you should use a Spark cluster.

    pdf()

    Returns the results as a pandas data frame

    DataFrame (i.e., RumbleDB was able to infer an output schema)

    Should fit in your computer's memory.

    rdd()

    Returns the results as an RDD containing dicts, lists, strs, ints, floats, booleans, Nones (experimental)

    RDD

    No limitation, but beyond a billion items, you should use a Spark cluster.

    items()

    Returns the results as a list containing Java Item objects that can be queried with the RumbleDB Item API. Will contain more information and more accurate typing.

    Local

    Sequence length below the materialization cap. The default is 200 but it can be increased in the RumbleDB configuration.

    open(), hasNext(), nextJSON(), close()

    Allows streaming (with no limitation of length) through individuals items as dicts, lists, strs, ints, floats, booleans, Nones.

    Local

    No limitation, as long as you go through the stream without saving all past items.

    open(), hasNext(), next(), close()

    Allows streaming (with no limitation of length) through individuals items as Java Item objects that can be queried with the RumbleDB Item API. Will contain more information and more accurate typing.

    Local

    No limitation, as long as you go through the stream without saving all past items.

    applyUpdates()

    Persists the Pending Update List produced by the query (to the Delta Lake or a table registered in the Hive metastore).

    PUL

    -

    XQuery Update Facility standard

    JSONiq 3.1

    JSONiq 3.1 is an initiative of the RumbleDB team that aligns JSONiq more closely with XQuery 3.1, which has now become a W3C recommendation, but keeping what makes it JSONiq: the flagship feature being the ability to copy-paste JSON into a JSONiq query and with a navigation syntax that appeals to the JSON community.

    JSONiq 3.1 does not require a distinct data model (JDM) since XQuery 3.1 support maps and arrays. As a result, JSONiq 3.1 objects are the same as XQuery 3.1 maps and JSONiq 3.1 arrays are the same as XQuery 3.1 arrays.

    JSONiq 3.1 does not require a separate serialization mechanism, since XQuery 3.1 supports the JSON output method.

    JSONiq 3.1 benefits from all the map and object builtin functions defined in XQuery 3.1.

    JSONiq 3.1 is fully interoperable with XQuery 3.1 and can execute on the same virtual machine (similar to Scala and Java).

    This also paves the way for JSONiq 4.0 which will also be aligned with XQuery 4.0 as much as is technically possible.

    As a result, the specification for JSONiq 3.1 is even more minimal than that of JSONiq 1.0. This makes it easy to support for any existing XQuery engine to step into the JSON community.

    RumbleDB is slowly deploying the use of JSONiq 3.1 but it will take some time as we make sure to sweep in all corners.

    How JSONiq 3.1 amends XQuery 3.1

    Context item

    In JSONiq 3.1, the context item is obtained through $$ and not through a dot.

    Escaping in strings

    String literals use JSON escaping instead of XML escaping (backslash, not ampersand).

    Map constructors

    In map (object) constructors, the "map" keyword in front is optional.

    Constraints on XPath

    A name test must be prefixed with $$/ and cannot stand on its own.

    True, null, and false literals

    true and false exist as literals and do not have to be obtained through function calls (true(), false()).

    null exists as a literal and stands for the empty sequence.

    Navigation

    The dot . and double square brackets [[ ]] act as syntactic sugars for ? lookup.

    How JSONiq 3.1 differs from JSONiq 1.0

    The data model standardized by the W3C working group is more generic and allows for atomic object keys that are not necessarily strings (dates, etc). Also, an object value or an array value can be a sequence of items and does not need to be a single item. The particular case in which object keys are strings and values are single items (or empty) corresponds to the JSON use.

    Null does not exist as its own type in JSONiq 3.1, instead it is mapped to the empty sequence.

    There are other minor changes in semantics that correspond to the alignment with XQuery 3.1 such as Effective Boolean Values, comparison, etc.

    Open Issues

    The JSON update syntax was not integrated yet into the core language. This is planned, and the syntax will be simplified (no json keyword, dot lookup allowed here as well).

    The semantics for the JSON serialization method is the same as in the JSONiq Extension to XQuery. It is still under discussion how to escape special characters with the Text output method.

    download it
    download page
    spark-submit rumbledb-1.24.0.jar serve -p 8001
  • And then use Jupyter notebooks in the same way you would do it locally (it magically works because of the tunneling)

  • print(seq.pdf())
    print(rumble.jsoniq("""
    for $v in $c
    let $parity := $v mod 2
    group by $parity
    return { switch($parity)
             case 0 return "even"
             case 1 return "odd"
             default return "?" : $v
    }
    """, c=(1,2,3,4, 5, 6)).json())
    
    print(rumble.jsoniq("""
    for $i in $c
    return [
      for $j in $i
      return { "foo" : $j }
    ]
    """, c=([1,2,3],[4,5,6])).json())
    
    
    print(rumble.jsoniq('{ "results" : $c.foo[[2]] }',
        c=({"foo":[1,2,3]},{"foo":[4,{"bar":[1,False, None]},6]})).json())
    rumble.bind('$c', (1,2,3,4, 5, 6))
    print(rumble.jsoniq("""
    for $v in $c
    let $parity := $v mod 2
    group by $parity
    return { switch($parity)
             case 0 return "even"
             case 1 return "odd"
             default return "?" : $v
    }
    """).json())
    
    print(rumble.jsoniq("""
    for $v in $c
    let $parity := $v mod 2
    group by $parity
    return { switch($parity)
             case 0 return "gerade"
             case 1 return "ungerade"
             default return "?" : $v
    }
    """).json())
    
    rumble.bind('$c', ([1,2,3],[4,5,6]))
    print(rumble.jsoniq("""
    for $i in $c
    return [
      for $j in $i
      return { "foo" : $j }
    ]
    """).json())
    
    rumble.bind('$c', ({"foo":[1,2,3]},{"foo":[4,{"bar":[1,False, None]},6]}))
    print(rumble.jsoniq('{ "results" : $c.foo[[2]] }').json())
    seq = rumble.jsoniq("$a.Name");
    seq.write().mode("overwrite").json("outputjson");
    seq.write().mode("overwrite").parquet("outputparquet");
    
    seq = rumble.jsoniq("1+1");
    seq.write().mode("overwrite").text("outputtext");
  • Spark, version 4.0.0 (for example)

  • Ant, version 1.10

  • Maven 3.9.9

  • Checking the requirements

    Type the following commands to check that the necessary commands are available. If not, you may need to either install the software, or make sure that it is on the PATH.

    Checkout

    You first need to download the rumble code to your local machine.

    In the shell, go to the desired location:

    Clone the github repository:

    Go to the root of this repository:

    Compile

    You can compile the entire project like so:

    After successful completion, you can check the target directory, which should contain the compiled classes as well as the JAR file rumbledb-2.0.0-jar-with-dependencies.jar.

    Running locally

    The most straightforward to test if the above steps were successful is to run the RumbleDB shell locally, like so:

    The RumbleDB shell should start:

    You can now start typing interactive queries. Queries can span over multiple lines. You need to press return 3 times to confirm.

    This produces the following results (>>> show the extra, empty lines that appear on the first two presses of the return key).

    You can try a few more queries.

    This is it. RumbleDB is setup and ready to go locally. You can now move on to a JSONiq tutorial. A RumbleDB tutorial will also follow soon.

    Running on a cluster

    You can also try to run the RumbleDB shell on a cluster if you have one available and configured -- this is done with the same command, as the master and deployment mode are usually already set up in cloud-managed clusters. More details are provided in the rest of the documentation.

    Java SE
    More complex, standalone queries

    Below are a few examples showing what is possible with JSONiq. You can learn JSONiq with our interactive tutorial. You will also find a full language reference here as well as a list of builtin functions.

    For complex queries, you can use Python's ability to spread strings over multiple lines, and with no need to escape special characters.

    from jsoniq import RumbleSession
    
    rumble = RumbleSession.builder.getOrCreate();
    
    items = rumble.jsoniq('1+1')
    python_tup = items.json()
    print(python_tup)
    type mapping for this is described here
    retrieving the output of a query are described here
    This is how to run a query. This is similar to spark.sql(). Since variable $a was bound to a DataFrame, it is automatically declared as an external variable and can be used in the query. In JSONiq, it is logically a sequence of objects.

    You can also, instead of the bind() call, pass the pyspark DataFrame directly in jsoniq() with an extra named parameter:

    There are several ways to collect the outputs, depending on the user needs but also on the query supplied. The following method returns a list containing one or several of "DataFrame", "RDD", "PUL", "Local".

    If DataFrame is in the list, df() can be invoked.

    If RDD is in the list, rdd() can be invoked.

    If Local is the list, items() or json() can be invokved, as well as the local iterator API.

    Manipulating DataFrames with SQL and JSONiq

    If the output of the JSONiq query is structured (i.e., RumbleDB was able to detect a schema), then we can extract a regular data frame that can be further processed with spark.sql() or rumble.jsoniq().

    We are continuously working on the detection of schemas and RumbleDB will get better at it with them. JSONiq is a very powerful language and can also produce heterogeneous output "by design". Then you need to use rdd() instead of df(), or to collect the list of JSON values (see further down). Remember that availableOutputs() tells you what is at your disposal.

    A DataFrame output by JSONiq can be reused as input to a Spark SQL query.

    (Remember that rumble is a wrapper around a SparkSession object, so you can use rumble.sql() just like spark.sql())

    A DataFrame output by Spark SQL can be reused as input to a JSONiq query.

    And a DataFrame output by JSONiq can be reused as input to another JSONiq query.

    data = [("Alice", 30), ("Bob", 25), ("Charlie", 35)];
    columns = ["Name", "Age"];
    df = spark.createDataFrame(data, columns);
    rumble.bind('$a', df);
    Change the behavior to output DataFrames

    By default, the output will be in the form of serialized JSON values. If the output is structured, then you can change this default behavior to show it in the form of a DataFrame instead.

    For a pandas DataFrame:

    For a pyspark DataFrame:

    Note that it will not work in all cases. If the output is not fully structured or RumbleDB is unable to infer a DataFrame schema, you can specify the schema yourself. The schema language is called JSound and you will find a tutorial here.

    It is possible to measure the response time with the -t parameter:

    %load_ext jsoniqmagic
    %%jsoniq
    {"foobar":1} 
    upgraded online sandbox

    Allocating more memory

    If you get an out-of-memory error, it is possible to allocate memory when you build the Rumble session with a config() call. This is exactly the same way it is done when building a Spark session. The config() call can of course be used in combination with any other method calls that are part of the builder chain (withDelta(), appName(), config(), etc).

    This will only have an effect when done the first time the session is created. Spark and Java are not able to adjust the memory automatically after the session has been created. Subsequent calls of getOrCreate() do not create a new session, they only get the existing one.

    In Jupyter, you can restart the Kernel before you create the session, to force a new session.

    For example:

    conf = rumble.getRumbleConf()
    conf.setResultSizeCap(1000)
    JavaDoc
    export SPARK_HOME=/path/to/spark-4.0.0-bin-hadoop3
    export PATH=$SPARK_HOME/bin:$PATH
    . ~/.zshrc
    spark-submit --version
    java -version
    spark-submit rumbledb.jar repl
        ____                  __    __     ____  ____ 
       / __ \__  ______ ___  / /_  / /__  / __ \/ __ )
      / /_/ / / / / __ `__ \/ __ \/ / _ \/ / / / __  |  The distributed JSONiq engine
     / _, _/ /_/ / / / / / / /_/ / /  __/ /_/ / /_/ /   2.0.0 "Lemon Ironwood" beta
    /_/ |_|\__,_/_/ /_/ /_/_.___/_/\___/_____/_____/  
    
    
    Master: local[*]
    Item Display Limit: 200
    Output Path: -
    Log Path: -
    Query Path : -
    
    rumble$
    "Hello, World"
     1 + 1
     
     (3 * 4) div 5
     
    spark-submit rumbledb.jar serve -p 8001
    http://localhost:8001/jsoniq?query-path=/tmp/query.jq
    { "values" : [ "foo", "bar" ] }
    curl -X POST --data '1+1' http://localhost:8001/jsoniq
    !pip install rumbledb
    %load_ext rumbledb
    %env RUMBLEDB_SERVER=http://localhost:8001/jsoniq
    %jsoniq 1 + 1
    %%jsoniq
    for $doc in json-lines("my-file")
    where $doc.foo eq "bar"
    return $doc
    
    spark-submit rumbledb.jar serve -p 8001 -h <ec2-hostname>
    rumble.bind('$c', (42,))
    print(rumble.jsoniq('for $i in 1 to $c return $i*$i').json())
    rumble.bindOne('$c', 42)
    print(rumble.jsoniq('for $i in 1 to $c return $i*$i').json())
    rumble.unbind('$c')
    $ java -version
    
    $ mvn --version
    
    $ ant -version
    
    $ spark-submit --version
    $ cd some_directory
    $ git clone https://github.com/RumbleDB/rumble.git
    $ cd rumble
    $ mvn clean compile assembly:single
    $ spark-submit target/rumbledb-2.0.0-with-dependencies.jar repl
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    
        ____                  __    __     ____  ____ 
       / __ \__  ______ ___  / /_  / /__  / __ \/ __ )
      / /_/ / / / / __ `__ \/ __ \/ / _ \/ / / / __  |  The distributed JSONiq engine
     / _, _/ /_/ / / / / / / /_/ / /  __/ /_/ / /_/ /   2.0.0 "Lemon Ironwood" beta
    /_/ |_|\__,_/_/ /_/ /_/_.___/_/\___/_____/_____/  
    
    Master: local[2]
    Item Display Limit: 1000
    Output Path: -
    Log Path: -
    Query Path : -
    
    rumble$
    rumble$ "Hello, world!"
    rumble$ "Hello, world!"
    >>> 
    >>> 
    Hello, world
    rumble$ 2 + 2
    >>> 
    >>> 
    4
    
    rumble$ 1 to 10
    >>> 
    >>> 
    ( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
    seq = rumble.jsoniq("""
    
    let $stores :=
    [
      { "store number" : 1, "state" : "MA" },
      { "store number" : 2, "state" : "MA" },
      { "store number" : 3, "state" : "CA" },
      { "store number" : 4, "state" : "CA" }
    ]
    let $sales := [
       { "product" : "broiler", "store number" : 1, "quantity" : 20  },
       { "product" : "toaster", "store number" : 2, "quantity" : 100 },
       { "product" : "toaster", "store number" : 2, "quantity" : 50 },
       { "product" : "toaster", "store number" : 3, "quantity" : 50 },
       { "product" : "blender", "store number" : 3, "quantity" : 100 },
       { "product" : "blender", "store number" : 3, "quantity" : 150 },
       { "product" : "socks", "store number" : 1, "quantity" : 500 },
       { "product" : "socks", "store number" : 2, "quantity" : 10 },
       { "product" : "shirt", "store number" : 3, "quantity" : 10 }
    ]
    let $join :=
      for $store in $stores[], $sale in $sales[]
      where $store."store number" = $sale."store number"
      return {
        "nb" : $store."store number",
        "state" : $store.state,
        "sold" : $sale.product
      }
    return [$join]
    """);
    
    print(seq.json());
    
    seq = rumble.jsoniq("""
    for $product in json-lines("http://rumbledb.org/samples/products-small.json", 10)
    group by $store-number := $product.store-number
    order by $store-number ascending
    return {
        "store" : $store-number,
        "products" : [ distinct-values($product.product) ]
    }
    """);
    print(seq.json());
    res = rumble.jsoniq('$a.Name');
    res = rumble.jsoniq('$a.Name', a=df);
    modes = res.availableOutputs();
    for mode in modes:
        print(mode)
    df = res.df();
    df.show();
    df.createTempView("myview")
    df2 = spark.sql("SELECT * FROM myview").toDF("name");
    df2.show();
    rumble.bind('$b', df2);
    seq2 = rumble.jsoniq("for $i in 1 to 5 return $b");
    df3 = seq2.df();
    df3.show();
    rumble.bind('$b', df3);
    seq3 = rumble.jsoniq("$b[position() lt 3]");
    df4 = seq3.df();
    df4.show();
    %%jsoniq -pdf
    for $i in 1 to 10000000
    return { "foobar" : $i}
    %%jsoniq -df
    for $i in 1 to 10000000
    return { "foobar" : $i}
    %%jsoniq -pdf
    declare type local:mytype as {
        "product" : "string",
        "store-number" : "int",
        "quantity" : "decimal"
    };
    validate type local:mytype* { 
        for $product in json-lines("http://rumbledb.org/samples/products-small.json", 10)
        where $product.quantity ge 995
        return $product
    }
    %%jsoniq -t
    for $i in 1 to 10000000
    return { "foobar" : $i}
    conf.setPrintIteratorTree(True)
    rumble = RumbleSession.builder
    .config("spark.driver.memory", "10g")
    .getOrCreate()
    In the remainder of this chapter, we showcase the individual updating expressions one by one, inside a copy-modify-return expression.

    Update expressions can also appear outside of a copy-modify-return expression, in which case they propagate and/or persist directly to their targets, to the extent that the context makes it meaningful and possible.

    copy $obj := { "foo" : "bar", "bar" : [ 1,2,3 ] }
    modify (
      insert json { "bar" : 123, "foobar" : [ true, false ] } into $obj,
      delete json $obj.bar,
      replace value of json $obj.foo with true
    )
    return $obj
          
    copy-modify-return

    First queries

    This section assumes that you have installed RumbleDB with one of the proposed ways, and guides you through your first queries.

    Create some data set

    Create, in the same directory as RumbleDB to keep it simple, a file data.json and put the following content inside. This is a small list of JSON objects in the JSON Lines format.

    If you want to later try a bigger version of this data, you can also download a larger version with 100,000 objects from here. Wait, no, in fact you do not even need to download it: you can simply replace the file path in the queries below with "https://rumbledb.org/samples/products-small.json" and it will just work! RumbleDB feels just at home on the Web.

    RumbleDB also scales without any problems to datasets that have millions or (on a cluster) billions of objects, although of course, for billions of objects HDFS or S3 are a better idea than the Web to store your data, for obvious reasons.

    In the JSON Lines format that this simple dataset uses, you just need to make sure you have one object on each line (this is different from a plain JSON file, which has a single JSON value and can be indented). Of course, RumbleDB can read plain JSON files, too (with json-doc()), but below we will show you how to read JSON Line files, which is how JSON data scales.

    Running simple queries locally

    Depending on your installation method, the JSONiq queries will go to:

    • A cell in a jupyter notebook and with the %%jsoniq magic: a simple click is sufficient to execute.

    • The shell: type the query, and finish by pressing Enter twice.

    • In a Python program, inside a rumble.jsoniq() call of which you can exploit the output with more Python code.

    • A JSONiq query file, which you can execute with the RumbleDB CLI interface.

    Either way, the meaning of the queries is the same.

    or

    or

    The above queries do not actually use Spark. Spark is used when the I/O workload can be parallelized. The following query should output the file created above.

    json-lines() reads its input in parallel, and thus will also work on your machine with MB or GB files (for TB files, a cluster will be preferable). You should specify a minimum number of partitions, here 10 (note that this is a bit ridiculous for our tiny example, but it is very relevant for larger files), as locally no parallelization will happen if you do not specify this number.

    The above creates a very simple Spark job and executes it. More complex queries will create several Spark jobs. But you will not see anything of it: this is all done behind the scenes. If you are curious, you can go to in your browser while your query is running (it will not be available once the job is complete) and look at what is going on behind the scenes.

    Data can be filtered with the where clause. Again, below the hood, a Spark transformation will be used:

    RumbleDB also supports grouping and aggregation, like so:

    RumbleDB also supports ordering. Note that clauses (where, let, group by, order by) can appear in any order. The only constraint is that the first clause should be a for or a let clause.

    Finally, RumbleDB can also parallelize data provided within the query, exactly like Sparks' parallelize() creation:

    Mind the double parenthesis, as parallelize is a unary function to which we pass a sequence of objects.

    With docker

    The docker installation is kindly contributed by Dr. Ingo Müller (Google).

    Known issue

    On occasion, the docker version of RumbleDB used to throw a Kryo NoSuchMethodError on some systems. This should be fixed with version 2.0.0, let us know if this is not the case.

    You can upgrade to the newest version with

    Running simple queries with Docker

    Docker is the easiest way to get a standard environment that just works.

    You can download Docker from .

    Then, in a shell, type, all on one line:

    The first time, it might take some time to download everything, but this is all done automatically. Subsequent commands will run immediately.

    When there are new RumbleDB versions, you can upgrade with:

    The RumbleDB shell appears:

    You can now start typing simple queries like the following few examples. Press three times the return key to execute a query.

    or

    or

    The above queries do not actually use Spark. Spark is used when the I/O workload can be parallelized. The following query should output the file created above.

    json-lines() reads its input in parallel, and thus will also work on your machine with MB or GB files (for TB files, a cluster will be preferable). You should specify a minimum number of partitions, here 10 (note that this is a bit ridiculous for our tiny example, but it is very relevant for larger files), as locally no parallelization will happen if you do not specify this number.

    The above creates a very simple Spark job and executes it. More complex queries will create several Spark jobs. But you will not see anything of it: this is all done behind the scenes. If you are curious, you can go to in your browser while your query is running (it will not be available once the job is complete) and look at what is going on behind the scenes.

    Data can be filtered with the where clause. Again, below the hood, a Spark transformation will be used:

    RumbleDB also supports grouping and aggregation, like so:

    RumbleDB also supports ordering. Note that clauses (where, let, group by, order by) can appear in any order. The only constraint is that the first clause should be a for or a let clause.

    Finally, RumbleDB can also parallelize data provided within the query, exactly like Sparks' parallelize() creation:

    Mind the double parenthesis, as parallelize is a unary function to which we pass a sequence of objects.

    Running the RumbleDB docker as a server

    You can also run the docker as a server like so:

    You can change the port to something else than 8001 at all three places it appears. Do not forget -p 8001:8001 that forwards the port to the outside of the docker. Then, you can use a connected to the RumbleDB docker server to write queries in it. Point the notebook to http://localhost:8001/jsoniq in the appropriate cell (or any other port).

    Querying local files with the docker version of RumbleDB

    In order to query your local files, you need to mount a local directory to a directory within the docker. This is done with the --mount option, and the source path must be absolute. For the target, you can pick anything that makes sense to you.

    For example, imagine you have a file products-small.json in the directory /path/to/my/directory. Then you need to run RumbleDB with:

    Then you can go ahead and use absolute paths in the target directory in input functions, like so:

    You can also mount a local directory in this way running it as a server rather than a shell.

    On a Spark cluster (e.g., AWS EMR)

    Running RumbleDB on a cluster

    After you have tried RumbleDB locally as explained in the getting started section, you can take RumbleDB to a real cluster simply by modifying the command line parameters as documented here for spark-submit.

    Warning: EMR as of version 7.10 does not support Spark 4.0 yet, but we expect this will happen soon. In the meantime, you should use RumbleDB 1.22.

    Creating a cluster

    Creating a cluster is the easiest part, as most cloud providers today offer that with just a few clicks: Amazon EMR, Azure HDInsight, etc. You can start with 4-5 machines with a few CPUs each and a bit of memory, and increase later when you want to get serious on larger scales.

    Maybe sure to select a cluster that has Apache Spark. On Amazon EMR, this is not the default and you need to make sure that you check the box that has Spark below the cluster version dropdown. We recommend taking the latest EMR version 6.5.0 and then picking Spark 3.1 in the software configuration. You will also need to create a public/private key pair if you do not already have one.

    Wait for 5 or 6 minutes, and the cluster is ready.

    Do not forget to terminate the cluster when you are done!

    How to tune the RumbleDB command

    Next, you need to use ssh to connect to the master node of your cluster as the hadoop user and specifying your private key file. You will find the hostname of the machine on the EMR cluster page. The command looks like:

    ssh -i ~/.ssh/yourkey.pem [email protected]

    If ssh hangs, then you may need to authorize your IP for incoming connections in the security group of your cluster.

    And once you have connected with ssh and are on the shell, you can start using RumbleDB in a way similar to what you do on your laptop.

    First you need to download it with wget (which is usually available by default on cloud virtual machines):

    This is all you need to do, since Apache Spark is already installed. If spark-submit does not work, you might want to wait for a few more minutes as it might be that the cluster is not fully prepared yet.

    Often, the Spark cluster is running on yarn. The --master option can be changed from local[*] (which was for running on your laptop) to yarn compared to the getting started guide.

    Most of the time, though (e.g., on Amazon EMR), it needs not be specified, as this is already set up in the environment. So the same command will do:

    When you are on a cluster, you can also adapt the number of executors, how many cores you want per executor, etc. It is recommended to use sqrt(n) cores per executor if a node has n cores. For the executor memory, it is just primary school math: you need to divide the memory on a machine with the number of executors per machine (which is also roughly sqrt(n)).

    For example, if we have 6 worker nodes with each 16 cores and 64 GB, we can use 5 executores on each machine, with 3 cores and 10 GB per executor. This leaves a core and a bit of memory free for other cluster tasks.

    If necesasry, the size limit for materialization can be made higher with --materialization-cap or its shortcut -c (the default is 200). This affects the number of items displayed on the shells as an answer to a query. It also affects the maximum number of items that can be materialized from a large sequence into, say, an array. Warnings are issued if the cap is reached.

    Creation functions

    json-lines() then takes an HDFS path and the host and port are optional if Spark is configured properly. A second parameter controls the minimum number of splits. By default, each HDFS block is a split if executed on a clustter. In a local execution, there is only one split by default.

    The same goes for parallelize(). It is also possible to read text with text-file(), parquet files with parquet-file(), and it is also possible to read data on S3 rather than HDFS for all three functions json-lines(), text-file() and parquet-file().

    Bigger data sets

    If you need a bigger data set out of the box, we recommend the , which has 16 million objects. On Amazon EMR, we could even read several billion of objects on less than ten machines.

    We tested this with each new release, and suggest the following queries to start with (we assume HDFS is the default file system, and that you copied over this dataset to HDFS with hadoop fs copyFromLocal):

    Note that by default only the first 200 items in the output will be displayed on the shell, but you can change it with the --materialization-cap parameter on the CLI.

    Execution of single queries and output to HDFS

    RumbleDB also supports executing a single query from the command line, reading from HDFS and outputting the results to HDFS, with the query file being either local or on HDFS. For this, use the --query-path (optional as any text without parameter is recognized as a path in any case), --output-path (shortcut -o) and --log-path parameters.

    The query path, output path and log path can be any of the supported schemes (HDFS, file, S3, WASB...) and can be relative or absolute.

    Equality and identity

    As in most language, one can distinguish between physical equality and logical equality.

    Atomics can only be compared logically. Their physically identity is totally opaque to you.

    Logical comparison of two atomics

    Result (run with Zorba):true

    Logical comparison of two atomics

    Result (run with Zorba):false

    Logical comparison of two atomics

    Result (run with Zorba):false

    Logical comparison of two atomics

    Result (run with Zorba):true

    Two objects or arrays can be tested for logical equality as well, using deep-equal(), which performs a recursive comparison.

    Logical comparison of two JSON items

    Result (run with Zorba):true

    Logical comparison of two JSON items

    Result (run with Zorba):false

    The physical identity of objects and arrays is not exposed to the user in the core JSONiq language itself. Some library modules might be able to reveal it, though.

    Modules

    Module

    You can group functions and variables in separate library modules.

    MainModule

    Up to now, everything we encountered were main modules, i.e., a prolog followed by a main query.

    LibraryModule

    A library module does not contain any query - just functions and variables that can be imported by other modules.

    A library module must be assigned to a namespace. For convenience, this namespace is bound to an alias in the module declaration. All variables and functions in a library module must be prefixed with this alias.

    A library module

    ModuleImport

    Here is a main module which imports the former library module. An alias is given to the module namespace (my). Variables and functions from that module can be accessed by prefixing their names with this alias. The alias may be different than the internal alias defined in the imported module.

    An importing main module

    Result (run with Zorba):1764

    The JSONiq data model

    JSONiq is a query language that was specifically designed for querying JSON, although its data model is powerful enough to handle more similar formats.

    As stated on json.org, JSON is a "lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate."

    A JSON document is made of the following building blocks: objects, arrays, strings, numbers, booleans and nulls.

    JSONiq manipulates sequences of these building blocks, which are called items. Hence, a JSONiq value is a sequence of items.

    Any JSONiq expression takes and returns sequences of items.

    Comma-separated JSON-like building blocks is all you need to begin building your own sequences. You can mix and match, as JSONiq supports heterogeneous sequences seamlessly.

    Frequently asked questions and common issues

    Out of memory error

    By default, the memory allocated is limited. This depends on whether you run RumbleDB with the standalone jar or as the thin jar in a Spark environment.

    If you run RumbleDB with a standalone jar, then your laptop will allocate by default one quarter of your total working memory. You can check this with:

    In order to increase the memory, you can use -Xmx10g (for 10 GB, but you can use any other value):

    If you run RumbleDB on your laptop (or a single machine) with the thin jar, then by default this is limited to around 2 GB, and you can change this with --driver-memory

    More advanced output retrieval methods

    This section shares more techniques for advanced users who want to make the most of RumbleDB in Python.

    RumbleDB Item API

    JSONiq has a rich type system, and the can lose type information.

    An alternative consists of retrieving the sequence as a tuple of native items, which can be accessed with the .

    Input datasets (examples)

    Even though you can build your own JSON values with JSONiq by copying-and-pasting JSON documents, most of the time, your JSON data will come from an external input dataset.

    How this dataset is access depends on the JSONiq implementation and of the context. Some engines can read the data from a file located on a file system, local or distributed (HDFS, S3); some others get data from the Web; some others are full-fledged datastores and have collections that can be created, queried, modified and persisted.

    It is up to each engine to document which functions should be used, and how, in order to read datasets into a JSONiq Data Model instance. These functions will take implementation-defined parameters and typically return sequences of objects, or sequences of strings, or sequences of items, etc.

    For the purpose of examples given in this specification, we assume that a hypothetical engine has collections that are sequences of objects, identified by a name which is a string. We assume that there is a collection() function that returns all objects associated with the provided collection name.

    We assume in particular that there are three example collections, shown below.

    { "product" : "broiler", "store number" : 1, "quantity" : 20  }
    { "product" : "toaster", "store number" : 2, "quantity" : 100 }
    { "product" : "toaster", "store number" : 2, "quantity" : 50 }
    { "product" : "toaster", "store number" : 3, "quantity" : 50 }
    { "product" : "blender", "store number" : 3, "quantity" : 100 }
    { "product" : "blender", "store number" : 3, "quantity" : 150 }
    { "product" : "socks", "store number" : 1, "quantity" : 500 }
    { "product" : "socks", "store number" : 2, "quantity" : 10 }
    { "product" : "shirt", "store number" : 3, "quantity" : 10 }
    docker pull rumbledb/rumble
    
    1 eq 1
        
    localhost:4040
    here
    localhost:4040
    jupyter notebook
    great language game
    A sequence

    Result:foo 2 true foo bar null [ 1, 2, 3 ]

    Sequences are flat and cannot be nested. This makes streaming possible, which is very powerful.

    Sequences are flat

    Result:foo 2 true 4 null 6

    A sequence can be empty. The empty sequence can be constructed with empty parentheses.

    The empty sequence

    Result:

    A sequence of just one item is considered the same as just this item. Whenever we say that an expression returns or takes one item, we really mean that it takes a singleton sequence of one item.

    A sequence of one item

    Result:foo

    JSONiq classifies the items mentioned above in three categories:

    • Atomic items: strings, numbers, booleans and nulls, but also many other supported atomic values such as dates, binary, etc.

    • Structured items: objects and arrays.

    • Function items: items that can take parameters and, upon evaluation, return sequences of items.

    The JSONiq data model follows the W3C specification, but, in core JSONiq, does not include XML nodes, and includes instead JSON objects and arrays. Engines are free, however, to optionally support XML nodes in addition to JSON objects and arrays.

    Atomic items

    An atomic is a non-structured value that is annotated with a type.

    JSONiq atomic values follow the W3C specification.

    JSONiq supports most atomic values available in the W3C specification. They are described in Chapter The JSONiq type system. JSONiq furthermore defines an additional atomic value, null, with a type of its own, jn:null, which does not exist in the W3C specification.

    In particular, JSONiq supports all core JSON values. Note that JSON numbers correspond to three different types in JSONiq.

    • string: all JSON strings.

    • integer: all JSON numbers that are integers (no dot, no exponent), infinite range.

    • decimal: all JSON numbers that are decimals (no exponent), infinite range.

    • double: IEEE double-precision 64-bit floating point numbers (corresponds to JSON numbers with an exponent).

    • boolean: the JSON booleans true and false.

    • null: the JSON null.

    Structured items

    Structured items in JSONiq do not follow the XQuery 3.1 standard but are specific to JSONiq.

    In JSONiq, an object represents a JSON object, i.e., a collection of string/items pairs.

    Objects have the following property:

    • pairs. A set of pairs. Each pair consists of an atomic value of type xs:string and of an item.

      [ Consistency constraint: no two pairs have the same name (using fn:codepoint-equal). ]

    The XQuery data model uses accessors to explain the data model. Accessors are not exposed to the user and are only used for convenience in this specification. Objects have the following accessors:

    • jdm:object-keys($o as js:object) as xs:string*: returns all keys in the object $o.

    • jdm:object-value($o as js:object, $s as xs:string) as js:item: returns the value associated with $s in the object $o.

    An object does not have a typed value.

    In JSONiq, an array represents a JSON array, i.e., a ordered list of items.

    Objects have the following property:

    • members. An ordered list of items.

    Arrays have the following accessors:

    • jdm:array-size($a as js:array) as xs:nonNegativeInteger: returns the number of values in the array $a.

    • jdm:array-value($a as js:array, $i as xs:positiveInteger) as js:item: returns the value at position $i in the array $a..

    An array does not have a typed value.

    Unlike in the XQuery 3.1 standard, the values in arrays and objects are single items (which disallows the empty sequence or a sequence of more than one item). Also, object keys must be strings (which disallows any other atomic value).

    Function items

    JSONiq also supports function items, also known as higher-order functions. A function item can be passed parameters and evaluated.

    A function item has an optional name and an arity. It also has a signature, which consists of the sequence type of each one of its parameters (as many as its arity), and the sequence type of the values it returns.

    The fact that functions are items means that they can be returned by expressions, and passed as parameters to other functions. This is why they are also often called higher-order functions.

    JSONiq function items follow the W3C specification.

    :

    If you run RumbleDB on a cluster, then the memory needs to be allocated to the executors, not the driver:

    Setting things up on a cluster requires more thinking because setting the executor memory should be done in conjunction with setting the total number of executors and the number of cores per executor. This highly depends on your cluster hardware.

    Paths with whitespaces

    RumbleDB does not currently support paths with a whitespace. Make sure to put your data and modules at paths without whitespaces.

    "Hadoop bin directory does not exist" on Windows

    If this happens, you can download winutils.exe to solve the issue as explained here.

    "java.lang.NoSuchMethodError: com.esotericsoftware.kryo.serializers. FieldSerializer.setIgnoreSyntheticFields" with docker

    This is a known issue under investigation. It is related to a version conflict between Kryo 4 and Kryo 5 that occasionally happens on some docker installations. We recommend trying a local installation instead, as described in the Getting Started section.

    Java version

    A very common issue leading to some errors is using the wrong Java version. With Spark 3.5, only Java 11 or 17 is supported. With Spark 4, Java 17 or 21 are supported.

    You should make sure in particular you are not using a more recent Java version. Multiple Java versions can normally co-habit on the same machine but you need to make sure to set the JAVA_HOME variable appropriately.

    It is easy to check the Java version with:

    Streaming through the items

    Sometimes, a sequence can be very long, and materializing it to a tuple of JSON values or a tuple of native items can fail because of the materialization cap. While it can be changed in the configuration to allow for larger tuples, this does not scale.

    Another way to retrieve a sequence of arbitrary length is to use the iterator API to stream through the items one by one. If you do not keep previous values in memory, there is no limit to the sequence size than can be retrieved in this way (but it may take more time than using RDDs or DataFrames, which benefit from parallelism).

    This is how to stream through the items converted to JSON one by one:

    This is how to stream through the native items, using the Item API:

    Getting unstructured output as an RDD

    Sometimes, it is not possible to retrieve the output sequence as a (pandas or pyspark) DataFrame because no schema could be inferred. This is notably the case if the output sequence is heterogeneous (such as a sequence of items mixing atomics, objects of various structures, arrays, etc).

    And materializing or streaming may not be an option either if there are billions of items.

    In this case, it is possible to obtain the output as an RDD instead. This gets an RDD of JSON values that can be processed by Python (using the type mapping).

    The rdd() method is experimental because we had to reverse-engineer how pyspark encodes RDDs for the Java Virtual Machine (pickling).

    list = res.items();
    for result in list:
        print(result.getStringValue())
    conversion to JSON values
    RumbleDB Item API
    Collection 1

    Result

    Collection 2

    Result

    Collection 3

    Result

    
    collection("one-object")
        
    { "foo" : "bar" }
    "Hello, World"
     1 + 1
     
     (3 * 4) div 5
     
     json-lines("data.json")
     
    for $i in json-lines("data.json", 10)
    return $i
    for $i in json-lines("data.json", 10)
    where $i.quantity gt 99
    return $i
    for $i in json-lines("data.json", 10)
    let $quantity := $i.quantity
    group by $product := $i.product
    return { "product" : $product, "total-quantity" : sum($quantity) }
    for $i in json-lines("data.json", 10)
    let $quantity := $i.quantity
    group by $product := $i.product
    let $sum := sum($quantity)
    order by $sum descending
    return { "product" : $product, "total-quantity" : $sum }
    for $i in parallelize((
     { "product" : "broiler", "store number" : 1, "quantity" : 20  },
     { "product" : "toaster", "store number" : 2, "quantity" : 100 },
     { "product" : "toaster", "store number" : 2, "quantity" : 50 },
     { "product" : "toaster", "store number" : 3, "quantity" : 50 },
     { "product" : "blender", "store number" : 3, "quantity" : 100 },
     { "product" : "blender", "store number" : 3, "quantity" : 150 },
     { "product" : "socks", "store number" : 1, "quantity" : 500 },
     { "product" : "socks", "store number" : 2, "quantity" : 10 },
     { "product" : "shirt", "store number" : 3, "quantity" : 10 }
    ), 10)
    let $quantity := $i.quantity
    group by $product := $i.product
    let $sum := sum($quantity)
    order by $sum descending
    return { "product" : $product, "total-quantity" : $sum }
    docker run -i rumbledb/rumble repl
                 
    docker pull rumbledb/rumble
        ____                  __    __     ____  ____ 
       / __ \__  ______ ___  / /_  / /__  / __ \/ __ )
      / /_/ / / / / __ `__ \/ __ \/ / _ \/ / / / __  |  The distributed JSONiq engine
     / _, _/ /_/ / / / / / / /_/ / /  __/ /_/ / /_/ /   2.0.0 "Lemon Ironwood" beta
    /_/ |_|\__,_/_/ /_/ /_/_.___/_/\___/_____/_____/  
    
    
    App name: spark-rumble-jar-with-dependencies.jar
    Master: local[*]
    Driver's memory: (not set)
    Number of executors (only applies if running on a cluster): (not set)
    Cores per executor (only applies if running on a cluster): (not set)
    Memory per executor (only applies if running on a cluster): (not set)
    Dynamic allocation: (not set)
    Item Display Limit: 200
    Output Path: -
    Log Path: -
    Query Path : -
    
    RumbleDB$
    "Hello, World"
     1 + 1
     
     (3 * 4) div 5
     
     json-lines("https://rumbledb.org/samples/products-small.json")
     
    for $i in json-lines("https://rumbledb.org/samples/products-small.json", 10)
    return $i
    for $i in json-lines("https://rumbledb.org/samples/products-small.json", 10)
    where $i.quantity gt 99
    return $i
    for $i in json-lines("https://rumbledb.org/samples/products-small.json", 10)
    let $quantity := $i.quantity
    group by $product := $i.product
    return { "product" : $product, "total-quantity" : sum($quantity) }
    for $i in json-lines("https://rumbledb.org/samples/products-small.json", 10)
    let $quantity := $i.quantity
    group by $product := $i.product
    let $sum := sum($quantity)
    order by $sum descending
    return { "product" : $product, "total-quantity" : $sum }
    for $i in parallelize((
     { "product" : "broiler", "store number" : 1, "quantity" : 20  },
     { "product" : "toaster", "store number" : 2, "quantity" : 100 },
     { "product" : "toaster", "store number" : 2, "quantity" : 50 },
     { "product" : "toaster", "store number" : 3, "quantity" : 50 },
     { "product" : "blender", "store number" : 3, "quantity" : 100 },
     { "product" : "blender", "store number" : 3, "quantity" : 150 },
     { "product" : "socks", "store number" : 1, "quantity" : 500 },
     { "product" : "socks", "store number" : 2, "quantity" : 10 },
     { "product" : "shirt", "store number" : 3, "quantity" : 10 }
    ), 10)
    let $quantity := $i.quantity
    group by $product := $i.product
    let $sum := sum($quantity)
    order by $sum descending
    return { "product" : $product, "total-quantity" : $sum }
    docker run -p 8001:8001 --rm rumbledb/rumble serve -p 8001 -h 0.0.0.0
    docker run -t -i --mount type=bind,source=/path/to/my/directory,target=/home rumbledb/rumble repl
    for $i in json-lines("/home/products-small.json", 10)
    where $i.quantity gt 99
    return $i
    wget https://github.com/RumbleDB/rumble/releases/download/v1.22.0/rumbledb-1.22.0-for-spark-3.5.jar
    spark-submit --master yarn --deploy-mode client rumbledb-1.22.0-for-spark-3.5.jar repl
                 
    spark-submit rumbledb-1.22.0-for-spark-3.5.jar repl
                 
    spark-submit --num-executors 30 --executor-cores 3 --executor-memory 10g
                 rumbledb-1.22.0-for-spark-3.5.jar repl
    spark-submit --num-executors 30 --executor-cores 3 --executor-memory 10g
                 rumbledb-1.22.0-for-spark-3.5.jar repl -c 10000
    for $i in json-lines(”/user/you/confusion−2014−03−02.json”, 300)
    let $guess := $i.guess
    let $target := $i.target
    where $guess eq $target
    where $target eq ”Russian”
    return $i
    
    for $i in json-lines(”/user/you/confusion−2014−03−02.json”, 300)
    let $guess := $i.guess, $target := $i.target
    where $guess eq $target
    order by $target, $i.country descending, $i.date descending return $i
    
    for $i in json-lines(”/user/you/confusion−2014−03−02.json”, 300)
    let $country := $i.country, $target := $i.target
    group by $target , $country
    return {”Language”: $target ,
    ”Country” : $country , ”Guesses”: length($i)}
    spark-submit --num-executors 30 --executor-cores 3 --executor-memory 10g
                 rumbledb-1.22.0-for-spark-3.5.jar run "hdfs:///user/me/query.jq"
                 -o "hdfs:///user/me/results/output"
                 --log-path "hdfs:///user/me/logging/mylog"
    spark-submit --num-executors 30 --executor-cores 3 --executor-memory 10g
                 rumbledb-1.22.0-for-spark-3.5.jar run "/home/me/my-local-machine/query.jq"
                 -o "/user/me/results/output"
                 --log-path "hdfs:///user/me/logging/mylog"
    
    1 eq 2
        
    
    "foo" eq "bar"
        
    
    "foo" ne "bar"
        
    
    deep-equal({ "foo" : "bar" }, { "foo" : "bar" })
        
    
    deep-equal({ "foo" : "bar" }, { "bar" : "foo" })
        
    
    module namespace my = "http://www.example.com/my-module";
    declare variable $my:variable := { "foo" : "bar" };
    declare variable $my:n := 42;
    declare function my:function($i as integer) { $i * $i };
        
    
    import module namespace other= "http://www.example.com/my-module";
    other:function($other:n)
        
    
    "foo", 2, true, { "foo", "bar" }, null, [ 1, 2, 3 ]
          
    
    ( ("foo", 2), ( (true, 4, null), 6 ) )
          
    
    ()
          
    
    ("foo")
          
    java -XX:+PrintFlagsFinal -version | grep -iE 'MaxHeapSize'   
    java -jar -Xmx10g rumbledb-2.0.0-standalone.jar ...
    spark-submit --driver-memory 10G rumbledb-2.0.0-for-spark-4.0.jar ...
    spark-submit --executor-memory 10G rumbledb-2.0.0-for-spark-4.0.jar ...
    java -version
    res.open();
    while(res.hasNext()):
        print(res.nextJSON());
    res.close();
    res.open();
    while (res.hasNext()):
        print(res.next().getStringValue());
    res.close();
    rdd = res.rdd();
    print(rdd.count());
    for str in rdd.take(10):
        print(str);
    
    collection("captains")
        
    
    { "name" : "James T. Kirk", "series" : [ "The original series" ], "century" : 23 }
    { "name" : "Jean-Luc Picard", "series" : [ "The next generation" ], "century" : 24 }
    { "name" : "Benjamin Sisko", "series" : [ "The next generation", "Deep Space 9" ], "century" : 24 }
    { "name" : "Kathryn Janeway", "series" : [ "The next generation", "Voyager" ], "century" : 24  }
    { "name" : "Jonathan Archer", "series" : [ "Entreprise" ], "century" : 22 }
    { "codename" : "Emergency Command Hologram", "surname" : "The Doctor", "series" : [ "Voyager" ], "century" : 24 }
    { "name" : "Samantha Carter", "series" : [ ], "century" : 21 }
          
    collection("films")
    
    { "id" : "I", "name" : "The Motion Picture", "captain" : "James T. Kirk" }
    { "id" : "II", "name" : "The Wrath of Kahn", "captain" : "James T. Kirk" }
    { "id" : "III", "name" : "The Search for Spock", "captain" : "James T. Kirk" }
    { "id" : "IV", "name" : "The Voyage Home", "captain" : "James T. Kirk" }
    { "id" : "V", "name" : "The Final Frontier", "captain" : "James T. Kirk" }
    { "id" : "VI", "name" : "The Undiscovered Country", "captain" : "James T. Kirk" }
    { "id" : "VII", "name" : "Generations", "captain" : [ "James T. Kirk", "Jean-Luc Picard" ] }
    { "id" : "VIII", "name" : "First Contact", "captain" : "Jean-Luc Picard" }
    { "id" : "IX", "name" : "Insurrection", "captain" : "Jean-Luc Picard" }
    { "id" : "X", "name" : "Nemesis", "captain" : "Jean-Luc Picard" }
    { "id" : "XI", "name" : "Star Trek", "captain" : "Spock" }
    { "id" : "XII", "name" : "Star Trek Into Darkness", "captain" : "Spock" }
          

    JSON update primitives

    A Pending Update List is an unordered list of update primitives. Update primitives are internal and do not appear in the syntax. Each kind of update primitive models one individual update to an object or an array.

    A Pending Update List can by analogy be seen as the diff between two git revisions, and a single update primitive can be seen, with this same analogy, as the difference between two single lines of code. Thus, the JSONiq Update Facility is to trees what git is to lines of text: a "tree diff" language.

    JSONiq adds the following new update primitives, specific to JSON. They are similar to those defined by the XQuery Update Facility for XML.

    Update primitives within a PUL are applied with strict snapshot semantics. For examples, the positions are resolved against the array before the updates. Names are resolved on the object before the updates.

    Update primitive for objects and arrays (in collections or in memory)

    Update primitive
    Description

    Update primitives at the collection level

    Credits: Dwij Dixit/Ghislain Fourny (student project at ETH)

    Update primitive
    Description

    Primary updating expressions

    Update expressions are the visible part of JSONiq Updates in the language. Each primary updating expression contributes an update primitive to the Pending Update List being built.

    Nested updates (memory or persistent)

    These expressions may appear in a copy-modify-return (transform) expression (for in-memory updates on cloned values), or outside (for persistent updates to an underlying storage).

    Prologs

    This section introduces prologs, which allows declaring functions and global variables that can then be used in the main query. A prolog also allows setting some default behaviour.

    MainModule

    Prolog

    The prolog appears before the main query and is optional. It can contain setters and module imports, followed by function and variable declarations.

    Module imports are explained in the next chapter.

    jupd:rename-in-object(

    $target as object(),

    $key as xs:string,

    $content as xs:string)

    Renames the pair originally named $key in the object $target as $content (do nothing if there is no such pair).

    jupd:insert-before-into-collection(

    $target as item,

    $content as item()*)

    Inserts the provided items before the specified item in its collection.

    jupd:insert-last-into-collection(

    $target as item,

    $content as item()*)

    Inserts the provided items after the specified item in its collection.

    jupd:insert-into-object(

    $target as object(),

    $content as object())

    Inserts all pairs of the object $content into the object $target.

    jupd:insert-into-array(

    $target as array(),

    $position as xs:integer,

    $content as item()*)

    Inserts all items in the sequence $content before position $position into the array $target.

    jupd:delete-from-object(

    $target as object(),

    $keys as xs:string*)

    Removes the pairs the names of which appear in $keys from the object $target.

    jupd:delete-from-array(

    $target as array(),

    $position as xs:integer)

    Removes the item at position $position from the array $target (causes all following items in the array to move one position to the left).

    jupd:replace-in-array(

    $target as array(),

    $position as xs:integer,

    $content as item())

    Replaces the item at position $position in the array $target with the item $content (do nothing if $position is not comprised between 1 and jdm:size($target)).

    jupd:replace-in-object(

    $target as object(),

    $key as xs:string,

    $content as item())

    Replaces the value of the pair named $key in the object $target with the item $content (do nothing if there is no such pair).

    jupd:create-collection(

    $name as string,

    $mode as string,

    $content as item()*)

    Creates a collection initialized with the provided items. Mode determines the kind of collection (e.g., a Hive metastore table, a delta lake file, etc).

    jupd:truncate-collection(

    $name as string,

    $mode as string)

    Deletes the specified collection.

    jupd:edit(

    $target as item(),

    $content as item())

    Modifies an item in a collection into another item, preserving its identity and location.

    jupd:delete-in-collection(

    $target as item())

    Deletes the provided item from its collection.

    jupd:insert-first-into-collection(

    $name as string,

    $mode as string,

    $content as item()*)

    Inserts the provided items at the very beginning of the specified collection.

    jupd:insert-last-into-collection(

    $name as string,

    $mode as string,

    $content as item()*)

    Inserts the provided items at the very end of the specified collection.

    Error codes

    • [FOAR0001] - Division by zero.

    • [FOAR0002] - Numeric operation overflow/underflow

    • [FOCA0002] - A value that is not lexically valid for a particular type has been encountered.

    • [FOCH0001] - Raised by fn:codepoints-to-string if the input contains an integer that is not the codepoint of a valid XML character.

    • [FOCH0003] - Raised by fn:normalize-unicode if the requested normalization form is not supported by the implementation.

    • [FODC0002] - Error retrieving resource.

    • [FODT0001] - Overflow/underflow in date/time operation.

    • [FODT0002] - Overflow/underflow in duration operation.

    • [FOFD1340] -This error is raised if the picture string or calendar supplied to fn:format-date, fn:format-time, or fn:format-dateTime has invalid syntax.

    • [FOFD1350] - This error is raised if the picture string supplied to fn:format-date selects a component that is not present in a date, or if the picture string supplied to fn:format-time selects a component that is not present in a time.

    • [FOTY0012] - The argument has no typed value (objects, arrays, functions cannot be atomized).

    • [JNTY0004] - Unexpected non-atomic element. Raised when objects or arrays are supplied where an atomic element is expected.

    • [JNTY0024] - Error getting the string value for array and object items

    • [JNTY0018] - Invalid selector error code. It is a type error if there is not exactly one supplied parameter for an object or array selector.

    • [RBDY0005] - Materialization Error: the sequence is too big to be materialized. Use --materialization-cap to increase the maximum materialization size, or add an output path to write to.

    • [RBML0001] - Unrecognized RumbleDB ML Class Reference An unrecognized classname is used in query while accessing the RumbleDB ML API.

    • [RBML0002] - Unrecognized RumbleDB ML Param Reference An unrecognized parameter is used in query while operating with a RumbleDB ML class.

    • [RBML0003] - Invalid RumbleDB ML Param Provided parameter does not match the expected type or value for the referenced RumbleDB ML class.

    • [RBML0004] - Input is not a DataFrame Provided input of items does not form a DataFrame as expected by RumbleDB ML.

    • [RBML0005] - Invalid schema for DataFrame in annotate() The provided schema can not be applied to the item data while converting the data to a DataFrame

    • [RBST0001] - CLI error. Raised when invalid parameters are supplied at launch.

    • [RBST0002] - Unimplemented feature error. Raised when a JSONiq feature that is not yet implemented in RumbleDB is used.

    • [RBST0003] - Invalid for clause expression error. Raised when an expression produces a different, big sequence of items for each binding within a big tuple, which would lead to a data flow explosion and to a nesting of jobs on the Spark cluster.

    • [RBST0004] - Implementation Error.

    • [SENR0001] - Serialization error. Function items can not be serialized.

    • [XPDY0002] - It is a dynamic error if evaluation of an expression relies on some part of the dynamic context that is absent.

    • [XPDY0050] - Dynamic type treat error. It is a dynamic error if the dynamic type of the operand of a treat expression does not match the sequence type specified by the treat expression. This error might also be raised by a path expression beginning with "/" or "//" if the context node is not in a tree that is rooted at a document node. This is because a leading "/" or "//" in a path expression is an abbreviation for an initial step that includes the clause treat as document-node().

    • [XPDY0130] - Generic runtime exception [check error message].

    • [XPST0003] - Parsing error. Invalid syntax or unsupported feature in query.

    • [XPST0008] - Undefined element reference. It is a static error if an expression refers to an element name, attribute name, schema type name, namespace prefix, or variable name that is not defined in the static context, except for an ElementName in an ElementTest or an AttributeName in an AttributeTest.

    • [XPST0017] - Invalid function call error. It is a static error if the expanded QName and number of arguments in a static function call do not match the name and arity of a function signature in the static context.

    • [XPST0080] - Invalid cast error - It is a static error if the target type of a cast or castable expression is NOTATION anySimpleType, or anyAtomicType.

    • [XPST0081] - Unknown namespace prefix - It is a static error if a QName used in a query contains a namespace prefix that cannot be expanded into a namespace URI by using the statically known namespaces.

    • [XPTY0004] - Unexpected Type Error. It is a type error if, during the static analysis phase, an expression is found to have a static type that is not appropriate for the context in which the expression occurs, or during the dynamic evaluation phase, the dynamic type of a value does not match a required type. Example: using subtraction on strings.

    • [XQDY0054] - It is a dynamic error if a cycle is encountered in the definition of a module's dynamic context components, for example because of a cycle in variable declarations.

    • [XQTY0024] - Attribute After Non Attribute Error - It is a type error if the content sequence in an element constructor contains an attribute node following a node that is not an attribute node.

    • [XQDY0025] - Duplicate Attribute Error - It is a dynamic error if any attribute of a constructed element does not have a name that is distinct from the names of all other attributes of the constructed element.

    • [XQDY0074] - Invalid Element Name Error - It is a dynamic error if the value of the name expression in a computed element or attribute constructor cannot be converted to an expanded QName (for example, because it contains a namespace prefix not found in statically known namespaces.)

    • [XQDY0096] - Invalid Node Name Error - It is a dynamic error if the node-name of a node constructed by a computed element constructor has any of the following properties: 1. Its namespace prefix is xmlns. 2. Its namespace URI is http://www.w3.org/2000/xmlns/. 3. Its namespace prefix is xml and its namespace URI is not http://www.w3.org/XML/1998/namespace. 4. Its namespace prefix is other than xml and its namespace URI is http://www.w3.org/XML/1998/namespace.

    • [XQDY0137] - Duplicate pair name. It is a dynamic error if two pairs in an object constructor or in a simple object union have the same name.

    • [XQST0016] - Module declaration error. Current implementation does not support the Module Feature raises a static error if it encounters a module declaration or a module import.

    • [XQST0031] - Invalid JSONiq version. It is a static error if the version number specified in a version declaration is not supported by the implementation. For now, only version 1.0 is supported.

    • [XQST0033] - Namespace prefix bound twice. It is a static error if a module contains multiple bindings for the same namespace prefix.

    • [XQST0034] - Function already exists. It is a static error if multiple functions declared or imported by a module have the same number of arguments and their expanded QNames are equal (as defined by the eq operator).

    • [XQST0038] - It is a static error if a Prolog contains more than one default collation declaration, or the value specified by a default collation declaration is not present in statically known collations.

    • [XQST0039] - Duplicate parameter name. It is a static error for a function declaration or an inline function expression to have more than one parameter with the same name.

    • [XQST0047] - It is a static error if multiple module imports in the same Prolog specify the same target namespace.

    • [XQST0048] - It is a static error if a function or variable declared in a library module is not in the target namespace of the library module.

    • [XQST0049] - It is a static error if two or more variables declared or imported by a module have the same name.

    • [XQST0052] - Simple type error. The type must be the name of a type defined in the in-scope schema types, and the {variety} of the type must be simple.

    • [XQST0059] - It is a static error if an implementation is unable to process a schema or module import by finding a schema or module with the specified target namespace.

    • [XQST0069] - A static error is raised if a Prolog contains more than one empty order declaration.

    • [XQST0088] - It is a static error if the literal that specifies the target namespace in a module import or a module declaration is of zero length.

    • [XQST0089] - It is a static error if a variable bound in a for or window clause of a FLWOR expression, and its associated positional variable, do not have distinct names (expanded QNames).

    • [XQST0094] - Invalid variable in group-by clause. The name of each grouping variable must be equal (by the eq operator on expanded QNames) to the name of a variable in the input tuple stream.

    • [XQST0118] - In a direct element constructor, the name used in the end tag must exactly match the name used in the corresponding start tag, including its prefix or absence of a prefix.

    Inserting values into an object or array

    A JSON insert expression is used to insert new pairs into an object. It produces a jupd:insert-into-object update primitive. If the target is not an object, JNUP0008 is raised. If the content is not a sequence of objects, JNUP0019 is raised. These objects are merged prior to inserting the pairs into the target, and JNDY0003 is raised if the content to be inserted has colliding keys.

    Example

    Result: { "foo" : "bar", "bar" : 123, "foobar" : [ true, false ] }

    A JSON insert expression is also used to insert a new member into an array. It produces a jupd:insert-into-array update primitive. If the target is not an array, JNUP0008 is raised. If the position is not an integer, JNUP0007 is raised.

    Example

    Result: { "foo" : [ 1, 2, 5, 3, 4 ] }

    Deleting values in an object or array

    A JSON delete expression is used to remove a pair from an object. It produces a jupd:delete-from-object update primitive. If the key is not a string, JNUP0007 is raised. If the key does not exist, JNUP0016 is raised.

    Example

    Result: { "bar" : 123 }

    A JSON delete expression is also used to remove a member from an array. It produces a jupd:insert-from-array update primitive. If the position is not an integer, JNUP0007 is raised. If the position is out of range, JNUP0016 is raised.

    Example

    Result: [ 1, 2, 4, 5, 6 ]

    Renaming a key

    A JSON rename expression is used to rename a key in an object. It produces a jupd:rename-in-object update primitive. If the sequence on the left of the dot is not a single object, JNUP0008 is raised. If the new name is not a single string, JNUP0007 is raised. If the old key does not exist, JNUP0016 is raised.

    Example 196. JSON rename expression

    Result: { "bar" : 123, "foobar" : "bar" }

    Appending values to an array

    A JSON append expression is used to add a new member at the end of an array. It produces a jupd:insert-into-array update primitive. JNUP0008 is raised if the target is not an array.

    Example 197. JSON append expression

    Result: { "foo" : "bar", "bar" : [ 1, 2, 3, 4 ] }

    Replacing a value in an object or array.

    A JSON replace expression is used to replace the value associated with a certain key in an object. It produces a jupd:replace-in-object update primitive. JNUP0007 is raised if the selector is not a single string. If the selector key does not exist, JNUP0016 is raised.

    Example

    Result: { "bar" : [ 1, 2, 3 ], "foo" : { "nested" : true } }

    A JSON replace expression is also used to replace a member in an array. It produces a jupd:insert-in-array update primitive. JNUP0007 is raised if the selector is not a single position. If the selector position is out of range, JNUP0016 is raised.

    Example

    Result: { "foo" : "bar", "bar" : [ 1, "two", 3 ] }

    Update expressions at the collection top-level (persistent only)

    These expressions may not appear in a copy-modify-return (transform) expression because they can only be used for persistent updates to an underlying storage (document store, data lakehouse, etc).

    Creating a collection

    This expression creates an update primitive that creates a collection.

    Example

    Deleting a collection

    This expression creates an update primitive that deletes a collection.

    Example

    Inserting into a collection

    This expression creates an update primitive that inserts values at the beginning or end of a collection, or before or after specific values in that collection.

    Example

    Editing a value in a collection

    This expression creates an update primitive that modifies a value in a collection into the other supplied value.

    Example

    Deleting a value from a collection

    This expression creates an update primitive that deletes a specified value from its collection.

    Example

    Setters

    Setters allow to specify a default behaviour for various aspects of the language.

    Default collation

    DefaultCollationDecl

    This specifies the default collation used for grouping and ordering clauses in FLWOR expressions. It can be overriden with a collation directive in these clauses.

    Default ordering mode

    OrderingModeDecl

    This specifies the default behaviour of from clauses, i.e., if they bind tuples in the order in which items occur in the binding sequence. It can be overriden with ordered and unordered expressions.

    Default ordering behaviour for empty sequences

    EmptyOrderDecl

    This specifies whether empty sequences come first or last in an ordering clause. It can be overriden by the corresponding directives in such clauses.

    Default decimal format

    DecimalFormatDecl

    DFPropertyName

    This specifies a default decimal format for the builtin function format-number().

    Global variables

    VarDecl

    Variables can be declared global. Global variables are declared in the prolog.

    Global variable

    Result (run with Zorba):{ "foo" : "bar" }

    Global variable

    Result (run with Zorba):[ 1, 2, 3, 4, 5 ]

    You can specify a type for a variable. If the type does not match, an error is raised. Types will be explained later. In general, you do not need to worry too much about variable types except if you want to make sure that what you bind to a variable is really what you want. In most cases, the engine will take care of types for you.

    Global variable with a type

    Result (run with Zorba):{ "foo" : "bar" }

    An external variable allows you to pass a value from the outside environment, which can be very useful. Each implementation can choose their own way of passing a value to an external variable. A default value for an external variable can also be supplied in case none is provided outside.

    An external global variable

    Result (run with Zorba):An error was raised: "obj": variable has no value

    An external global variable with a default value

    Result (run with Zorba):{ "foo" : "bar" }

    Functions

    VarDecl

    You can define your own functions in the prolog. These user-defined functions must be prefixed with local:, both in the declaration and when called.

    Remember than types are optional, and if you do not specify any, item* is assumed, both for parameters and for the return type.

    An external global variable with a default value

    Result (run with Zorba):Hello, Mister Spock!

    An external global variable with a default value

    Result (run with Zorba):Hello, Mister Spock!

    An external global variable with a default value

    Result (run with Zorba):Hello, Mister Spock!

    If you do specify types, an error is raised in case of a mismatch

    An external global variable with a default value

    Result (run with Zorba):Hello, 1!

    The JSONiq type system

    This section describes JSONiq types as well as the sequence type syntax.

    JSONiq manipulates semi-structured data: in general, JSONiq allows you, but does not require you to specify types. So you have as much or as little type verification as you wish.

    JSONiq is still strongly typed, so that you will be told if there is a type inconsistency or mismatch in your programs.

    Whenever you do not specify the type of a variable or the type signature of a function, the most general type for any sequence of items, item*, is assumed.

    Section Expressions dealing with types introduces expressions which work with values of these types, as well as type operations (variable types, casts, ...).

    Sequence types

    JSONiq follows the regarding sequence occurrence indicators. The following explanations, provided as an informal summary for convenience, are non-normative.

    A sequence is an ordered list of items.

    All sequences match the sequence type js:item*.

    A sequence type is made of an item type followed by an occurrence indicator:

    • The symbol * (star) stands for a sequence of any length (zero or more)

    • The symbol + (plus) stands for a non-empty sequence (one or more)

    • The symbol ? (question mark) stands for an empty or a singleton sequence (zero or one)

    • The absence of indicator stands for a singleton sequence (one).

    Examples:

    • string matches any singleton sequence containing a string.

    • item+ matches any non-empty sequence.

    • object? matches the empty sequence and any sequence containing one object.

    JSONiq defines the syntax () for the empty sequence, rather than empty-sequence().

    SequenceType

    Item types

    Item types are the first component of a sequence type, together with the cardinality indicator. Thus, an item type matches (or not) a single item. For example, "foo" matches the item type xs:string.

    There are three categories of item types:

    • Atomic types (W3C-conformant, additional js:null and js:atomic)

    • Structured types (JSONiq-specific)

    • Function types (W3C-conformant)

    JSONiq uses a JSONiq-specific, implementation-defined default type namespace that acts as a proxy namespace to all types (xs: or js:). As a consequence, buitin atomic types do not need to be prefixed in the JSONiq syntax (integer instead of xs:integer, null instead of js:null).

    All items match the item type js:item, which is a JSONiq-specific synonym for the W3C-confirmant item().

    ItemType

    Atomic types

    JSONiq follows the for atomic types except for modifications in the list of available atomic types and a simplified syntax for xs:anyAtomicType. The following explanations, provided as an informal summary for convenience, are non-normative.

    Atomic types are organized in a tree hierarchy.

    JSONiq defines the following build-in types that have a direct relation with JSON:

    • xs:string: the value space is all strings made of Unicode characters.

      All string literals build an atomic which matches string.

    • xs:integer (W3C-conformant): the value space is that of all mathematical integral numbers (N), with an infinite range. This is a subtype of decimal, so that all integers also match the item type decimal.

      All integer literals build an atomic which matches integer.

    • xs:decimal (W3C-conformant): the value space is that of all mathematical decimal numbers (D), with an infinite range.

    JSONiq also supports further atomic types, which are conformant with .

    These datatypes are already used as a set of atomic datatypes by the other two semi-structured data formats of the Web: XML and RDF, as well as by the corresponding query languages: XQuery and SPARQL, so it is natural for a complete JSON data model to reuse them.

    • Further number types: xs:float, xs:long, xs:int, xs:short, xs:byte, xs:float, xs:positiveInteger, xs:negativeInteger, xs:nonPositiveInteger, xs:nonNegativeInteger, xs:unsignedLong, xs:unsignedInt, xs:unsignedShort, xs:unsignedByte.

    • Date or time types: xs:date, xs:dateTime, xs:dateTimeStamp, xs:gDay, xs:gMonth, xs:gMonthDay, xs:gYear, xs:xs:gYearMonth, xs:time.

    • Duration types: xs:duration, xs:dayTimeDuration, xs:yearMonthDuration.

    • Binary types: xs:base64Binary, xs:hexBinary.

    The support of xs:ID, xs:IDREF, xs:IDREFS, xs:NOTATION, xs:Name, xs:NCName, xs:NMTOKEN, xs:NMTOKENS, xs:ENTITY, xs:ENTITIES is not required by JSONiq, although engines that also support XML can support them.

    AtomicType

    Structured types

    JSONiq introduces four more types for matching objects and arrays. Like atomic types, they do not need the js: prefix in the syntax (object instead of js:object, etc.).

    All objects match the item type js:object.

    All arrays match the item type js:array.

    All objects and arrays match the item type js:json-item.

    For engines that also support optionally XML, js:structured-item matches both XML nodes and JSON objects and arrays.

    StructuredType

    Function types

    JSONiq follows the regarding function types. The following explanations are non-normative.

    FunctionType

    AnyFunctionType

    TypedFunctionType

    User-defined types

    RumbleDB now supports user-defined array and object types both with the JSound compact syntax and the JSound verbose syntax.

    JSound Schema Compact syntax

    RumbleDB user-defined types can be defined with the JSound syntax. A tutorial for the JSound syntax can be found .

    For now, RumbleDB only allows the definition of user-defined types for objects and arrays. User-defined atomic types and union types will follow soon. The @ (primary key) and ? (nullable) shortcuts are supported as of version 2.0.5. The behavior of nulls with absent vs. nullable fields can be tweaked in the configuration (e.g., if a null is present in an optional, non-nullable field, RumbleBD can be lenient and simply remove it instead of throwing an error).

    The implementation is still experimental and bugs are still expected, which we will appreciate to be informed of.

    Notes

    Sequences vs. Arrays

    Even though JSON supports arrays, JSONiq uses a different construct as its first class citizens: sequences. Any value returned by or passed to an expression is a sequence.

    The main difference between sequences and arrays is that sequences are completely flat, meaning they cannot contain other sequences.

    Since sequences are flat, expressions of the JSONiq language just concatenate them to form bigger sequences.

    This is crucial to allow streaming results, for example through an HTTP session.

    copy $obj := { "foo" : "bar" }
    modify insert json { "bar" : 123, "foobar" : [ true, false ] } into $obj
    return $obj
          
    copy $arr := { "foo" : [1,2,3,4] }
    modify insert json 5 into $arr.foo at position 3
    return $arr
          
    copy $obj := { "foo" : "bar", "bar" : 123 }
    modify delete json $obj.foo
    return $obj
          
    copy $arr := [1,2,3,4,5,6]
    modify delete json $arr[[3]]
    return $arr
          
    copy $obj := { "foo" : "bar", "bar" : 123 }
    modify rename json $obj.foo as "foobar"
    return $obj
          
    copy $obj := { "foo" : "bar", "bar" : [1,2,3] }
    modify append json 4 into $obj.bar
    return $obj
          
    copy $obj := { "foo" : "bar", "bar" : [1,2,3] }
    modify replace value of json $obj.foo with { "nested" : true }
    return $obj
          
    copy $obj := { "foo" : "bar", "bar" : [1,2,3] }
    modify replace value of json $obj.bar[[2]] with "two"
    return $obj
          
    create collection table("mytable") with ({"foo":1},{"foo":2}),
    create collection delta-file("/path/to/file.delta") with ({"foo":1},{"foo":2})
    delete collection table("mytable"),
    delete collection delta-file("/path/to/file.delta")
    insert {"foo":3} first into collection table("mytable"),
    insert {"foo":4} last into collection delta-file("/path/to/file.delta"),
    insert {"foo":3} before table("mytable")[3] into collection,
    insert {"foo":3} after delta-file("/path/to/file.delta")[3] into collection
    edit table("mytable")[1] into {"foo":3} in collection
    delete table("mytable")[1] from collection
    
      declare variable $obj := { "foo" : "bar" };
      $obj
          
    
      declare variable $numbers := (1, 2, 3, 4, 5);
      [ $numbers ]
          
    
      declare variable $obj as object := { "foo" : "bar" };
      $obj
          
    
      declare variable $obj external;
      $obj
          
    
      declare variable $obj external := { "foo" : "bar" };
      $obj
          
    
    declare function local:say-hello($x) { "Hello, " || $x || "!" };
    local:say-hello("Mister Spock")
          
    
    declare function local:say-hello($x as string) { "Hello, " || $x || "!" };
    local:say-hello("Mister Spock")
          
    
    declare function local:say-hello($x as string) as string { "Hello, " || $x || "!" };
    local:say-hello("Mister Spock")
          
    
    declare function local:say-hello($x) { "Hello, " || $x || "!" }; 
    local:say-hello(1)
          

    All decimal literals build an atomic which matches decimal.

  • xs:double (W3C-conformant): the value space is that of all IEEE double-precision 64-bit floating point numbers.

    All double literals build an atomic which matches double.

  • xs:boolean (W3C-conformant): the value space contains the booleans true and false.

    All boolean literals build an atomic which matches boolean.

  • js:null (JSONiq-specific): the value space is a singleton and only contains null.

    All null literals build an atomic which matches null.

  • js:atomic (JSONiq-specific synonym of, and W3C-conformant with, xs:anyAtomicType): all atomic types.

    All literals build an atomic which matches atomic.

  • An URI type: xs:anyURI.

    W3C standard
    W3C standard
    XML Schema
    W3C standard
    Type declaration

    A new type can be declared in the prolog, at the same location where you also define global variables and user-defined functions.

    In the above query, although the type is defined, the query returns an object that was not validated against this type.

    Type declaration

    To validate and annotate a sequence of objects, you need to use the validate-type expression, like so:

    You can use user-defined types wherever other types can appear: as type annotation for FLWOR variables or global variables, as function parameter or return types, in instance-of or treat-as expressions, etc.

    You can validate larger sequences

    You can also validate, in parallel, an entire JSON Lines file, like so:

    Optional vs. required fields

    By defaults, fields are optional:

    You can, however, make a field required by adding a ! in front of its name:

    Or you can provide a default value with the equal sign:

    Extra fields

    Extra fields will be rejected. However, the verbose version of JSound supports allowing extra fields (open objects) and will be supported in a future version of RumbleDB.

    Nested arrays

    With the JSound comptact syntax, you can easily define nested array structures:

    You can even further nest objects:

    Or split your definitions into several types that refer to each other:

    DataFrames

    In fact, RumbleDB will internally convert the sequence of objects to a Spark DataFrame, leading to faster execution times.

    In other words, the JSound Compact Schema Syntax is perfect for defining DataFrames schema!

    Verbose syntax

    For advanced JSound features, such as open object types or subtypes, the verbose syntax must be used, like so:

    The JSound type system, as its name indicates, is sound: you can only make subtypes more restrictive than the super type. The complete specification of both syntaxes is available on the JSound website.

    In the feature, RumbleDB will support user-defined atomic types and union types via the verbose syntax.

    What's next?

    Once you have validated your data as a dataframe with a user-defined type, you are all set to use the RumbleDB ML Machine Learning library and feed it through ML pipelines!

    here
    Flat sequences

    Result (run with Zorba):1 2 3 4

    Arrays on the other side can contain nested arrays, like in JSON.

    Nesting arrays

    Result (run with Zorba):[ [ 1, 2 ], [ 3, 4 ] ]

    Many expressions return single items - actually, they really return a singleton sequence, but a singleton sequence of one item is considered the same as this item.

    Singleton sequences

    Result (run with Zorba):2

    This is different for arrays: a singleton array is distinct from its unique member, like in JSON.

    Singleton sequences

    Result (run with Zorba):[ 2 ]

    An array is a single item. A (non-singleton) sequence is not. This can be observed by counting the number of items in a sequence.

    count() on an array

    Result (run with Zorba):1

    count() on a sequence

    Result (run with Zorba):4

    Other than that, arrays and sequences can contain exactly the same members (atomics, arrays, objects).

    Members of an array

    Result (run with Zorba):[ 1, "foo", [ 1, 2, 3, 4 ], { "foo" : "bar" } ]

    Members of an sequence

    Result (run with Zorba):1 foo [ 1, 2, 3, 4 ] { "foo" : "bar" }

    Arrays can be converted to sequences, and vice-versa.

    Converting an array to a sequence

    Result (run with Zorba):1 foo [ 1, 2, 3, 4 ] { "foo" : "bar" }

    Converting a sequence to an array

    Result (run with Zorba):[ 1, "foo", [ 1, 2, 3, 4 ], { "foo" : "bar" } ]

    Null vs. empty sequence

    Null and the empty sequence are two different concepts.

    Null is an item (an atomic value), and can be a member of an array or of a sequence, or the value associated with a key in an object. Sequences cannot, as they represent the absence of any item.

    Null values in an array

    Result (run with Zorba):[ null, 1, null, 2 ]

    Null values in an object

    Result (run with Zorba):{ "foo" : null }

    Null values in a sequence

    Result (run with Zorba):null 1 null 2

    If an empty sequence is found as an object value, it is automatically converted to null.

    Automatic conversion to null.

    Result (run with Zorba):{ "foo" : null }

    In an arithmetic opration or a comparison, if an operand is an empty sequence, an empty sequence is returned. If an operand is a null, an error is raised except for equality and inequality.

    Empty sequence in an arithmetic operation.

    Result (run with Zorba):

    Null in an arithmetic operation.

    Result (run with Zorba):An error was raised: arithmetic operation not defined between types "js:null" and "xs:integer"

    Null and empty sequence in an arithmetic operation.

    Result (run with Zorba):

    Empty sequence in a comparison.

    Result (run with Zorba):

    Null in a comparison.

    Result (run with Zorba):false

    Null in a comparison.

    Result (run with Zorba):true

    Null and the empty sequence in a comparison.

    Result (run with Zorba):

    Null and the empty sequence in a comparison.

    Result (run with Zorba):

    declare type local:my-type as {
      "foo" : "string",
      "bar" : "integer"
    };
    
    { "foo" : "this is a string", "bar" : 42 }
    declare type local:my-type as {
      "foo" : "string",
      "bar" : "integer"
    };
    
    validate type local:my-type* {
      { "foo" : "this is a string", "bar" : 42 }
    }
    declare type local:my-type as {
      "foo" : "string",
      "bar" : "integer"
    };
    
    declare function local:proj($x as local:my-type+) as string*
    {
      $x.foo
    };
    
    let $a as local:my-type* := validate type local:my-type* {
      { "foo" : "this is a string", "bar" : 42 }
    }
    return if($a instance of local:my-type*)
           then local:proj($a)
           else "Not an instance."
    declare type local:my-type as {
      "foo" : "string",
      "bar" : "integer"
    };
    
    validate type local:my-type* {
      { "foo" : "this is a string", "bar" : 42 },
      { "foo" : "this is another string", "bar" : 1 },
      { "foo" : "this is yet another string", "bar" : 2 },
      { "foo" : "this is a string", "bar" : 12 },
      { "foo" : "this is a string", "bar" : 42345 },
      { "foo" : "this is a string", "bar" : 42 }
    }
    declare type local:my-type as {
      "foo" : "string",
      "bar" : "integer"
    };
    
    validate type local:my-type* {
      json-lines("hdfs:///directory-file.json")
    }
    declare type local:my-type as {
      "foo" : "string",
      "bar" : "integer"
    };
    
    validate type local:my-type* {
      { "foo" : "this is a string", "bar" : 42 },
      { "bar" : 1 },
      { "foo" : "this is yet another string", "bar" : 2 },
      { "foo" : "this is a string" },
      { "foo" : "this is a string", "bar" : 42345 },
      { "foo" : "this is a string", "bar" : 42 }
    }
    declare type local:my-type as {
      "foo" : "string",
      "!bar" : "integer"
    };
    
    validate type local:my-type* {
      { "foo" : "this is a string", "bar" : 42 },
      { "bar" : 1 },
      { "foo" : "this is yet another string", "bar" : 2 },
      { "foo" : "this is a string", "bar" : 1234 },
      { "foo" : "this is a string", "bar" : 42345 },
      { "foo" : "this is a string", "bar" : 42 }
    }
    declare type local:my-type as {
      "foo" : "string=foobar",
      "!bar" : "integer"
    };
    
    validate type local:my-type* {
      { "foo" : "this is a string", "bar" : 42 },
      { "bar" : 1 },
      { "foo" : "this is yet another string", "bar" : 2 },
      { "foo" : "this is a string", "bar" : 1234 },
      { "foo" : "this is a string", "bar" : 42345 },
      { "foo" : "this is a string", "bar" : 42 }
    }
    declare type local:my-type as {
      "foo" : "string",
      "!bar" : [ "integer" ]
    };
    
    validate type local:my-type* {
      { "foo" : "this is a string", "bar" : [ 42, 1234 ] },
      { "bar" : [ 1 ] },
      { "foo" : "this is yet another string", "bar" : [ 2 ] },
      { "foo" : "this is a string", "bar" : [ ] },
      { "foo" : "this is a string", "bar" : [ 1, 2, 3, 4, 5, 6 ] },
      { "foo" : "this is a string", "bar" : [ 42 ] }
    }
    declare type local:my-type as {
      "foo" : { "bar" : "integer" },
      "!bar" : [ { "first" : "string", "last" : "string" } ]
    };
    
    validate type local:my-type* {
      {
        "foo" : { "bar" : 1 },
        "bar" : [
          { "first" : "Albert", "last" : "Einstein" },
          { "first" : "Erwin", "last" : "Schrodinger" }
        ]
      },
      {
        "foo" : { "bar" : 2 },
        "bar" : [
          { "first" : "Alan", "last" : "Turing" },
          { "first" : "John", "last" : "Von Neumann" }
        ]
      },
      {
        "foo" : { "bar" : 3 },
        "bar" : [
        ]
      }
    }
    declare type local:person as {
      "first" : "string",
      "last" : "string"
    };
    
    declare type local:my-type as {
      "foo" : { "bar" : "integer" },
      "!bar" : [ "local:person" ]
    };
    
    validate type local:my-type* {
      {
        "foo" : { "bar" : 1 },
        "bar" : [
          { "first" : "Albert", "last" : "Einstein" },
          { "first" : "Erwin", "last" : "Schrodinger" }
        ]
      },
      {
        "foo" : { "bar" : 2 },
        "bar" : [
          { "first" : "Alan", "last" : "Turing" },
          { "first" : "John", "last" : "Von Neumann" }
        ]
      },
      {
        "foo" : { "bar" : 3 },
        "bar" : [
        ]
      }
    }
    declare type local:x as jsound verbose {
      "kind" : "object",
      "baseType" : "object",
      "content" : [
        { "name" : "foo", "type" : "integer" }
      ],
      "closed" : false
    };
    
    declare type local:y as jsound verbose {
      "kind" : "object",
      "baseType" : "local:x",
      "content" : [
        { "name" : "bar", "type" : "date" }
      ],
      "closed" : true
    };
    
    ( (1, 2), (3, 4) )
          
    
    [ [ 1, 2 ], [ 3, 4 ] ]
          
    
    1 + 1
          
    
    [ 1 + 1 ]
          
    
    count([ 1, "foo", [ 1, 2, 3, 4 ], { "foo" : "bar" } ])
          
    
    count( ( 1, "foo", [ 1, 2, 3, 4 ], { "foo" : "bar" } ) )
          
    
    [ 1, "foo", [ 1, 2, 3, 4 ], { "foo" : "bar" } ]
          
    
    ( 1, "foo", [ 1, 2, 3, 4 ], { "foo" : "bar" } )
          
    
    [ 1, "foo", [ 1, 2, 3, 4 ], { "foo" : "bar" } ] []
          
    
    [ ( 1, "foo", [ 1, 2, 3, 4 ], { "foo" : "bar" } ) ]
          
    
    [ null, 1, null, 2 ]
          
    
    { "foo" : null }
          
    
    (null, 1, null, 2)
          
    
    { "foo" : () }
          
    
    () + 2
          
    
    null + 2
          
    
    null + ()
          
    
    () eq 2
          
    
    null eq 2
          
    
    null lt 2
          
    
    null eq ()
          
    
    null lt ()
          

    JSONiq coverage

    RumbleDB relies on the JSONiq language.

    JSONiq reference

    The complete specification can be found here and on the JSONiq.org website. The implementation is now in a very advanced stage and there remain only few unsupported core JSONiq features.

    JSONiq tutorial

    A tutorial can be found . All queries in this tutorial will work with RumbleDB.

    JSONiq tutorial for Python users

    A tutorial aimed at Python users can be found . Please keep in mind, though, that examples using not supported features may not work (see below).

    Nested FLWOR expressions

    FLWOR expressions now support nestedness, for example like so:

    However, keep in mind that parallelization cannot be nested in Spark (there cannot be a job within a job), that is, the following will not work:

    Expressions pushed down to Spark

    Many expressions are pushed down to Spark out of the box. For example, this will work on a large file leveraging the parallelism of Spark:

    What is pushed down so far is:

    • FLWOR expressions (as soon as a for clause is encountered, binding a variable to a sequence generated with json-lines() or parallelize())

    • aggregation functions such as count

    • JSON navigation expressions: object lookup (as well as keys() call), array lookup, array unboxing, filtering predicates

    • predicates on positions, include use of context-dependent functions position() and last(), e.g.,

    More expressions working on sequences will be pushed down in the future, prioritized on the feedback we receive.

    We also started to push down some expressions to DataFrames and Spark SQL (obtained via structured-json-lines, csv-file and parquet-file calls). In particular, keys() pushes down the schema lookup if used on parquet-file() and structured-json-lines(). Likewise, count() as well as object lookup, array unboxing and array lookup is also pushed down on DataFrames.

    When an expression does not support pushdown, it will materialize automaticaly. To avoid issues, the materializion is capped by default at 200 items, but this can be changed on the command line with --materialization-cap. A warning is issued if a materialization happened and the sequence was truncated on screen. An error is thrown if this happens within a query.

    External global variables.

    Prologs with user-defined functions and global variables are supported. Global external variables are supported (use "--variable:foo bar" on the command line to assign values to them). If the declared type is not string, then the literal supplied on the command line is cast. If the declared type is anyURI, the path supplied on the command line is also resolved against the working directory to an absolute URI. Thus, anyURI should be used to supply paths dynamically through an external variable.

    Context item declarations are supported and a global context item value can be passed with the "--context-item" or "-I" parameter on the command line.

    Library modules

    Library modules are now supported (experimental, please report bugs), and their namespace URI is used for resolution. If it is relative, it is resolved against the importing module location.

    The same schemes are supported as for reading queries and data: file, hdfs, and so on. HTTP is also supported: you can import modules from the Web!

    Example of library module (the file name is library-module.jq):

    Example of importing module (assuming it is in the same directory):

    Try/catch

    Try/catch expressions are supported. Error codes are in the default, RumbleDB namespace and do not need prefixes.

    Supported types

    The JSONiq type system is fully supported. Below is a complete list of JSONiq types and their support status. All builtin types are in the default type namespace, so that no prefix is needed. These types are defined in the XML Schema standard. Note that some types specific to XML (e.g., NOTATION, NMTOKENS, NMTOKEN, ID, IDREF, ENTITY, etc) are not part of the JSONiq standard and not supported by RumbleDB.

    Type
    Status

    Unsupported/Unimplemented features (beta release)

    Most core features of JSONiq are now in place, and we are working on getting the last (less used) ones into RumbleDB as well. We prioritize their implementation on user requests.

    Prolog

    Some prolog settings (base URI, ordering mode, decimal format, namespace declarations) are not supported yet.

    Location hints for the resolution of modules are not supported yet.

    FLWOR features

    Window clauses are not supported, because they are not compatible with the Spark execution model.

    Function types

    Function type syntax is supported.

    Function annotations are not supported (%public, %private...), but this is planned.

    Builtin functions

    Most JSONiq and XQuery builtin functions are now supported (see function documentation), except XML-specific functions. A few are still missing, do not hesitate to reach out if you need them.

    Constructors for atomic types are fully supported.

    Buitin functions cannot yet be used with named function reference expressions (example: concat#2).

    Error variables

    Error variables ($err:code, ...) for inside catch blocks are not supported.

    Updates and scripting

    There are future plans to support JSONiq updates and scripting.

    type checking (instance of, treat as)

  • many builtin function calls (head, tail, exist, etc)

  • supported

    dateTime

    supported

    dateTimeStamp

    supported

    dayTimeDuration

    supported

    decimal

    supported

    double

    supported

    duration

    supported

    float

    supported

    gDay

    supported

    gMonth

    supported

    gYear

    supported

    gYearMonth

    supported

    hexBinary

    supported

    int

    supported

    integer

    supported

    long

    supported

    negativeInteger

    supported

    nonPositiveInteger

    supported

    nonNegativeInteger

    supported

    numeric

    supported

    positiveInteger

    supported

    short

    supported

    string

    supported

    time

    supported

    unsignedByte

    supported

    unsignedInt

    supported

    unsignedLong

    supported

    unsignedShort

    supported

    yearMonthDuration

    supported

    atomic

    JSONiq 1.0 only

    anyAtomicType

    supported

    anyURI

    supported

    base64Binary

    supported

    boolean

    supported

    byte

    supported

    here
    here

    date

    let $x := for $x in json-lines("file.json")
              where $x.field eq "foo"
              return $x
    return count($x)
    for $x in json-lines("file1.json")
    let $z := for $y in json-lines("file2.json")
              where $y.foo eq $x.fbar
              return $y
    return count($z)
    count(json-lines("file.json")[$$.field eq "foo"].bar[].foo[[1]])
    json-lines("file.json")[position() ge 10 and position() le last() - 2]
    module namespace m = "library-module.jq";
    
    declare variable $m:x := 2;
    
    declare function mod:func($v) {
      $m:x + $v
    );
    import module namespace mod = "library-module.jq";
    
    mod:func($mod:x)
    try { 1 div 0 } catch FOAR0001 { "Division by zero!" }

    Configuration parameters

    The parameters that can be used on the command line as well as on the planned HTTP server are shown below. They are also accessible via the Java API and via Python through the RumbleRuntimeConfiguration class.

    RumbleDB runs in three modes. You can select the mode passing a verb as the first parameter. For example:

    Previous parameters (--shell, --query-path, --server) work in a backward compatible fashion, however we do recommend to start using the new verb-based format.

    Shell parameter
    Shortcut
    HTTP parameter
    example values
    Semantics

    --shell

    repl

    Data sources and formats

    RumbleDB is able to read a variety of formats from a variety of file systems and database management systems.

    We support functions to read JSON, JSON Lines, XML, Parquet, CSV, Text, ROOT, Delta files from various storage layers such as S3 and HDFS, Azure blob storage. We run most of our tests on Amazon EMR with S3 or HDFS, as well as locally on the local file system, but we welcome feedback on other setups.

    We also support some ETL-based systems such as PostgreSQL, MongoDB and the Hive metastore.

    Supported formats

       spark-submit rumbledb.jar run file.jq -o output-dir -P 1
       spark-submit rumbledb.jar run -q '1+1'
       spark-submit rumbledb.jar serve -p 8001
       spark-submit rumbledb.jar repl -c 10

    N/A

    yes, no

    yes runs the interactive shell. No executes a query specified with --query-path

    --shell-filter

    N/A

    N/A

    jq .

    Post-processes the output of JSONiq queries on the shell with the specified command (reading the RumbleDB output via stdin)

    --query

    -q

    query

    1+1

    A JSONiq query directly provided as a string.

    --query-path

    (any text without -- or - is recognized as a query path)

    query-path

    file:///folder/file.jq

    A JSONiq query file to read from (from any file system, even the Web!).

    --output-path

    -o

    output-path

    file:///folder/output

    Where to output to (if the output is large, it will create a sharded directory, otherwise it will create a file)

    --output-format

    -f

    N/A

    json, csv, avro, parquet, or any other format supported by Spark

    An output format to use for the output. Formats other than json can only be output if the query outputs a highly structured sequence of objects (you can nest your query in an annotate() call to specify a schema if it does not).

    --output-format-option:foo

    N/A

    N/A

    bar

    Options to further specify the output format (example: separator character for CSV, compression format...)

    --overwrite

    -O (meaning --overwrite yes)

    overwrite

    yes, no

    Whether to overwrite to --output-path. No throws an error if the output file/folder exists.

    --materialization-cap

    -c

    materialization-cap

    100000

    A cap on the maximum number of items to materialize during the query execution for large sequences within a query. For example, when nesting an expression producing a large sequence of items (and that RumbleDB chose to physically store as an RDD or DataFrame) into an array constructor.

    --result-size

    result-size

    10

    A cap on the maximum number of items to output on the screen or to a local list.

    --number-of-output-partitions

    -P

    N/A

    ad hoc

    How many partitions to create in the output, i.e., the number of files that will be created in the output path directory.

    --log-path

    N/A

    log-path

    file:///folder/log.txt

    Where to output log information

    --print-iterator-tree

    N/A

    N/A

    yes, no

    For debugging purposes, prints out the expression tree and runtime interator tree.

    --show-error-info

    -v (meaning --show-error-info yes)

    show-error-info

    yes, no

    For debugging purposes. If you want to report a bug, you can use this to get the full exception stack. If no, then only a short message is shown in case of error.

    --static-typing

    -t (meaning --static-typing yes)

    static-typing

    yes, no

    Activates static type analysis, which annotates the expression tree with inferred types at compile time and enables more optimizations (experimental). Deactivated by default.

    --server

    serve

    N/A

    yes, no

    yes runs RumbleDB as a server on port 8001. Run queries with http://localhost:8001/jsoniq?query-path=/folder/foo.json

    --port

    -p

    N/A

    8001 (default)

    Changes the port of the RumbleDB HTTP server to any of your liking

    --host

    -h

    N/A

    localhost (default)

    Changes the host of the RumbleDB HTTP server to any of your liking

    --variable:foo

    N/A

    variable:foo

    bar

    --variable:foo bar initialize the global variable $foo to "bar". The query must contain the corresponding global variable declaration, e.g., "declare variable $foo external;"

    --context-item

    -I

    context-item

    bar

    initializes the global context item $$ to "bar". The query must contain the corresponding global variable declaration, e.g., "declare context item external;"

    --context-item-input

    -i

    context-item-input

    -

    reads the context item value from the standard input

    --context-item-input-format

    N/A

    context-item-input-format

    text or json

    sets the input format to use for parsing the standard input (as text or as a serialized json value)

    --dates-with-timezone

    N/A

    dates-with-timezone

    yes or no

    activates timezone support for the type xs:date (deactivated by default)

    --lax-json-null-valication

    N/A

    lax-json-null-validation

    yes or no

    Allows conflating JSON nulls with absent values when validating nillable object fields for more flexibility (activated by default).

    --optimize-general-comparison-to-value-comparison

    N/A

    optimize-general-comparison-to-value-comparison

    yes or no

    activates automatic conversion of general comparisons to value comparisons when applicable (activated by default)

    --function-inlining

    N/A

    function-inlining

    yes or no

    activates function inlining for non-recursive functions (activated by default)

    --parallel-execution

    N/A

    parallel-execution

    yes or no

    activates parallel execution when possible (activated by default)

    --native-execution

    N/A

    native-execution

    yes or no

    activates native (Spark SQL) execution when possible (activated by default)

    --default-language

    N/A

    N/A

    jsoniq10, jsoniq31, xquery31

    specifies the query language to be used

    --optimize-steps

    N/A

    N/A

    yes or no

    allows RumbleDB to optimize steps, might violate stability of document order (activated by default)

    --optimize-steps-experimental

    N/A

    N/A

    yes or no

    experimentally optimizes steps more by skipping uniqueness and sorting in some cases. correctness is not yet verified (disabled by default)

    --optimize-parent-pointers

    N/A

    N/A

    yes or no

    allows RumbleDB to remove parent pointers from items if no steps requiring parent pointers are detected statically (activated by default)

    --static-base-uri

    N/A

    N/A

    "../data/"

    sets the static base uri for the execution. This option overwrites module location but is overwritten by declaration inside query

    JSON

    A JSON file containing a single JSON object (or value) can be read with json-doc(). It will not spread access in any way, so that the files should be reasonably small. json-doc() can read JSON files even if the object or value is spread over multiple lines.

    returns the (single) JSON value read from the supplied JSON file. This will also work for structures spread over multiple lines, as the read is local and not sharded.

    json-doc() also works with an HTTP URI.

    JSON Lines

    JSON Lines files are files that have one JSON object (or value) per line. Such files can thus become very large, up to billions or even trillions of JSON objects.

    JSON Lines files are read with the json-lines() function (formerly called json-file()). json-lines() exists in unary and binary. The first parameter specifies the JSON file (or set of JSON files) to read. The second, optional parameter specifies the minimum number of partitions. It is recommended to use it in a local setup, as the default is only one partition, which does not fully use the parallelism. If the input is on HDFS, then blocks are taken as splits by default. This is also similar to Spark's textFile().

    json-lines() also works with an HTTP URI, however, it will download the file completely and then parallelize, because HTTP does not support blocks. As a consequence, it can only be used for reasonable sizes.

    Example of usage:

    If a default host and port are set in the Hadoop configuration, you can directly specify an absolute path without host and port:

    For a set of files:

    If a working directory is set:

    Several files or whole directories can be read with the same pattern syntax as in Spark.

    In some cases, JSON Lines files are highly structured, meaning that all objects have the same fields and these fields are associated with values with the same types. In this case, RumbleDB will be faster navigating such files if you open them with the function structured-json-lines().

    structured-json-lines() parses one or more json files that follow JSON-lines format and returns a sequence of objects. This enables better performance with fully structured data and is recommended to use only when such data is available.

    Warning: when the data has multiple types for the same field, this field and contained values will be treated as strings. This is also similar to Spark's spark.read.json().

    Example of usage:

    XML

    XML files can be read into RumbleDB using the doc() function. The parameter specifies the XML file to read and return as a document node.

    Example of usage:

    Additionally, RumbeDB provides the xml-files() function to read many XML files at once. xml-files() exists in unary and binary. The first parameter specifies the directory of XML files to read. The second, optional parameter specifies the minimum number of partitions. It is recommended to use it in a local setup, as the default is only one partition.

    Example of usage:

    Text

    Text files can be read into a sequence of string items, one string per line. RumbleDB can open files that have billions or potentially even trillions of lines with the function text-file().

    text-file() exists in unary and binary. The first parameter specifies the text file (or set of text files) to read and return as a sequence of strings.

    The second, optional parameter specifies the minimum number of partitions. It is recommended to use it in a local setup, as the default is only one partition, which does not fully use the parallelism. If the input is on HDFS, then blocks are taken as splits by default. This is also similar to Spark's textFile().

    Example of usage:

    Several files or whole directories can be read with the same pattern syntax as in Spark.

    (Also see examples for json-lines for host and port, sets of files and working directory).

    There is also a function local-text-file() that reads locally, without parallelism. RumbleDB can stream through the file efficiently.

    RumbleDB supports also the W3C-standard functions unparsed-text and unparsed-text-lines. The output of the latter is automatically parallelized as a potentially large sequence of strings.

    Parquet

    Parquet files can be opened with the function parquet-file().

    Parses one or more parquet files and returns a sequence of objects. This is also similar to Spark's spark.read.parquet()

    Several files or whole directories can be read with the same pattern syntax as in Spark.

    CSV

    CSV files can be opened with the function csv-file().

    Parses one or more csv files and returns a sequence of objects. This is also similar to Spark's spark.read.csv()

    Several files or whole directories can be read with the same pattern syntax as in Spark.

    Options can be given in the form of a JSON object. All available options can be found in the Spark documentation

    PostgreSQL

    This functionality is currently only available in the Python edition (pip install jsoniq) as of 2.0.1+.

    PostgreSQL tables can be opened with the function postgresql-table().

    PostgreSQL is an OLTP system with its own storage system. Thus, unlike most other functions on this page, it uses a connection string rather than a path on a data lake.

    It opens one table and returns it as a sequence of objects. The first argument is the connection string in the JDBC format, containing host, port, username, password, and database. The second argument is the name of the table to read.

    The third parameter can be used to control the number of partitions.

    MongoDB

    This functionality is currently only available in the Python edition (pip install jsoniq) as of 2.0.2+.

    MongoDB collections can be opened with the function mongodb-collection().

    MongoDB is an OLTP system with its own storage system. Thus, unlike most other functions on this page, it uses a connection string rather than a path on a data lake.

    It opens one collection and returns it as a sequence of objects. The first argument is the connection string in the MongoDB format, containing host, port, database, collection, username, password. The second argument is the name of the collection to read.

    The third parameter can be used to control the number of partitions.

    MongoDB does not work "out of the box" but requires some configuration as indicated on the MongoDB Spark connector website. In the Python edition, we simplified the process and all that is needed is to add withMongo() on the session building chain:

    Hive metastore

    RumbleDB can connect to a table registered in the Hive metastore with the function table().

    The Hive metastore manages its own storage system. Thus, unlike most other functions on this page, it uses a simple name rather than a path on a data lake.

    RumbleDB can also modify data in a Hive metastore table with the JSONiq Update Facility.

    Delta files

    Delta files, part of the Delta Lake framework, can be opened with the function delta-file().

    RumbleDB can also modify data in a delta file with the JSONiq Update Facility.

    Delta files do not work "out of the box" but require some configuration as indicated on the Delta Lake website (importing packages, configuring some parameters). In the Python edition, we simplified the process and all that is needed is to add withDelta() on the session building chain:

    AVRO

    Avro files can be opened with the function avro-file().

    Parses one or more avro files and returns a sequence of objects. This is similar to Spark's spark.read().format("avro").load()

    Several files or whole directories can be read with the same pattern syntax as in Spark.

    Options can be given in the form of a JSON object. All available options relevant for reading in avro data can be found in the Spark documentation

    libSVM

    libSVM files can be opened with the function libsvm-file().

    Parses one or more libsvm files and returns a sequence of objects. This is similar to Spark's spark.read().format("libsvm").load()

    Several files or whole directories can be read with the same pattern syntax as in Spark.

    ROOT

    ROOT files can be open with the function root-file(). The second parameter specifies the path within the ROOT files (a ROOT file is like a mini-file system of its own). It is often Events or tree.

    Creating your own big sequence

    The function parallelize() can be used to create, on the fly, a big sequence of items in such a way that RumbleDB can spread its querying across cores and machines.

    This function behaves like the Spark parallelize() you are familiar with and sends a large sequence to the cluster. The rest of the FLWOR expression is then evaluated with Spark transformations on the cluster.

    There is also be a second, optional parameter that specifies the minimum number of partitions.

    Supported file systems

    As a general rule of thumb, RumbleDB can read from any file system that Spark can read from. The file system is inferred from the scheme used in the path used in any of the functions described above, with the exception of MongoDB, the Hive metastore, and PostgreSQL, which are ETL-based.

    Note that the scheme is optional, in which case the default file system as configured in Hadoop and Spark is used. A relative path can also be provided, in which case the working directory (including its file system) as configured is used.

    Local file system

    The scheme for the local file system is file://. Pay attention to the fact that for reading an absolute path, a third slash will follow the scheme.

    Example:

    Warning! If you try to open a file from the local file system on a cluster of several machines, this might fail as the file is only on the machine that you are connected to. You need to pass additional parameters to spark-submit to make sure that any files read locally will be copied over to all machines.

    If you use spark-submit locally, however, this will work out of the box, but we recommend specifying a number of partitions to avoid reading the file as a single partition.

    For Windows, you need to use forward slashes, and if the local file system is set up as the default and you omit the file scheme, you still need a forward slash in front of the drive letter to not confuse it with a URI scheme:

    In particular, the following will not work:

    HDFS

    The scheme for the Hadoop Distributed File System is hdfs://. A host and port should also be specified, as this is required by Hadoop.

    Example:

    If HDFS is already set up as the default file system as is often the case in managed Spark clusters, an absolute path suffices:

    The following will not work:

    S3

    There are three schemes for reading from S3: s3://, s3n:// and s3a://.

    Examples:

    If you are on an Amazon EMR cluster, s3:// is straightforward to use and will automatically authenticate. For more details on how to set up your environment to read from S3 and which scheme is most appropriate, we refer to the Amazon S3 documentation.

    Azure blob storage

    The scheme for Azure blob storage is wasb://.

    Example:

    json-doc("file.json")
    for $my-json in json-lines("hdfs://host:port/directory/file.json")
    where $my-json.property eq "some value"
    return $my-json
    for $my-json in json-lines("/absolute/directory/file.json")
    where $my-json.property eq "some value"
    return $my-json
    for $my-json in json-lines("/absolute/directory/file-*.json")
    where $my-json.property eq "some value"
    return $my-json
    for $my-json in json-lines("file.json")
    where $my-json.property eq "some value"
    return $my-json
    for $my-json in json-lines("*.json")
    where $my-json.property eq "some value"
    return $my-json
    for $my-structured-json in structured-json-lines("hdfs://host:port/directory/structured-file.json")
    where $my-structured-json.property eq "some value"
    return $my-structured-json
    doc("path/to/file.xml")
    xml-files("path/to/directory/*.xml", 10)
    count(
      for $my-string in text-file("hdfs://host:port/directory/file.txt")
      for $token in tokenize($my-string, ";")
      where $token eq "some value"
      return $token
    )
    count(
      for $my-string in local-text-file("file:///home/me/file.txt")
      for $token in tokenize($my-string, ";")
      where $token eq "some value"
      return $token
    )
    count(
      for $my-string in unparsed-text-lines("file:///home/me/file.txt")
      for $token in tokenize($my-string, ";")
      where $token eq "some value"
      return $token
    )
    count(
      let $text := unparsed-text("file:///home/me/file.txt")
      for $my-string in tokenize($text, "\n")
      for $token in tokenize($my-string, ";")
      where $token eq "some value"
      return $token
    )
    for $my-object in parquet-file("file.parquet")
    where $my-object.property eq "some value"
    return $my-json
    for $my-object in parquet-file("*.parquet")
    where $my-object.property eq "some value"
    return $my-json
    for $i in csv-file("file.csv")
    where $i._c0 eq "some value"
    return $i
    for $i in csv-file("*.csv")
    where $i._c0 eq "some value"
    return $i
    for $i in csv-file("file.csv", {"header": true, "inferSchema": true})
    where $i.key eq "some value"
    return $i
    for $i in postgresql-table("jdbc:postgresql://servername/dbname?user=postgres&password=example", "tablename")
    where $i.attribute eq "some value"
    return $i
    for $i in postgresql-table("jdbc:postgresql://servername/dbname?user=postgres&password=example", "tablename", 10)
    where $i.attribute eq "some value"
    return $i
    for $i in mongodb-collection("mongodb://servername/dbname", "collection")
    where $i.attribute eq "some value"
    return $i
    for $i in mongodb-collection("mongodb://servername/dbname", "collection", 10)
    where $i.attribute eq "some value"
    return $i
    RumbleSession.builder.withMongo().getOrCreate();
    for $i in table("mytable")
    where $i.attribute eq "some value"
    return $i
    for $i in delta-file("hdfs://path/to/my/delta-file")
    where $i.attribute eq "some value"
    return $i
    RumbleSession.builder.withDelta().getOrCreate();
    for $i in avro-file("file.avro")
    where $i._col1 eq "some value"
    return $i
    for $i in avro-file("*.avro")
    where $i._col1 eq "some value"
    return $i
    for $i in avro-file("file.avro", {"ignoreExtension": true, "avroSchema": "/path/to/schema.avsc"})
    where $i._col1 eq "some value"
    return $i
    for $i in libsvm-file("file.txt")
    where $i._col1 eq "some value"
    return $i
    for $i in libsvm-file("*.txt")
    where $i._col1 eq "some value"
    return $i
    for $i in root-file("events.root", "Events")
    where $i._c0 eq "some value"
    return $i
    for $i in parallelize(1 to 1000000)
    where $i mod 1000 eq 0
    return $i
    for $i in parallelize(1 to 1000000, 100)
    where $i mod 1000 eq 0
    return $i
    file:///home/user/file.json
    file:///C:/Users/hadoop/file.json
    file:/C:/Users/hadoop/file.json
    /C:/Users/hadoop/file.json
    file://C:/Users/hadoop/file.json
    C:/Users/hadoop/file.json
    C:\Users\hadoop\file.json
    file://C:\Users\hadoop\file.json
    hdfs://www.example.com:8021/user/hadoop/file.json
    /user/hadoop/file.json
    hdfs:///user/hadoop/file.json
    hdfs://user/hadoop/file.json
    hdfs:/user/hadoop/file.json
    s3://my-bucket/directory/file.json
    s3n://my-bucket/directory/file.json
    s3a://my-bucket/directory/file.json
    wasb://[email protected]/directory/file.json

    Introduction

    In this specification, we detail the JSONiq language in version 1.0. Historically, JSONiq was first created as an extension to XQuery. Later, a separate core syntax was created which makes it 100% tailored for JSON. It is the JSONiq core syntax that is detailed in this document.

    The functionality directly inherited from XQuery is described on a higher level and we explicitly refer for more in-depth details to the .

    Structure of a JSONiq program.

    A JSONiq program can either be a main module, which contains a query that can be executed, or a library module, which defines functions and variables that can be used in other modules.

    A main or library module can be optionally prefixed with a JSONiq declaration with a version (currently 1.0) and an encoding.

    Function library

    Function Library

    JSONiq provides a rich set of functions.

    JSON specific functions.

    Some functions are specific to JSON.

    Module

    Main modules

    A JSONiq main module is made of two parts: an optional prolog, and an expression, which is the main query.

    MainModule

    The result of the main JSONiq program is the result of its main query.

    In the prolog, it is possible to declare global variables and functions. Mostly, you will recognize a prolog declaration by the semi-colon it ends with. The main query does not contain semi-colons (at least in core JSONiq).

    Global variables and functions can use and call each other arbitrarily, even if the dependency is further down in the prolog. If there a cycle, an error is thrown.

    JSONiq largely follows the W3C standard regarding modules. The detailed specification is found here.

    Library modules

    Library modules do not contain any main query, just global variables and functions. They can be imported by other modules.

    A library module is introduced with a module declaration, followed by the prolog containing its variables and functions.

    LibraryModule

    Feature matrix

    JSONiq is 99% reliant on XQuery, a W3C standard. For everything taken over from the W3C standard, a brief, non-normative explanation is provided with a link to the corresponding part in the W3C specification.

    Feature
    Specification status

    JSONiq Data Model

    Atomic items

    W3C-conformant

    Structured items

    JSONiq-specific

    Function items

    W3C-conformant

    Node items (XML)

    Omitted (optional support by some engines)

    JSONiq Type System

    Namespaces

    The namespace http://jsoniq.org/functions is used for JSONiq builtin functions defined by this specification. This namespace is exposed to the user and is bound by default to the prefix jn. For instance, the function name jn:keys() is in this namespace.

    The namespace http://jsoniq.org/types is used for JSONiq builtin types defined by this specification (including synonyms for some XQuery types). This namespace is exposed to the user and is bound by default to the prefix js. For instance, the type name js:null is in this namespace.

    The namespace http://jsoniq.org/default-function-namespace is a proxy namespace that maps to the jn: (JSONiq), fn: (XQuery) and math: (XQuery) namespaces. It is the default function namespace, allowing to call all these functions with no prefix.

    The namespace http://jsoniq.org/default-type-namespace is a proxy namespace that maps to the js: (JSONiq) and xs: (XQuery) namespaces. It is the default type namespace, allowing to use all builtin types with no prefix.

    Accessors used in JSONiq Data Model use the jdm: prefix. These functions are not exposed to the user and are for explanatory purposes of the data model within this document only. The jdm: prefix is not associated with a namespace.

    W3C specification
    keys

    This function returns the distinct keys of all objects in the supplied sequence, in an implementation-dependent order.

    keys($o as item*) as string*

    Getting all distinct key names in the supplied objects, ignoring non-objects.

    Result (run with Zorba):a b c

    Retrieving all Pairs from an Object:

    Result (run with Zorba):{ "eyes" : "blue" } { "hair" : "fuchsia" }

    members

    This functions returns all members of all arrays of the supplied sequence.

    members($a as item*) as item*

    Retrieving the members of all supplied arrays, ignoring non-arrays.

    Result (run with Zorba):mercury venus earth mars 1 2 3

    null

    This function returns the JSON null.

    null() as null

    parse-json

    This function parses its first parameter (a string) as JSON, and returns the resulting sequence of objects and arrays.

    parse-json($arg as string?) as json-item*

    parse-json($arg as string?, $options as object) as json-item*

    The object optionally supplied as the second parameter may contain additional options:

    • jsoniq-multiple-top-level-items (boolean): indicates whether parsing to zero, or several objects is allowed. An error is raised if this value is false and there is not exactly one object that was parsed.

    If parsing is not successful, an error is raised. Parsing is considered in particular to be non-successful if the boolean associated with "jsoniq-multiple-top-level-items" in the additional parameters is false and there is extra content after parsing a single abject or array.

    Parsing a JSON document

    Result (run with Zorba):{ "foo" : "bar" }

    Parsing multiple, whitespace-separated JSON documents

    Result (run with Zorba):{ "foo" : "bar" } { "bar" : "foo" }

    size

    This function returns the size of the supplied array, or the empty sequence if the empty sequence is provided.

    size($a as array?) as integer?

    Retrieving the size of an array

    Result (run with Zorba):10

    accumulate

    This function dynamically builds an object, like the {| |} syntax, except that it does not throw an error upon pair collision. Instead, it accumulates them, wrapping into an array if necessary. Non-objects are ignored.

    descendant-arrays

    This function returns all arrays contained within the supplied items, regardless of depth.

    descendant-objects

    This function returns all objects contained within the supplied items, regardless of depth.

    descendant-pairs

    This function returns all descendant pairs within the supplied items.

    Accessing all descendant pairs

    Result (run with Zorba):An error was raised: "descendant-pairs": function with arity 1 not declared

    flatten

    This function recursively flattens arrays in the input sequence, leaving non-arrays intact.

    intersect

    This function returns the intersection of the supplied objects, and aggregates values corresponding to the same name into an array. Non-objects are ignored.

    project

    This function iterates on the input sequence. It projects objects by filtering their pairs and leaves non-objects intact.

    Projecting an object 1

    Result (run with Zorba):{ "Captain" : "Kirk", "First Officer" : "Spock" }

    Projecting an object 2

    Result (run with Zorba):{ }

    remove-keys

    This function iterates on the input sequence. It removes the pairs with the given keys from all objects and leaves non-objects intact.

    Removing keys from an object (not implemented yet)

    Result (run with Zorba):An error was raised: "remove-keys": function with arity 2 not declared

    values

    This function returns all values in the supplied objects. Non-objects are ignored.

    encode-for-roundtrip

    This function encodes any sequence of items, even containing non-JSON types, to a sequence of JSON items that can be serialized as pure JSON, in a way that it can be parsed and decoded back using decode-from-roundtrip. JSON features are left intact, while atomic items annotated with a non-JSON type are converted to objects embedding all necessary information.

    encode-for-roundtrip($items as item*) as json-item*

    decode-from-roundtrip

    This function decodes a sequence previously encoded with encode-for-roundtrip.

    decode-from-roundtrip($items as json-item*) as item*

    Functions taken from XQuery

    • Access to the external environment: collection#1

    • Function to turn atomics into booleans for use in two-valued logics: boolean#1

    • Raising errors: error#0, error#1, error#2, error#3.

    • Functions on numeric values: abs#1, ceilingabs#1, floorabs#1, roundabs#1,

    • Parsing numbers: ,

    • Formatting integers: ,

    • Formatting numbers: ,

    • Trigonometric and exponential functions: , , , , , , , , , , , , ,

    • Functions to assemble and disassemble strings: ,

    • Comparison of strings: , ,

    • Functions on string values: , , , , , , , , , , , , ,

    • Functions based on substring matching: , , , , , , , , ,

    • String functions that use regular expressions: , , , , ,

    • Functions that manipulate URIs: , , , ,

    • General functions on sequences: , , , , , , , , ,

    • Function that compare values in sequences: , , , , .

    • Functions that test the cardinality of sequences: , ,

    • Aggregate functions: , , , ,

    • Serializing functions: (unary)

    • Context information: , , , ,

    • Constructor functions: for all builtin types, with the name of the builtin type and unary. Equivalent to a cast expression.

    
    let $o := ("foo", [ 1, 2, 3 ], { "a" : 1, "b" : 2 }, { "a" : 3, "c" : 4 })
    return keys($o)
            
    
    let $map := { "eyes" : "blue", "hair" : "fuchsia" }
    for $key in keys($map)
    return { $key : $map.$key }
            
    
    let $planets :=  ( "foo", { "foo" : "bar "}, [ "mercury", "venus", "earth", "mars" ], [ 1, 2, 3 ])
    return members($planets)
            
    
    parse-json("{ \"foo\" : \"bar\" }", { "jsoniq-multiple-top-level-items" : false })
            
    
    parse-json("{ \"foo\" : \"bar\" } { \"bar\" : \"foo\" }")
            
    
    let $a := [1 to 10]
    return size($a)
            
    
    declare function accumulate($seq as item*) as object
    {
      {|
        keys($seq) ! { $$ : $seq.$$ }
      |}
    };
          
    
    declare function descendant-arrays($seq as item*) as array*
    {
      for $i in $seq
      return typeswitch ($i)
      case array return ($i, descendant-arrays($i[])
      case object return descendant-arrays(values($i))
      default return ()
    };
          
    
    declare function descendant-objects($seq as item*) as object*
    {
      for $i in $seq
      return typeswitch ($i)
      case object return ($i, descendant-objects(values($i)))
      case array return descendant-objects($i[])
      default return ()
    };
          
    
    declare function descendant-pairs($seq as item*)
    {
      for $i in $seq
      return typeswitch ($i)
      case object return
        for $k in keys($o)
        let $v := $o.$k
        return ({ $k : $v }, descendant-pairs($v))
      case array return descendant-pairs($i[])
      default return ()
    };
          
    
    let $o := 
    {
      "first" : 1,
      "second" : { 
        "first" : "a", 
        "second" : "b" 
      }
    }
    return descendant-pairs($o)
            
    
    declare function flatten($seq as item*) as item*
    {
      for $value in $seq
      return typeswitch ($value)
             case array return flatten($value[])
             default return $value
    };
    	  
    
    declare function intersect($seq as item*)
    {
      {|
        let $objects := $seq[. instance of object()]
        for $key in keys(head($objects))
        where every $object in tail($objects)
              satisfies exists(index-of(keys($object), $key))
        return { $key : $objects.$key }
      |}
    };
          
    
    declare function project($seq as item*, $keys as string*) as item*
    {
      for $item in $seq
      return typeswitch ($item)
             case $object as object return
             {|
               for $key in keys($object)
               where some $to-project in $keys satisfies $to-project eq $key
               let $value := $object.$key
               return { $key : $value }
             |}
             default return $item
    };
          
    
    let $o := {
      "Captain" : "Kirk",
      "First Officer" : "Spock",
      "Engineer" : "Scott"
      }
    return project($o, ("Captain", "First Officer"))
            
    
    let $o := {
      "Captain" : "Kirk",
      "First Officer" : "Spock",
      "Engineer" : "Scott"
      }
    return project($o, "XQuery Evangelist")
            
    
    declare function remove-keys($seq as item*, $keys as string*) as item*
    {
      for $item in $seq
      return typeswitch ($item)
             case $object as object return
             {|
               for $key in keys($object)
               where every $to-remove in $keys satisfies $to-remove ne $key
               let $value := $object.$key
               return { $key : $value }
             |}
             default return $item
    };
          
    
    let $o := {
      "Captain" : "Kirk",
      "First Officer" : "Spock",
      "Engineer" : "Scott"
      }
    return remove-keys($o, ("Captain", "First Officer"))
            
    
    declare function values($seq as item*) as item* {
      for $i in $seq
      for $k in jn:keys($i)
      return $i($k)
    };
          
    round-half-to-even#1
    number#0
    number#1
    format-integer#2
    format-integer#3
    format-numberreturn r#2
    format-number#3
    pi#0
    exp#1
    exp10#1
    log#1
    log10#1
    pow#2
    sqrt#1
    sin#1
    cos#1
    tan#1
    asin#1
    acos#1
    atan#1
    atan2#1
    codepoints-to-string#1
    string-to-codepoints#1
    compare#2
    compare#3
    codepoint-equal#2
    concat#2
    string-join#1
    string-join#2
    substring#2
    substring#3
    string-length#0
    string-length#1
    normalize-space#0
    normalize-space#1
    normalize-unicode#1
    normalize-unicode#2
    upper-case#1
    lower-case#1
    translate#3
    contains#2
    contains#3
    starts-with#2
    starts-with#3
    ends-with#2
    ends-with#3
    substring-before#2
    substring-before#3
    substring-after#2
    substring-after#3
    matches#2
    matches#3
    replace#3
    replace#4
    tokenize#2
    tokenize#3
    resolve-uri#1
    resolve-uri#2
    encode-for-uri#1
    iri-to-uri#1
    escape-html-uri#1
    empty#1
    exists#1
    head#1
    tail#1
    insert-before#3
    remove#2
    reverse#1
    subsequence#2
    subsequence#3
    unordered#1
    distinct-values#1
    distinct-values#2
    index-of#2
    index-of#3
    deep-equal#2
    deep-equal#3
    zero-or-one#1
    one-or-more#1
    exactly-one#1
    count#1
    avg#1
    max#1
    min#1
    sum#1
    serialize#1
    current-dateTime#1
    current-date#1
    current-time#1
    implicit-timezone#1
    default-collation#1

    Atomic types

    W3C-conformant, but support for xs:ID, xs:IDREF, xs:IDREFS, xs:Name, xs:NCName, xs:ENTITY, xs:ENTITIES, xs:NOTATION omitted (except for engines also supporting XML)

    js:null type

    JSONiq-specific

    js:item, js:atomic types

    JSONiq-specific synonyms for item() and xs:anyAtomicType

    Structured types

    JSONiq-specific

    Function types

    W3C-conformant

    Empty sequence type

    JSONiq-specific notation () for empty-sequence()

    XML node types

    Omitted (optional support by engines supporting XML)

    Concepts

    Effective boolean value

    W3C-conformant, extended with object, array and null semantics

    Atomization

    Omitted (optional support by engines supporting XML)

    Expressions

    Numeric literals

    W3C-conformant

    String literals

    W3C-conformant, but escape is done with \ not with &

    Boolean and null literals

    JSONiq-specific

    Variable reference

    W3C-conformant

    Parenthesized expressions

    W3C-conformant

    Context item expressions

    W3C-conformant but $$ syntax instead of .

    Static function calls

    W3C-conformant

    Named function reference

    W3C-conformant

    Inline function expressions

    W3C-conformant

    Filter expressions

    W3C-conformant

    Dynamic function calls

    W3C-conformant

    Path expressions (XML)

    Omitted (optional support by engines supporting XML, but relative paths must start with ./)

    Object lookup

    JSONiq-specific

    Array lookup

    JSONiq-specific

    Array unboxing

    JSONiq-specific

    Sequence expressions

    W3C-conformant

    Arithmetic expressions

    W3C-conformant, no atomization needed (except for engines also supporting XML)

    String concatenation expressions

    W3C-conformant

    Comparison expressions

    W3C-conformant, no need to atomize or convert from untyped and untypedAtomic (except for engines also supporting XML)

    Logical expressions

    W3C-conformant

    XML constructors

    Omitted (optional support by engines supporting XML)

    JSON (object and array) constructors

    JSONiq-specific

    FLWOR expressions

    W3C-conformant

    Unordered and ordered expressions

    W3C-conformant

    Conditional expressions

    W3C-conformant

    Switch expressions

    W3C-conformant

    Quantified expressions

    W3C-conformant

    Try-catch expressions

    W3C-conformant

    Instance-of expressions

    W3C-conformant

    Typeswitch expressions

    W3C-conformant

    Cast expressions

    W3C-conformant

    Castable expressions

    W3C-conformant

    Constructor functions

    W3C-conformant, additional constructor function for null()

    Treat expressions

    W3C-conformant

    Simple map operator

    W3C-conformant

    Validate expressions

    Omitted (optional support by engines supporting XML)

    Extension expressions

    W3C-conformant

    Static context

    XPath 1.0 compatibility mode

    Omitted (optional support by engines supporting XML)

    Statically known namespaces

    W3C-conformant

    Default element/type namespace

    W3C-conformant, strong recommendation for implementations to overwrite with the proxy namespace http://jsoniq.org/default-type-namespace to omit prefixes.

    Default function namespace

    W3C-conformant, strong recommendation for implementations to overwrite with http://jsoniq.org/default-function-namespace to omit prefixes.

    In-scope schema definitions

    Omitted (optional support by engines supporting XML)

    In-scope variables

    W3C-conformant

    Context item static type

    W3C-conformant

    Statically known function signatures

    W3C-conformant, augmented with all JSONiq builtin functions

    Statically known collations

    W3C-conformant

    Default collation

    W3C-conformant

    Construction mode

    Omitted (optional support by engines supporting XML)

    Ordering mode

    W3C-conformant

    Default order for empty sequences

    W3C-conformant

    Boundary-space policy

    Omitted (optional support by engines supporting XML)

    Copy-namespaces mode

    Omitted (optional support by engines supporting XML)

    Static Base URI

    W3C-conformant

    Statically known documents

    Omitted (optional support by engines supporting XML)

    Statically known collections

    Omitted (optional support by engines supporting XML)

    Statically known default collection type

    Omitted (optional support by engines supporting XML)

    Statically known decimal formats

    W3C-conformant

    Dynamic context

    Context item

    W3C-conformant (but with syntax $$ not .)

    Initial context item

    W3C-conformant

    Context position

    W3C-conformant

    Context size

    W3C-conformant

    Variable values

    W3C-conformant

    Named functions

    W3C-conformant

    Current dateTime

    W3C-conformant

    Implicit timezone

    W3C-conformant

    Default language

    W3C-conformant

    Default calendar

    W3C-conformant

    Default place

    W3C-conformant

    Available documents

    Omitted (optional support by engines supporting XML)

    Available text resources

    W3C-conformant

    Available node collections

    Omitted (optional support by engines supporting XML)

    Default node collection

    Omitted (optional support by engines supporting XML)

    Available resource collections

    Omitted (optional support by engines supporting XML)

    Default resource collection

    Omitted (optional support by engines supporting XML)

    Environment variables

    W3C-conformant

    RumbleML

    RumbleDB ML

    RumbleDB ML is a Machine Learning library built on top of the RumbleDB engine that makes it more productive and easier to perform ML tasks thanks to the abstraction layer provided by JSONiq.

    The machine learning capabilities are exposed through JSONiq function items. The concepts of "estimator" and "transformer", which are core to Machine Learning, are naturally function items and fit seamlessly in the JSONiq data model.

    Training sets, test sets, and validation sets, which contain features and labels, are exposed through JSONiq sequences of object items: the keys of these objects are the features and labels.

    The names of the estimators and of the transformers, as well as the functionality they encapsulate, are directly inherited from the SparkML library which RumbleDB ML is based on: we chose not to reinvent the wheel.

    Transformers

    A transformer is a function item that maps a sequence of objects to a sequence of objects.

    It is an abstraction that either performs a feature transformation or generates predictions based on trained models. For example:

    • Tokenizer is a feature transformer that receives textual input data and splits it into individual terms (usually words), which are called tokens.

    • KMeansModel is a trained model and a transformer that can read a dataset containing features and generate predictions as its output.

    Estimators

    An estimator is a function item that maps a sequence of objects to a transformer (yes, you got it right: that's a function item returned by a function item. This is why they are also called higher-order functions!).

    Estimators abstract the concept of a Machine Learning algorithm or any algorithm that fits or trains on data. For example, a learning algorithm such as KMeans is implemented as an Estimator. Calling this estimator on data essentially trains a KMeansModel, which is a Model and hence a Transformer.

    Parameters

    Transformers and estimators are function items in the RumbleDB Data Model. Their first argument is the sequence of objects that represents, for example, the training set or test set. Parameters can be provided as their second argument. This second argument is expected to be an object item. The machine learning parameters form the fields of the said object item as key-value pairs.

    Type Annotations

    RumbleDB ML works on highly structured data, because it requires full type information for all the fields in the training set or test set. It is on our development plan to automate the detection of these types when the sequence of objects gets created in the fly.

    RumbleDB supports a user-defined type system with which you can validate and annotate datasets against a JSound schema.

    This annotation is required to be applied on any dataset that must be used as input to RumbleDB ML, but it is superfluous if the data was directly read from a structured input format such as Parquet, CSV, Avro, SVM or ROOT.

    Examples

    • Tokenizer Example:

    • KMeans Example:

    RumbleDB ML Functionality Overview:

    RumblDB eML - Catalogue of Estimators:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    RumbleDB ML - Catalogue of Transformers:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    Parameters:

    AFTSurvivalRegression
    ALS
    BisectingKMeans
    BucketedRandomProjectionLSH
    ChiSqSelector
    CountVectorizer
    CrossValidator
    DecisionTreeClassifier
    DecisionTreeRegressor
    FPGrowth
    GBTClassifier
    GBTRegressor
    GaussianMixture
    GeneralizedLinearRegression
    IDF
    Imputer
    IsotonicRegression
    KMeans
    LDA
    LinearRegression
    LinearSVC
    LogisticRegression
    MaxAbsScaler
    MinHashLSH
    MinMaxScaler
    MultilayerPerceptronClassifier
    NaiveBayes
    OneHotEncoder
    OneVsRest
    PCA
    Pipeline
    QuantileDiscretizer
    RFormula
    RandomForestClassifier
    RandomForestRegressor
    StandardScaler
    StringIndexer
    TrainValidationSplit
    VectorIndexer
    Word2Vec
    AFTSurvivalRegressionModel
    ALSModel
    Binarizer
    BisectingKMeansModel
    BucketedRandomProjectionLSHModel
    Bucketizer
    ChiSqSelectorModel
    ColumnPruner
    CountVectorizerModel
    CrossValidatorModel
    DCT
    DecisionTreeClassificationModel
    DecisionTreeRegressionModel
    DistributedLDAModel
    ElementwiseProduct
    FPGrowthModel
    FeatureHasher
    GBTClassificationModel
    GBTRegressionModel
    GaussianMixtureModel
    GeneralizedLinearRegressionModel
    HashingTF
    IDFModel
    ImputerModel
    IndexToString
    Interaction
    IsotonicRegressionModel
    KMeansModel
    LinearRegressionModel
    LinearSVCModel
    LocalLDAModel
    LogisticRegressionModel
    MaxAbsScalerModel
    MinHashLSHModel
    MinMaxScalerModel
    MultilayerPerceptronClassificationModel
    NGram
    NaiveBayesModel
    Normalizer
    OneHotEncoder
    OneHotEncoderModel
    OneVsRestModel
    PCAModel
    PipelineModel
    PolynomialExpansion
    RFormulaModel
    RandomForestClassificationModel
    RandomForestRegressionModel
    RegexTokenizer
    SQLTransformer
    StandardScalerModel
    StopWordsRemover
    StringIndexerModel
    Tokenizer
    TrainValidationSplitModel
    VectorAssembler
    VectorAttributeRewriter
    VectorIndexerModel
    VectorSizeHint
    VectorSlicer
    Word2VecModel
    
    declare type local:id-and-sentence as {
      "id": "integer",
      "sentence": "string"
    };
    
    
    let $local-data := (
        {"id": 1, "sentence": "Hi I heard about Spark"},
        {"id": 2, "sentence": "I wish Java could use case classes"},
        {"id": 3, "sentence": "Logistic regression models are neat"}
    )
    let $df-data := validate type local:id-and-sentence* { $local-data }
    
    let $transformer := get-transformer("Tokenizer")
    for $i in $transformer(
        $df-data,
        {"inputCol": "sentence", "outputCol": "output"}
    )
    return $i
    
    // returns
    // { "id" : 1, "sentence" : "Hi I heard about Spark", "output" : [ "hi", "i", "heard", "about", "spark" ] }
    // { "id" : 2, "sentence" : "I wish Java could use case classes", "output" : [ "i", "wish", "java", "could", "use", "case", "classes" ] }
    // { "id" : 3, "sentence" : "Logistic regression models are neat", "output" : [ "logistic", "regression", "models", "are", "neat" ] }
    declare type local:col-1-2-3 as {
      "id": "integer",
      "col1": "decimal",
      "col2": "decimal",
      "col3": "decimal"
    };
    
    let $vector-assembler := get-transformer("VectorAssembler")(
      ?,
      { "inputCols" : [ "col1", "col2", "col3" ], "outputCol" : "features" }
    )
    
    let $local-data := (
        {"id": 0, "col1": 0.0, "col2": 0.0, "col3": 0.0},
        {"id": 1, "col1": 0.1, "col2": 0.1, "col3": 0.1},
        {"id": 2, "col1": 0.2, "col2": 0.2, "col3": 0.2},
        {"id": 3, "col1": 9.0, "col2": 9.0, "col3": 9.0},
        {"id": 4, "col1": 9.1, "col2": 9.1, "col3": 9.1},
        {"id": 5, "col1": 9.2, "col2": 9.2, "col3": 9.2}
    )
    let $df-data := validate type local:col-1-2-3* {$local-data }
    let $df-data := $vector-assembler($df-data)
    
    let $est := get-estimator("KMeans")
    let $tra := $est(
        $df-data,
        {"featuresCol": "features"}
    )
    
    for $i in $tra(
        $df-data,
        {"featuresCol": "features"}
    )
    return $i
    
    // returns
    // { "id" : 0, "col1" : 0, "col2" : 0, "col3" : 0, "prediction" : 0 }
    // { "id" : 1, "col1" : 0.1, "col2" : 0.1, "col3" : 0.1, "prediction" : 0 }
    // { "id" : 2, "col1" : 0.2, "col2" : 0.2, "col3" : 0.2, "prediction" : 0 }
    // { "id" : 3, "col1" : 9, "col2" : 9, "col3" : 9, "prediction" : 1 }
    // { "id" : 4, "col1" : 9.1, "col2" : 9.1, "col3" : 9.1, "prediction" : 1 }
    // { "id" : 5, "col1" : 9.2, "col2" : 9.2, "col3" : 9.2, "prediction" : 1 }
    - aggregationDepth: integer
    - censorCol: string
    - featuresCol: string
    - fitIntercept: boolean
    - labelCol: string
    - maxIter: integer
    - predictionCol: string
    - quantileProbabilities: array (of double)
    - quantilesCol: string
    - tol: double
    - alpha: double
    - checkpointInterval: integer
    - coldStartStrategy: string
    - finalStorageLevel: string
    - implicitPrefs: boolean
    - intermediateStorageLevel: string
    - itemCol: string
    - maxIter: integer
    - nonnegative: boolean
    - numBlocks: integer
    - numItemBlocks: integer
    - numUserBlocks: integer
    - predictionCol: string
    - rank: integer
    - ratingCol: string
    - regParam: double
    - seed: double
    - userCol: string
    - distanceMeasure: string
    - featuresCol: string
    - k: integer
    - maxIter: integer
    - minDivisibleClusterSize: double
    - predictionCol: string
    - seed: double
    - bucketLength: double
    - inputCol: string
    - numHashTables: integer
    - outputCol: string
    - seed: double
    - fdr: double
    - featuresCol: string
    - fpr: double
    - fwe: double
    - labelCol: string
    - numTopFeatures: integer
    - outputCol: string
    - percentile: double
    - selectorType: string
    - binary: boolean
    - inputCol: string
    - maxDF: double
    - minDF: double
    - minTF: double
    - outputCol: string
    - vocabSize: integer
    - collectSubModels: boolean
    - estimator: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - numFolds: integer
    - parallelism: integer
    - seed: double
    - cacheNodeIds: boolean
    - checkpointInterval: integer
    - featuresCol: string
    - impurity: string
    - labelCol: string
    - maxBins: integer
    - maxDepth: integer
    - maxMemoryInMB: integer
    - minInfoGain: double
    - minInstancesPerNode: integer
    - predictionCol: string
    - probabilityCol: string
    - rawPredictionCol: string
    - seed: double
    - thresholds: array (of double)
    - cacheNodeIds: boolean
    - checkpointInterval: integer
    - featuresCol: string
    - impurity: string
    - labelCol: string
    - maxBins: integer
    - maxDepth: integer
    - maxMemoryInMB: integer
    - minInfoGain: double
    - minInstancesPerNode: integer
    - predictionCol: string
    - seed: double
    - varianceCol: string
    - itemsCol: string
    - minConfidence: double
    - minSupport: double
    - numPartitions: integer
    - predictionCol: string
    - cacheNodeIds: boolean
    - checkpointInterval: integer
    - featuresCol: string
    - featureSubsetStrategy: string
    - impurity: string
    - labelCol: string
    - lossType: string
    - maxBins: integer
    - maxDepth: integer
    - maxIter: integer
    - maxMemoryInMB: integer
    - minInfoGain: double
    - minInstancesPerNode: integer
    - predictionCol: string
    - probabilityCol: string
    - rawPredictionCol: string
    - seed: double
    - stepSize: double
    - subsamplingRate: double
    - thresholds: array (of double)
    - validationIndicatorCol: string
    - cacheNodeIds: boolean
    - checkpointInterval: integer
    - featuresCol: string
    - featureSubsetStrategy: string
    - impurity: string
    - labelCol: string
    - lossType: string
    - maxBins: integer
    - maxDepth: integer
    - maxIter: integer
    - maxMemoryInMB: integer
    - minInfoGain: double
    - minInstancesPerNode: integer
    - predictionCol: string
    - seed: double
    - stepSize: double
    - subsamplingRate: double
    - validationIndicatorCol: string
    - featuresCol: string
    - k: integer
    - maxIter: integer
    - predictionCol: string
    - probabilityCol: string
    - seed: double
    - tol: double
    - family: string
    - featuresCol: string
    - fitIntercept: boolean
    - labelCol: string
    - link: string
    - linkPower: double
    - linkPredictionCol: string
    - maxIter: integer
    - offsetCol: string
    - predictionCol: string
    - regParam: double
    - solver: string
    - tol: double
    - variancePower: double
    - weightCol: string
    - inputCol: string
    - minDocFreq: integer
    - outputCol: string
    - inputCols: array (of string)
    - missingValue: double
    - outputCols: array (of string)
    - strategy: string
    - featureIndex: integer
    - featuresCol: string
    - isotonic: boolean
    - labelCol: string
    - predictionCol: string
    - weightCol: string
    - distanceMeasure: string
    - featuresCol: string
    - initMode: string
    - initSteps: integer
    - k: integer
    - maxIter: integer
    - predictionCol: string
    - seed: double
    - tol: double
    - checkpointInterval: integer
    - docConcentration: double
    - docConcentration: array (of double)
    - featuresCol: string
    - k: integer
    - keepLastCheckpoint: boolean
    - learningDecay: double
    - learningOffset: double
    - maxIter: integer
    - optimizeDocConcentration: boolean
    - optimizer: string
    - seed: double
    - subsamplingRate: double
    - topicConcentration: double
    - topicDistributionCol: string
    - aggregationDepth: integer
    - elasticNetParam: double
    - epsilon: double
    - featuresCol: string
    - fitIntercept: boolean
    - labelCol: string
    - loss: string
    - maxIter: integer
    - predictionCol: string
    - regParam: double
    - solver: string
    - standardization: boolean
    - tol: double
    - weightCol: string
    - aggregationDepth: integer
    - featuresCol: string
    - fitIntercept: boolean
    - labelCol: string
    - maxIter: integer
    - predictionCol: string
    - rawPredictionCol: string
    - regParam: double
    - standardization: boolean
    - threshold: double
    - tol: double
    - weightCol: string
    - aggregationDepth: integer
    - elasticNetParam: double
    - family: string
    - featuresCol: string
    - fitIntercept: boolean
    - labelCol: string
    - lowerBoundsOnCoefficients: object (of object of double)
    - lowerBoundsOnIntercepts: object (of double)
    - maxIter: integer
    - predictionCol: string
    - probabilityCol: string
    - rawPredictionCol: string
    - regParam: double
    - standardization: boolean
    - threshold: double
    - thresholds: array (of double)
    - tol: double
    - upperBoundsOnCoefficients: object (of object of double)
    - upperBoundsOnIntercepts: object (of double)
    - weightCol: string
    - inputCol: string
    - outputCol: string
    - inputCol: string
    - numHashTables: integer
    - outputCol: string
    - seed: double
    - inputCol: string
    - max: double
    - min: double
    - outputCol: string
    - blockSize: integer
    - featuresCol: string
    - initialWeights: object (of double)
    - labelCol: string
    - layers: array (of integer)
    - maxIter: integer
    - predictionCol: string
    - probabilityCol: string
    - rawPredictionCol: string
    - seed: double
    - solver: string
    - stepSize: double
    - thresholds: array (of double)
    - tol: double
    - featuresCol: string
    - labelCol: string
    - modelType: string
    - predictionCol: string
    - probabilityCol: string
    - rawPredictionCol: string
    - smoothing: double
    - thresholds: array (of double)
    - weightCol: string
    - dropLast: boolean
    - handleInvalid: string
    - inputCols: array (of string)
    - outputCols: array (of string)
    - featuresCol: string
    - labelCol: string
    - parallelism: integer
    - predictionCol: string
    - rawPredictionCol: string
    - weightCol: string
    - inputCol: string
    - k: integer
    - outputCol: string
    - handleInvalid: string
    - inputCol: string
    - inputCols: array (of string)
    - numBuckets: integer
    - numBucketsArray: array (of integer)
    - outputCol: string
    - outputCols: array (of string)
    - relativeError: double
    - featuresCol: string
    - forceIndexLabel: boolean
    - formula: string
    - handleInvalid: string
    - labelCol: string
    - stringIndexerOrderType: string
    - cacheNodeIds: boolean
    - checkpointInterval: integer
    - featuresCol: string
    - featureSubsetStrategy: string
    - impurity: string
    - labelCol: string
    - maxBins: integer
    - maxDepth: integer
    - maxMemoryInMB: integer
    - minInfoGain: double
    - minInstancesPerNode: integer
    - numTrees: integer
    - predictionCol: string
    - probabilityCol: string
    - rawPredictionCol: string
    - seed: double
    - subsamplingRate: double
    - thresholds: array (of double)
    - cacheNodeIds: boolean
    - checkpointInterval: integer
    - featuresCol: string
    - featureSubsetStrategy: string
    - impurity: string
    - labelCol: string
    - maxBins: integer
    - maxDepth: integer
    - maxMemoryInMB: integer
    - minInfoGain: double
    - minInstancesPerNode: integer
    - numTrees: integer
    - predictionCol: string
    - seed: double
    - subsamplingRate: double
    - inputCol: string
    - outputCol: string
    - withMean: boolean
    - withStd: boolean
    - handleInvalid: string
    - inputCol: string
    - outputCol: string
    - stringOrderType: string
    - collectSubModels: boolean
    - estimator: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - parallelism: integer
    - seed: double
    - trainRatio: double
    - handleInvalid: string
    - inputCol: string
    - maxCategories: integer
    - outputCol: string
    - inputCol: string
    - maxIter: integer
    - maxSentenceLength: integer
    - minCount: integer
    - numPartitions: integer
    - outputCol: string
    - seed: double
    - stepSize: double
    - vectorSize: integer
    - windowSize: integer
    - featuresCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - predictionCol: string
    - quantileProbabilities: array (of double)
    - quantilesCol: string
    - coldStartStrategy: string
    - itemCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - predictionCol: string
    - userCol: string
    - inputCol: string
    - outputCol: string
    - threshold: double
    - featuresCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - predictionCol: string
    - inputCol: string
    - outputCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - handleInvalid: string
    - inputCol: string
    - inputCols: array (of string)
    - outputCol: string
    - outputCols: array (of string)
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - splits: array (of double)
    - splitsArray: array (of array of double)
    - featuresCol: string
    - outputCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - binary: boolean
    - inputCol: string
    - minTF: double
    - outputCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - inputCol: string
    - inverse: boolean
    - outputCol: string
    - cacheNodeIds: boolean
    - checkpointInterval: integer
    - featuresCol: string
    - impurity: string
    - maxBins: integer
    - maxDepth: integer
    - maxMemoryInMB: integer
    - minInfoGain: double
    - minInstancesPerNode: integer
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - predictionCol: string
    - probabilityCol: string
    - rawPredictionCol: string
    - seed: double
    - thresholds: array (of double)
    - cacheNodeIds: boolean
    - checkpointInterval: integer
    - featuresCol: string
    - impurity: string
    - maxBins: integer
    - maxDepth: integer
    - maxMemoryInMB: integer
    - minInfoGain: double
    - minInstancesPerNode: integer
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - predictionCol: string
    - seed: double
    - varianceCol: string
    - featuresCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - seed: double
    - topicDistributionCol: string
    - inputCol: string
    - outputCol: string
    - scalingVec: object (of double)
    - itemsCol: string
    - minConfidence: double
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - predictionCol: string
    - categoricalCols: array (of string)
    - inputCols: array (of string)
    - numFeatures: integer
    - outputCol: string
    - cacheNodeIds: boolean
    - checkpointInterval: integer
    - featuresCol: string
    - featureSubsetStrategy: string
    - impurity: string
    - maxBins: integer
    - maxDepth: integer
    - maxIter: integer
    - maxMemoryInMB: integer
    - minInfoGain: double
    - minInstancesPerNode: integer
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - predictionCol: string
    - probabilityCol: string
    - rawPredictionCol: string
    - seed: double
    - stepSize: double
    - subsamplingRate: double
    - thresholds: array (of double)
    - cacheNodeIds: boolean
    - checkpointInterval: integer
    - featuresCol: string
    - featureSubsetStrategy: string
    - impurity: string
    - maxBins: integer
    - maxDepth: integer
    - maxIter: integer
    - maxMemoryInMB: integer
    - minInfoGain: double
    - minInstancesPerNode: integer
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - predictionCol: string
    - seed: double
    - stepSize: double
    - subsamplingRate: double
    - featuresCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - predictionCol: string
    - probabilityCol: string
    - featuresCol: string
    - linkPredictionCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - predictionCol: string
    - binary: boolean
    - inputCol: string
    - numFeatures: integer
    - outputCol: string
    - inputCol: string
    - outputCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - inputCols: array (of string)
    - outputCols: array (of string)
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - inputCol: string
    - labels: array (of string)
    - outputCol: string
    - inputCols: array (of string)
    - outputCol: string
    - featureIndex: integer
    - featuresCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - predictionCol: string
    - featuresCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - predictionCol: string
    - featuresCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - predictionCol: string
    - featuresCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - predictionCol: string
    - rawPredictionCol: string
    - threshold: double
    - weightCol: double
    - featuresCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - seed: double
    - topicDistributionCol: string
    - featuresCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - predictionCol: string
    - probabilityCol: string
    - rawPredictionCol: string
    - threshold: double
    - thresholds: array (of double)
    - inputCol: string
    - outputCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - inputCol: string
    - outputCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - inputCol: string
    - max: double
    - min: double
    - outputCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - featuresCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - predictionCol: string
    - probabilityCol: string
    - rawPredictionCol: string
    - thresholds: array (of double)
    - inputCol: string
    - n: integer
    - outputCol: string
    - featuresCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - predictionCol: string
    - probabilityCol: string
    - rawPredictionCol: string
    - thresholds: array (of double)
    - inputCol: string
    - outputCol: string
    - p: double
    - dropLast: boolean
    - inputCol: string
    - outputCol: string
    - dropLast: boolean
    - handleInvalid: string
    - inputCols: array (of string)
    - outputCols: array (of string)
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - featuresCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - predictionCol: string
    - rawPredictionCol: string
    - inputCol: string
    - outputCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - degree: integer
    - inputCol: string
    - outputCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - cacheNodeIds: boolean
    - checkpointInterval: integer
    - featuresCol: string
    - featureSubsetStrategy: string
    - impurity: string
    - maxBins: integer
    - maxDepth: integer
    - maxMemoryInMB: integer
    - minInfoGain: double
    - minInstancesPerNode: integer
    - numTrees: integer
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - predictionCol: string
    - probabilityCol: string
    - rawPredictionCol: string
    - seed: double
    - subsamplingRate: double
    - thresholds: array (of double)
    - cacheNodeIds: boolean
    - checkpointInterval: integer
    - featuresCol: string
    - featureSubsetStrategy: string
    - impurity: string
    - maxBins: integer
    - maxDepth: integer
    - maxMemoryInMB: integer
    - minInfoGain: double
    - minInstancesPerNode: integer
    - numTrees: integer
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - predictionCol: string
    - seed: double
    - subsamplingRate: double
    - gaps: boolean
    - inputCol: string
    - minTokenLength: integer
    - outputCol: string
    - pattern: string
    - toLowercase: boolean
    - statement: string
    - inputCol: string
    - outputCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - caseSensitive: boolean
    - inputCol: string
    - locale: string
    - outputCol: string
    - stopWords: array (of string)
    - handleInvalid: string
    - inputCol: string
    - outputCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - inputCol: string
    - outputCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - handleInvalid: string
    - inputCols: array (of string)
    - outputCol: string
    - inputCol: string
    - outputCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
    - handleInvalid: string
    - inputCol: string
    - size: integer
    - indices: array (of integer)
    - inputCol: string
    - names: array (of string)
    - outputCol: string
    - inputCol: string
    - outputCol: string
    - parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)

    Function library

    We list here the most important functions supported by RumbleDB, and introduce them by means of examples. Highly detailed specifications can be found in the underlying W3C standard, unless the function is marked as specific to JSON or RumbleDB, in which case it can be found here. JSONiq and RumbleDB intentionally do not support builtin functions on XML nodes, NOTATION or QNames. RumbleDB supports almost all other W3C-standardized functions, please contact us if you are still missing one.

    For the sake of ease of use, all W3C standard builtin functions and JSONiq builtin functions are in the RumbleDB namespace, which is the default function namespace and does not require any prefix in front of function names.

    It is recommended that user-defined functions are put in the local namespace, i.e., their name should have the local: prefix (which is predefined). Otherwise, there is the risk that your code becomes incompatible with subsequent releases if new (unprefixed) builtin functions are introduced.

    Errors and diagnostics

    Diagnostic tracing

    trace

    Fully implemented

    returns (1, 2, 3) and logs it in the log-path if specified

    Functions and operators on numerics

    Functions on numeric values

    abs

    Fully implemented

    returns 2.0

    ceiling

    Fully implemented

    returns 3.0

    floor

    Fully implemented

    returns 2.0

    round

    Fully implemented

    returns 2.0

    returns 2.23

    round-half-to-even

    Fully implemented

    Parsing numbers

    number

    Fully implemented

    returns 15 as a double

    returns NaN as a double

    returns 15 as a double

    Formatting integers

    format-integer

    Not implemented

    ##Formatting numbers

    format-number

    Not implemented

    ##Trigonometric and exponential functions

    ###pi

    Fully implemented

    returns 3.141592653589793

    ###exp

    Fully implemented

    ###exp10

    Fully implemented

    log

    Fully implemented

    log10

    Fully implemented

    pow

    Fully implemented

    sqrt

    Fully implemented

    returns 2

    sin

    Fully implemented

    cos

    Fully implemented

    cosh

    JSONiq-specific. Fully implemented

    sinh

    JSONiq-specific. Fully implemented

    tan

    Fully implemented

    asin

    Fully implemented

    acos

    Fully implemented

    atan

    Fully implemented

    atan2

    Fully implemented

    Random numbers

    random-number-generator

    Not implemented

    Functions on strings

    Functions to assemble and disassemble strings

    string-to-codepoint

    Fully implemented

    returns (84, 104, 233, 114, 232, 115, 101)

    returns ()

    codepoints-to-string

    Fully implemented

    returns "अशॊक"

    returns ""

    Comparison of strings

    compare

    Fully implemented

    returns -1

    codepoint-equal

    Fully implemented

    returns true

    returns ()

    collation-key

    Not implemented

    contains-token

    Not implemented

    Functions on string values

    concat

    Fully implemented

    returns "foobarfoobar"

    string-join

    Fully implemented

    returns "foobarfoobar"

    returns "foo-bar-foobar"

    substring

    Fully implemented

    returns "bar"

    returns "ba"

    string-length

    Fully implemented

    Returns the length of the supplied string, or 0 if the empty sequence is supplied.

    returns 3.

    returns 0.

    ###normalize-space

    Fully implemented

    Normalization of spaces in a string.

    returns "The wealthy curled darlings of our nation."

    normalize-unicode

    Fully implemented

    Returns the value of the input after applying Unicode normalization.

    returns the unicode-normalized version of the input string. Normalization forms NFC, NFD, NFKC, and NFKD are supported. "FULLY-NORMALIZED" though supported, should be used with caution as only the composition exclusion characters supported FULLY-NORMALIZED are which are uncommented in the .

    upper-case

    Fully implemented

    returns "ABCD0"

    lower-case

    Fully implemented

    returns "abc!d"

    translate

    Fully implemented

    returns "BAr"

    returns "AAA"

    Functions based on substring matching

    contains

    Fully implemented

    returns true.

    starts-with

    Fully implemented

    returns true

    ends-with

    Fully implemented

    returns true.

    substring-before

    Fully implemented

    returns "foo"

    returns "f"

    substring-after

    Fully implemented

    returns "bar"

    returns ""

    String functions that use regular expressions

    matches

    Arity 2 implemented, arity 3 is not.

    Regular expression matching. The semantics of regular expressions are those of Java's Pattern class.

    returns true.

    returns true.

    replace

    Arity 3 implemented, arity 4 is not.

    Regular expression matching and replacing. The semantics of regular expressions are those of Java's Pattern class.

    returns "a*cada*"

    returns "abbraccaddabbra"

    tokenize

    Arity 2 implemented, arity 3 is not.

    returns ("aa", "bb", "cc", "dd")

    returns ("aa", "bb", "cc", "dd")

    analyze-string

    Not implemented

    Functions that manipulate URIs

    resolve-uri

    Fully implemented

    returns http://www.examples.com/examples

    encode-for-uri

    Fully implemented

    returns 100%25%20organic

    iri-to-uri

    Not implemented

    escape-html-uri

    Not implemented

    Functions and operators on Boolean values

    Boolean constant functions

    true

    Fully implemented

    returns true

    false

    Fully implemented

    returns false

    boolean

    Fully implemented

    returns true

    returns false

    not

    Fully implemented

    returns false

    returns true

    Functions and operators on durations

    Component extraction functions on durations

    years-from-duration

    Fully implemented

    returns 2021.

    months-from-duration

    Fully implemented

    returns 6.

    days-from-duration

    Fully implemented

    returns 17.

    hours-from-duration

    Fully implemented

    returns 12.

    minutes-from-duration

    Fully implemented

    returns 35.

    seconds-from-duration

    Fully implemented

    returns 30.

    Functions and operators on dates and times

    Constructing a DateTime

    dateTime

    Fully implemented

    returns 2004-04-12T13:20:00+14:00

    Component extraction functions on dates and times

    year-from-dateTime

    Fully implemented

    returns 2021.

    month-from-dateTime

    Fully implemented

    returns 04.

    day-from-dateTime

    Fully implemented

    returns 12.

    hours-from-dateTime

    Fully implemented

    returns 13.

    minutes-from-dateTime

    Fully implemented

    returns 20.

    seconds-from-dateTime

    Fully implemented

    returns 32.

    timezone-from-dateTime

    Fully implemented

    returns PT2H.

    year-from-date

    Fully implemented

    returns 2021.

    month-from-date

    Fully implemented

    returns 6.

    day-from-date

    Fully implemented

    returns 4.

    timezone-from-date

    Fully implemented

    returns -PT14H.

    hours-from-time

    Fully implemented

    returns 13.

    minutes-from-time

    Fully implemented

    returns 20.

    seconds-from-time

    Fully implemented

    returns 32.123.

    timezone-from-time

    Fully implemented

    returns PT2H.

    Timezone adjustment functions on dates and time values

    adjust-dateTime-to-timezone

    Fully implemented

    returns 2004-04-12T03:25:15+04:05.

    adjust-date-to-timezone

    Fully implemented

    returns 2014-03-12+04:00.

    adjust-time-to-timezone

    Fully implemented

    returns 04:20:00-14:00.

    Formatting dates and times functions

    The functions in this section accept a simplified version of the picture string, in which a variable marker accepts only:

    • One of the following component specifiers: Y, M, d, D, F, H, m, s, P

    • A first presentation modifier, for which the value can be:

      • Nn, for all supported component specifiers, besides P

      • N, if the component specifier is P

    format-dateTime

    Fully implemented

    returns 20-13-12-4-2004

    format-date

    Fully implemented

    returns 12-4-2004

    format-time

    Fully implemented

    returns 13-20-0

    Functions related to QNames

    Not implemented

    Functions and operators on sequences

    General functions and operators on sequences

    empty

    Fully implemented

    Returns a boolean whether the input sequence is empty or not.

    returns false.

    exists

    Fully implemented

    Returns a boolean whether the input sequence has at least one item or not.

    returns true.

    returns false.

    This is pushed down to Spark and works on big sequences.

    head

    Fully implemented

    Returns the first item of a sequence, or the empty sequence if it is empty.

    returns 1.

    returns ().

    This is pushed down to Spark and works on big sequences.

    tail

    Fully implemented

    Returns all but the last item of a sequence, or the empty sequence if it is empty.

    returns (2, 3, 4, 5).

    returns ().

    This is pushed down to Spark and works on big sequences.

    insert-before

    Fully implemented

    returns (1, 2, 3, 4, 5).

    remove

    Fully implemented

    returns (1, 2).

    reverse

    Fully implemented

    returns (3, 2, 1).

    subsequence

    Fully implemented

    returns (2, 3).

    unordered

    Fully implemented

    returns (1, 2, 3).

    Functions that compare values in sequences

    distinct-values

    Fully implemented

    Eliminates duplicates from a sequence of atomic items.

    returns (1, 4, 3, "foo", true, 5).

    This is pushed down to Spark and works on big sequences.

    index-of

    Fully implemented

    returns 3.

    returns "".

    deep-equal

    Fully implemented

    returns true.

    returns false.

    Functions that test the cardinality of sequences

    zero-or-one

    Fully implemented

    returns "a".

    returns an error.

    one-or-more

    Fully implemented

    returns "a".

    returns an error.

    exactly-one

    Fully implemented

    returns "a".

    returns an error.

    Aggregate functions

    count

    Fully implemented

    returns 4.

    Count calls are pushed down to Spark, so this works on billions of items as well:

    avg

    Fully implemented

    returns 2.5.

    Avg calls are pushed down to Spark, so this works on billions of items as well:

    max

    Fully implemented

    returns 4.

    returns (1, 2, 3).

    Max calls are pushed down to Spark, so this works on billions of items as well:

    min

    Fully implemented

    returns 1.

    returns (1, 2, 3).

    Min calls are pushed down to Spark, so this works on billions of items as well:

    sum

    Fully implemented

    returns 10.

    Sum calls are pushed down to Spark, so this works on billions of items as well:

    Functions giving access to external information

    doc

    Fully implemented

    Returns the corresponding document node

    collection

    Not implemented

    Parsing and serializing

    serialize

    Fully implemented

    Serializes the supplied input sequence, returning the serialized representation of the sequence as a string

    returns { "hello" : "world" }

    Context Functions

    position

    Fully implemented

    returns 5

    last

    Fully implemented

    returns 10

    returns 10

    current-dateTime

    Fully implemented

    returns 2020-02-26T11:22:48.423+01:00

    current-date

    Fully implemented

    returns 2020-02-26Europe/Zurich

    current-time

    Fully implemented

    returns 11:24:10.064+01:00

    implicit-timezone

    Fully implemented

    returns PT1H.

    default-collation

    Fully implemented

    returns http://www.w3.org/2005/xpath-functions/collation/codepoint.

    High order functions

    Functions on functions

    function-lookup

    Not implemented

    function-name

    Not implemented

    function-arity

    Not implemented

    Basic higher-order functions

    for-each

    Not implemented

    filter

    Not implemented

    fold-left

    Not implemented

    fold-right

    Not implemented

    for-each-pair

    Not implemented

    JSONiq functions

    keys

    Fully implemented

    returns ("foo", "bar"). Also works on an input sequence, eliminating duplicates

    Keys calls are pushed down to Spark, so this works on billions of items as well:

    members

    Fully implemented

    This function returns the members as an array, but not recursively, i.e., nested arrays are not unboxed.

    Returns the first 100 integers as a sequence. Also works on an input sequence, in a distributive way.

    null

    Fully implemented

    Returns a JSON null (also available as the literal null).

    parse-json

    Fully implemented

    size

    Fully implemented

    returns 100. Also works if the empty sequence is supplied, in which case it returns the empty sequence.

    accumulate

    Fully implemented

    returns

    descendant-arrays

    Fully implemented

    returns

    descendant-objects

    Fully implemented

    returns

    descendant-pairs

    Fully implemented

    returns

    flatten

    Fully implemented

    Unboxes arrays recursively, stopping the recursion when any other item is reached (object or atomic). Also works on an input sequence, in a distributive way.

    Returns (1, 2, 3, 4, 5, 6, 7, 8, 9).

    intersect

    Fully implemented

    returns

    project

    Fully implemented

    returns the object {"foo" : "bar", "bar" : "foobar"}. Also works on an input sequence, in a distributive way.

    remove-keys

    Fully implemented

    returns the object {"foobar" : "foo"}. Also works on an input sequence, in a distributive way.

    values

    Fully implemented

    returns ("bar", "foobar"). Also works on an input sequence, in a distributive way.

    Values calls are pushed down to Spark, so this works on billions of items as well:

    encode-for-roundtrip

    Not implemented

    decode-from-roundtrip

    Not implemented

    json-doc

    returns the (unique) JSON value parsed from a local JSON (but not necessarily JSON Lines) file where this value may be spread over multiple lines.

    a format token that indicates a numbering sequence of the the following form: '0001'

  • A second presentation modifier, for which the value can be t or c, which are also the default values

  • A width modifier, both minimum and maximum values

  • W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    following file
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    W3C specification
    JSONiq specification
    JSONiq specification
    JSONiq specification
    JSONiq specification
    JSONiq specification
    JSONiq specification
    JSONiq specification
    JSONiq specification
    JSONiq specification
    JSONiq specification
    JSONiq specification
    JSONiq specification
    JSONiq specification
    JSONiq specification
    JSONiq specification
    JSONiq specification
    trace(1 to 3)
    abs(-2)
    ceiling(2.3)
    floor(2.3)
    round(2.3)
    round(2.2345, 2)
    round-half-to-even(2.2345, 2), round-half-to-even(2.2345)
    number("15")
    number("foo")
    number(15)
    pi()
    exp(10)
    exp10(10)
    log(100)
    log10(100)
    pow(10, 2)
    sqrt(4)
    sin(pi())
    cos(pi())
    cosh(pi())
    sinh(pi())
    tan(pi())
    asin(1)
    acos(1)
    atan(1)
    atan2(1)
    string-to-codepoints("Thérèse")
    string-to-codepoints("")
    codepoints-to-string((2309, 2358, 2378, 2325))
    codepoints-to-string(())
    compare("aa", "bb")
    codepoint-equal("abcd", "abcd")
    codepoint-equal("", ())
    concat("foo", "bar", "foobar")
    string-join(("foo", "bar", "foobar"))
    string-join(("foo", "bar", "foobar"), "-")
    substring("foobar", 4)
    substring("foobar", 4, 2)
    string-length("foo")
    string-length(())
    normalize-space(" The    wealthy curled darlings                                         of    our    nation. "),
    normalize-unicode("hello world", "NFC")
    upper-case("abCd0")
    lower-case("ABc!D")
    translate("bar","abc","ABC")
    translate("--aaa--","abc-","ABC")
    contains("foobar", "ob")
    starts-with("foobar", "foo")
    ends-with("foobar", "bar")
    substring-before("foobar", "bar")
    substring-before("foobar", "o")
    substring-after("foobar", "foo")
    substring-after("foobar", "r")
    matches("foobar", "o+")
    matches("foobar", "^fo+.*")
    replace("abracadabra", "bra", "*")
    replace("abracadabra", "a(.)", "a$1$1")
    tokenize("aa bb cc dd")
    tokenize("aa;bb;cc;dd", ";")
    string(resolve-uri("examples","http://www.examples.com/"))
    encode-for-uri("100% organic")
    fn:true()
    fn:false()
    boolean(9)
    boolean("")
    not(9)
    boolean("")
    years-from-duration(duration("P2021Y6M"))
    months-from-duration(duration("P2021Y6M"))
    days-from-duration(duration("P2021Y6M17D"))
    hours-from-duration(duration("P2021Y6M17DT12H35M30S"))
    minutes-from-duration(duration("P2021Y6M17DT12H35M30S"))
    minutes-from-duration(duration("P2021Y6M17DT12H35M30S"))
    dateTime("2004-04-12T13:20:00+14:00")
    year-from-dateTime(dateTime("2021-04-12T13:20:32.123+02:00"))
    month-from-dateTime(dateTime("2021-04-12T13:20:32.123+02:00"))
    day-from-dateTime(dateTime("2021-04-12T13:20:32.123+02:00"))
    hours-from-dateTime(dateTime("2021-04-12T13:20:32.123+02:00"))
    minutes-from-dateTime(dateTime("2021-04-12T13:20:32.123+02:00"))
    seconds-from-dateTime(dateTime("2021-04-12T13:20:32.123+02:00"))
    timezone-from-dateTime(dateTime("2021-04-12T13:20:32.123+02:00"))
    year-from-date(date("2021-06-04"))
    month-from-date(date("2021-06-04"))
    day-from-date(date("2021-06-04"))
    timezone-from-date(date("2021-06-04-14:00"))
    hours-from-time(time("13:20:32.123+02:00"))
    minutes-from-time(time("13:20:32.123+02:00"))
    seconds-from-time(time("13:20:32.123+02:00"))
    timezone-from-time(time("13:20:32.123+02:00"))
    adjust-dateTime-to-timezone(dateTime("2004-04-12T13:20:15+14:00"), dayTimeDuration("PT4H5M"))
    adjust-date-to-timezone(date("2014-03-12"), dayTimeDuration("PT4H"))
    adjust-time-to-timezone(time("13:20:00-05:00"), dayTimeDuration("-PT14H"))
    format-dateTime(dateTime("2004-04-12T13:20:00"), "[m]-[H]-[D]-[M]-[Y]")
    format-date(date("2004-04-12"), "[D]-[M]-[Y]")
    format-time(time("13:20:00"), "[H]-[m]-[s]")
    empty(1 to 10)
    exists(1 to 10)
    exists(())
    exists(json-lines("file.json"))
    head(1 to 10)
    head(())
    head(json-lines("file.json"))
    tail(1 to 5)
    tail(())
    tail(json-lines("file.json"))
    insert-before((3, 4, 5), 0, (1, 2))
    remove((1, 2, 10), 3)
    remove((1, 2, 3))
    subsequence((1, 2, 3), 2, 5)
    unordered((1, 2, 3))
    distinct-values((1, 1, 4, 3, 1, 1, "foo", 4, "foo", true, 3, 1, true, 5, 3, 1, 1))
    distinct-values(json-lines("file.json").foo)
    distinct-values(text-file("file.txt"))
    index-of((10, 20, 30, 40), 30)
    index-of((10, 20, 30, 40), 35)
    deep-equal((10, 20, "a"), (10, 20, "a"))
    deep-equal(("b", "0"), ("b", 0))
    zero-or-one(("a"))
    zero-or-one(("a", "b"))
    one-or-more(("a"))
    one-or-more(())
    exactly-one(("a"))
    exactly-one(("a", "b"))
    let $x := (1, 2, 3, 4)
    return count($x)
    count(json-lines("file.json"))
    count(
      for $i in json-lines("file.json")
      where $i.foo eq "bar"
      return $i
    )
    let $x := (1, 2, 3, 4)
    return avg($x)
    avg(json-lines("file.json").foo)
    let $x := (1, 2, 3, 4)
    return max($x)
    for $i in 1 to 3
    return max($i)
    max(json-lines("file.json").foo)
    let $x := (1, 2, 3, 4)
    return min($x)
    for $i in 1 to 3
    return min($i)
    min(json-lines("file.json").foo)
    let $x := (1, 2, 3, 4)
    return sum($x)
    sum(json-lines("file.json").foo)
    doc("path/to/file.xml")
    serialize({hello: "world"})
    (1 to 10)[position() eq 5]
    (1 to 10)[position() eq last()]
    (1 to 10)[last()]
    current-dateTime()
    current-date()
    current-time()
    implicit-timezone()
    default-collation()
    keys({"foo" : "bar", "bar" : "foobar"})
    keys(({"foo" : "bar", "bar" : "foobar"}, {"foo": "bar2"}))
    keys(json-lines("file.json"))
    members([1 to 100])
    members(([1 to 100], [ 300 to 1000 ]))
    null()
    size([1 to 100])
    size(())
    accumulate(({ "b" : 2 }, { "c" : 3 }, { "b" : [1, "abc"] }, {"c" : {"d" : 0.17}}))
    { "b" : [ 2, [ 1, "abc" ] ], "c" : [ 3, { "d" : 0.17 } ] }
    descendant-arrays(([0, "x", { "a" : [1, {"b" : 2}, [2.5]], "o" : {"c" : 3} }]))
    [ 0, "x", { "a" : [ 1, { "b" : 2 }, [ 2.5 ] ], "o" : {"c" : 3} } ]
    [ 1, { "b" : 2 }, [ 2.5 ] ]
    [ 2.5 ]
    descendant-objects(([0, "x", { "a" : [1, {"b" : 2}, [2.5]], "o" : {"c" : 3} }]))
    { "a" : [ 1, { "b" : 2 }, [ 2.5 ] ], "o" : { "c" : 3 } }
    { "b" : 2 }
    { "c" : 3 }
    descendant-pairs(({ "a" : [1, {"b" : 2}], "d" : {"c" : 3} }))
    { "a" : [ 1, { "b" : 2 } ] }
    { "b" : 2 }
    { "d" : { "c" : 3 } }
    { "c" : 3 }
    flatten(([1, 2], [[3, 4], [5, 6]], [7, [8, 9]]))
    intersect(({"a" : "abc", "b" : 2, "c" : [1, 2], "d" : "0"}, { "a" : 2, "b" : "ab", "c" : "foo" }))
    { "a" : [ "abc", 2 ], "b" : [ 2, "ab" ], "c" : [ [ 1, 2 ], "foo" ] }
    project({"foo" : "bar", "bar" : "foobar", "foobar" : "foo" }, ("foo", "bar"))
    project(({"foo" : "bar", "bar" : "foobar", "foobar" : "foo" }, {"foo": "bar2"}), ("foo", "bar"))
    remove-keys({"foo" : "bar", "bar" : "foobar", "foobar" : "foo" }, ("foo", "bar"))
    remove-keys(({"foo" : "bar", "bar" : "foobar", "foobar" : "foo" }, {"foo": "bar2"}), ("foo", "bar"))
    values({"foo" : "bar", "bar" : "foobar"})
    values(({"foo" : "bar", "bar" : "foobar"}, {"foo" : "bar2"}))
    values(json-lines("file.json"))
    json-doc("/Users/sheldon/object.json")

    Expressions

    Construction of items

    In JSONiq, objects, arrays and basic atomic values (string, number, boolean, null) are constructed exactly as they are constructed in JSON. Any JSON document is also a valid JSONiq query which just "returns itself".

    Because JSONiq expressions are fully composable, however, in objects and arrays constructors, it is possible to put any JSONiq expression and not only atomic literals, object constructors and array constructors. Furthermore, JSONiq supports the construction of other W3C-standardized builtin types (date, hexBinary, etc).

    The following examples are a few of many operators available in JSONiq: "to" for creating arithmetic sequences, "||" for concatenating strings, "+" for adding numbers, "," for appending sequences.

    In an array, the operand expression will evaluated to a sequence of items, and these items will be copied and become members of the newly created array.

    Composable array constructors

    Result:[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ]

    In an object, the expression you use for the key must evaluate to an atomic - if it is not a string, it will just be cast to it.

    Composable object keys

    Result:{ "foobar" : true }

    An error is raised if the key expressions is not an atomic.

    Non-atomic object keys

    Result:An error was raised: can not atomize an array item: an array has probably been passed where an atomic value is expected (e.g., as a key, or to a function expecting an atomic item)

    If the value expression is empty, null will be used as a value, and if it contains two items or more, they will be wrapped into an array.

    If the colon is preceded with a question mark, then the pair will be omitted if the value expression evaluates to the empty sequence.

    Composable object values

    Result:{ "foo" : 2 }

    Composable object values and automatic conversion

    Result:{ "foo" : null, "bar" : [ 1, 2 ] }

    Optional pair (not implemented yet in Zorba)

    Result:An error was raised: invalid expression: syntax error, unexpected "?", expecting "end of file" or "," or "}"

    The {| |} syntax can be used to merge several objects.

    Merging object constructor

    Result:{ "foo" : "bar", "bar" : "foo" }

    An error is raised if the operand expression does not evaluate to a sequence of objects.

    Merging object constructor with a type error

    Result:An error was raised: xs:integer can not be treated as type object()*

    Numbers

    JSONiq follows the for constructing numbers. The following explanations, provided as an informal summary for convenience, are non-normative.

    Literal

    NumericLiteral

    IntegerLiteral

    DecimalLiteral

    DoubleLiteral

    The syntax for creating numbers is identical to that of JSON (it is actually a more flexible superset, for example leading 0s are allowed, and a decimal literal can begin with a dot). Note that JSONiq distinguishes between integers (no dot, no scientific notation), decimals (dot but no scientific notation) and doubles (scientific notation). As expected, an integer literal creates an atomic of type integer, and so on.

    Integer literals

    Result:42

    Decimal literals

    Result:3.14

    Double literals

    Result:6.022E23

    Strings

    The syntax for creating string items is conformant to rather than to the W3C standard for string literals. This means concretely that escaping is done with backslashes and not with ampersands. Also, like in JSON, double quotes are required and single quotes are forbidden.

    StringLiteral

    String literals

    Result:foo

    String literals with escaping

    Result:This is a line and this is a new line

    String literals with Unicode character escaping

    Result:&#x1;

    String literals with a nested quote

    Result:This is a nested "quote"

    Booleans and null

    JSONiq also introduces three more literals for constructing booleans and nulls: true, false and null. This makes in particular the functions true() and false() superfluous.

    BooleanLiteral

    NullLiteral

    Boolean literals (true)

    Result:true

    Boolean literals (false)

    Result:false

    Null literals

    Result:null

    Other atomic values

    JSONiq follows the for constructing most atomic values with constructors. In JSONiq, the xs prefix is optional.

    Objects

    Expressions constructing objects are JSONiq-specific and introduced in this specification.

    ObjectConstructor

    PairConstructor

    The syntax for creating objects is identical to that of JSON. You can use for an object key any string literal, and for an object value any literal, object constructor or array constructor.

    Empty object constructors

    Result:{ }

    Object constructors 1

    Result:{ "foo" : "bar" }

    Object constructors 2

    Result:{ "foo" : [ 1, 2, 3, 4, 5, 6 ] }

    Object constructors 3

    Result:{ "foo" : true, "bar" : false }

    Nested object constructors

    Result:{ "this is a key" : { "value" : "a value" } }

    As in JavaScript, if your key is simple enough (like alphanumerics, underscores, dashes, this kind of things), the quotes can be omitted. The strings for which quotes are not mandatory are called NCNames. This class of strings can be used for unquoted keys, for variable and function names, and for module aliases.

    Object constructors with unquoted key 1

    Result:{ "foo" : "bar" }

    Object constructors with unquoted key 2

    Result:{ "foo" : [ 1, 2, 3, 4, 5, 6 ] }

    Object constructors with unquoted key 3

    Result:{ "foo" : "bar", "bar" : "foo" }

    Object constructors with needed quotes around the key

    Result:{ "but you need the quotes here" : null }

    Objects can be constructed more dynamically (e.g., dynamic keys) by constructing and merging smaller objects. Duplicate key names throw an error.

    Object constructors with needed quotes around the key

    Result:{ "foo1" : 1, "foo2" : 2, "foo3" : 3 }

    Arrays

    Expressions constructing arrays are JSONiq-specific and introduced in this specification.

    ArrayConstructor

    Expr

    The syntax for creating arrays is identical to that of JSON: square brackets, comma separated literals, object constructors and arrays constructors.

    Empty array constructors

    Result:[ ]

    Array constructors

    Result:[ 1, 2, 3, 4, 5, 6 ]

    Nested array constructors

    Result:[ "foo", 3.14, [ "Go", "Boldly", "When", "No", "Man", "Has", "Gone", "Before" ], { "foo" : "bar" }, true, false, null ]

    Square brackets are mandatory. Do not push it.

    Functions

    JSONiq follows the for constructing function items with inline expressions or . The following explanations, provided as an informal summary for convenience, are non-normative.

    Function items can be constructed in two ways: by definining its body directly (inline function expression), or by referring by name to a function declared in a prolog.

    FunctionItemExpr

    Inline function expression

    JSONiq follows the for constructing function items with inline expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

    A function can be built directly by specifying its parameters and its body as expression. Types are optional and by default, assumed to be item*.

    Function items can also be produced with a partial function application.

    Inline function expression

    Result(two function items)

    InlineFunctionExpr

    ParamList

    Named function reference

    JSONiq follows the for constructing function items with named function references. The following explanations, provided as an informal summary for convenience, are non-normative.

    If a function is builtin or declared in a prolog, in the same module or imported, then it is also possible to build a function item by referring to its name and arity.

    Named function reference

    Result(a function items)

    NamedFunctionRef

    Manipulating atomic values

    We now introduce the expressions that manipulate atomic values: arithmetics, logics, comparison, string concatenation.

    Arithmetics

    JSONiq follows the for arithmetic expressions, and naturally extends to return errors for null values. The following explanations, provided as an informal summary for convenience, are non-normative.

    JSONiq supports the basic four operations, integer division and modulo.

    Multiplicative operations have precedence over additive operations. Parentheses can override it.

    Basic arithmetic operations with precedence override

    Result (run with Zorba):8

    Dates, times and durations are also supported in a natural way.

    Using basic operations with dates.

    Result (run with Zorba):P29D

    If any of the operands is a sequence of more than one item, an error is raised.

    Sequence of more than one number in an addition

    Result (run with Zorba):An error was raised: sequence of more than one item can not be promoted to parameter type xs:anyAtomicType? of function add()

    If any of the operands is not a number, a date, a time or a duration, an error is raised, which seamlessly includes raising errors for null with no need to extend the specification.

    Null in an addition

    Result (run with Zorba):An error was raised: arithmetic operation not defined between types "xs:integer" and "js:null"

    If one of the operands evaluates to the empty sequence, then the operation results in the empty sequence.

    If the two operands do not have the same number type, JSONiq will do the adequate conversions.

    Basic arithmetic operations with an empty sequence

    Result (run with Zorba):

    AdditiveExpr

    MultiplicativeExpr

    UnaryExpr

    String concatenation

    JSONiq follows the for string concatenation. The following explanations, provided as an informal summary for convenience, are non-normative.

    Two strings or more can be concatenated using the concatenation operator.

    String concatenation

    Result (run with Zorba):Captain Kirk

    An empty sequence is treated like an empty string.

    String concatenation with the empty sequence

    Result (run with Zorba):CaptainKirk

    StringConcatExpr

    Comparison

    JSONiq follows the for comparison, and only extends its semantics to null values as follows.

    null can be compared for equality or inequality to anything - it is only equal to itself so that false is returned when comparing if for equality with any non-null atomic. True is returned when comparing it with non-equality with any non-null atomic.

    Equality and non-equality comparison with null

    Result (run with Zorba):false true true

    For ordering operators (lt, le, gt, ge), null is considered the smallest possible value (like in JavaScript).

    Ordering comparison with null

    Result (run with Zorba):false

    The following explanations, provided as an informal summary for convenience, are non-normative.

    ComparisonExpr

    Atomics can be compared with the usual six comparison operators (equality, non-equality, lower-than, greater-than, lower-or-equal, greater-or-equal), and with the same two-letter symbols as in MongoDB.

    Equality comparison

    Result (run with Zorba):true true

    Comparison is only possible between two compatible types, otherwise, an error is raised.

    Comparisons with a type mismatch

    Result (run with Zorba):An error was raised: "xs:string": invalid type: can not compare for equality to type "xs:integer"

    Like for arithmetic operations, if an operand is the empty sequence, the empty sequence is returned as well.

    Comparison with the empty sequence

    Result (run with Zorba):

    Comparisons and logic operators are fundamental for a query language and for the implementation of a query processor as they impact query optimization greatly. The current comparison semantics for them is carefully chosen to have the right characteristics as to enable optimization.

    Logics

    JSONiq follows the for logical expressions; it introduces a prefix unary not operator as a synonym for fn:not, and extends the semantics of effective boolean values to objects, arrays and nulls. The following explanations, provided as an informal summary for convenience, are non-normative.

    OrExpr

    AndExpr

    NotExpr

    JSONiq logics support is based on two-valued logics: just true and false.

    Non-boolean operands get automatically converted to either true or false, or an error is raised. The boolean() function performs a manual conversion.

    • An empty sequence is converted to false.

    • A singleton sequence of one null is converted to false.

    • A singleton sequence of one string is converted to true except the empty string which is converted to false.

    • A singleton sequence of one number is converted to true except zero or NaN which are converted to false.

    JSONiq supports the most famous three boolean operations: conjunction, disjunction and negation. Negation has the highest precedence, then conjunction, then disjunction. Parentheses can override.

    Logics with booleans

    Result (run with Zorba):true

    Logics with comparing operands

    Result (run with Zorba):true

    Conversion of the empty sequence to false

    Result (run with Zorba):false

    Conversion of null to false

    Result (run with Zorba):false

    Conversion of a string to true

    Result (run with Zorba):true false

    Conversion of a number to false

    Result (run with Zorba):false true

    Conversion of an object to a boolean (not implemented in Zorba at this point)

    Result (run with Zorba):true

    If the input sequence has more than one item, and the first item is not an object or array, an error is raised.

    Error upon conversion of a sequence of more than one item, not beginning with a JSON item, to a boolean

    Result (run with Zorba):An error was raised: invalid argument type for function fn:boolean(): effective boolean value not defined for sequence of more than one item that starts with "xs:integer"

    Unlike in C++ or Java, you cannot rely on the order of evaluation of the operands of a boolean operation. The following query may return true or may return an error.

    Non-determinism in presence of errors.

    Result (run with Zorba):true

    JSONiq follows the for quantified expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

    QuantifiedExpr

    It is possible to perform a conjunction or a disjunction on a predicate for each item in a sequence.

    Universal quantifier

    Result (run with Zorba):true

    Existential quantifier on several variables

    Result (run with Zorba):true

    Variables can be annotated with a type. If no type is specified, item* is assumed. If the type does not match, an error is raised.

    Existential quantifier with type checking

    Result (run with Zorba):true

    Manipulating sequences

    JSONiq can create sequences with concatenation (comma) or with a range. Parentheses can be used for overriding precedence.

    Comma operator

    JSONiq follows the for the concatenation of sequences with commas. The following explanations, provided as an informal summary for convenience, are non-normative.

    Expr

    Use a comma to concatenate two sequences, or even single items. This operator has the lowest precedence of all.

    Comma

    Result (run with Zorba):1 2 3 4 5 6 7 8 9 10

    Comma

    Result (run with Zorba):{ "foo" : "bar" } [ 1 ]

    Sequences do not nest. You need to use arrays in order to nest.

    Range operator

    JSONiq follows the for range expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

    RangeExpr

    With the binary operator "to", you can generate larger sequences with just two integer operands.

    Range operator

    Result (run with Zorba):1 2 3 4 5 6 7 8 9 10

    If one operand evaluates to the empty sequence, then the range operator returns the empty sequence.

    Range operator with the empty sequence

    Result (run with Zorba):

    Otherwise, if an operand evaluates to something else than a single integer or an empty sequence, an error is raised.

    Range operator with a type inconsistency

    Result (run with Zorba):An error was raised: sequence of more than one item can not be promoted to parameter type xs:integer? of function to()

    Parenthesized expression

    JSONiq follows the for parenthesized expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

    ParenthesizedExpr

    Use parentheses to override the precedence of expressions.

    If the parentheses are empty, the empty sequence is produced.

    Empty sequence

    Result (run with Zorba):

    Calling functions

    JSONiq follows the for function calls. The following explanations, provided as an informal summary for convenience, are non-normative.

    Function calls in JSONiq can either be made statically, with a named function, or dynamically, by passing a function item on the fly.

    The syntax for function calls is similar to many other languages. JSONiq supports four sorts of functions:

    • Builtin functions: these have no prefix and can be called without any import.

    • Local functions: they are defined in the prolog, to be used in the main query. They have the prefix local:. Chapter describes how to define your own local functions.

    • Imported functions: they are defined in a library module. They have the prefix corresponding to the alias to which the imported module has been bound to. Chapter describes how to define your own modules.

    The first three are named functions and can be called statictically. All four can be called dynamically, as a named function can be also passed as an item with a named function reference.

    Static function calls

    JSONiq follows the for static function calls. The following explanations, provided as an informal summary for convenience, are non-normative.

    A static function call consists of the name of the function and of expressions returning its parameters. An error is thrown if no function with the corresponding name and arity is found.

    A builtin function call.

    Result:foo bar

    A builtin function call.

    Result:foobar

    An error is raised if the actual types do not match the expected types.

    A type error in a function call.

    Result:An error was raised: can not atomize an object item: an object has probably been passed where an atomic value is expected (e.g., as a key, or to a function expecting an atomic item)

    JSONiq static function calls follow the .

    FunctionCall

    Dynamic function calls

    JSONiq follows the for dynamic function calls. The following explanations, provided as an informal summary for convenience, are non-normative.

    A dynamic function call is a postfix expression. Its left-hand-side is an expression that must return a single function item (see in the data model ). Its right-hand side is a list of parameters, each one of which is an arbitrary expression providing a sequence of items, one such sequence for each parameter.

    A dynamic function call.

    Result:3

    If the number of parameters does not match the arity of the function, an error is raised. An error is also raised if an argument value does not match the corresponding type in the function signature.

    Otherwise, the function is evaluated with the supplied parameters. If the result matches the return type of the function, it is returned, otherwise an error is raised.

    A dynamic function call with signature

    Result:3

    JSONiq dynamic function calls follow the .

    PostfixExpr

    ArgumentList

    Argument

    Partial application

    JSONiq follows the for partial application. The following explanations, provided as an informal summary for convenience, are non-normative.

    A static or dynamic function call also have placeholder parameters, represented with a question mark in the syntax. When this is the case, the function call returns a function item that is the partial application of the original function, and its arity is the number of remaining placeholders.

    A partial application.

    Result:4

    JSONiq dynamic function calls follow the .

    Navigating objects

    Like in JavaScript, it is possible to navigate through objects and arrays. This is a specific JSONiq extension.

    JSONiq also allows to filter sequences with a predicate and predicates are fully W3C-conformant.

    JSONiq supports filtering items from a sequence, looking up the value associated with a given key in an object, looking up the item at a given position in an array, and looking up all items in an array.

    PostfixExpr

    Object field selector

    ObjectLookup

    The simplest way to navigate in an object is similar to JavaScript, using a dot. This will work as soon as you do not push it too much: alphanumerical characters, dashes, underscores - just like unquoted keys in object constructors, any NCName is allowed.

    Object lookup

    Result (run with Zorba):bar

    Since JSONiq expressions are composable, you can also use any expression for the left-hand side. You might need parentheses depending on the precedence.

    Lookup on a single-object collection.

    Result (run with Zorba):bar

    The dot operator does an implicit mapping on the left-hand-side, i.e., it applies the lookup in turn on each item. Lookup on an object returns the value associated with the supplied key, or the empty sequence if there is none. Lookup on any item which is not an object (arrays and atomics) results in the empty sequence.

    Object lookup with an iteration on several objects

    Result (run with Zorba):bar bar2

    Object lookup with an iteration on a collection

    Result (run with Zorba):James T. Kirk Jean-Luc Picard Benjamin Sisko Kathryn Janeway Jonathan Archer Samantha Carter

    Object lookup on a mixed sequence

    Result (run with Zorba):bar1 bar2

    Of course, unquoted keys will not work for strings that are not NCNames, e.g., if the field contains a dot or begins with a digit. Then you will need quotes.

    Quotes for object lookup

    Result (run with Zorba):bar

    If you use an expression on the right side of the dot, it must always have parentheses. The result of the right-hand-side expression is cast to a string. An error is raised if the cast fails.

    Object lookup with a nested expression

    Result (run with Zorba):bar

    Object lookup with a nested expression

    Result (run with Zorba):An error was raised: sequence of more than one item can not be treated as type xs:string

    Object lookup with a nested expression

    Result (run with Zorba):bar

    Variables, or a context item reference, do not need parentheses. Variables are introduced later, but here is a sneak peek:

    Object lookup with a variable

    Result (run with Zorba):bar

    Array member selector

    ArrayLookup

    Array lookup uses double square brackets.

    Array lookup

    Result (run with Zorba):bar

    Since JSONiq expressions are composable, you can also use any expression for the left-hand side. You might need parentheses depending on the precedence.

    Array lookup after an object lookup

    Result (run with Zorba):bar

    The array lookup operator does an implicit mapping on the left-hand-side, i.e., it applies the lookup in turn on each item. Lookup on an array returns the item at that position in the array, or the empty sequence if there is none (position larger than size or smaller than 1). Lookup on any item which is not an array (objects and atomics) results in the empty sequence.

    Array lookup with an iteration on several arrays

    Result (run with Zorba):2 5

    Array lookup with an iteration on a collection

    Result (run with Zorba):The original series The next generation The next generation The next generation Entreprise Voyager

    Array lookup on a mixed sequence

    Result (run with Zorba):3 6

    The expression inside the double-square brackets may be any expression. The result of evaluating this expression is cast to an integer. An error is raised if the cast fails.

    Array lookup with a right-hand-side expression

    Result (run with Zorba):bar

    ArrayUnboxing

    You can also extract all items from an array (i.e., as a sequence) with the [] syntax. The [] operator also implicitly iterates on the left-hand-side, returning the empty sequence for non-arrays.

    Extracting all items from an array

    Result (run with Zorba):foo bar

    Extracting all items from arrays in a mixed sequence

    Result (run with Zorba):foo bar 1 2 3

    Sequence predicates

    Predicate

    A predicate allows filtering a sequence, keeping only items that fulfill it.

    The predicate is evaluated once for each item in the left-hand-side sequence, with the context item set to that item. The predicate expression can use $$ to access this context item.

    ContextItemExpr

    If the predicate evaluates to an integer, it is matched against the item position in the left-hand side sequence automatically

    Predicate expression

    Result (run with Zorba):2

    Otherwise, the result of the predicate is converted to a boolean.

    All items for which the converted predicate result evaluates to true are then output.

    Predicate expression

    Result (run with Zorba):2 4 6 8 10

    Control flow expressions

    JSONiq supports control flow expressions such as if-then-else, switch and typeswitch following the W3C standard.

    Conditional expressions

    JSONiq follows the for conditional expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

    IfExpr

    A conditional expressions allows you to pick one or another value depending on a boolean value.

    A conditional expression

    Result (run with Zorba):{ "foo" : "yes" }

    The behavior of the expression inside the if is similar to that of logical operations (two-valued logics), meaning that non-boolean values get converted to a boolean.

    A conditional expression

    Result (run with Zorba):{ "foo" : "no" }

    A conditional expression

    Result (run with Zorba):{ "foo" : "yes" }

    A conditional expression

    Result (run with Zorba):{ "foo" : "no" }

    A conditional expression

    Result (run with Zorba):{ "foo" : "yes" }

    A conditional expression

    Result (run with Zorba):{ "foo" : "no" }

    A conditional expression

    Result (run with Zorba):{ "foo" : "no" }

    A conditional expression

    Result (run with Zorba):{ "foo" : "yes" }

    Note that the else clause is mandatory (but can be the empty sequence)

    A conditional expression

    Result (run with Zorba):{ "foo" : "yes" }

    Switch expressions

    JSONiq follows the for switch expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

    SwitchExpr

    SwitchCaseClause

    A switch expression evaluates the expression inside the switch. If it is an atomic, it compares it in turn to the provided atomic values (with the semantics of the eq operator) and returns the value associated with the first matching case clause.

    Note that if there is an object or array in the base switch expression or any case expression, a JSONiq-specific type error JNTY0004 will be raised, because objects and arrays cannot be atomized and the W3C standard requires atomization of the base and case expressions.

    A switch expression

    Result (run with Zorba):bar

    If it is not an atomic, an error is raised.

    A switch expression

    Result (run with Zorba):An error was raised: can not atomize an object item: an object has probably been passed where an atomic value is expected (e.g., as a key, or to a function expecting an atomic item)

    If no value matches, the default is used.

    A switch expression

    Result (run with Zorba):none

    The case clauses support composability of expressions as well.

    A switch expression

    Result (run with Zorba):foo

    A switch expression

    Result (run with Zorba):1 + 1 is 2

    Try-catch expressions

    JSONiq follows the for try-catch expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

    TryCatchExpr

    A try catch expression evaluates the expression inside the try block and returns its resulting value.

    However, if an error is raised dynamically, the catch clause is evaluated and its result value returned.

    A try catch expression

    Result (run with Zorba):division by zero!

    Only errors raised within the lexical scope of the try block are caught.

    A try catch expression

    Result (run with Zorba):An error was raised: division by zero

    Errors that are detected statically within the try block are still reported statically.

    A try catch expression

    Result (run with Zorba):syntax error

    FLWOR expressions

    JSONiq follows the for FLWOR expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

    FLWORExpr

    FLWOR expressions are probably the most powerful JSONiq construct and correspond to SQL's SELECT-FROM-WHERE statements, but they are more general and more flexible. In particular, clauses can almost appear in any order (apart that it must begin with a for or let clause, and end with a return clause).

    Here is a bit of theory on how it works.

    A clause binds values to some variables according to its own semantics, possibly several times. Each time, a tuple of variable bindings (mapping variable names to sequences) is passed on to the next clause.

    This goes all the way down, until the return clause. The return clause is eventually evaluated for each tuple of variable bindings, resulting in a sequence of items for each tuple.

    These sequences of items are concatenated, in the order of the incoming tuples, and the obtained sequence is returned by the FLWOR expression.

    We are now giving practical examples with a hint on how it maps to SQL.

    For clauses

    JSONiq follows the for for clauses. The following explanations, provided as an informal summary for convenience, are non-normative.

    ForClause

    For clauses allow iteration on a sequence.

    For each incoming tuple, the expression in the for clause is evaluated to a sequence. Each item in this sequence is in turn bound to the for variable. A tuple is hence produced for each incoming tuple, and for each item in the sequence produced by the for clause for this tuple.

    The order in which items are bound by the for clause can be relaxed with unordered expressions, as described later in this section.

    The following query, using a for and a return clause, is the counterpart of SQL's "SELECT name FROM captains". $x is bound in turn to each item in the captains collection.

    A for clause.

    Result (run with Zorba):James T. Kirk Jean-Luc Picard Benjamin Sisko Kathryn Janeway Jonathan Archer Samantha Carter

    For clause expressions are composable, there can be several of them.

    Two for clauses.

    Result (run with Zorba):11 12 13 21 22 23 31 32 33

    A for clause.

    Result (run with Zorba):11 12 13 21 22 23 31 32 33

    A for variable is visible to subsequence bindings.

    A for clause.

    Result (run with Zorba):1 2 3 4 5 6 7 8 9

    A for clause.

    Result (run with Zorba):{ "captain" : "James T. Kirk", "series" : "The original series" } { "captain" : "Jean-Luc Picard", "series" : "The next generation" } { "captain" : "Benjamin Sisko", "series" : "The next generation" } { "captain" : "Benjamin Sisko", "series" : "Deep Space 9" } { "captain" : "Kathryn Janeway", "series" : "The next generation" } { "captain" : "Kathryn Janeway", "series" : "Voyager" } { "captain" : "Jonathan Archer", "series" : "Entreprise" } { "captain" : null, "series" : "Voyager" }

    It is also possible to bind the position of the current item in the sequence to a variable.

    A for clause.

    Result (run with Zorba):{ "captain" : "James T. Kirk", "id" : 1 } { "captain" : "Jean-Luc Picard", "id" : 2 } { "captain" : "Benjamin Sisko", "id" : 3 } { "captain" : "Kathryn Janeway", "id" : 4 } { "captain" : "Jonathan Archer", "id" : 5 } { "captain" : null, "id" : 6 } { "captain" : "Samantha Carter", "id" : 7 }

    JSONiq supports joins. For example, the counterpart of "SELECT c.name AS captain, m.name AS movie FROM captains c JOIN movies m ON c.name = m.name" is:

    A join

    Result (run with Zorba):{ "captain" : "James T. Kirk", "movie" : "The Motion Picture" } { "captain" : "James T. Kirk", "movie" : "The Wrath of Kahn" } { "captain" : "James T. Kirk", "movie" : "The Search for Spock" } { "captain" : "James T. Kirk", "movie" : "The Voyage Home" } { "captain" : "James T. Kirk", "movie" : "The Final Frontier" } { "captain" : "James T. Kirk", "movie" : "The Undiscovered Country" } { "captain" : "Jean-Luc Picard", "movie" : "First Contact" } { "captain" : "Jean-Luc Picard", "movie" : "Insurrection" } { "captain" : "Jean-Luc Picard", "movie" : "Nemesis" }

    Note how JSONiq handles semi-structured data in a flexible way.

    Outer joins are also possible with "allowing empty", i.e., output will also be produced if there is no matching movie for a captain. The following query is the counterpart of "SELECT c.name AS captain, m.name AS movie FROM captains c LEFT JOIN movies m ON c.name = m.captain".

    A join

    Result (run with Zorba):{ "captain" : "James T. Kirk", "movie" : "The Motion Picture" } { "captain" : "James T. Kirk", "movie" : "The Wrath of Kahn" } { "captain" : "James T. Kirk", "movie" : "The Search for Spock" } { "captain" : "James T. Kirk", "movie" : "The Voyage Home" } { "captain" : "James T. Kirk", "movie" : "The Final Frontier" } { "captain" : "James T. Kirk", "movie" : "The Undiscovered Country" } { "captain" : "Jean-Luc Picard", "movie" : "First Contact" } { "captain" : "Jean-Luc Picard", "movie" : "Insurrection" } { "captain" : "Jean-Luc Picard", "movie" : "Nemesis" } { "captain" : "Benjamin Sisko", "movie" : null } { "captain" : "Kathryn Janeway", "movie" : null } { "captain" : "Jonathan Archer", "movie" : null } { "captain" : null, "movie" : null } { "captain" : "Samantha Carter", "movie" : null }

    Where clauses

    JSONiq follows the for where clauses. The following explanations, provided as an informal summary for convenience, are non-normative.

    WhereClause

    Where clauses are used for filtering (selection operator in the relational algebra).

    For each incoming tuple, the expression in the where clause is evaluated to a boolean (possibly converting an atomic to a boolean). if this boolean is true, the tuple is forwarded to the next clause, otherwise it is dropped.

    The following query corresponds to "SELECT series FROM captains WHERE name = 'Kathryn Janeway'".

    A where clause.

    Result (run with Zorba):[ "The next generation", "Voyager" ]

    Order clauses

    JSONiq follows the for order by clauses. The following explanations, provided as an informal summary for convenience, are non-normative.

    OrderByClause

    Order clauses are for reordering tuples.

    For each incoming tuple, the expression in the where clause is evaluated to an atomic. The tuples are then sorted based on the atomics they are associated with, and then forwarded to the next clause.

    Like for ordering comparisons, null values are always considered the smallest.

    The following query is the counterpart of SQL's "SELECT * FROM captains ORDER BY name".

    An order by clause.

    Result (run with Zorba):{ "name" : "Benjamin Sisko", "series" : [ "The next generation", "Deep Space 9" ], "century" : 24 } { "name" : "James T. Kirk", "series" : [ "The original series" ], "century" : 23 } { "name" : "Jean-Luc Picard", "series" : [ "The next generation" ], "century" : 24 } { "name" : "Jonathan Archer", "series" : [ "Entreprise" ], "century" : 22 } { "name" : "Kathryn Janeway", "series" : [ "The next generation", "Voyager" ], "century" : 24 } { "name" : "Samantha Carter", "series" : [ ], "century" : 21 } { "codename" : "Emergency Command Hologram", "surname" : "The Doctor", "series" : [ "Voyager" ], "century" : 24 }

    Multiple sorting criteria can be given - they are treated like a lexicographic order (most important criterium first).

    An order by clause.

    Result (run with Zorba):{ "name" : "Samantha Carter", "series" : [ ], "century" : 21 } { "name" : "James T. Kirk", "series" : [ "The original series" ], "century" : 23 } { "name" : "Jean-Luc Picard", "series" : [ "The next generation" ], "century" : 24 } { "name" : "Jonathan Archer", "series" : [ "Entreprise" ], "century" : 22 } { "codename" : "Emergency Command Hologram", "surname" : "The Doctor", "series" : [ "Voyager" ], "century" : 24 } { "name" : "Benjamin Sisko", "series" : [ "The next generation", "Deep Space 9" ], "century" : 24 } { "name" : "Kathryn Janeway", "series" : [ "The next generation", "Voyager" ], "century" : 24 }

    It can be specified whether the order is ascending or descending. Empty sequences are allowed and it can be chosen whether to put them first or last.

    An order by clause.

    Result (run with Zorba):{ "codename" : "Emergency Command Hologram", "surname" : "The Doctor", "series" : [ "Voyager" ], "century" : 24 } { "name" : "Samantha Carter", "series" : [ ], "century" : 21 } { "name" : "Kathryn Janeway", "series" : [ "The next generation", "Voyager" ], "century" : 24 } { "name" : "Jonathan Archer", "series" : [ "Entreprise" ], "century" : 22 } { "name" : "Jean-Luc Picard", "series" : [ "The next generation" ], "century" : 24 } { "name" : "James T. Kirk", "series" : [ "The original series" ], "century" : 23 } { "name" : "Benjamin Sisko", "series" : [ "The next generation", "Deep Space 9" ], "century" : 24 }

    An error is raised if the expression does not evaluate to an atomic or the empty sequence.

    An order by clause.

    Result (run with Zorba):An error was raised: can not atomize an object item: an object has probably been passed where an atomic value is expected (e.g., as a key, or to a function expecting an atomic item)

    Collations can be used to give a specific way of how strings are to be ordered. A collation is identified by a URI.

    Use of a collation in an order by clause.

    Result (run with Zorba):Benjamin Sisko James T. Kirk Jean-Luc Picard Jonathan Archer Kathryn Janeway Samantha Carter

    Group clauses

    JSONiq follows the for group by clauses. The following explanations, provided as an informal summary for convenience, are non-normative.

    GroupByClause

    Grouping is also supported, like in SQL.

    For each incoming tuple, the expression in the group clause is evaluated to an atomic (a grouping key). The incoming tuples are then grouped according to the key they are associated with.

    For each group, a tuple is output, with a binding from the grouping variable to the key of the group.

    A group by clause.

    Result (run with Zorba):{ "century" : 21 } { "century" : 22 } { "century" : 23 } { "century" : 24 }

    As for the other (non-grouping) variables, their values within one group are all concatenated, keeping the same name. Aggregations can be done on these variables.

    The following query is equivalent to "SELECT century, COUNT(*) FROM captains GROUP BY century".

    A group by clause.

    Result (run with Zorba):{ "century" : 21, "count" : 1 } { "century" : 22, "count" : 1 } { "century" : 23, "count" : 1 } { "century" : 24, "count" : 4 }

    JSONiq's group by is more flexible than SQL and is fully composable.

    A group by clause.

    Result (run with Zorba):{ "century" : 21, "captains" : [ "Samantha Carter" ] } { "century" : 22, "captains" : [ "Jonathan Archer" ] } { "century" : 23, "captains" : [ "James T. Kirk" ] } { "century" : 24, "captains" : [ "Jean-Luc Picard", "Benjamin Sisko", "Kathryn Janeway" ] }

    Unlike SQL, JSONiq does not need a having clause, because a where clause works perfectly after grouping as well.

    The following query is the counterpart of "SELECT century, COUNT(*) FROM captains GROUP BY century HAVING COUNT(*) > 1"

    A group by clause.

    Result (run with Zorba):{ "century" : 24, "count" : 4 }

    Let clauses

    JSONiq follows the for let clauses. The following explanations, provided as an informal summary for convenience, are non-normative.

    LetClause

    Let bindings can be used to define aliases for any sequence, for convenience.

    For each incoming tuple, the expression in the let clause is evaluated to a sequence. A binding is added from this sequence to the let variable in each tuple. A tuple is hence produced for each incoming tuple.

    A let clause.

    Result (run with Zorba):{ "century" : 24, "count" : 4 }

    Note that it is perfectly fine to reuse a variable name and hide a variable binding.

    A let clause.

    Result (run with Zorba):{ "century" : 24, "number of series" : 3 }

    Count clauses

    JSONiq follows the for count clauses. The following explanations, provided as an informal summary for convenience, are non-normative.

    CountClause

    For each incoming tuple, a binding from the position of this tuple in the tuple stream to the count variable is added. The new tuple is then forwarded to the next clause.

    A count clause.

    Result (run with Zorba):{ "id" : 1, "captain" : { "name" : "Benjamin Sisko", "series" : [ "The next generation", "Deep Space 9" ], "century" : 24 } } { "id" : 2, "captain" : { "name" : "James T. Kirk", "series" : [ "The original series" ], "century" : 23 } } { "id" : 3, "captain" : { "name" : "Jean-Luc Picard", "series" : [ "The next generation" ], "century" : 24 } } { "id" : 4, "captain" : { "name" : "Jonathan Archer", "series" : [ "Entreprise" ], "century" : 22 } } { "id" : 5, "captain" : { "name" : "Kathryn Janeway", "series" : [ "The next generation", "Voyager" ], "century" : 24 } } { "id" : 6, "captain" : { "name" : "Samantha Carter", "series" : [ ], "century" : 21 } } { "id" : 7, "captain" : { "codename" : "Emergency Command Hologram", "surname" : "The Doctor", "series" : [ "Voyager" ], "century" : 24 } }

    Map operator

    JSONiq follows the for the map operator, except that it changes the syntax for the context item to $$ instead of the . syntax.

    The following explanations, provided as an informal summary for convenience, are non-normative.

    SimpleMapExpr

    ContextItemExpr

    JSONiq provides a shortcut for a for-return construct, automatically binding each item in the left-hand-side sequence to the context item.

    A simple map

    Result (run with Zorba):2 4 6 8 10 12 14 16 18 20

    An equivalent query

    Result (run with Zorba):2 4 6 8 10 12 14 16 18 20

    Variable references

    JSONiq follows the for variable references, except that it disallows the character . in variable names, which is instead used for object lookup.

    Composing FLWOR expressions

    Like all other expressions, FLWOR expressions can be composed. In the following examples, a FLWOR is nested in a function call, nested in a FLWOR, nested in an array constructor:

    Nested FLWORs

    Result (run with Zorba):[ "James T. Kirk", "Jean-Luc Picard" ]

    Ordered and Unordered expressions

    JSONiq follows the for ordered and unordered expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

    OrderedExpr

    UnorderedExpr

    By default, the order in which a for clause binds its items is important.

    This behaviour can be relaxed in order give the optimizer more leeway. An unordered expression relaxes ordering by for clauses within its operand scope:

    An unordered expression.

    Result (run with Zorba):{ "name" : "Jean-Luc Picard", "series" : [ "The next generation" ], "century" : 24 } { "name" : "Benjamin Sisko", "series" : [ "The next generation", "Deep Space 9" ], "century" : 24 } { "name" : "Kathryn Janeway", "series" : [ "The next generation", "Voyager" ], "century" : 24 } { "codename" : "Emergency Command Hologram", "surname" : "The Doctor", "series" : [ "Voyager" ], "century" : 24 }

    An ordered expression can be used to reactivate ordering behaviour in a subscope.

    An ordered expression.

    Result (run with Zorba):{ "name" : "James T. Kirk", "series" : [ "The original series" ], "century" : 23 }

    Expressions dealing with types

    This section describes JSONiq types as well as the sequence type syntax.

    Instance-of expressions

    JSONiq follows the for ordered and unordered expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

    InstanceofExpr

    An instance expression can be used to tell whether a JSONiq value matches a given sequence type.

    Instance of expression

    Result (run with Zorba):true

    Instance of expression

    Result (run with Zorba):false

    Instance of expression

    Result (run with Zorba):true

    Instance of expression

    Result (run with Zorba):true

    Instance of expression

    Result (run with Zorba):true

    Instance of expression

    Result (run with Zorba):true

    Instance of expression

    Result (run with Zorba):true

    Treat expressions

    JSONiq follows the for ordered and unordered expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

    TreatExpr

    A treat expression checks that a JSONiq value matches a given sequence type. If it is not the case, an error is raised.

    Treat as expression

    Result (run with Zorba):1

    Treat as expression

    Result (run with Zorba):An error was raised: "xs:integer" cannot be treated as type xs:string

    Treat as expression

    Result (run with Zorba):foo

    Treat as expression

    Result (run with Zorba):{ "foo" : "bar" }

    Treat as expression

    Result (run with Zorba):{ "foo" : "bar" } { "bar" : "foo" }

    Treat as expression

    Result (run with Zorba):[ 1, 2, 3 ]

    Treat as expression

    Result (run with Zorba):

    Castable expressions

    JSONiq follows the for ordered and unordered expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

    CastableExpr

    A castable expression checks whether a JSONiq value can be cast to a given atomic type and returns true or false accordingly. It can be used before actually casting to that type.

    Castable as expression

    Result (run with Zorba):true

    Castable as expression

    Result (run with Zorba):false

    Castable as expression

    Result (run with Zorba):true

    Castable as expression

    Result (run with Zorba):false

    Castable as expression

    Result (run with Zorba):false

    The question mark allows for an empty sequence.

    Castable as expression

    Result (run with Zorba):true

    Cast expressions

    JSONiq follows the for ordered and unordered expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

    CastExpr

    A cast expression casts a JSONiq value to a given atomic type. The resulting value is annotated with this type.

    Cast as expression

    Result (run with Zorba):1

    Cast as expression

    Result (run with Zorba):An error was raised: "foo": value of type xs:string is not castable to type xs:integer

    Cast as expression

    Result (run with Zorba):2013-04-02

    Cast as expression

    Result (run with Zorba):An error was raised: empty sequence can not be cast to type with quantifier '1'

    Cast as expression

    Result (run with Zorba):An error was raised: sequence of more than one item can not be cast to type with quantifier '1' or '?'

    The question mark allows for an empty sequence.

    Cast as expression

    Result (run with Zorba):

    Cast as expression

    Result (run with Zorba):2013-04-02

    Typeswitch expressions

    JSONiq follows the for ordered and unordered expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

    TypeswitchExpr

    CaseClause

    A typeswitch expressions tests if the value resulting from the first operand matches a given list of types. The expression corresponding to the first matching case is finally evaluated. If there is no match, the expression in the default clause is evaluated.

    Typeswitch expression

    Result (run with Zorba):string

    In each clause, it is possible to bind the value of the first operand to a variable.

    Typeswitch expression

    Result (run with Zorba):foofoo

    The vertical bar can be used to allow several types in the same case clause.

    Typeswitch expression

    Result (run with Zorba):{ "integer or string" : "foo" }

    An operand singleton sequence whose first item is an object or array is converted to true.

  • Other operand sequences cannot be converted and an error is raised.

  • Anonymous functions: they are defined on the fly, by inline function expressions or partial evaluation.
    W3C standard
    JSON
    W3C standard
    W3C standard
    named function references
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    Prologs
    Modules
    W3C standard
    W3C specification
    W3C standard
    Function items
    W3C specification
    W3C standard
    W3C specification
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    W3C standard
    
      [ 1 to 10 ]
        
    
      { "foo" || "bar" : true }
          
    
      { [ 1, 2 ] : true }
          
    
      { "foo" : 1 + 1 }
          
    
      { "foo" : (), "bar" : (1, 2) }
          
    
      { "foo" ?: (), "bar" : (1, 2) }
          
    
      {| { "foo" : "bar" }, { "bar" : "foo" } |}
          
    
      {| 1 |}
          
    
    42
          
    
    3.14
          
    
    +6.022E23
          
    
      "foo"
            
    
      "This is a line\nand this is a new line"
            
    
      "\u0001"
            
    
      "This is a nested \"quote\""
            
    
    true
          
    
    false
          
    
    null
          
    
    {}
          
    
    { "foo" : "bar" }
          
    
    { "foo" : [ 1, 2, 3, 4, 5, 6 ] }
          
    
    { "foo" : true, "bar" : false }
          
    
    { "this is a key" : { "value" : "a value" } }
          
    
    { foo : "bar" }
          
    
    { foo : [ 1, 2, 3, 4, 5, 6 ] }
          
    
    { foo : "bar", bar : "foo" }
          
    
    { "but you need the quotes here" : null }
        
    
    {|
      for $i in 1 to 3
      return { "foo" || $i : $i }
    |}
        
    
    []
          
    
    [ 1, 2, 3, 4, 5, 6 ]
          
    
    [ "foo", 3.14, [ "Go", "Boldly", "When", "No", "Man", "Has", "Gone", "Before" ], { "foo" : "bar" }, true, false, null ]
          
    
               function ($x as integer, $y as integer) as integer { $x + 2 },
               function ($x) { $x + 2 }
           
    
               declare function local:sum($x as integer, $y as integer) as integer
               {
                 $x + 2
               };
               local:sum#2
           
    
    1 * ( 2 + 3 ) + 7 idiv 2 - (-8) mod 2
          
    
    date("2013-05-01") - date("2013-04-02")
          
    
    (1, 2) + 3
          
    
    1 + null
          
    
    () + 2
          
    
    "Captain" || " " || "Kirk"
          
    
    "Captain" || () || "Kirk"
          
    
    1 eq null, "foo" ne null, null eq null
          
    
    1 lt null
          
    
    1 + 1 eq 2, 1 lt 2
          
    
    "foo" eq 1
          
    
    () eq 1
          
    
    true and ( true or not true )
          
    
    1 + 1 eq 2 or 1 + 1 eq 3
          
    
    boolean(())
          
    
    boolean(null)
          
    
    boolean("foo"), boolean("")
          
    
    0 and true, not (not 1e42)
          
    
    { "foo" : "bar" } or false
          
    
    ( 1, 2, 3 ) or false
          
    
    true or (1 div 0)
          
    
    every $i in 1 to 10 satisfies $i gt 0
          
    
    some $i in -5 to 5, $j in 1 to 10 satisfies $i eq $j
          
    
    some $i as integer in -5 to 5, $j as integer in 1 to 10 satisfies $i eq $j
          
    
    1, 2, 3, 4, 5, 6, 7, 8, 9, 10
      
    
    { "foo" : "bar" }, [ 1 ]
      
    
    1 to 10
      
    
    () to 10, 1 to ()
      
    
    (1, 2) to 10
      
    
    ()
          
    
           keys({ "foo" : "bar", "bar" : "foo" })
         
    
           concat("foo", "bar")
         
    
           sum({ "foo" : "bar" })
         
    
           let $f := function($x) { $x + 1 }
           return $f(2)
         
    
           let $f := function($x as integer) as integer { $x + 1 }
           return $f(2)
         
    
           let $f := function($x as integer, $y as integer) as integer { $x + $y }
           let $g := $f(?, 2)
           return $g(2)
         
    
    { "foo" : "bar" }.foo
          
    
    collection("one-object").foo
          
    
    ({ "foo" : "bar" }, { "foo" : "bar2" }, { "bar" : "foo" }).foo
            
    
    collection("captains").name
          
    
    ({ "foo" : "bar1" }, [ "foo", "bar" ], { "foo" : "bar2" }, "foo").foo
          
    
    { "foo bar" : "bar" }."foo bar"
          
    
    { "foobar" : "bar" }.("foo" || "bar")
          
    
    { "foobar" : "bar" }.("foo", "bar")
          
    
    { "1" : "bar" }.(1)
          
    
    let $field := "foo" || "bar"
    return { "foobar" : "bar" }.$field
          
    
    [ "foo", "bar" ] [[2]]
          
    
    { field : [ "one",  { "foo" : "bar" } ] }.field[[2]].foo
          
    
    ([ 1, 2, 3 ], [ 4, 5, 6 ])[[2]]
            
    
    collection("captains").series[[1]]
          
    
    ([ 1, 2, 3 ], [ 4, 5, 6 ], { "foo" : "bar" }, true)[[3]]
          
    
    [ "foo", "bar" ] [[ 1 + 1 ]]
          
    
    [ "foo", "bar" ][]
          
    
    ([ "foo", "bar" ], { "foo" : "bar" }, true, [ 1, 2, 3 ] )[]
          
    
    (1 to 10)[2]
          
    
    (1 to 10)[$$ mod 2 eq 0]
          
    
    if (1 + 1 eq 2) then { "foo" : "yes" } else { "foo" : "false" }
          
    
    if (null) then { "foo" : "yes" } else { "foo" : "no" }
          
    
    if (1) then { "foo" : "yes" } else { "foo" : "no" }
          
    
    if (0) then { "foo" : "yes" } else { "foo" : "no" }
            
    
    if ("foo") then { "foo" : "yes" } else { "foo" : "no" }
          
    
    if ("") then { "foo" : "yes" } else { "foo" : "no" }
            
    
    if (()) then { "foo" : "yes" } else { "foo" : "no" }
            
    
    if (({ "foo" : "bar" }, [ 1, 2, 3, 4])) then { "foo" : "yes" } else { "foo" : "no" }
            
    
    if (1+1 eq 2) then { "foo" : "yes" } else ()
            
    
    switch ("foo")
    case "bar" return "foo"
    case "foo" return "bar"
    default return "none"
            
    
    switch ({ "foo" : "bar" })
    case "bar" return "foo"
    case "foo" return "bar"
    default return "none"
            
    
    switch ("no-match")
    case "bar" return "foo"
    case "foo" return "bar"
    default return "none"
            
    
    switch (2)
    case 1 + 1 return "foo"
    case 2 + 2 return "bar"
    default return "none"
            
    
    switch (true)
    case 1 + 1 eq 2 return "1 + 1 is 2"
    case 2 + 2 eq 5 return "2 + 2 is 5"
    default return "none of the above is true"
            
    
    try { 1 div 0 } catch * { "division by zero!" } 
          
    
    let $x := 1 div 0
    return try { $x }
           catch * { "division by zero!" } 
          
    
    try { x } catch * { "syntax error" } 
          
    
    for $x in collection("captains")
    return $x.name
          
    
    for $x in ( 1, 2, 3 )
    for $y in ( 1, 2, 3 )
    return 10 * $x + $y
          
    
    for $x in ( 1, 2, 3 ), $y in ( 1, 2, 3 )
    return 10 * $x + $y
          
    
    for $x in ( [ 1, 2, 3 ], [ 4, 5, 6 ], [ 7, 8, 9 ] ), $y in $x[]
    return $y
          
    
    for $x in collection("captains"), $y in $x.series[]
    return { "captain" : $x.name, "series" : $y }
          
    
    for $x at $position in collection("captains")
    return { "captain" : $x.name, "id" : $position }
            
    
    for $captain in collection("captains"), $movie in collection("movies")[ try { $$.captain eq $captain.name } catch * { false } ]
    return { "captain" : $captain.name, "movie" : $movie.name }
            
    
    for $captain in collection("captains"), $movie allowing empty in collection("movies")[ try { $$.captain eq $captain.name } catch * { false } ]
    return { "captain" : $captain.name, "movie" : $movie.name }
            
    
    for $x in collection("captains")
    where $x.name eq "Kathryn Janeway"
    return $x.series
          
    
    for $x in collection("captains")
    order by $x.name
    return $x
          
    
    for $x in collection("captains")
    order by size($x.series), $x.name
    return $x
          
    
    for $x in collection("captains")
    order by $x.name descending empty greatest
    return $x
          
    
    for $x in collection("captains")
    order by $x
    return $x.name
          
    
    for $x in collection("captains")
    order by $x.name collation "http://www.w3.org/2005/xpath-functions/collation/codepoint"
    return $x.name
          
    
    for $x in collection("captains")
    group by $century := $x.century
    return { "century" : $century  }
          
    
    for $x in collection("captains")
    group by $century := $x.century
    return { "century" : $century, "count" : count($x) }
          
    
    for $x in collection("captains")
    group by $century := $x.century
    return { "century" : $century, "captains" : [ $x.name ] }
          
    
    for $x in collection("captains")
    group by $century := $x.century
    where count($x) gt 1
    return { "century" : $century, "count" : count($x) }
          
    
    for $x in collection("captains")
    let $century := $x.century
    group by $century
    let $number := count($x)
    where $number gt 1
    return { "century" : $century, "count" : $number }
          
    
    for $x in collection("captains")
    let $century := $x.century
    group by $century
    let $number := count($x)
    let $number := count(distinct-values(for $series in $x.series
                                         return typeswitch($series)
                                                case array return $series()
                                                default return $series ))
    where $number gt 1
    return { "century" : $century, "number of series" : $number }
          
    
    for $x in collection("captains")
    order by $x.name
    count $c
    return { "id" : $c, "captain" : $x }
          
    
    (1 to 10) ! ($$ * 2)
          
    
    for $i in 1 to 10
    return $i * 2
          
    
            [
              for $c in collection("captains")
              where exists(for $m in collection("movies")
                           where some $moviecaptain in let $captain := $m.captain
                                                       return typeswitch ($captain)
                                                              case array return $captain()
                                                              default return $captain
                                 satisfies
                                 $moviecaptain eq $c.name
                           return $m)
              return $c.name
            ]
          
    
    unordered {
      for $captain in collection("captains")
      where $captain.century eq 24
      return $captain
    }
          
    
    unordered {
      for $captain in collection("captains")
      where ordered { exists(for $movie at $i in collection("movies")
                             where $i eq 5
                             where $movie.captain eq $captain.name
                             return $movie) }
      return $captain
    }
          
    
    1 instance of integer
          
    
    1 instance of string
          
    
    "foo" instance of string
          
    
    { "foo" : "bar" } instance of object
          
    
    ({ "foo" : "bar" }, { "bar" : "foo" }) instance of json-item+
          
    
    [ 1, 2, 3 ] instance of array?
          
    
    () instance of ()
          
    
    1 treat as integer
          
    
    1 treat as string
          
    
    "foo" treat as string
          
    
    { "foo" : "bar" } treat as object
          
    
    ({ "foo" : "bar" }, { "bar" : "foo" }) treat as json-item+
          
    
    [ 1, 2, 3 ] treat as array?
          
    
    () treat as ()
          
    
    "1" castable as integer
          
    
    "foo" castable as integer
          
    
    "2013-04-02" castable as date
          
    
    () castable as date
          
    
    ("2013-04-02", "2013-04-03") castable as date
          
    
    () castable as date?
          
    
    "1" cast as integer
          
    
    "foo" cast as integer
          
    
    "2013-04-02" cast as date
          
    
    () cast as date
          
    
    ("2013-04-02", "2013-04-03") cast as date
          
    
    () cast as date?
          
    
    "2013-04-02" cast as date?
          
    
    typeswitch("foo")
    case integer return "integer"
    case string return "string"
    case object return "object"
    default return "other"
          
    
    typeswitch("foo")
    case $i as integer return $i + 1
    case $s as string return $s || "foo"
    case $o as object return [ $o ]
    default $d return $d
          
    
    typeswitch("foo")
    case $a as integer | string return { "integer or string" : $a }
    case $o as object return [ $o ]
    default $d return $d