1 of 55

Documentation

RumbleDB 2.0 "Lemon Ironwood"

RumbleDB is a querying engine that allows you to query your large, messy datasets with ease and productivity. It covers the entire data pipeline: clean up, structure, normalize, validate, convert to an efficient binary format, and feed it right into Machine Learning estimators and models, all within the JSONiq language.

RumbleDB supports JSON-like datasets including JSON, JSON Lines, Parquet, Avro, SVM, CSV, ROOT as well as text files, of any size from kB to at least the two-digit TB range (we have not found the limit yet).

RumbleDB is both good at handling small amounts of data on your laptop (in which case it simply runs locally and efficiently in a single-thread) as well as large amounts of data by spreading computations on your laptop cores, or onto a large cluster (in which case it leverages Spark automagically).

RumbleDB can also be used to easily and efficiently convert data from a format to another, including from JSON to Parquet thanks to JSound validation.

It runs on many local or distributed filesystems such as HDFS, S3, Azure blob storage, and HTTP (read-only), and of course your local drive as well. You can use any of these file systems to store your datasets, but also to store and share your queries and functions as library modules with other users, worldwide or within your institution, who can import them with just one line of code. You can also output the results of your query or the log to these filesystems (as long as you have write access).

With RumbleDB, queries can be written in the tailor-made and expressive JSONiq language. Users can write their queries declaratively and start with just a few lines. No need for complex JSON parsing machinery as JSONiq supports the JSON data model natively.

The core of RumbleDB lies in JSONiq's FLWOR expressions, the semantics of which map beautifully to DataFrames and Spark SQL. Likewise expression semantics is seamlessly translated to transformations on RDDs or DataFrames, depending on whether a structure is recognized or not. Transformations are not exposed as function calls, but are completely hidden behind JSONiq queries, giving the user the simplicity of an SQL-like language and the flexibility needed to query heterogeneous, tree-like data that does not fit in DataFrames.

This documentation provides you with instructions on how to get started, examples of data sets and queries that can be executed locally or on a cluster, links to JSONiq reference and tutorials, notes on the function library implemented so far, and instructions on how to compile RumbleDB from scratch.

Please note that this is a (maturing) beta version. We welcome bug reports in the GitHub issues section.

Ways to install and use

There are many ways to install and use RumbleDB. For example:

By simply using one of our online sandboxes (Jupyter notebook or simple sandbox page)
Our newest library: by installing a pip package (pip install jsoniq)
By running the standalone RumbleDB jar with Java on your laptop
By installing with homebrew
By installing Spark yourself on your laptop (for more control on Spark parameters) and use a small RumbleDB jar with spark-submit
By using our docker image on your laptop (go to the "Run with docker" section on the left menu)
By uploading the small RumbleDB jar to an existing Spark cluster (such as AWS EMR)
By running RumbleDB as an HTTP server in the background and connecting to it in a Jupyter notebook with the %%jsoniq magic.
By installing it manually on your machine.

Further steps

After installing RumbleDB, further steps could involve:

Learning JSONiq. More details can be found in the JSONiq section of this documentation and in the and .
Storing some data on S3, creating a Spark cluster on Amazon EMR (or Azure blob storage and Azure, etc), and querying the data with RumbleDB. More details are found in the cluster section of this documentation.
Using RumbleDB with Jupyter notebooks. For this, you can run RumbleDB as a server with a simple command, and get started by downloading the and just clicking your way through it. More details are found in the Jupyter notebook section of this documentation. Jupyter notebooks work both locally and on a cluster.

On the online sandbox

If you really want to start writing queries right now, there is a public sandbox here that will just work and guide you. You only need to have a Google account to be able to execute them, as this exposes our Jupyter notebook via the Colab environment. You are also free to download and use this notebook with any other provider or even your own local Jupyter and it will work just the same: the queries are all shipped to our own, small public backend no matter what. However, this may require a bit of configuration (JAVA_HOME pointing to Java 17 or 21, and if you have conflicting Spark installations in addition to pyspark, SPARK_HOME pointing to a Spark 4.0 installation).

If you do not have a Google account, you can also use our simpler sandbox page without Jupyter, here where you can type small queries and see the results.

With the sandboxes above, you can only inline your data in the query or access a dataset with an HTTP URL.

Once you want to take it to the next level and query your own data on your laptop, you will find instructions below to use RumbleDB on your own computer manually, which among others will allow you to query any files stored on your local disk. And then, you can take a leap of faith and use RumbleDB on a large cluster (Amazon EMR, your company's cluster, etc).

As a pip package

You can use RumbleDB from within Python programmes by running

Java version

Important note: since the jsoniq package depends on pyspark 4, Java 17 or Java 21 is a requirement. If another version of Java is installed, the execution of a Python program attempting to create a RumbleSession will lead to an error message on stderr that contains explanations.

You can control your Java version with:

Information about how this package is used can be found .

In jupyter notebooks

The Python edition of Rumble can be used to directly write JSONiq queries in Jupyter notebook cells. This is explained . You first need to install the library as described .

With homebrew

It is also possible to use RumbleDB with brew, however there is currently no way to adjust memory usage. To install RumbleDB with brew, type the commands:

You can test that it works with:

Then, launch a JSONiq shell with:

The RumbleDB shell appears:

You can now start typing simple queries like the following few examples. Press three times the return key to execute a query.

Command line (java -jar)

Java version (important)

You need to make sure that you have Java 11 or 17 and that, if you have several versions installed, JAVA_HOME correctly points to Java 11 or 17.

RumbleDB works with both Java 11 and Java 17. You can check the Java version that is configured on your machine with:

If you do not have Java, you can download version 11 or 17 from .

Do make sure it is not Java 8, which will not work.

Command line (with spark-submit and an existing Spark installation)

This method gives you more control about the Spark configuration than the experimental standalone jar, in particular you can increase the memory used, change the number of cores, and so on.

If you use Linux, Florian Kellner also kindly contributed an installation script for Linux users that roughly takes care of what is described below for you.

Users of the Python edition (pip install jsoniq) should not have to install Spark manually because the pip package automatically installs pyspark and this contains a Spark 4 installation. However, advanced users who have multiple Spark installations or encounter a Spark version conflict in Python may find the information below useful.

Install Spark (if you do not have installed already)

RumbleDB requires an Apache Spark installation on Linux, Mac or Windows. Important note: it needs to be either Spark 4, or the Scala 2.13 build of Spark 3.5.

It is straightforward to directly , unpack it and put it at a location of your choosing. We recommend to pick Spark 4.0.0.

SPARK_HOME and PATH (you need to check even if you already have an existing installation)

You then need to point the SPARK_HOME environment variable to this directory, and to additionally add the subdirectory "bin" within the unpacked directory to the PATH variable. On macOS this is done by adding.

Users of the Python edition who have additional Spark installations must ensure that SPARK_HOME and PATH point to a Spark 4 installation. The Python edition does not work with Spark 3.5.

(with SPARK_HOME appropriately set to match your unzipped Spark directory) to the file .zshrc in your home directory, then making sure to force the change with

in the shell. In Windows, changing the PATH variable is done in the control panel. In Linux, it is similar to macOS.

As an alternative, users who love the command line can also install Spark with a package management system instead, such as brew (on macOS) or apt-get (on Ubuntu). However, these might be less predictable than a raw download.

You can test that Spark was correctly installed with:

Java version (important)

You need to make sure that you have Java 11 (for Spark 3.5) or 17 (for Spark 3.5 or 4.0) or 21 (for Spark 4.0) and that, if you have several versions installed, JAVA_HOME correctly points to the correct Java installation. Spark only supports Java 11 or 17 or 21 depending on the version.

Spark 4+ is documented to work with both Java 17 and Java 21. If there is an issue with the Java version, RumbleDB will inform you with an appropriate error message. You can check the Java version that is configured on your machine with:

Download the small version of the RumbleDB jar

Like Spark, RumbleDB is just a download and no installation is required.

In order to run RumbleDB, you simply need to download one of the small .jar files from the and put it in a directory of your choice, for example, right besides your data.

If you use Spark 3.5, use rumbledb-2.0.0-for-spark-3.5-scala-2.13.jar.

If you use Spark 4.0, use rumbledb-2.0.0-for-spark-4.0.jar.

These jars do not embed Spark, since you chose to set it up separately. They will work with your Spark installation with the spark-submit command.

Make sure to use the corresponding jar name accordingly in all our instructions in lieu of rumbledb.jar, replacing rumbledb.jar with the actual name of the jar file you downloaded.

In a shell, from the directory where the RumbleDB .jar lies, type, all on one line:

replacing rumbledb.jar with the actual name of the jar file you downloaded.

The RumbleDB shell appears:

You can now start typing simple queries like the following few examples. Press three times the return key to execute a query.

Through the Java API with Maven

RumbleDB can also be used as a maven dependency. You can find it here.

The JavaDoc documentation is accessible here.

Installing from source (for the adventurous)

We show here how to install RumbleDB from the GitHub repository and build it yourself if you wish to do so (for example, to use the latest master). However, the easiest way to use RumbleDB is to simply download the already compiled .jar files.

Requirements

The following software is required:

: the version of Java is important, as RumbleDB only works with Java 11 (Standalone or Spark 3.5), 17 (Standalone or Spark 3.5 or Spark 4 or Python) or 21 (Spark 4 or Python). The current master branch corresponds to Spark 4.0, meaning that Java 17 or 21 is required.

As an HTTP server

Now that there is a pip package available, using it may appeal more to some users than this older approach based on running RumbleDB as a server (you can put your JSONiq queries in rumble.jsoniq() calls). We keep this documentation for any users interested in the server capabilities of RumbleDB.

Starting the HTTP server

RumbleDB can be run as an HTTP server that listens for queries. In order to do so, you can use the --server and --port parameters:

This command will not return until you force it to (Ctrl+C on Linux and Mac). This is because the server has to run permanently to listen to incoming requests.

Most users will not have to do anything beyond running the above command. For most of them, the next step would be to open a Jupyter notebook that connects to this server automatically.

This HTTP server is built as a basic server for the single user use case, i.e., the user runs their own RumbleDB server on their laptop or cluster, and connects to it via their Jupyter notebook, one query at a time. Some of our users have more advanced needs, or have a larger user base, and typically prefer to implement their own HTTP server, lauching RumbleDB queries either via the public RumbleDB Java API (like the basic HTTP server does -- so its code can serve as a demo of the Java API) or via the RumbleDB CLI.

Caution! Launching a server always has consequences on security, especially as RumbleDB can read from and write to your disk; So make sure you activate your firewall. In later versions, we may support authentication tokens.

Testing that it works (not necessary for most end users)

The HTTP server is meant not to be used directly by end users, but instead to make it possible to integrate RumbleDB in other languages and environments, such as Python and Jupyter notebooks.

To test that the server is running, you can try the following address in your browser, assuming you have a query stored locally at /tmp/query.jq. All queries have to go to the /jsoniq path.

The request returns a JSON object, and the resulting sequence of items is in the values array.

Almost all parameters from the command line are exposed as HTTP parameters.

A query can also be submitted in the request body:

Use with Jupyter notebooks

With the HTTP server running, if you have installed Python and Jupyter notebooks (for example with the Anaconda data science package that does all of it automatically), you can create a RumbleDB magic by just executing the following code in a cell:

Where, of course, you need to adapt the port (8001) to the one you picked previously.

Then, you can execute queries in subsequent cells with:

or on multiple lines:

Use with clusters

You can also let RumbleDB run as an HTTP server on the master node of a cluster, e.g. on Amazon EMR or Azure. You just need to:

Create the cluster (it is usually just the push of a few buttons in Amazon or Azure)
Wait for a few minutes
Make sure that your own IP has incoming access to EMR machines by configuring the security group properly. You usually only need to do so the first time you set up a cluster (if your IP address remains the same), because the security group configuration will be reused for future EMR clusters.

Then there are two options

With SSH tunneling

Connect to the master with SSH with an extra parameter for securely tunneling the HTTP connection (for example -L 8001:localhost:8001 or any port of your choosing)
Download the RumbleDB jar to the master node
wget https://github.com/RumbleDB/rumble/releases/download/v1.24.0/rumbledb-1.24.0.jar
Launch the HTTP server on the master node (it will be accessible under http://localhost:8001/jsoniq).

With the EC2 hostname

There is also another way that does not need any tunnelling: you can specify the hostname of your EC2 machine (copied over from the EC2 dashboard) with the --host parameter. For example, with the placeholder :

You also need to make sure in your EMR security group that the chosen port (e.g., 8001) is accessible from the machine in which you run your Jupyter notebook. Then, you can point your Jupyter notebook on this machine to http://<ec2-hostname>:8001/jsoniq.

Be careful not to open this port to the whole world, as queries can be sent that read and write to the EC2 machine and anything it has access to (like S3).

Writing JSONiq queries in Python

You can use RumbleDB from within Python programmes by running

Java version

You can control your Java version with:

Information about how this package is used can be found in this section.

Common issue: colliding Spark version

Some advanced users who have already configured a Spark installation on their machine may encounter a version issue if SPARK_HOME points to this alternate installation, and it is a different version of Spark (e.g., 3.5 or 3.4). The jsoniq package requires Spark 4.0.

If this happens, RumbleDB should output an informative error message. They are two ways to fix such conflicts:

The easiest is remove the SPARK_HOME environment variable completely. This will have RumbleDB fall back to the Spark 4.0 installation that ships with its pyspark dependency.
Or you can instead change the value of SPARK_HOME to point to a Spark 4.0 installation, if you have one. This would be for more advanced users who know what they are doing.

If you have another working Spark installation on your machine, you can see which version it is with

The above command is of course expected not to work for first-time users who only installed the jsoniq package and never installed Spark additionally on their machine.

High-level information on the library

A RumbleSession is a wrapper around a SparkSession that additionally makes sure the RumbleDB environment is in scope.

JSONiq queries are invoked with rumble.jsoniq() in a way similar to the way Spark SQL queries are invoked with spark.sql().

JSONiq variables can be bound to lists of JSON values (str, int, float, True, False, None, dict, list) or to Pyspark DataFrames. A JSONiq query can use as many variables as needed (for example, it can join between different collections).

It will later also be possible to read tables registered in the Hive metastore, similar to spark.sql(). Alternatively, the JSONiq query can also read many files of many different formats from many places (local drive, HTTP, S3, HDFS, ...) directly with simple such as json-lines(), text-file(), parquet-file(), csv-file(), etc.

The resulting sequence of items can be retrieved as a list of JSON values, as a Pyspark DataFrame, or, for advanced users, as an RDD or with a streaming iteration over the items using the .

It is also possible to write the sequence of items to the local disk, to HDFS, to S3, etc in a way similar to how DataFrames are written back by Pyspark.

The design goal is that it is possible to chain DataFrames between JSONiq and Spark SQL queries seamlessly. For example, JSONiq can be used to clean up very messy data and turn it into a clean DataFrame, which can then be processed with Spark SQL, spark.ml, etc.

Any feedback or error reports are very welcome.

Your first programs

The syntax to start a session is similar to that of Spark. A RumbleSession is a SparkSession that additionally knows about RumbleDB. All attributes and methods of SparkSession are also available on RumbleSession.

Even though RumbleDB uses Spark internally, it can be used without any knowledge of Spark.

Executing a query is done with rumble.jsoniq() like so.

A query returns a sequence of items, here the sequence with just the integer item 2.

There are several ways to retrieve the results of the query. Calling the json() is just one of them. It retrieves the sequence of as a tuple of JSON values that Python can process. The detailed . Other methods for .

Ways to get and process the output of a JSONiq query

There are several ways to get back the output of the JSONiq query. There are many examples of use further down this page.

Method

Description

Requirement in availableOutputs()

Scale

availableOutputs()

Returns a list that helps you understand which output methods you can call. The strings in this list can be Local, RDD, DataFrame, or PUL.

json()

Returns the results as a tuple containing dicts, lists, strs, ints, floats, booleans, Nones.

Local

Type mapping

Any expression in JSONiq returns a sequence of items. Any variable in JSONiq is bound to a sequence of items. Items can be objects, arrays, or atomic values (strings, integers, booleans, nulls, dates, binary, durations, doubles, decimal numbers, etc). A sequence of items can be a sequence of just one item, but it can also be empty, or it can be as large as to contain millions, billions or even trillions of items. Obviously, for sequence longer than a billion items, it is a better idea to use a cluster than a laptop. A relational table (or more generally a data frame) corresponds to a sequence of object items sharing the same schema. However, sequences of items are more general than tables or data frames and support heterogeneity seamlessly.

When passing Python values to JSONiq or getting them from a JSONiq queries, the mapping to and from Python is as follows:

Python

JSONiq

Binding JSONiq variables to Python values

It is possible to bind a JSONiq variable to a tuple of native Python values and then use it in a query. JSONiq, variables are bound to sequences of items, just like the results of JSONiq queries are sequence of items. A Python tuple will be seamlessly converted to a sequence of items by the library. Currently we only support strs, ints, floats, booleans, None, and (recursively) lists and dicts. But if you need more (like date, bytes, etc) we will add them without any problem. JSONiq has a rich type system.

Values can be passed with extra named parameters, like so.

It is also possible to bind variables more durably (across multiple jsoniq() calls) with bind().

It is possible to bind only one value. The it must be provided as a singleton tuple. This is because in JSONiq, an item is the same a sequence of one item.

For convenience and code readability, you can also use bindOne().

A variable that was durably bound with bind() or bindOne() can be unbound with unbind().

Interacting with pandas DataFrames

RumbleDB can work out of the box with pandas DataFrames, both as input and (when the output has a schema) as output.

Binding JSONiq variables to pandas DataFrames

bind() also accepts pandas dataframes

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [30,25,35]};
pdf = pd.DataFrame(data);

rumble.bind('$a',pdf);
seq = rumble.jsoniq('$a.Name')

The same goes for extra named parameters.

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [30,25,35]};
pdf = pd.DataFrame(data);

seq = rumble.jsoniq('$a.Name', a=pdf)

Getting the results as a pandas DataFrame

It is also possible to get the results back as a pandas dataframe with pdf() (if the output has a schema, which you can check by calling availableOutputs() and seeing if "DataFrame" is in the returned list).

Interacting with pyspark DataFrames

RumbleDB can work out of the box with pyspark DataFrames, both as input and (when the output has a schema) as output.

Using Pyspark DataFrames with JSONiq

The power users can also interface our library with pyspark DataFrames. JSONiq sequences of items can have billions of items, and our library supports this out of the box: it can also run on clusters on AWS Elastic MapReduce for example. But your laptop is just fine, too: it will spread the computations on your cores. You can bind a DataFrame to a JSONiq variable. JSONiq will recognize this DataFrame as a sequence of object items.

Creating a data frame also similar to Spark (but using the rumble object).

This is how to bind a JSONiq variable to a dataframe. You can bind as many variables as you want.

Writing queries directly in Jupyter notebook cells

The Python edition of RumbleDB comes out of the box with a JSONiq magic.

If you are in a Jupyter notebook and have installed the jsoniq pip package, you can activate the jsoniq magic with:

Then, you can run JSONiq in standalone cells and see the results:

Of course, you can still continue to use rumble.jsoniq() calls and process the outputs as you see fit.

An example of the magic in action is available in our .

Note: This is a different magic than the magic that works with the RumbleDB HTTP server. It is more modern and running a server is no longer needed with this different magic. It suffices to install the jsoniq Python package.

Advanced configuration

RumbleDB's specific configuration

It is possible to access RumbleDB's advanced configuration parameters with

Then, you can change the value of some parameters. For example, you can increase the number of JSON values that you can retrieve with a json() call:

You can also configure RumbleDB to output verbose information about the internal query plan, type and mode detection, and optimizations. This can be of interest to data engineers or researchers to understand how RumbleDB works.

The complete API for configuring RumbleDB is accessible in our pages. These methods are also callable in Python.

Warning: some of the configuration methods do not make sense in Python and are specific to the command line edition of RumbleDB (such as setting the query content or an output path and input/output format). Also, setting external variables in Python should not be done via the configuration, but with the bind() and unbind() functions or extra parameters in jsoniq() calls.

Write back to the disk (or data lake)

Generally, it is possible to write output by to disk using the pandas DataFrame API, the pyspark DataFrame API, or Python's library to write JSON values to disk.

For convenience, we provide a way to also directly do so with the sequence object output by the query.

it is possible to write the output to a file locally or on a cluster. The API is similar to that of Spark dataframes. Note that it creates a directory and stores the (potentially very large) output in a sharded directory. RumbleDB was already tested with up to 64 AWS machines and 100s of TBs of data.

Of course the examples below are so small that it makes more sense to process the results locally with Python, but this shows how GBs or TBs of data obtained from JSONiq can be written back to disk.

The JSONiq language

JSONiq is a query and processing language specifically designed for the popular JSON data model. The main ideas behind JSONiq are based on lessons learned in more than 30 years of relational query systems and more than 15 years of experience with designing and implementing query languages for semi-structured data like XML and RDF.

The main source of inspiration behind JSONiq is XQuery, which has been proven so far a successful and productive query language for semi-structured data (in particular XML). JSONiq borrowed a large numbers of ideas from XQuery, like the structure and semantics of a FLWOR construct, the functional aspect of the language, the semantics of comparisons in the face of data heterogeneity, the declarative, snapshot-based updates. However, unlike XQuery, JSON is not concerned with the peculiarities of XML, like mixed content, ordered children, the confusion between attributes and elements, the complexities of namespaces and QNames, or the complexities of XML Schema, and so on.

The power of the XQuery's FLWOR construct and the functional aspect, combined with the simplicity of the JSON data model result in a clean, sleek and easy to understand data processing language. As a matter of fact, JSONiq is a language that can do more than queries: it can describe powerful data processing programs, from transformations, selections, joins of heterogeneous data sets, data enrichment, information extraction, information cleaning, and so on.

Technically, the main characteristics of JSONiq (and XQuery) are the following:

It is a set-oriented language. While most programming languages are designed to manipulate one object at a time, JSONiq is designed to process sets (actually, sequences) of data objects.
It is a functional language. A JSONiq program is an expression; the result of the program is the result of the evaluation of the expression. Expressions have fundamental role in the language: every language construct is an expression, and expressions are fully composable.
It is a declarative language. A program specifies what is the result being calculated, and does not specify low level algorithms like the sort algorithm, the fact that an algorithm is executed in main memory or is external, on a single machine or parallelized on several machines, or what access patterns (aka indexes) are being used during the evaluation of the program. Such implementation decisions should be taken automatically, by an optimizer, based on the physical characteristics of the data, and of the hardware environment. Just like a traditional database would do. The language has been designed from day one with optimizability in mind.
It is designed for nested, heterogeneous, semi-structured data. Data structures in JSON can be nested with arbitrary depth, do not have a specific type pattern (i.e. are heterogeneous), and may or may not have one or more schemas that describe the data. Even in the case of a schema, such a schema can be open, and/or simply partially describe the data. Unlike SQL, which is designed to query tabular, flat, homogeneous structures. JSONiq has been designed from scratch as a query for nested and heterogeneous data.

JSONiq 1.0

JSONiq 1.0 is the first version of the JSONiq language, currently in use.

It is a cousin of the XQuery 3.0 language and was developed by W3C XML Query Working Group members as a proposal of how to integrate JSON support into the language, while making it appealing to the JSON community, and making it easy for an existing XQuery engine to implement.

Open Issues

The JSON update syntax was not integrated yet into the core language. This is planned, and the syntax will be simplified (no json keyword, dot lookup allowed here as well).

The semantics for the JSON serialization method is the same as in the JSONiq Extension to XQuery. It is still under discussion how to escape special characters with the Text output method.

JSONiq 3.1

JSONiq 3.1 is an initiative of the RumbleDB team that aligns JSONiq more closely with XQuery 3.1, which has now become a W3C recommendation, but keeping what makes it JSONiq: the flagship feature being the ability to copy-paste JSON into a JSONiq query and with a navigation syntax that appeals to the JSON community.

JSONiq 3.1 does not require a distinct data model (JDM) since XQuery 3.1 support maps and arrays. As a result, JSONiq 3.1 objects are the same as XQuery 3.1 maps and JSONiq 3.1 arrays are the same as XQuery 3.1 arrays.

JSONiq 3.1 does not require a separate serialization mechanism, since XQuery 3.1 supports the JSON output method.

JSONiq 3.1 benefits from all the map and object builtin functions defined in XQuery 3.1.

JSONiq 3.1 is fully interoperable with XQuery 3.1 and can execute on the same virtual machine (similar to Scala and Java).

This also paves the way for JSONiq 4.0 which will also be aligned with XQuery 4.0 as much as is technically possible.

As a result, the specification for JSONiq 3.1 is even more minimal than that of JSONiq 1.0. This makes it easy to support for any existing XQuery engine to step into the JSON community.

RumbleDB is slowly deploying the use of JSONiq 3.1 but it will take some time as we make sure to sweep in all corners.

How JSONiq 3.1 amends XQuery 3.1

Context item

In JSONiq 3.1, the context item is obtained through $$ and not through a dot.

Escaping in strings

String literals use JSON escaping instead of XML escaping (backslash, not ampersand).

Map constructors

In map (object) constructors, the "map" keyword in front is optional.

Constraints on XPath

A name test must be prefixed with $$/ and cannot stand on its own.

True, null, and false literals

true and false exist as literals and do not have to be obtained through function calls (true(), false()).

null exists as a literal and stands for the empty sequence.

The dot . and double square brackets [[ ]] act as syntactic sugars for ? lookup.

How JSONiq 3.1 differs from JSONiq 1.0

The data model standardized by the W3C working group is more generic and allows for atomic object keys that are not necessarily strings (dates, etc). Also, an object value or an array value can be a sequence of items and does not need to be a single item. The particular case in which object keys are strings and values are single items (or empty) corresponds to the JSON use.

Null does not exist as its own type in JSONiq 3.1, instead it is mapped to the empty sequence.

There are other minor changes in semantics that correspond to the alignment with XQuery 3.1 such as Effective Boolean Values, comparison, etc.

JSONiq Update Facility

JSONiq follows the and introduces update primitives and update expressions specific to JSON data.

In JSONiq, updates are not immediately applied. Rather, a snapshot of the current data is taken, and a list of updates, called the Pending Update List, is collected. Then, upon explicit request by the user (via specific expressions), the Pending Update List is applied atomically, leading to a new snapshot. It is also possible for an engine to persist (to the local disk, to a database management system, to a data lake...) the resulting Pending Update List after a query has been completed.

Merging updates

In the middle of a program, several PULs can be produced against the same snapshot. They are then merged with upd:mergeUpdates (part of the XQuery Update Facility standard), which is extended as follows.

Several deletes on the same object are replaced with a unique delete on that object, with a list of all selectors (names) to be deleted, where duplicates have been eliminated.
Several deletes on the same array and selector (position) are replaced with a unique delete on that array and with that selector.
Several inserts on the same array and selector (position) are equivalent to a unique insert on that array and selector with the content of those original inserts appended in an implementation-dependent order (like XQUF).

Applying updates

At the end of an updating program, the resulting PUL is applied with upd:applyUpdates (part of the XQuery Update Facility standard), which is extended as follows:

First, before applying any update, each update primitive (except the jupd:insert-into-object primitives, which do not have any target) locks onto its target by resolving the selector on the object or array it updates. If the selector is resolved to the empty sequence, the update primitive is ignored in step 2. After this operation, each of these update primitives will contain a reference to either the pair (for an object) or the value (for an array) on or relatively to which it operates.
Then each update primitive is applied, using the target references that were resolved at step 1. The order in which they are applied is not relevant and does not affect the final instance of the data model. After applying all updates, an error jerr:JNUP0006 is raised upon pair name collision within the same object.

The transform expression

Updates can be applied to a clone of an existing instance with the expression.

The content of the modify clause may build a complex Pending Update List with multiple updates. Remember that, with snapshot semantics, each update is applied against the initial snapshot, and updates do not see each other's effects.

Updating expression can also be combined with conditional expressions (in the then and else clauses), switch expressions (in the return clauses), FLWOR expressions (in the return clause), etc for more powerful queries based on patterns in the available data (from any source visible to the JSONiq query).

The updates generated inside the modify clause may only target the cloned object, i.e., the variable specified in the copy clause.

Example 191. JSON copy-modify-return expression

Result: { "foo" : true, "bar" : 123, "foobar" : [ true, false ] }

RumbleDB Reference

Licenses

RumbleDB uses the following software:

ANTLR v4 Framework - BSD License
Apache Commons Text - Apache License
Apache Commons Lang - Apache License

Error codes

[FOAR0001] - Division by zero.
[FOAR0002] - Numeric operation overflow/underflow
[FOCA0002] - A value that is not lexically valid for a particular type has been encountered.
[FOCH0001] - Raised by fn:codepoints-to-string if the input contains an integer that is not the codepoint of a valid XML character.
[FOCH0003] - Raised by fn:normalize-unicode if the requested normalization form is not supported by the implementation.
[FODC0002] - Error retrieving resource.
[FODT0001] - Overflow/underflow in date/time operation.
[FODT0002] - Overflow/underflow in duration operation.
[FOFD1340] -This error is raised if the picture string or calendar supplied to fn:format-date, fn:format-time, or fn:format-dateTime has invalid syntax.
[FOFD1350] - This error is raised if the picture string supplied to fn:format-date selects a component that is not present in a date, or if the picture string supplied to fn:format-time selects a component that is not present in a time.
[FOTY0012] - The argument has no typed value (objects, arrays, functions cannot be atomized).
[JNTY0004] - Unexpected non-atomic element. Raised when objects or arrays are supplied where an atomic element is expected.
[JNTY0024] - Error getting the string value for array and object items
[JNTY0018] - Invalid selector error code. It is a type error if there is not exactly one supplied parameter for an object or array selector.
[RBDY0005] - Materialization Error: the sequence is too big to be materialized. Use --materialization-cap to increase the maximum materialization size, or add an output path to write to.
[RBML0001] - Unrecognized RumbleDB ML Class Reference An unrecognized classname is used in query while accessing the RumbleDB ML API.
[RBML0002] - Unrecognized RumbleDB ML Param Reference An unrecognized parameter is used in query while operating with a RumbleDB ML class.
[RBML0003] - Invalid RumbleDB ML Param Provided parameter does not match the expected type or value for the referenced RumbleDB ML class.
[RBML0004] - Input is not a DataFrame Provided input of items does not form a DataFrame as expected by RumbleDB ML.
[RBML0005] - Invalid schema for DataFrame in annotate() The provided schema can not be applied to the item data while converting the data to a DataFrame
[RBST0001] - CLI error. Raised when invalid parameters are supplied at launch.
[RBST0002] - Unimplemented feature error. Raised when a JSONiq feature that is not yet implemented in RumbleDB is used.
[RBST0003] - Invalid for clause expression error. Raised when an expression produces a different, big sequence of items for each binding within a big tuple, which would lead to a data flow explosion and to a nesting of jobs on the Spark cluster.
[RBST0004] - Implementation Error.
[SENR0001] - Serialization error. Function items can not be serialized.
[XPDY0002] - It is a dynamic error if evaluation of an expression relies on some part of the dynamic context that is absent.
[XPDY0050] - Dynamic type treat error. It is a dynamic error if the dynamic type of the operand of a treat expression does not match the sequence type specified by the treat expression. This error might also be raised by a path expression beginning with "/" or "//" if the context node is not in a tree that is rooted at a document node. This is because a leading "/" or "//" in a path expression is an abbreviation for an initial step that includes the clause treat as document-node().
[XPDY0130] - Generic runtime exception [check error message].
[XPST0003] - Parsing error. Invalid syntax or unsupported feature in query.
[XPST0008] - Undefined element reference. It is a static error if an expression refers to an element name, attribute name, schema type name, namespace prefix, or variable name that is not defined in the static context, except for an ElementName in an ElementTest or an AttributeName in an AttributeTest.
[XPST0017] - Invalid function call error. It is a static error if the expanded QName and number of arguments in a static function call do not match the name and arity of a function signature in the static context.
[XPST0080] - Invalid cast error - It is a static error if the target type of a cast or castable expression is NOTATION anySimpleType, or anyAtomicType.
[XPST0081] - Unknown namespace prefix - It is a static error if a QName used in a query contains a namespace prefix that cannot be expanded into a namespace URI by using the statically known namespaces.
[XPTY0004] - Unexpected Type Error. It is a type error if, during the static analysis phase, an expression is found to have a static type that is not appropriate for the context in which the expression occurs, or during the dynamic evaluation phase, the dynamic type of a value does not match a required type. Example: using subtraction on strings.
[XQDY0054] - It is a dynamic error if a cycle is encountered in the definition of a module's dynamic context components, for example because of a cycle in variable declarations.
[XQTY0024] - Attribute After Non Attribute Error - It is a type error if the content sequence in an element constructor contains an attribute node following a node that is not an attribute node.
[XQDY0025] - Duplicate Attribute Error - It is a dynamic error if any attribute of a constructed element does not have a name that is distinct from the names of all other attributes of the constructed element.
[XQDY0074] - Invalid Element Name Error - It is a dynamic error if the value of the name expression in a computed element or attribute constructor cannot be converted to an expanded QName (for example, because it contains a namespace prefix not found in statically known namespaces.)
[XQDY0096] - Invalid Node Name Error - It is a dynamic error if the node-name of a node constructed by a computed element constructor has any of the following properties: 1. Its namespace prefix is xmlns. 2. Its namespace URI is http://www.w3.org/2000/xmlns/. 3. Its namespace prefix is xml and its namespace URI is not http://www.w3.org/XML/1998/namespace. 4. Its namespace prefix is other than xml and its namespace URI is http://www.w3.org/XML/1998/namespace.
[XQDY0137] - Duplicate pair name. It is a dynamic error if two pairs in an object constructor or in a simple object union have the same name.
[XQST0016] - Module declaration error. Current implementation does not support the Module Feature raises a static error if it encounters a module declaration or a module import.
[XQST0031] - Invalid JSONiq version. It is a static error if the version number specified in a version declaration is not supported by the implementation. For now, only version 1.0 is supported.
[XQST0033] - Namespace prefix bound twice. It is a static error if a module contains multiple bindings for the same namespace prefix.
[XQST0034] - Function already exists. It is a static error if multiple functions declared or imported by a module have the same number of arguments and their expanded QNames are equal (as defined by the eq operator).
[XQST0038] - It is a static error if a Prolog contains more than one default collation declaration, or the value specified by a default collation declaration is not present in statically known collations.
[XQST0039] - Duplicate parameter name. It is a static error for a function declaration or an inline function expression to have more than one parameter with the same name.
[XQST0047] - It is a static error if multiple module imports in the same Prolog specify the same target namespace.
[XQST0048] - It is a static error if a function or variable declared in a library module is not in the target namespace of the library module.
[XQST0049] - It is a static error if two or more variables declared or imported by a module have the same name.
[XQST0052] - Simple type error. The type must be the name of a type defined in the in-scope schema types, and the {variety} of the type must be simple.
[XQST0059] - It is a static error if an implementation is unable to process a schema or module import by finding a schema or module with the specified target namespace.
[XQST0069] - A static error is raised if a Prolog contains more than one empty order declaration.
[XQST0088] - It is a static error if the literal that specifies the target namespace in a module import or a module declaration is of zero length.
[XQST0089] - It is a static error if a variable bound in a for or window clause of a FLWOR expression, and its associated positional variable, do not have distinct names (expanded QNames).
[XQST0094] - Invalid variable in group-by clause. The name of each grouping variable must be equal (by the eq operator on expanded QNames) to the name of a variable in the input tuple stream.
[XQST0118] - In a direct element constructor, the name used in the end tag must exactly match the name used in the corresponding start tag, including its prefix or absence of a prefix.

Function library

We list here the most important functions supported by RumbleDB, and introduce them by means of examples. Highly detailed specifications can be found in the underlying W3C standard, unless the function is marked as specific to JSON or RumbleDB, in which case it can be found here. JSONiq and RumbleDB intentionally do not support builtin functions on XML nodes, NOTATION or QNames. RumbleDB supports almost all other W3C-standardized functions, please contact us if you are still missing one.

For the sake of ease of use, all W3C standard builtin functions and JSONiq builtin functions are in the RumbleDB namespace, which is the default function namespace and does not require any prefix in front of function names.

It is recommended that user-defined functions are put in the local namespace, i.e., their name should have the local: prefix (which is predefined). Otherwise, there is the risk that your code becomes incompatible with subsequent releases if new (unprefixed) builtin functions are introduced.

Errors and diagnostics

Diagnostic tracing

trace

Fully implemented

returns (1, 2, 3) and logs it in the log-path if specified

Functions and operators on numerics

Functions on numeric values

abs

Fully implemented

returns 2.0

ceiling

Fully implemented

returns 3.0

floor

Fully implemented

returns 2.0

round

Fully implemented

returns 2.0

returns 2.23

round-half-to-even

Fully implemented

Parsing numbers

number

Fully implemented

returns 15 as a double

returns NaN as a double

returns 15 as a double

Formatting integers

format-integer

Not implemented

##Formatting numbers

format-number

Not implemented

##Trigonometric and exponential functions

###pi

Fully implemented

returns 3.141592653589793

###exp

Fully implemented

###exp10

Fully implemented

log

Fully implemented

log10

Fully implemented

pow

Fully implemented

sqrt

Fully implemented

returns 2

sin

Fully implemented

cos

Fully implemented

cosh

JSONiq-specific. Fully implemented

sinh

JSONiq-specific. Fully implemented

tan

Fully implemented

asin

Fully implemented

acos

Fully implemented

atan

Fully implemented

atan2

Fully implemented

Random numbers

random-number-generator

Not implemented

Functions on strings

Functions to assemble and disassemble strings

string-to-codepoint

Fully implemented

returns (84, 104, 233, 114, 232, 115, 101)

returns ()

codepoints-to-string

Fully implemented

returns "अशॊक"

returns ""

Comparison of strings

compare

Fully implemented

returns -1

codepoint-equal

Fully implemented

returns true

returns ()

collation-key

Not implemented

contains-token

Not implemented

Functions on string values

concat

Fully implemented

returns "foobarfoobar"

string-join

Fully implemented

returns "foobarfoobar"

returns "foo-bar-foobar"

substring

Fully implemented

returns "bar"

returns "ba"

string-length

Fully implemented

Returns the length of the supplied string, or 0 if the empty sequence is supplied.

returns 3.

returns 0.

###normalize-space

Fully implemented

Normalization of spaces in a string.

returns "The wealthy curled darlings of our nation."

normalize-unicode

Fully implemented

Returns the value of the input after applying Unicode normalization.

returns the unicode-normalized version of the input string. Normalization forms NFC, NFD, NFKC, and NFKD are supported. "FULLY-NORMALIZED" though supported, should be used with caution as only the composition exclusion characters supported FULLY-NORMALIZED are which are uncommented in the .

upper-case

Fully implemented

returns "ABCD0"

lower-case

Fully implemented

returns "abc!d"

translate

Fully implemented

returns "BAr"

returns "AAA"

Functions based on substring matching

contains

Fully implemented

returns true.

starts-with

Fully implemented

returns true

ends-with

Fully implemented

returns true.

substring-before

Fully implemented

returns "foo"

returns "f"

substring-after

Fully implemented

returns "bar"

returns ""

String functions that use regular expressions

matches

Arity 2 implemented, arity 3 is not.

Regular expression matching. The semantics of regular expressions are those of Java's Pattern class.

returns true.

replace

Arity 3 implemented, arity 4 is not.

Regular expression matching and replacing. The semantics of regular expressions are those of Java's Pattern class.

returns "a*cada*"

returns "abbraccaddabbra"

tokenize

Arity 2 implemented, arity 3 is not.

returns ("aa", "bb", "cc", "dd")

analyze-string

Not implemented

Functions that manipulate URIs

resolve-uri

Fully implemented

returns http://www.examples.com/examples

encode-for-uri

Fully implemented

returns 100%25%20organic

iri-to-uri

Not implemented

escape-html-uri

Not implemented

Functions and operators on Boolean values

Boolean constant functions

true

Fully implemented

returns true

false

Fully implemented

returns false

boolean

Fully implemented

returns true

returns false

not

Fully implemented

returns false

returns true

Functions and operators on durations

Component extraction functions on durations

years-from-duration

Fully implemented

returns 2021.

months-from-duration

Fully implemented

returns 6.

days-from-duration

Fully implemented

returns 17.

hours-from-duration

Fully implemented

returns 12.

minutes-from-duration

Fully implemented

returns 35.

seconds-from-duration

Fully implemented

returns 30.

Functions and operators on dates and times

Constructing a DateTime

dateTime

Fully implemented

returns 2004-04-12T13:20:00+14:00

Component extraction functions on dates and times

year-from-dateTime

Fully implemented

returns 2021.

month-from-dateTime

Fully implemented

returns 04.

day-from-dateTime

Fully implemented

returns 12.

hours-from-dateTime

Fully implemented

returns 13.

minutes-from-dateTime

Fully implemented

returns 20.

seconds-from-dateTime

Fully implemented

returns 32.

timezone-from-dateTime

Fully implemented

returns PT2H.

year-from-date

Fully implemented

returns 2021.

month-from-date

Fully implemented

returns 6.

day-from-date

Fully implemented

returns 4.

timezone-from-date

Fully implemented

returns -PT14H.

hours-from-time

Fully implemented

returns 13.

minutes-from-time

Fully implemented

returns 20.

seconds-from-time

Fully implemented

returns 32.123.

timezone-from-time

Fully implemented

returns PT2H.

Timezone adjustment functions on dates and time values

adjust-dateTime-to-timezone

Fully implemented

returns 2004-04-12T03:25:15+04:05.

adjust-date-to-timezone

Fully implemented

returns 2014-03-12+04:00.

adjust-time-to-timezone

Fully implemented

returns 04:20:00-14:00.

Formatting dates and times functions

The functions in this section accept a simplified version of the picture string, in which a variable marker accepts only:

One of the following component specifiers: Y, M, d, D, F, H, m, s, P
A first presentation modifier, for which the value can be:
- Nn, for all supported component specifiers, besides P
- N, if the component specifier is P

format-dateTime

Fully implemented

returns 20-13-12-4-2004

format-date

Fully implemented

returns 12-4-2004

format-time

Fully implemented

returns 13-20-0

Not implemented

Functions and operators on sequences

General functions and operators on sequences

empty

Fully implemented

Returns a boolean whether the input sequence is empty or not.

returns false.

exists

Fully implemented

Returns a boolean whether the input sequence has at least one item or not.

returns true.

returns false.

This is pushed down to Spark and works on big sequences.

head

Fully implemented

Returns the first item of a sequence, or the empty sequence if it is empty.

returns 1.

returns ().

This is pushed down to Spark and works on big sequences.

tail

Fully implemented

Returns all but the last item of a sequence, or the empty sequence if it is empty.

returns (2, 3, 4, 5).

returns ().

This is pushed down to Spark and works on big sequences.

insert-before

Fully implemented

returns (1, 2, 3, 4, 5).

remove

Fully implemented

returns (1, 2).

reverse

Fully implemented

returns (3, 2, 1).

subsequence

Fully implemented

returns (2, 3).

unordered

Fully implemented

returns (1, 2, 3).

Functions that compare values in sequences

distinct-values

Fully implemented

Eliminates duplicates from a sequence of atomic items.

returns (1, 4, 3, "foo", true, 5).

This is pushed down to Spark and works on big sequences.

index-of

Fully implemented

returns 3.

returns "".

deep-equal

Fully implemented

returns true.

returns false.

Functions that test the cardinality of sequences

zero-or-one

Fully implemented

returns "a".

returns an error.

one-or-more

Fully implemented

returns "a".

returns an error.

exactly-one

Fully implemented

returns "a".

returns an error.

Aggregate functions

count

Fully implemented

returns 4.

Count calls are pushed down to Spark, so this works on billions of items as well:

avg

Fully implemented

returns 2.5.

Avg calls are pushed down to Spark, so this works on billions of items as well:

max

Fully implemented

returns 4.

returns (1, 2, 3).

Max calls are pushed down to Spark, so this works on billions of items as well:

min

Fully implemented

returns 1.

returns (1, 2, 3).

Min calls are pushed down to Spark, so this works on billions of items as well:

sum

Fully implemented

returns 10.

Sum calls are pushed down to Spark, so this works on billions of items as well:

Functions giving access to external information

doc

Fully implemented

Returns the corresponding document node

collection

Not implemented

Parsing and serializing

serialize

Fully implemented

Serializes the supplied input sequence, returning the serialized representation of the sequence as a string

returns { "hello" : "world" }

Context Functions

position

Fully implemented

returns 5

last

Fully implemented

returns 10

current-dateTime

Fully implemented

returns 2020-02-26T11:22:48.423+01:00

current-date

Fully implemented

returns 2020-02-26Europe/Zurich

current-time

Fully implemented

returns 11:24:10.064+01:00

implicit-timezone

Fully implemented

returns PT1H.

default-collation

Fully implemented

returns http://www.w3.org/2005/xpath-functions/collation/codepoint.

High order functions

Functions on functions

function-lookup

Not implemented

function-name

Not implemented

function-arity

Not implemented

Basic higher-order functions

for-each

Not implemented

filter

Not implemented

fold-left

Not implemented

fold-right

Not implemented

for-each-pair

Not implemented

JSONiq functions

keys

Fully implemented

returns ("foo", "bar"). Also works on an input sequence, eliminating duplicates

Keys calls are pushed down to Spark, so this works on billions of items as well:

members

Fully implemented

This function returns the members as an array, but not recursively, i.e., nested arrays are not unboxed.

Returns the first 100 integers as a sequence. Also works on an input sequence, in a distributive way.

null

Fully implemented

Returns a JSON null (also available as the literal null).

parse-json

Fully implemented

size

Fully implemented

returns 100. Also works if the empty sequence is supplied, in which case it returns the empty sequence.

accumulate

Fully implemented

returns

descendant-arrays

Fully implemented

returns

descendant-objects

Fully implemented

returns

descendant-pairs

Fully implemented

returns

flatten

Fully implemented

Unboxes arrays recursively, stopping the recursion when any other item is reached (object or atomic). Also works on an input sequence, in a distributive way.

Returns (1, 2, 3, 4, 5, 6, 7, 8, 9).

intersect

Fully implemented

returns

project

Fully implemented

returns the object {"foo" : "bar", "bar" : "foobar"}. Also works on an input sequence, in a distributive way.

remove-keys

Fully implemented

returns the object {"foobar" : "foo"}. Also works on an input sequence, in a distributive way.

values

Fully implemented

returns ("bar", "foobar"). Also works on an input sequence, in a distributive way.

Values calls are pushed down to Spark, so this works on billions of items as well:

encode-for-roundtrip

Not implemented

decode-from-roundtrip

Not implemented

json-doc

returns the (unique) JSON value parsed from a local JSON (but not necessarily JSON Lines) file where this value may be spread over multiple lines.

Expressions

Construction of items

In JSONiq, objects, arrays and basic atomic values (string, number, boolean, null) are constructed exactly as they are constructed in JSON. Any JSON document is also a valid JSONiq query which just "returns itself".

Because JSONiq expressions are fully composable, however, in objects and arrays constructors, it is possible to put any JSONiq expression and not only atomic literals, object constructors and array constructors. Furthermore, JSONiq supports the construction of other W3C-standardized builtin types (date, hexBinary, etc).

The following examples are a few of many operators available in JSONiq: "to" for creating arithmetic sequences, "||" for concatenating strings, "+" for adding numbers, "," for appending sequences.

In an array, the operand expression will evaluated to a sequence of items, and these items will be copied and become members of the newly created array.

Composable array constructors

Result:[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ]

In an object, the expression you use for the key must evaluate to an atomic - if it is not a string, it will just be cast to it.

Composable object keys

Result:{ "foobar" : true }

An error is raised if the key expressions is not an atomic.

Non-atomic object keys

Result:An error was raised: can not atomize an array item: an array has probably been passed where an atomic value is expected (e.g., as a key, or to a function expecting an atomic item)

If the value expression is empty, null will be used as a value, and if it contains two items or more, they will be wrapped into an array.

If the colon is preceded with a question mark, then the pair will be omitted if the value expression evaluates to the empty sequence.

Composable object values

Result:{ "foo" : 2 }

Composable object values and automatic conversion

Result:{ "foo" : null, "bar" : [ 1, 2 ] }

Optional pair (not implemented yet in Zorba)

Result:An error was raised: invalid expression: syntax error, unexpected "?", expecting "end of file" or "," or "}"

The {| |} syntax can be used to merge several objects.

Merging object constructor

Result:{ "foo" : "bar", "bar" : "foo" }

An error is raised if the operand expression does not evaluate to a sequence of objects.

Merging object constructor with a type error

Result:An error was raised: xs:integer can not be treated as type object()*

Numbers

JSONiq follows the for constructing numbers. The following explanations, provided as an informal summary for convenience, are non-normative.

Literal

NumericLiteral

IntegerLiteral

DecimalLiteral

DoubleLiteral

The syntax for creating numbers is identical to that of JSON (it is actually a more flexible superset, for example leading 0s are allowed, and a decimal literal can begin with a dot). Note that JSONiq distinguishes between integers (no dot, no scientific notation), decimals (dot but no scientific notation) and doubles (scientific notation). As expected, an integer literal creates an atomic of type integer, and so on.

Integer literals

Result:42

Decimal literals

Result:3.14

Double literals

Result:6.022E23

Strings

The syntax for creating string items is conformant to rather than to the W3C standard for string literals. This means concretely that escaping is done with backslashes and not with ampersands. Also, like in JSON, double quotes are required and single quotes are forbidden.

StringLiteral

String literals

Result:foo

String literals with escaping

Result:This is a line and this is a new line

String literals with Unicode character escaping

Result:

String literals with a nested quote

Result:This is a nested "quote"

Booleans and null

JSONiq also introduces three more literals for constructing booleans and nulls: true, false and null. This makes in particular the functions true() and false() superfluous.

BooleanLiteral

NullLiteral

Boolean literals (true)

Result:true

Boolean literals (false)

Result:false

Null literals

Result:null

Other atomic values

JSONiq follows the for constructing most atomic values with constructors. In JSONiq, the xs prefix is optional.

Objects

Expressions constructing objects are JSONiq-specific and introduced in this specification.

ObjectConstructor

PairConstructor

The syntax for creating objects is identical to that of JSON. You can use for an object key any string literal, and for an object value any literal, object constructor or array constructor.

Empty object constructors

Result:{ }

Object constructors 1

Result:{ "foo" : "bar" }

Object constructors 2

Result:{ "foo" : [ 1, 2, 3, 4, 5, 6 ] }

Object constructors 3

Result:{ "foo" : true, "bar" : false }

Nested object constructors

Result:{ "this is a key" : { "value" : "a value" } }

As in JavaScript, if your key is simple enough (like alphanumerics, underscores, dashes, this kind of things), the quotes can be omitted. The strings for which quotes are not mandatory are called NCNames. This class of strings can be used for unquoted keys, for variable and function names, and for module aliases.

Object constructors with unquoted key 1

Result:{ "foo" : "bar" }

Object constructors with unquoted key 2

Result:{ "foo" : [ 1, 2, 3, 4, 5, 6 ] }

Object constructors with unquoted key 3

Result:{ "foo" : "bar", "bar" : "foo" }

Object constructors with needed quotes around the key

Result:{ "but you need the quotes here" : null }

Objects can be constructed more dynamically (e.g., dynamic keys) by constructing and merging smaller objects. Duplicate key names throw an error.

Object constructors with needed quotes around the key

Result:{ "foo1" : 1, "foo2" : 2, "foo3" : 3 }

Arrays

Expressions constructing arrays are JSONiq-specific and introduced in this specification.

ArrayConstructor

Expr

The syntax for creating arrays is identical to that of JSON: square brackets, comma separated literals, object constructors and arrays constructors.

Empty array constructors

Result:[ ]

Array constructors

Result:[ 1, 2, 3, 4, 5, 6 ]

Nested array constructors

Result:[ "foo", 3.14, [ "Go", "Boldly", "When", "No", "Man", "Has", "Gone", "Before" ], { "foo" : "bar" }, true, false, null ]

Square brackets are mandatory. Do not push it.

Functions

JSONiq follows the for constructing function items with inline expressions or . The following explanations, provided as an informal summary for convenience, are non-normative.

Function items can be constructed in two ways: by definining its body directly (inline function expression), or by referring by name to a function declared in a prolog.

FunctionItemExpr

Inline function expression

JSONiq follows the for constructing function items with inline expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

A function can be built directly by specifying its parameters and its body as expression. Types are optional and by default, assumed to be item*.

Function items can also be produced with a partial function application.

Inline function expression

Result(two function items)

InlineFunctionExpr

ParamList

Named function reference

JSONiq follows the for constructing function items with named function references. The following explanations, provided as an informal summary for convenience, are non-normative.

If a function is builtin or declared in a prolog, in the same module or imported, then it is also possible to build a function item by referring to its name and arity.

Named function reference

Result(a function items)

NamedFunctionRef

Manipulating atomic values

We now introduce the expressions that manipulate atomic values: arithmetics, logics, comparison, string concatenation.

Arithmetics

JSONiq follows the for arithmetic expressions, and naturally extends to return errors for null values. The following explanations, provided as an informal summary for convenience, are non-normative.

JSONiq supports the basic four operations, integer division and modulo.

Multiplicative operations have precedence over additive operations. Parentheses can override it.

Basic arithmetic operations with precedence override

Result (run with Zorba):8

Dates, times and durations are also supported in a natural way.

Using basic operations with dates.

Result (run with Zorba):P29D

If any of the operands is a sequence of more than one item, an error is raised.

Sequence of more than one number in an addition

Result (run with Zorba):An error was raised: sequence of more than one item can not be promoted to parameter type xs:anyAtomicType? of function add()

If any of the operands is not a number, a date, a time or a duration, an error is raised, which seamlessly includes raising errors for null with no need to extend the specification.

Null in an addition

Result (run with Zorba):An error was raised: arithmetic operation not defined between types "xs:integer" and "js:null"

If one of the operands evaluates to the empty sequence, then the operation results in the empty sequence.

If the two operands do not have the same number type, JSONiq will do the adequate conversions.

Basic arithmetic operations with an empty sequence

Result (run with Zorba):

AdditiveExpr

MultiplicativeExpr

UnaryExpr

String concatenation

JSONiq follows the for string concatenation. The following explanations, provided as an informal summary for convenience, are non-normative.

Two strings or more can be concatenated using the concatenation operator.

String concatenation

Result (run with Zorba):Captain Kirk

An empty sequence is treated like an empty string.

String concatenation with the empty sequence

Result (run with Zorba):CaptainKirk

StringConcatExpr

Comparison

JSONiq follows the for comparison, and only extends its semantics to null values as follows.

null can be compared for equality or inequality to anything - it is only equal to itself so that false is returned when comparing if for equality with any non-null atomic. True is returned when comparing it with non-equality with any non-null atomic.

Equality and non-equality comparison with null

Result (run with Zorba):false true true

For ordering operators (lt, le, gt, ge), null is considered the smallest possible value (like in JavaScript).

Ordering comparison with null

Result (run with Zorba):false

The following explanations, provided as an informal summary for convenience, are non-normative.

ComparisonExpr

Atomics can be compared with the usual six comparison operators (equality, non-equality, lower-than, greater-than, lower-or-equal, greater-or-equal), and with the same two-letter symbols as in MongoDB.

Equality comparison

Result (run with Zorba):true true

Comparison is only possible between two compatible types, otherwise, an error is raised.

Comparisons with a type mismatch

Result (run with Zorba):An error was raised: "xs:string": invalid type: can not compare for equality to type "xs:integer"

Like for arithmetic operations, if an operand is the empty sequence, the empty sequence is returned as well.

Comparison with the empty sequence

Result (run with Zorba):

Comparisons and logic operators are fundamental for a query language and for the implementation of a query processor as they impact query optimization greatly. The current comparison semantics for them is carefully chosen to have the right characteristics as to enable optimization.

Logics

JSONiq follows the for logical expressions; it introduces a prefix unary not operator as a synonym for fn:not, and extends the semantics of effective boolean values to objects, arrays and nulls. The following explanations, provided as an informal summary for convenience, are non-normative.

OrExpr

AndExpr

NotExpr

JSONiq logics support is based on two-valued logics: just true and false.

Non-boolean operands get automatically converted to either true or false, or an error is raised. The boolean() function performs a manual conversion.

An empty sequence is converted to false.
A singleton sequence of one null is converted to false.
A singleton sequence of one string is converted to true except the empty string which is converted to false.
A singleton sequence of one number is converted to true except zero or NaN which are converted to false.

JSONiq supports the most famous three boolean operations: conjunction, disjunction and negation. Negation has the highest precedence, then conjunction, then disjunction. Parentheses can override.

Logics with booleans

Result (run with Zorba):true

Logics with comparing operands

Result (run with Zorba):true

Conversion of the empty sequence to false

Result (run with Zorba):false

Conversion of null to false

Result (run with Zorba):false

Conversion of a string to true

Result (run with Zorba):true false

Conversion of a number to false

Result (run with Zorba):false true

Conversion of an object to a boolean (not implemented in Zorba at this point)

Result (run with Zorba):true

If the input sequence has more than one item, and the first item is not an object or array, an error is raised.

Error upon conversion of a sequence of more than one item, not beginning with a JSON item, to a boolean

Result (run with Zorba):An error was raised: invalid argument type for function fn:boolean(): effective boolean value not defined for sequence of more than one item that starts with "xs:integer"

Unlike in C++ or Java, you cannot rely on the order of evaluation of the operands of a boolean operation. The following query may return true or may return an error.

Non-determinism in presence of errors.

Result (run with Zorba):true

JSONiq follows the for quantified expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

QuantifiedExpr

It is possible to perform a conjunction or a disjunction on a predicate for each item in a sequence.

Universal quantifier

Result (run with Zorba):true

Existential quantifier on several variables

Result (run with Zorba):true

Variables can be annotated with a type. If no type is specified, item* is assumed. If the type does not match, an error is raised.

Existential quantifier with type checking

Result (run with Zorba):true

Manipulating sequences

JSONiq can create sequences with concatenation (comma) or with a range. Parentheses can be used for overriding precedence.

Comma operator

JSONiq follows the for the concatenation of sequences with commas. The following explanations, provided as an informal summary for convenience, are non-normative.

Expr

Use a comma to concatenate two sequences, or even single items. This operator has the lowest precedence of all.

Comma

Result (run with Zorba):1 2 3 4 5 6 7 8 9 10

Comma

Result (run with Zorba):{ "foo" : "bar" } [ 1 ]

Sequences do not nest. You need to use arrays in order to nest.

Range operator

JSONiq follows the for range expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

RangeExpr

With the binary operator "to", you can generate larger sequences with just two integer operands.

Range operator

Result (run with Zorba):1 2 3 4 5 6 7 8 9 10

If one operand evaluates to the empty sequence, then the range operator returns the empty sequence.

Range operator with the empty sequence

Result (run with Zorba):

Otherwise, if an operand evaluates to something else than a single integer or an empty sequence, an error is raised.

Range operator with a type inconsistency

Result (run with Zorba):An error was raised: sequence of more than one item can not be promoted to parameter type xs:integer? of function to()

Parenthesized expression

JSONiq follows the for parenthesized expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

ParenthesizedExpr

Use parentheses to override the precedence of expressions.

If the parentheses are empty, the empty sequence is produced.

Empty sequence

Result (run with Zorba):

Calling functions

JSONiq follows the for function calls. The following explanations, provided as an informal summary for convenience, are non-normative.

Function calls in JSONiq can either be made statically, with a named function, or dynamically, by passing a function item on the fly.

The syntax for function calls is similar to many other languages. JSONiq supports four sorts of functions:

Builtin functions: these have no prefix and can be called without any import.
Local functions: they are defined in the prolog, to be used in the main query. They have the prefix local:. Chapter describes how to define your own local functions.
Imported functions: they are defined in a library module. They have the prefix corresponding to the alias to which the imported module has been bound to. Chapter describes how to define your own modules.

The first three are named functions and can be called statictically. All four can be called dynamically, as a named function can be also passed as an item with a named function reference.

Static function calls

JSONiq follows the for static function calls. The following explanations, provided as an informal summary for convenience, are non-normative.

A static function call consists of the name of the function and of expressions returning its parameters. An error is thrown if no function with the corresponding name and arity is found.

A builtin function call.

Result:foo bar

A builtin function call.

Result:foobar

An error is raised if the actual types do not match the expected types.

A type error in a function call.

Result:An error was raised: can not atomize an object item: an object has probably been passed where an atomic value is expected (e.g., as a key, or to a function expecting an atomic item)

JSONiq static function calls follow the .

FunctionCall

Dynamic function calls

JSONiq follows the for dynamic function calls. The following explanations, provided as an informal summary for convenience, are non-normative.

A dynamic function call is a postfix expression. Its left-hand-side is an expression that must return a single function item (see in the data model ). Its right-hand side is a list of parameters, each one of which is an arbitrary expression providing a sequence of items, one such sequence for each parameter.

A dynamic function call.

Result:3

If the number of parameters does not match the arity of the function, an error is raised. An error is also raised if an argument value does not match the corresponding type in the function signature.

Otherwise, the function is evaluated with the supplied parameters. If the result matches the return type of the function, it is returned, otherwise an error is raised.

A dynamic function call with signature

Result:3

JSONiq dynamic function calls follow the .

PostfixExpr

ArgumentList

Argument

Partial application

JSONiq follows the for partial application. The following explanations, provided as an informal summary for convenience, are non-normative.

A static or dynamic function call also have placeholder parameters, represented with a question mark in the syntax. When this is the case, the function call returns a function item that is the partial application of the original function, and its arity is the number of remaining placeholders.

A partial application.

Result:4

JSONiq dynamic function calls follow the .

Navigating objects

Like in JavaScript, it is possible to navigate through objects and arrays. This is a specific JSONiq extension.

JSONiq also allows to filter sequences with a predicate and predicates are fully W3C-conformant.

JSONiq supports filtering items from a sequence, looking up the value associated with a given key in an object, looking up the item at a given position in an array, and looking up all items in an array.

PostfixExpr

Object field selector

ObjectLookup

The simplest way to navigate in an object is similar to JavaScript, using a dot. This will work as soon as you do not push it too much: alphanumerical characters, dashes, underscores - just like unquoted keys in object constructors, any NCName is allowed.

Object lookup

Result (run with Zorba):bar

Since JSONiq expressions are composable, you can also use any expression for the left-hand side. You might need parentheses depending on the precedence.

Lookup on a single-object collection.

Result (run with Zorba):bar

The dot operator does an implicit mapping on the left-hand-side, i.e., it applies the lookup in turn on each item. Lookup on an object returns the value associated with the supplied key, or the empty sequence if there is none. Lookup on any item which is not an object (arrays and atomics) results in the empty sequence.

Object lookup with an iteration on several objects

Result (run with Zorba):bar bar2

Object lookup with an iteration on a collection

Result (run with Zorba):James T. Kirk Jean-Luc Picard Benjamin Sisko Kathryn Janeway Jonathan Archer Samantha Carter

Object lookup on a mixed sequence

Result (run with Zorba):bar1 bar2

Of course, unquoted keys will not work for strings that are not NCNames, e.g., if the field contains a dot or begins with a digit. Then you will need quotes.

Quotes for object lookup

Result (run with Zorba):bar

If you use an expression on the right side of the dot, it must always have parentheses. The result of the right-hand-side expression is cast to a string. An error is raised if the cast fails.

Object lookup with a nested expression

Result (run with Zorba):bar

Object lookup with a nested expression

Result (run with Zorba):An error was raised: sequence of more than one item can not be treated as type xs:string

Object lookup with a nested expression

Result (run with Zorba):bar

Variables, or a context item reference, do not need parentheses. Variables are introduced later, but here is a sneak peek:

Object lookup with a variable

Result (run with Zorba):bar

Array member selector

ArrayLookup

Array lookup uses double square brackets.

Array lookup

Result (run with Zorba):bar

Since JSONiq expressions are composable, you can also use any expression for the left-hand side. You might need parentheses depending on the precedence.

Array lookup after an object lookup

Result (run with Zorba):bar

The array lookup operator does an implicit mapping on the left-hand-side, i.e., it applies the lookup in turn on each item. Lookup on an array returns the item at that position in the array, or the empty sequence if there is none (position larger than size or smaller than 1). Lookup on any item which is not an array (objects and atomics) results in the empty sequence.

Array lookup with an iteration on several arrays

Result (run with Zorba):2 5

Array lookup with an iteration on a collection

Result (run with Zorba):The original series The next generation The next generation The next generation Entreprise Voyager

Array lookup on a mixed sequence

Result (run with Zorba):3 6

The expression inside the double-square brackets may be any expression. The result of evaluating this expression is cast to an integer. An error is raised if the cast fails.

Array lookup with a right-hand-side expression

Result (run with Zorba):bar

ArrayUnboxing

You can also extract all items from an array (i.e., as a sequence) with the [] syntax. The [] operator also implicitly iterates on the left-hand-side, returning the empty sequence for non-arrays.

Extracting all items from an array

Result (run with Zorba):foo bar

Extracting all items from arrays in a mixed sequence

Result (run with Zorba):foo bar 1 2 3

Sequence predicates

Predicate

A predicate allows filtering a sequence, keeping only items that fulfill it.

The predicate is evaluated once for each item in the left-hand-side sequence, with the context item set to that item. The predicate expression can use $$ to access this context item.

ContextItemExpr

If the predicate evaluates to an integer, it is matched against the item position in the left-hand side sequence automatically

Predicate expression

Result (run with Zorba):2

Otherwise, the result of the predicate is converted to a boolean.

All items for which the converted predicate result evaluates to true are then output.

Predicate expression

Result (run with Zorba):2 4 6 8 10

Control flow expressions

JSONiq supports control flow expressions such as if-then-else, switch and typeswitch following the W3C standard.

Conditional expressions

JSONiq follows the for conditional expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

IfExpr

A conditional expressions allows you to pick one or another value depending on a boolean value.

A conditional expression

Result (run with Zorba):{ "foo" : "yes" }

The behavior of the expression inside the if is similar to that of logical operations (two-valued logics), meaning that non-boolean values get converted to a boolean.

A conditional expression

Result (run with Zorba):{ "foo" : "no" }

A conditional expression

Result (run with Zorba):{ "foo" : "yes" }

A conditional expression

Result (run with Zorba):{ "foo" : "no" }

A conditional expression

Result (run with Zorba):{ "foo" : "yes" }

A conditional expression

Result (run with Zorba):{ "foo" : "no" }

A conditional expression

Result (run with Zorba):{ "foo" : "no" }

A conditional expression

Result (run with Zorba):{ "foo" : "yes" }

Note that the else clause is mandatory (but can be the empty sequence)

A conditional expression

Result (run with Zorba):{ "foo" : "yes" }

Switch expressions

JSONiq follows the for switch expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

SwitchExpr

SwitchCaseClause

A switch expression evaluates the expression inside the switch. If it is an atomic, it compares it in turn to the provided atomic values (with the semantics of the eq operator) and returns the value associated with the first matching case clause.

Note that if there is an object or array in the base switch expression or any case expression, a JSONiq-specific type error JNTY0004 will be raised, because objects and arrays cannot be atomized and the W3C standard requires atomization of the base and case expressions.

A switch expression

Result (run with Zorba):bar

If it is not an atomic, an error is raised.

A switch expression

Result (run with Zorba):An error was raised: can not atomize an object item: an object has probably been passed where an atomic value is expected (e.g., as a key, or to a function expecting an atomic item)

If no value matches, the default is used.

A switch expression

Result (run with Zorba):none

The case clauses support composability of expressions as well.

A switch expression

Result (run with Zorba):foo

A switch expression

Result (run with Zorba):1 + 1 is 2

Try-catch expressions

JSONiq follows the for try-catch expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

TryCatchExpr

A try catch expression evaluates the expression inside the try block and returns its resulting value.

However, if an error is raised dynamically, the catch clause is evaluated and its result value returned.

A try catch expression

Result (run with Zorba):division by zero!

Only errors raised within the lexical scope of the try block are caught.

A try catch expression

Result (run with Zorba):An error was raised: division by zero

Errors that are detected statically within the try block are still reported statically.

A try catch expression

Result (run with Zorba):syntax error

FLWOR expressions

JSONiq follows the for FLWOR expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

FLWORExpr

FLWOR expressions are probably the most powerful JSONiq construct and correspond to SQL's SELECT-FROM-WHERE statements, but they are more general and more flexible. In particular, clauses can almost appear in any order (apart that it must begin with a for or let clause, and end with a return clause).

Here is a bit of theory on how it works.

A clause binds values to some variables according to its own semantics, possibly several times. Each time, a tuple of variable bindings (mapping variable names to sequences) is passed on to the next clause.

This goes all the way down, until the return clause. The return clause is eventually evaluated for each tuple of variable bindings, resulting in a sequence of items for each tuple.

These sequences of items are concatenated, in the order of the incoming tuples, and the obtained sequence is returned by the FLWOR expression.

We are now giving practical examples with a hint on how it maps to SQL.

For clauses

JSONiq follows the for for clauses. The following explanations, provided as an informal summary for convenience, are non-normative.

ForClause

For clauses allow iteration on a sequence.

For each incoming tuple, the expression in the for clause is evaluated to a sequence. Each item in this sequence is in turn bound to the for variable. A tuple is hence produced for each incoming tuple, and for each item in the sequence produced by the for clause for this tuple.

The order in which items are bound by the for clause can be relaxed with unordered expressions, as described later in this section.

The following query, using a for and a return clause, is the counterpart of SQL's "SELECT name FROM captains". $x is bound in turn to each item in the captains collection.

A for clause.

Result (run with Zorba):James T. Kirk Jean-Luc Picard Benjamin Sisko Kathryn Janeway Jonathan Archer Samantha Carter

For clause expressions are composable, there can be several of them.

Two for clauses.

Result (run with Zorba):11 12 13 21 22 23 31 32 33

A for clause.

Result (run with Zorba):11 12 13 21 22 23 31 32 33

A for variable is visible to subsequence bindings.

A for clause.

Result (run with Zorba):1 2 3 4 5 6 7 8 9

A for clause.

Result (run with Zorba):{ "captain" : "James T. Kirk", "series" : "The original series" } { "captain" : "Jean-Luc Picard", "series" : "The next generation" } { "captain" : "Benjamin Sisko", "series" : "The next generation" } { "captain" : "Benjamin Sisko", "series" : "Deep Space 9" } { "captain" : "Kathryn Janeway", "series" : "The next generation" } { "captain" : "Kathryn Janeway", "series" : "Voyager" } { "captain" : "Jonathan Archer", "series" : "Entreprise" } { "captain" : null, "series" : "Voyager" }

It is also possible to bind the position of the current item in the sequence to a variable.

A for clause.

Result (run with Zorba):{ "captain" : "James T. Kirk", "id" : 1 } { "captain" : "Jean-Luc Picard", "id" : 2 } { "captain" : "Benjamin Sisko", "id" : 3 } { "captain" : "Kathryn Janeway", "id" : 4 } { "captain" : "Jonathan Archer", "id" : 5 } { "captain" : null, "id" : 6 } { "captain" : "Samantha Carter", "id" : 7 }

JSONiq supports joins. For example, the counterpart of "SELECT c.name AS captain, m.name AS movie FROM captains c JOIN movies m ON c.name = m.name" is:

A join

Result (run with Zorba):{ "captain" : "James T. Kirk", "movie" : "The Motion Picture" } { "captain" : "James T. Kirk", "movie" : "The Wrath of Kahn" } { "captain" : "James T. Kirk", "movie" : "The Search for Spock" } { "captain" : "James T. Kirk", "movie" : "The Voyage Home" } { "captain" : "James T. Kirk", "movie" : "The Final Frontier" } { "captain" : "James T. Kirk", "movie" : "The Undiscovered Country" } { "captain" : "Jean-Luc Picard", "movie" : "First Contact" } { "captain" : "Jean-Luc Picard", "movie" : "Insurrection" } { "captain" : "Jean-Luc Picard", "movie" : "Nemesis" }

Note how JSONiq handles semi-structured data in a flexible way.

Outer joins are also possible with "allowing empty", i.e., output will also be produced if there is no matching movie for a captain. The following query is the counterpart of "SELECT c.name AS captain, m.name AS movie FROM captains c LEFT JOIN movies m ON c.name = m.captain".

A join

Where clauses

JSONiq follows the for where clauses. The following explanations, provided as an informal summary for convenience, are non-normative.

WhereClause

Where clauses are used for filtering (selection operator in the relational algebra).

For each incoming tuple, the expression in the where clause is evaluated to a boolean (possibly converting an atomic to a boolean). if this boolean is true, the tuple is forwarded to the next clause, otherwise it is dropped.

The following query corresponds to "SELECT series FROM captains WHERE name = 'Kathryn Janeway'".

A where clause.

Result (run with Zorba):[ "The next generation", "Voyager" ]

Order clauses

JSONiq follows the for order by clauses. The following explanations, provided as an informal summary for convenience, are non-normative.

OrderByClause

Order clauses are for reordering tuples.

For each incoming tuple, the expression in the where clause is evaluated to an atomic. The tuples are then sorted based on the atomics they are associated with, and then forwarded to the next clause.

Like for ordering comparisons, null values are always considered the smallest.

The following query is the counterpart of SQL's "SELECT * FROM captains ORDER BY name".

An order by clause.

Result (run with Zorba):{ "name" : "Benjamin Sisko", "series" : [ "The next generation", "Deep Space 9" ], "century" : 24 } { "name" : "James T. Kirk", "series" : [ "The original series" ], "century" : 23 } { "name" : "Jean-Luc Picard", "series" : [ "The next generation" ], "century" : 24 } { "name" : "Jonathan Archer", "series" : [ "Entreprise" ], "century" : 22 } { "name" : "Kathryn Janeway", "series" : [ "The next generation", "Voyager" ], "century" : 24 } { "name" : "Samantha Carter", "series" : [ ], "century" : 21 } { "codename" : "Emergency Command Hologram", "surname" : "The Doctor", "series" : [ "Voyager" ], "century" : 24 }

Multiple sorting criteria can be given - they are treated like a lexicographic order (most important criterium first).

An order by clause.

Result (run with Zorba):{ "name" : "Samantha Carter", "series" : [ ], "century" : 21 } { "name" : "James T. Kirk", "series" : [ "The original series" ], "century" : 23 } { "name" : "Jean-Luc Picard", "series" : [ "The next generation" ], "century" : 24 } { "name" : "Jonathan Archer", "series" : [ "Entreprise" ], "century" : 22 } { "codename" : "Emergency Command Hologram", "surname" : "The Doctor", "series" : [ "Voyager" ], "century" : 24 } { "name" : "Benjamin Sisko", "series" : [ "The next generation", "Deep Space 9" ], "century" : 24 } { "name" : "Kathryn Janeway", "series" : [ "The next generation", "Voyager" ], "century" : 24 }

It can be specified whether the order is ascending or descending. Empty sequences are allowed and it can be chosen whether to put them first or last.

An order by clause.

Result (run with Zorba):{ "codename" : "Emergency Command Hologram", "surname" : "The Doctor", "series" : [ "Voyager" ], "century" : 24 } { "name" : "Samantha Carter", "series" : [ ], "century" : 21 } { "name" : "Kathryn Janeway", "series" : [ "The next generation", "Voyager" ], "century" : 24 } { "name" : "Jonathan Archer", "series" : [ "Entreprise" ], "century" : 22 } { "name" : "Jean-Luc Picard", "series" : [ "The next generation" ], "century" : 24 } { "name" : "James T. Kirk", "series" : [ "The original series" ], "century" : 23 } { "name" : "Benjamin Sisko", "series" : [ "The next generation", "Deep Space 9" ], "century" : 24 }

An error is raised if the expression does not evaluate to an atomic or the empty sequence.

An order by clause.

Collations can be used to give a specific way of how strings are to be ordered. A collation is identified by a URI.

Use of a collation in an order by clause.

Result (run with Zorba):Benjamin Sisko James T. Kirk Jean-Luc Picard Jonathan Archer Kathryn Janeway Samantha Carter

Group clauses

JSONiq follows the for group by clauses. The following explanations, provided as an informal summary for convenience, are non-normative.

GroupByClause

Grouping is also supported, like in SQL.

For each incoming tuple, the expression in the group clause is evaluated to an atomic (a grouping key). The incoming tuples are then grouped according to the key they are associated with.

For each group, a tuple is output, with a binding from the grouping variable to the key of the group.

A group by clause.

Result (run with Zorba):{ "century" : 21 } { "century" : 22 } { "century" : 23 } { "century" : 24 }

As for the other (non-grouping) variables, their values within one group are all concatenated, keeping the same name. Aggregations can be done on these variables.

The following query is equivalent to "SELECT century, COUNT(*) FROM captains GROUP BY century".

A group by clause.

Result (run with Zorba):{ "century" : 21, "count" : 1 } { "century" : 22, "count" : 1 } { "century" : 23, "count" : 1 } { "century" : 24, "count" : 4 }

JSONiq's group by is more flexible than SQL and is fully composable.

A group by clause.

Result (run with Zorba):{ "century" : 21, "captains" : [ "Samantha Carter" ] } { "century" : 22, "captains" : [ "Jonathan Archer" ] } { "century" : 23, "captains" : [ "James T. Kirk" ] } { "century" : 24, "captains" : [ "Jean-Luc Picard", "Benjamin Sisko", "Kathryn Janeway" ] }

Unlike SQL, JSONiq does not need a having clause, because a where clause works perfectly after grouping as well.

The following query is the counterpart of "SELECT century, COUNT(*) FROM captains GROUP BY century HAVING COUNT(*) > 1"

A group by clause.

Result (run with Zorba):{ "century" : 24, "count" : 4 }

Let clauses

JSONiq follows the for let clauses. The following explanations, provided as an informal summary for convenience, are non-normative.

LetClause

Let bindings can be used to define aliases for any sequence, for convenience.

For each incoming tuple, the expression in the let clause is evaluated to a sequence. A binding is added from this sequence to the let variable in each tuple. A tuple is hence produced for each incoming tuple.

A let clause.

Result (run with Zorba):{ "century" : 24, "count" : 4 }

Note that it is perfectly fine to reuse a variable name and hide a variable binding.

A let clause.

Result (run with Zorba):{ "century" : 24, "number of series" : 3 }

Count clauses

JSONiq follows the for count clauses. The following explanations, provided as an informal summary for convenience, are non-normative.

CountClause

For each incoming tuple, a binding from the position of this tuple in the tuple stream to the count variable is added. The new tuple is then forwarded to the next clause.

A count clause.

Result (run with Zorba):{ "id" : 1, "captain" : { "name" : "Benjamin Sisko", "series" : [ "The next generation", "Deep Space 9" ], "century" : 24 } } { "id" : 2, "captain" : { "name" : "James T. Kirk", "series" : [ "The original series" ], "century" : 23 } } { "id" : 3, "captain" : { "name" : "Jean-Luc Picard", "series" : [ "The next generation" ], "century" : 24 } } { "id" : 4, "captain" : { "name" : "Jonathan Archer", "series" : [ "Entreprise" ], "century" : 22 } } { "id" : 5, "captain" : { "name" : "Kathryn Janeway", "series" : [ "The next generation", "Voyager" ], "century" : 24 } } { "id" : 6, "captain" : { "name" : "Samantha Carter", "series" : [ ], "century" : 21 } } { "id" : 7, "captain" : { "codename" : "Emergency Command Hologram", "surname" : "The Doctor", "series" : [ "Voyager" ], "century" : 24 } }

Map operator

JSONiq follows the for the map operator, except that it changes the syntax for the context item to $$ instead of the . syntax.

The following explanations, provided as an informal summary for convenience, are non-normative.

SimpleMapExpr

ContextItemExpr

JSONiq provides a shortcut for a for-return construct, automatically binding each item in the left-hand-side sequence to the context item.

A simple map

Result (run with Zorba):2 4 6 8 10 12 14 16 18 20

An equivalent query

Result (run with Zorba):2 4 6 8 10 12 14 16 18 20

Variable references

JSONiq follows the for variable references, except that it disallows the character . in variable names, which is instead used for object lookup.

Composing FLWOR expressions

Like all other expressions, FLWOR expressions can be composed. In the following examples, a FLWOR is nested in a function call, nested in a FLWOR, nested in an array constructor:

Nested FLWORs

Result (run with Zorba):[ "James T. Kirk", "Jean-Luc Picard" ]

Ordered and Unordered expressions

JSONiq follows the for ordered and unordered expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

OrderedExpr

UnorderedExpr

By default, the order in which a for clause binds its items is important.

This behaviour can be relaxed in order give the optimizer more leeway. An unordered expression relaxes ordering by for clauses within its operand scope:

An unordered expression.

Result (run with Zorba):{ "name" : "Jean-Luc Picard", "series" : [ "The next generation" ], "century" : 24 } { "name" : "Benjamin Sisko", "series" : [ "The next generation", "Deep Space 9" ], "century" : 24 } { "name" : "Kathryn Janeway", "series" : [ "The next generation", "Voyager" ], "century" : 24 } { "codename" : "Emergency Command Hologram", "surname" : "The Doctor", "series" : [ "Voyager" ], "century" : 24 }

An ordered expression can be used to reactivate ordering behaviour in a subscope.

An ordered expression.

Result (run with Zorba):{ "name" : "James T. Kirk", "series" : [ "The original series" ], "century" : 23 }

Expressions dealing with types

This section describes JSONiq types as well as the sequence type syntax.

Instance-of expressions

JSONiq follows the for ordered and unordered expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

InstanceofExpr

An instance expression can be used to tell whether a JSONiq value matches a given sequence type.

Instance of expression

Result (run with Zorba):true

Instance of expression

Result (run with Zorba):false

Instance of expression

Result (run with Zorba):true

Instance of expression

Result (run with Zorba):true

Instance of expression

Result (run with Zorba):true

Instance of expression

Result (run with Zorba):true

Instance of expression

Result (run with Zorba):true

Treat expressions

JSONiq follows the for ordered and unordered expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

TreatExpr

A treat expression checks that a JSONiq value matches a given sequence type. If it is not the case, an error is raised.

Treat as expression

Result (run with Zorba):1

Treat as expression

Result (run with Zorba):An error was raised: "xs:integer" cannot be treated as type xs:string

Treat as expression

Result (run with Zorba):foo

Treat as expression

Result (run with Zorba):{ "foo" : "bar" }

Treat as expression

Result (run with Zorba):{ "foo" : "bar" } { "bar" : "foo" }

Treat as expression

Result (run with Zorba):[ 1, 2, 3 ]

Treat as expression

Result (run with Zorba):

Castable expressions

JSONiq follows the for ordered and unordered expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

CastableExpr

A castable expression checks whether a JSONiq value can be cast to a given atomic type and returns true or false accordingly. It can be used before actually casting to that type.

Castable as expression

Result (run with Zorba):true

Castable as expression

Result (run with Zorba):false

Castable as expression

Result (run with Zorba):true

Castable as expression

Result (run with Zorba):false

Castable as expression

Result (run with Zorba):false

The question mark allows for an empty sequence.

Castable as expression

Result (run with Zorba):true

Cast expressions

JSONiq follows the for ordered and unordered expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

CastExpr

A cast expression casts a JSONiq value to a given atomic type. The resulting value is annotated with this type.

Cast as expression

Result (run with Zorba):1

Cast as expression

Result (run with Zorba):An error was raised: "foo": value of type xs:string is not castable to type xs:integer

Cast as expression

Result (run with Zorba):2013-04-02

Cast as expression

Result (run with Zorba):An error was raised: empty sequence can not be cast to type with quantifier '1'

Cast as expression

Result (run with Zorba):An error was raised: sequence of more than one item can not be cast to type with quantifier '1' or '?'

The question mark allows for an empty sequence.

Cast as expression

Result (run with Zorba):

Cast as expression

Result (run with Zorba):2013-04-02

Typeswitch expressions

JSONiq follows the for ordered and unordered expressions. The following explanations, provided as an informal summary for convenience, are non-normative.

TypeswitchExpr

CaseClause

A typeswitch expressions tests if the value resulting from the first operand matches a given list of types. The expression corresponding to the first matching case is finally evaluated. If there is no match, the expression in the default clause is evaluated.

Typeswitch expression

Result (run with Zorba):string

In each clause, it is possible to bind the value of the first operand to a variable.

Typeswitch expression

Result (run with Zorba):foofoo

The vertical bar can be used to allow several types in the same case clause.

Typeswitch expression

Result (run with Zorba):{ "integer or string" : "foo" }

Documentation

RumbleDB 2.0 "Lemon Ironwood"

Ways to install and use

Further steps

On the online sandbox

As a pip package

Java version

In jupyter notebooks

With homebrew

Command line (java -jar)

Java version (important)

Command line (with spark-submit and an existing Spark installation)

Install Spark (if you do not have installed already)

SPARK_HOME and PATH (you need to check even if you already have an existing installation)

Java version (important)

Download the small version of the RumbleDB jar

Through the Java API with Maven

Installing from source (for the adventurous)

Requirements

As an HTTP server

Starting the HTTP server

Testing that it works (not necessary for most end users)

Use with Jupyter notebooks

Use with clusters

With SSH tunneling

With the EC2 hostname

Writing JSONiq queries in Python

Java version

Common issue: colliding Spark version

High-level information on the library

Your first programs

Ways to get and process the output of a JSONiq query

Type mapping

Binding JSONiq variables to Python values

Interacting with pandas DataFrames

Binding JSONiq variables to pandas DataFrames

Getting the results as a pandas DataFrame

Interacting with pyspark DataFrames

Using Pyspark DataFrames with JSONiq

Writing queries directly in Jupyter notebook cells

Advanced configuration

RumbleDB's specific configuration

Write back to the disk (or data lake)

The JSONiq language

JSONiq 1.0

Open Issues

JSONiq 3.1

How JSONiq 3.1 amends XQuery 3.1

Context item

Escaping in strings

Map constructors

Constraints on XPath

True, null, and false literals

Navigation

How JSONiq 3.1 differs from JSONiq 1.0

JSONiq Update Facility

Merging updates

Applying updates

The transform expression

RumbleDB Reference

Licenses

On the online sandbox

With homebrew

Type mapping

As a pip package

Java version

Through the Java API with Maven

JSONiq 1.0

Writing JSONiq queries in Python

Java version

Common issue: colliding Spark version

High-level information on the library

Ways to install and use

Further steps

Command line (java -jar)

Java version (important)

In jupyter notebooks

The JSONiq language

RumbleDB 2.0 "Lemon Ironwood"

RumbleDB Reference