Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
If you really want to start writing queries right now, there is a public sandbox here that will just work and guide you. You only need to have a Google account to be able to execute them, as this exposes our Jupyter notebook via the Colab environment. You are also free to download and use this notebook with any other provider or even your own local Jupyter and it will work just the same: the queries are all shipped to our own, small public backend no matter what. However, this may require a bit of configuration (JAVA_HOME pointing to Java 17 or 21, and if you have conflicting Spark installations in addition to pyspark, SPARK_HOME pointing to a Spark 4.0 installation).
If you do not have a Google account, you can also use our simpler sandbox page without Jupyter, here where you can type small queries and see the results.
With the sandboxes above, you can only inline your data in the query or access a dataset with an HTTP URL.
Once you want to take it to the next level and query your own data on your laptop, you will find instructions below to use RumbleDB on your own computer manually, which among others will allow you to query any files stored on your local disk. And then, you can take a leap of faith and use RumbleDB on a large cluster (Amazon EMR, your company's cluster, etc).
It is also possible to use RumbleDB with brew, however there is currently no way to adjust memory usage. To install RumbleDB with brew, type the commands:
You can test that it works with:
Then, launch a JSONiq shell with:
The RumbleDB shell appears:
You can now start typing simple queries like the following few examples. Press three times the return key to execute a query.
Any expression in JSONiq returns a sequence of items. Any variable in JSONiq is bound to a sequence of items. Items can be objects, arrays, or atomic values (strings, integers, booleans, nulls, dates, binary, durations, doubles, decimal numbers, etc). A sequence of items can be a sequence of just one item, but it can also be empty, or it can be as large as to contain millions, billions or even trillions of items. Obviously, for sequence longer than a billion items, it is a better idea to use a cluster than a laptop. A relational table (or more generally a data frame) corresponds to a sequence of object items sharing the same schema. However, sequences of items are more general than tables or data frames and support heterogeneity seamlessly.
When passing Python values to JSONiq or getting them from a JSONiq queries, the mapping to and from Python is as follows:
You can use RumbleDB from within Python programmes by running
Important note: since the jsoniq package depends on pyspark 4, Java 17 or Java 21 is a requirement. If another version of Java is installed, the execution of a Python program attempting to create a RumbleSession will lead to an error message on stderr that contains explanations.
You can control your Java version with:
Information about how this package is used can be found .
list
array item
str
string item
int
integer item
bool
boolean item
None
null item
Furthermore, other JSONiq types will be mapped to string literals. Users who want to preserve JSONiq types can use the Item API instead.
JSONiq is very powerful and expressive. You will find tutorials as well as a reference on JSONiq.org.
tuple
sequence of items
dict
object item
JSONiq 1.0 is the first version of the JSONiq language, currently in use.
It is a cousin of the XQuery 3.0 language and was developed by W3C XML Query Working Group members as a proposal of how to integrate JSON support into the language, while making it appealing to the JSON community, and making it easy for an existing XQuery engine to implement.
or
brew tap rumbledb/rumble
brew install --build-from-source rumblerumbledb run -q '1+1'rumbledb repl"Hello, World"Some users who have already configured a Spark installation on their machine may encounter a version issue if SPARK_HOME points to this alternate installation, and it is a different version of Spark (e.g., 3.5 or 3.4). The jsoniq package requires Spark 4.0.
If this happens, RumbleDB should output an informative error message. They are two ways to fix such conflicts:
The easiest is remove the SPARK_HOME environment variable completely. This will have RumbleDB fall back to the Spark 4.0 installation that ships with its pyspark dependency.
Or you can instead change the value of SPARK_HOME to point to a Spark 4.0 installation, if you have one. This would be for more advanced users who know what they are doing.
If you have another working Spark installation on your machine, you can see which version it is with
The above command is of course expected not to work for first-time users who only installed the jsoniq package and never installed Spark additionally on their machine.
pip install jsoniqjava -version ____ __ __ ____ ____
/ __ \__ ______ ___ / /_ / /__ / __ \/ __ )
/ /_/ / / / / __ `__ \/ __ \/ / _ \/ / / / __ | The distributed JSONiq engine
/ _, _/ /_/ / / / / / / /_/ / / __/ /_/ / /_/ / 2.0.0 "Lemon Ironwood" beta
/_/ |_|\__,_/_/ /_/ /_/_.___/_/\___/_____/_____/
Master: local[*]
Item Display Limit: 200
Output Path: -
Log Path: -
Query Path : -
rumble$ 1 + 1
(3 * 4) div 5
spark-submit --versionYou can use RumbleDB from within Python programmes by running
Important note: since the jsoniq package depends on pyspark 4, Java 17 or Java 21 is a requirement. If another version of Java is installed, the execution of a Python program attempting to create a RumbleSession will lead to an error message on stderr that contains explanations.
You can control your Java version with:
Information about how this package is used can be found in this section.
Some advanced users who have already configured a Spark installation on their machine may encounter a version issue if SPARK_HOME points to this alternate installation, and it is a different version of Spark (e.g., 3.5 or 3.4). The jsoniq package requires Spark 4.0.
If this happens, RumbleDB should output an informative error message. They are two ways to fix such conflicts:
The easiest is remove the SPARK_HOME environment variable completely. This will have RumbleDB fall back to the Spark 4.0 installation that ships with its pyspark dependency.
Or you can instead change the value of SPARK_HOME to point to a Spark 4.0 installation, if you have one. This would be for more advanced users who know what they are doing.
If you have another working Spark installation on your machine, you can see which version it is with
The above command is of course expected not to work for first-time users who only installed the jsoniq package and never installed Spark additionally on their machine.
A RumbleSession is a wrapper around a SparkSession that additionally makes sure the RumbleDB environment is in scope.
JSONiq queries are invoked with rumble.jsoniq() in a way similar to the way Spark SQL queries are invoked with spark.sql().
JSONiq variables can be bound to lists of JSON values (str, int, float, True, False, None, dict, list) or to Pyspark DataFrames. A JSONiq query can use as many variables as needed (for example, it can join between different collections).
It will later also be possible to read tables registered in the Hive metastore, similar to spark.sql(). Alternatively, the JSONiq query can also read many files of many different formats from many places (local drive, HTTP, S3, HDFS, ...) directly with simple such as json-lines(), text-file(), parquet-file(), csv-file(), etc.
The resulting sequence of items can be retrieved as a list of JSON values, as a Pyspark DataFrame, or, for advanced users, as an RDD or with a streaming iteration over the items using the .
It is also possible to write the sequence of items to the local disk, to HDFS, to S3, etc in a way similar to how DataFrames are written back by Pyspark.
The design goal is that it is possible to chain DataFrames between JSONiq and Spark SQL queries seamlessly. For example, JSONiq can be used to clean up very messy data and turn it into a clean DataFrame, which can then be processed with Spark SQL, spark.ml, etc.
Any feedback or error reports are very welcome.
There are many ways to install and use RumbleDB. For example:
By simply using one of our online sandboxes (Jupyter notebook or simple sandbox page)
Our newest library: by installing a pip package (pip install jsoniq)
By running the standalone RumbleDB jar with Java on your laptop
By installing with homebrew
By installing Spark yourself on your laptop (for more control on Spark parameters) and use a small RumbleDB jar with spark-submit
By using our docker image on your laptop (go to the "Run with docker" section on the left menu)
By uploading the small RumbleDB jar to an existing Spark cluster (such as AWS EMR)
By running RumbleDB as an HTTP server in the background and connecting to it in a Jupyter notebook with the %%jsoniq magic.
By installing it manually on your machine.
After installing RumbleDB, further steps could involve:
Learning JSONiq. More details can be found in the JSONiq section of this documentation and in the and .
Storing some data on S3, creating a Spark cluster on Amazon EMR (or Azure blob storage and Azure, etc), and querying the data with RumbleDB. More details are found in the cluster section of this documentation.
Using RumbleDB with Jupyter notebooks. For this, you can run RumbleDB as a server with a simple command, and get started by downloading the and just clicking your way through it. More details are found in the Jupyter notebook section of this documentation. Jupyter notebooks work both locally and on a cluster.
You need to make sure that you have Java 11 or 17 and that, if you have several versions installed, JAVA_HOME correctly points to Java 11 or 17.
RumbleDB works with both Java 11 and Java 17. You can check the Java version that is configured on your machine with:
If you do not have Java, you can download version 11 or 17 from .
Do make sure it is not Java 8, which will not work.
The Python edition of Rumble can be used to directly write JSONiq queries in Jupyter notebook cells. This is explained . You first need to install the library as described .
pip install jsoniqjava -versionJSONiq is a query and processing language specifically designed for the popular JSON data model. The main ideas behind JSONiq are based on lessons learned in more than 30 years of relational query systems and more than 15 years of experience with designing and implementing query languages for semi-structured data like XML and RDF.
The main source of inspiration behind JSONiq is XQuery, which has been proven so far a successful and productive query language for semi-structured data (in particular XML). JSONiq borrowed a large numbers of ideas from XQuery, like the structure and semantics of a FLWOR construct, the functional aspect of the language, the semantics of comparisons in the face of data heterogeneity, the declarative, snapshot-based updates. However, unlike XQuery, JSON is not concerned with the peculiarities of XML, like mixed content, ordered children, the confusion between attributes and elements, the complexities of namespaces and QNames, or the complexities of XML Schema, and so on.
The power of the XQuery's FLWOR construct and the functional aspect, combined with the simplicity of the JSON data model result in a clean, sleek and easy to understand data processing language. As a matter of fact, JSONiq is a language that can do more than queries: it can describe powerful data processing programs, from transformations, selections, joins of heterogeneous data sets, data enrichment, information extraction, information cleaning, and so on.
Technically, the main characteristics of JSONiq (and XQuery) are the following:
It is a set-oriented language. While most programming languages are designed to manipulate one object at a time, JSONiq is designed to process sets (actually, sequences) of data objects.
It is a functional language. A JSONiq program is an expression; the result of the program is the result of the evaluation of the expression. Expressions have fundamental role in the language: every language construct is an expression, and expressions are fully composable.
It is a declarative language. A program specifies what is the result being calculated, and does not specify low level algorithms like the sort algorithm, the fact that an algorithm is executed in main memory or is external, on a single machine or parallelized on several machines, or what access patterns (aka indexes) are being used during the evaluation of the program. Such implementation decisions should be taken automatically, by an optimizer, based on the physical characteristics of the data, and of the hardware environment. Just like a traditional database would do. The language has been designed from day one with optimizability in mind.
It is designed for nested, heterogeneous, semi-structured data. Data structures in JSON can be nested with arbitrary depth, do not have a specific type pattern (i.e. are heterogeneous), and may or may not have one or more schemas that describe the data. Even in the case of a schema, such a schema can be open, and/or simply partially describe the data. Unlike SQL, which is designed to query tabular, flat, homogeneous structures. JSONiq has been designed from scratch as a query for nested and heterogeneous data.
RumbleDB is a querying engine that allows you to query your large, messy datasets with ease and productivity. It covers the entire data pipeline: clean up, structure, normalize, validate, convert to an efficient binary format, and feed it right into Machine Learning estimators and models, all within the JSONiq language.
RumbleDB supports JSON-like datasets including JSON, JSON Lines, Parquet, Avro, SVM, CSV, ROOT as well as text files, of any size from kB to at least the two-digit TB range (we have not found the limit yet).
RumbleDB is both good at handling small amounts of data on your laptop (in which case it simply runs locally and efficiently in a single-thread) as well as large amounts of data by spreading computations on your laptop cores, or onto a large cluster (in which case it leverages Spark automagically).
RumbleDB can also be used to easily and efficiently convert data from a format to another, including from JSON to Parquet thanks to JSound validation.
It runs on many local or distributed filesystems such as HDFS, S3, Azure blob storage, and HTTP (read-only), and of course your local drive as well. You can use any of these file systems to store your datasets, but also to store and share your queries and functions as library modules with other users, worldwide or within your institution, who can import them with just one line of code. You can also output the results of your query or the log to these filesystems (as long as you have write access).
With RumbleDB, queries can be written in the tailor-made and expressive JSONiq language. Users can write their queries declaratively and start with just a few lines. No need for complex JSON parsing machinery as JSONiq supports the JSON data model natively.
The core of RumbleDB lies in JSONiq's FLWOR expressions, the semantics of which map beautifully to DataFrames and Spark SQL. Likewise expression semantics is seamlessly translated to transformations on RDDs or DataFrames, depending on whether a structure is recognized or not. Transformations are not exposed as function calls, but are completely hidden behind JSONiq queries, giving the user the simplicity of an SQL-like language and the flexibility needed to query heterogeneous, tree-like data that does not fit in DataFrames.
This documentation provides you with instructions on how to get started, examples of data sets and queries that can be executed locally or on a cluster, links to JSONiq reference and tutorials, notes on the function library implemented so far, and instructions on how to compile RumbleDB from scratch.
Please note that this is a (maturing) beta version. We welcome bug reports in the GitHub issues section.
At the end of an updating program, the resulting PUL is applied with upd:applyUpdates (part of the XQuery Update Facility standard), which is extended as follows:
First, before applying any update, each update primitive (except the jupd:insert-into-object primitives, which do not have any target) locks onto its target by resolving the selector on the object or array it updates. If the selector is resolved to the empty sequence, the update primitive is ignored in step 2. After this operation, each of these update primitives will contain a reference to either the pair (for an object) or the value (for an array) on or relatively to which it operates.
Then each update primitive is applied, using the target references that were resolved at step 1. The order in which they are applied is not relevant and does not affect the final instance of the data model. After applying all updates, an error jerr:JNUP0006 is raised upon pair name collision within the same object.
RumbleDB is just a download and no installation is required.
In order to run RumbleDB, you simply need to download rumbledb-2.0.0-standalone.jar from the download page and put it in a directory of your choice, for example, right besides your data.
Make sure to use the corresponding jar name accordingly in all our instructions in lieu of rumbledb.jar.
You can test that it works with:
or launch a JSONiq shell with:
If you run out of memory, you can set allocate more memory to Java with an additional Java parameter, e.g., -Xmx10g
The RumbleDB shell appears:
You can now start typing simple queries like the following few examples. Press three times the return key to execute a query.
or
or
Javadoc
If you plan to add the jar to your Java environment to use RumbleDB in your Java programs, the JavaDoc documentation can be found here. The entry point is the class org.rumbledb.api.Rumble.
java -versionspark-submit --versionjava -jar rumbledb-2.0.0-standalone.jar run -q '1+1'java -jar rumbledb-2.0.0-standalone.jar repl ____ __ __ ____ ____
/ __ \__ ______ ___ / /_ / /__ / __ \/ __ )
/ /_/ / / / / __ `__ \/ __ \/ / _ \/ / / / __ | The distributed JSONiq engine
/ _, _/ /_/ / / / / / / /_/ / / __/ /_/ / /_/ / 2.0.0 "Lemon Ironwood" beta
/_/ |_|\__,_/_/ /_/ /_/_.___/_/\___/_____/_____/
Master: local[*]
Item Display Limit: 200
Output Path: -
Log Path: -
Query Path : -
rumble$"Hello, World" 1 + 1
(3 * 4) div 5
This method gives you more control about the Spark configuration than the experimental standalone jar, in particular you can increase the memory used, change the number of cores, and so on.
If you use Linux, Florian Kellner also kindly contributed an installation script for Linux users that roughly takes care of what is described below for you.
RumbleDB requires an Apache Spark installation on Linux, Mac or Windows. Important note: it needs to be either Spark 4, or the Scala 2.13 build of Spark 3.5.
It is straightforward to directly , unpack it and put it at a location of your choosing. We recommend to pick Spark 4.0.0.
You then need to point the SPARK_HOME environment variable to this directory, and to additionally add the subdirectory "bin" within the unpacked directory to the PATH variable. On macOS this is done by adding.
(with SPARK_HOME appropriately set to match your unzipped Spark directory) to the file .zshrc in your home directory, then making sure to force the change with
in the shell. In Windows, changing the PATH variable is done in the control panel. In Linux, it is similar to macOS.
As an alternative, users who love the command line can also install Spark with a package management system instead, such as brew (on macOS) or apt-get (on Ubuntu). However, these might be less predictable than a raw download.
You can test that Spark was correctly installed with:
You need to make sure that you have Java 11 (for Spark 3.5) or 17 (for Spark 3.5 or 4.0) or 21 (for Spark 4.0) and that, if you have several versions installed, JAVA_HOME correctly points to the correct Java installation. Spark only supports Java 11 or 17 or 21 depending on the version.
Spark 4+ is documented to work with both Java 17 and Java 21. If there is an issue with the Java version, RumbleDB will inform you with an appropriate error message. You can check the Java version that is configured on your machine with:
Like Spark, RumbleDB is just a download and no installation is required.
In order to run RumbleDB, you simply need to download one of the small .jar files from the and put it in a directory of your choice, for example, right besides your data.
If you use Spark 3.5, use rumbledb-2.0.0-for-spark-3.5-scala-2.13.jar.
If you use Spark 4.0, use rumbledb-2.0.0-for-spark-4.0.jar.
These jars do not embed Spark, since you chose to set it up separately. They will work with your Spark installation with the spark-submit command.
Make sure to use the corresponding jar name accordingly in all our instructions in lieu of rumbledb.jar, replacing rumbledb.jar with the actual name of the jar file you downloaded.
In a shell, from the directory where the RumbleDB .jar lies, type, all on one line:
replacing rumbledb.jar with the actual name of the jar file you downloaded.
The RumbleDB shell appears:
You can now start typing simple queries like the following few examples. Press three times the return key to execute a query.
or
or
RumbleDB can be run as an HTTP server that listens for queries. In order to do so, you can use the --server and --port parameters:
This command will not return until you force it to (Ctrl+C on Linux and Mac). This is because the server has to run permanently to listen to incoming requests.
Most users will not have to do anything beyond running the above command. For most of them, the next step would be to open a Jupyter notebook that connects to this server automatically.
This HTTP server is built as a basic server for the single user use case, i.e., the user runs their own RumbleDB server on their laptop or cluster, and connects to it via their Jupyter notebook, one query at a time. Some of our users have more advanced needs, or have a larger user base, and typically prefer to implement their own HTTP server, lauching RumbleDB queries either via the public RumbleDB Java API (like the basic HTTP server does -- so its code can serve as a demo of the Java API) or via the RumbleDB CLI.
Caution! Launching a server always has consequences on security, especially as RumbleDB can read from and write to your disk; So make sure you activate your firewall. In later versions, we may support authentication tokens.
The HTTP server is meant not to be used directly by end users, but instead to make it possible to integrate RumbleDB in other languages and environments, such as Python and Jupyter notebooks.
To test that the server is running, you can try the following address in your browser, assuming you have a query stored locally at /tmp/query.jq. All queries have to go to the /jsoniq path.
The request returns a JSON object, and the resulting sequence of items is in the values array.
Almost all parameters from the command line are exposed as HTTP parameters.
A query can also be submitted in the request body:
With the HTTP server running, if you have installed Python and Jupyter notebooks (for example with the Anaconda data science package that does all of it automatically), you can create a RumbleDB magic by just executing the following code in a cell:
Where, of course, you need to adapt the port (8001) to the one you picked previously.
Then, you can execute queries in subsequent cells with:
or on multiple lines:
You can also let RumbleDB run as an HTTP server on the master node of a cluster, e.g. on Amazon EMR or Azure. You just need to:
Create the cluster (it is usually just the push of a few buttons in Amazon or Azure)
Wait for a few minutes
Make sure that your own IP has incoming access to EMR machines by configuring the security group properly. You usually only need to do so the first time you set up a cluster (if your IP address remains the same), because the security group configuration will be reused for future EMR clusters.
Then there are two options
Connect to the master with SSH with an extra parameter for securely tunneling the HTTP connection (for example -L 8001:localhost:8001 or any port of your choosing)
Download the RumbleDB jar to the master node
wget https://github.com/RumbleDB/rumble/releases/download/v1.24.0/rumbledb-1.24.0.jar
Launch the HTTP server on the master node (it will be accessible under http://localhost:8001/jsoniq).
There is also another way that does not need any tunnelling: you can specify the hostname of your EC2 machine (copied over from the EC2 dashboard) with the --host parameter. For example, with the placeholder :
You also need to make sure in your EMR security group that the chosen port (e.g., 8001) is accessible from the machine in which you run your Jupyter notebook. Then, you can point your Jupyter notebook on this machine to http://<ec2-hostname>:8001/jsoniq.
Be careful not to open this port to the whole world, as queries can be sent that read and write to the EC2 machine and anything it has access to (like S3).
There are several ways to get back the output of the JSONiq query. There are many examples of use further down this page.
availableOutputs()
Returns a list that helps you understand which output methods you can call. The strings in this list can be Local, RDD, DataFrame, or PUL.
-
json()
Returns the results as a tuple containing dicts, lists, strs, ints, floats, booleans, Nones.
Local
RumbleDB can work out of the box with pandas DataFrames, both as input and (when the output has a schema) as output.
bind() also accepts pandas dataframes
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [30,25,35]};
pdf = pd.DataFrame(data);
rumble.bind('$a',pdf);
seq = rumble.jsoniq('$a.Name')The same goes for extra named parameters.
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [30,25,35]};
pdf = pd.DataFrame(data);
seq = rumble.jsoniq('$a.Name', a=pdf)It is also possible to get the results back as a pandas dataframe with pdf() (if the output has a schema, which you can check by calling availableOutputs() and seeing if "DataFrame" is in the returned list).
We show here how to install RumbleDB from the GitHub repository and build it yourself if you wish to do so (for example, to use the latest master). However, the easiest way to use RumbleDB is to simply download the already compiled .jar files.
The following software is required:
: the version of Java is important, as RumbleDB only works with Java 11 (Standalone or Spark 3.5), 17 (Standalone or Spark 3.5 or Spark 4 or Python) or 21 (Spark 4 or Python). The current master branch corresponds to Spark 4.0, meaning that Java 17 or 21 is required.
The syntax to start a session is similar to that of Spark. A RumbleSession is a SparkSession that additionally knows about RumbleDB. All attributes and methods of SparkSession are also available on RumbleSession.
Even though RumbleDB uses Spark internally, it can be used without any knowledge of Spark.
Executing a query is done with rumble.jsoniq() like so.
A query returns a sequence of items, here the sequence with just the integer item 2.
There are several ways to retrieve the results of the query. Calling the json() is just one of them. It retrieves the sequence of as a tuple of JSON values that Python can process. The detailed . Other methods for .
RumbleDB can work out of the box with pyspark DataFrames, both as input and (when the output has a schema) as output.
The power users can also interface our library with pyspark DataFrames. JSONiq sequences of items can have billions of items, and our library supports this out of the box: it can also run on clusters on AWS Elastic MapReduce for example. But your laptop is just fine, too: it will spread the computations on your cores. You can bind a DataFrame to a JSONiq variable. JSONiq will recognize this DataFrame as a sequence of object items.
Creating a data frame also similar to Spark (but using the rumble object).
This is how to bind a JSONiq variable to a dataframe. You can bind as many variables as you want.
It is possible to bind a JSONiq variable to a tuple of native Python values and then use it in a query. JSONiq, variables are bound to sequences of items, just like the results of JSONiq queries are sequence of items. A Python tuple will be seamlessly converted to a sequence of items by the library. Currently we only support strs, ints, floats, booleans, None, and (recursively) lists and dicts. But if you need more (like date, bytes, etc) we will add them without any problem. JSONiq has a rich type system.
Values can be passed with extra named parameters, like so.
It is also possible to bind variables more durably (across multiple jsoniq() calls) with bind().
It is possible to bind only one value. The it must be provided as a singleton tuple. This is because in JSONiq, an item is the same a sequence of one item.
For convenience and code readability, you can also use bindOne().
A variable that was durably bound with bind() or bindOne() can be unbound with unbind().
The Python edition of RumbleDB comes out of the box with a JSONiq magic.
If you are in a Jupyter notebook and have installed the jsoniq pip package, you can activate the jsoniq magic with:
Then, you can run JSONiq in standalone cells and see the results:
Of course, you can still continue to use rumble.jsoniq() calls and process the outputs as you see fit.
An example of the magic in action is available in our .
Note: This is a different magic than the magic that works with the RumbleDB HTTP server. It is more modern and running a server is no longer needed with this different magic. It suffices to install the jsoniq Python package.
Generally, it is possible to write output by to disk using the pandas DataFrame API, the pyspark DataFrame API, or Python's library to write JSON values to disk.
For convenience, we provide a way to also directly do so with the sequence object output by the query.
it is possible to write the output to a file locally or on a cluster. The API is similar to that of Spark dataframes. Note that it creates a directory and stores the (potentially very large) output in a sharded directory. RumbleDB was already tested with up to 64 AWS machines and 100s of TBs of data.
Of course the examples below are so small that it makes more sense to process the results locally with Python, but this shows how GBs or TBs of data obtained from JSONiq can be written back to disk.
Updates can be applied to a clone of an existing instance with the expression.
The content of the modify clause may build a complex Pending Update List with multiple updates. Remember that, with snapshot semantics, each update is applied against the initial snapshot, and updates do not see each other's effects.
Updating expression can also be combined with conditional expressions (in the then and else clauses), switch expressions (in the return clauses), FLWOR expressions (in the return clause), etc for more powerful queries based on patterns in the available data (from any source visible to the JSONiq query).
The updates generated inside the modify clause may only target the cloned object, i.e., the variable specified in the copy clause.
Example 191. JSON copy-modify-return expression
Result: { "foo" : true, "bar" : 123, "foobar" : [ true, false ] }
It is possible to access RumbleDB's advanced configuration parameters with
Then, you can change the value of some parameters. For example, you can increase the number of JSON values that you can retrieve with a json() call:
You can also configure RumbleDB to output verbose information about the internal query plan, type and mode detection, and optimizations. This can be of interest to data engineers or researchers to understand how RumbleDB works.
The complete API for configuring RumbleDB is accessible in our pages. These methods are also callable in Python.
Warning: some of the configuration methods do not make sense in Python and are specific to the command line edition of RumbleDB (such as setting the query content or an output path and input/output format). Also, setting external variables in Python should not be done via the configuration, but with the bind() and unbind() functions or extra parameters in jsoniq() calls.
RumbleDB uses the following software:
ANTLR v4 Framework - BSD License
Apache Commons Text - Apache License
Apache Commons Lang - Apache License
JSONiq follows the and introduces update primitives and update expressions specific to JSON data.
In JSONiq, updates are not immediately applied. Rather, a snapshot of the current data is taken, and a list of updates, called the Pending Update List, is collected. Then, upon explicit request by the user (via specific expressions), the Pending Update List is applied atomically, leading to a new snapshot. It is also possible for an engine to persist (to the local disk, to a database management system, to a data lake...) the resulting Pending Update List after a query has been completed.
In the middle of a program, several PULs can be produced against the same snapshot. They are then merged with upd:mergeUpdates (part of the XQuery Update Facility standard), which is extended as follows.
Several deletes on the same object are replaced with a unique delete on that object, with a list of all selectors (names) to be deleted, where duplicates have been eliminated.
Several deletes on the same array and selector (position) are replaced with a unique delete on that array and with that selector.
Several inserts on the same array and selector (position) are equivalent to a unique insert on that array and selector with the content of those original inserts appended in an implementation-dependent order (like XQUF).
Apache HTTP client - Apache License
gson - Apache License
JLine terminal framework - BSD License
Kryo serialization framework - BSD License
Laurelin (ROOT parser) - BSD-3
Spark Libraries - Apache License
As well as the JSONiq language - CC BY-SA 3.0 License
Several inserts on the same object are equivalent to a unique insert where the objects containing the pairs to insert are merged. An error jerr:JNUP0005 is raised if a collision occurs.
Several replaces on the same object or array and with the same selector raise an error jerr:JNUP0009.
Several renames on the same object and with the same selector raise an error jerr:JNUP0010.
If there is a replace and a delete on the same object or array and with the same selector, the replace is omitted in the merged PUL.
If there is a rename and a delete on the same object or array and with the same selector, the rename is omitted in the merged PUL.
Sequence length below the materialization cap. The default is 200 but it can be increased in the RumbleDB configuration.
df()
Returns the results as a pyspark data frame
DataFrame (i.e., RumbleDB was able to infer an output schema)
No limitation, but beyond a billion items, you should use a Spark cluster.
pdf()
Returns the results as a pandas data frame
DataFrame (i.e., RumbleDB was able to infer an output schema)
Should fit in your computer's memory.
rdd()
Returns the results as an RDD containing dicts, lists, strs, ints, floats, booleans, Nones (experimental)
RDD
No limitation, but beyond a billion items, you should use a Spark cluster.
items()
Returns the results as a list containing Java Item objects that can be queried with the RumbleDB Item API. Will contain more information and more accurate typing.
Local
Sequence length below the materialization cap. The default is 200 but it can be increased in the RumbleDB configuration.
open(), hasNext(), nextJSON(), close()
Allows streaming (with no limitation of length) through individuals items as dicts, lists, strs, ints, floats, booleans, Nones.
Local
No limitation, as long as you go through the stream without saving all past items.
open(), hasNext(), next(), close()
Allows streaming (with no limitation of length) through individuals items as Java Item objects that can be queried with the RumbleDB Item API. Will contain more information and more accurate typing.
Local
No limitation, as long as you go through the stream without saving all past items.
applyUpdates()
Persists the Pending Update List produced by the query (to the Delta Lake or a table registered in the Hive metastore).
PUL
-
JSONiq 3.1 is an initiative of the RumbleDB team that aligns JSONiq more closely with XQuery 3.1, which has now become a W3C recommendation, but keeping what makes it JSONiq: the flagship feature being the ability to copy-paste JSON into a JSONiq query and with a navigation syntax that appeals to the JSON community.
JSONiq 3.1 does not require a distinct data model (JDM) since XQuery 3.1 support maps and arrays. As a result, JSONiq 3.1 objects are the same as XQuery 3.1 maps and JSONiq 3.1 arrays are the same as XQuery 3.1 arrays.
JSONiq 3.1 does not require a separate serialization mechanism, since XQuery 3.1 supports the JSON output method.
JSONiq 3.1 benefits from all the map and object builtin functions defined in XQuery 3.1.
JSONiq 3.1 is fully interoperable with XQuery 3.1 and can execute on the same virtual machine (similar to Scala and Java).
This also paves the way for JSONiq 4.0 which will also be aligned with XQuery 4.0 as much as is technically possible.
As a result, the specification for JSONiq 3.1 is even more minimal than that of JSONiq 1.0. This makes it easy to support for any existing XQuery engine to step into the JSON community.
RumbleDB is slowly deploying the use of JSONiq 3.1 but it will take some time as we make sure to sweep in all corners.
In JSONiq 3.1, the context item is obtained through $$ and not through a dot.
String literals use JSON escaping instead of XML escaping (backslash, not ampersand).
In map (object) constructors, the "map" keyword in front is optional.
A name test must be prefixed with $$/ and cannot stand on its own.
true and false exist as literals and do not have to be obtained through function calls (true(), false()).
null exists as a literal and stands for the empty sequence.
The dot . and double square brackets [[ ]] act as syntactic sugars for ? lookup.
The data model standardized by the W3C working group is more generic and allows for atomic object keys that are not necessarily strings (dates, etc). Also, an object value or an array value can be a sequence of items and does not need to be a single item. The particular case in which object keys are strings and values are single items (or empty) corresponds to the JSON use.
Null does not exist as its own type in JSONiq 3.1, instead it is mapped to the empty sequence.
There are other minor changes in semantics that correspond to the alignment with XQuery 3.1 such as Effective Boolean Values, comparison, etc.
The JSON update syntax was not integrated yet into the core language. This is planned, and the syntax will be simplified (no json keyword, dot lookup allowed here as well).
The semantics for the JSON serialization method is the same as in the JSONiq Extension to XQuery. It is still under discussion how to escape special characters with the Text output method.
And then use Jupyter notebooks in the same way you would do it locally (it magically works because of the tunneling)
print(seq.pdf())print(rumble.jsoniq("""
for $v in $c
let $parity := $v mod 2
group by $parity
return { switch($parity)
case 0 return "even"
case 1 return "odd"
default return "?" : $v
}
""", c=(1,2,3,4, 5, 6)).json())
print(rumble.jsoniq("""
for $i in $c
return [
for $j in $i
return { "foo" : $j }
]
""", c=([1,2,3],[4,5,6])).json())
print(rumble.jsoniq('{ "results" : $c.foo[[2]] }',
c=({"foo":[1,2,3]},{"foo":[4,{"bar":[1,False, None]},6]})).json())rumble.bind('$c', (1,2,3,4, 5, 6))
print(rumble.jsoniq("""
for $v in $c
let $parity := $v mod 2
group by $parity
return { switch($parity)
case 0 return "even"
case 1 return "odd"
default return "?" : $v
}
""").json())
print(rumble.jsoniq("""
for $v in $c
let $parity := $v mod 2
group by $parity
return { switch($parity)
case 0 return "gerade"
case 1 return "ungerade"
default return "?" : $v
}
""").json())
rumble.bind('$c', ([1,2,3],[4,5,6]))
print(rumble.jsoniq("""
for $i in $c
return [
for $j in $i
return { "foo" : $j }
]
""").json())
rumble.bind('$c', ({"foo":[1,2,3]},{"foo":[4,{"bar":[1,False, None]},6]}))
print(rumble.jsoniq('{ "results" : $c.foo[[2]] }').json())seq = rumble.jsoniq("$a.Name");
seq.write().mode("overwrite").json("outputjson");
seq.write().mode("overwrite").parquet("outputparquet");
seq = rumble.jsoniq("1+1");
seq.write().mode("overwrite").text("outputtext");Spark, version 4.0.0 (for example)
Ant, version 1.10
Maven 3.9.9
Type the following commands to check that the necessary commands are available. If not, you may need to either install the software, or make sure that it is on the PATH.
You first need to download the rumble code to your local machine.
In the shell, go to the desired location:
Clone the github repository:
Go to the root of this repository:
You can compile the entire project like so:
After successful completion, you can check the target directory, which should contain the compiled classes as well as the JAR file rumbledb-2.0.0-jar-with-dependencies.jar.
The most straightforward to test if the above steps were successful is to run the RumbleDB shell locally, like so:
The RumbleDB shell should start:
You can now start typing interactive queries. Queries can span over multiple lines. You need to press return 3 times to confirm.
This produces the following results (>>> show the extra, empty lines that appear on the first two presses of the return key).
You can try a few more queries.
This is it. RumbleDB is setup and ready to go locally. You can now move on to a JSONiq tutorial. A RumbleDB tutorial will also follow soon.
You can also try to run the RumbleDB shell on a cluster if you have one available and configured -- this is done with the same command, as the master and deployment mode are usually already set up in cloud-managed clusters. More details are provided in the rest of the documentation.
Below are a few examples showing what is possible with JSONiq. You can learn JSONiq with our interactive tutorial. You will also find a full language reference here as well as a list of builtin functions.
For complex queries, you can use Python's ability to spread strings over multiple lines, and with no need to escape special characters.
from jsoniq import RumbleSession
rumble = RumbleSession.builder.getOrCreate();
items = rumble.jsoniq('1+1')
python_tup = items.json()
print(python_tup)You can also, instead of the bind() call, pass the pyspark DataFrame directly in jsoniq() with an extra named parameter:
There are several ways to collect the outputs, depending on the user needs but also on the query supplied. The following method returns a list containing one or several of "DataFrame", "RDD", "PUL", "Local".
If DataFrame is in the list, df() can be invoked.
If RDD is in the list, rdd() can be invoked.
If Local is the list, items() or json() can be invokved, as well as the local iterator API.
If the output of the JSONiq query is structured (i.e., RumbleDB was able to detect a schema), then we can extract a regular data frame that can be further processed with spark.sql() or rumble.jsoniq().
We are continuously working on the detection of schemas and RumbleDB will get better at it with them. JSONiq is a very powerful language and can also produce heterogeneous output "by design". Then you need to use rdd() instead of df(), or to collect the list of JSON values (see further down). Remember that availableOutputs() tells you what is at your disposal.
A DataFrame output by JSONiq can be reused as input to a Spark SQL query.
(Remember that rumble is a wrapper around a SparkSession object, so you can use rumble.sql() just like spark.sql())
A DataFrame output by Spark SQL can be reused as input to a JSONiq query.
And a DataFrame output by JSONiq can be reused as input to another JSONiq query.
data = [("Alice", 30), ("Bob", 25), ("Charlie", 35)];
columns = ["Name", "Age"];
df = spark.createDataFrame(data, columns);rumble.bind('$a', df);By default, the output will be in the form of serialized JSON values. If the output is structured, then you can change this default behavior to show it in the form of a DataFrame instead.
For a pandas DataFrame:
For a pyspark DataFrame:
Note that it will not work in all cases. If the output is not fully structured or RumbleDB is unable to infer a DataFrame schema, you can specify the schema yourself. The schema language is called JSound and you will find a tutorial here.
It is possible to measure the response time with the -t parameter:
%load_ext jsoniqmagic%%jsoniq
{"foobar":1} If you get an out-of-memory error, it is possible to allocate memory when you build the Rumble session with a config() call. This is exactly the same way it is done when building a Spark session. The config() call can of course be used in combination with any other method calls that are part of the builder chain (withDelta(), appName(), config(), etc).
For example:
conf = rumble.getRumbleConf()conf.setResultSizeCap(1000)export SPARK_HOME=/path/to/spark-4.0.0-bin-hadoop3
export PATH=$SPARK_HOME/bin:$PATH. ~/.zshrcspark-submit --versionjava -versionspark-submit rumbledb.jar repl ____ __ __ ____ ____
/ __ \__ ______ ___ / /_ / /__ / __ \/ __ )
/ /_/ / / / / __ `__ \/ __ \/ / _ \/ / / / __ | The distributed JSONiq engine
/ _, _/ /_/ / / / / / / /_/ / / __/ /_/ / /_/ / 2.0.0 "Lemon Ironwood" beta
/_/ |_|\__,_/_/ /_/ /_/_.___/_/\___/_____/_____/
Master: local[*]
Item Display Limit: 200
Output Path: -
Log Path: -
Query Path : -
rumble$"Hello, World" 1 + 1
(3 * 4) div 5
spark-submit rumbledb.jar serve -p 8001http://localhost:8001/jsoniq?query-path=/tmp/query.jq{ "values" : [ "foo", "bar" ] }curl -X POST --data '1+1' http://localhost:8001/jsoniq!pip install rumbledb
%load_ext rumbledb
%env RUMBLEDB_SERVER=http://localhost:8001/jsoniq%jsoniq 1 + 1%%jsoniq
for $doc in json-lines("my-file")
where $doc.foo eq "bar"
return $doc
spark-submit rumbledb.jar serve -p 8001 -h <ec2-hostname>rumble.bind('$c', (42,))
print(rumble.jsoniq('for $i in 1 to $c return $i*$i').json())rumble.bindOne('$c', 42)
print(rumble.jsoniq('for $i in 1 to $c return $i*$i').json())rumble.unbind('$c')$ java -version
$ mvn --version
$ ant -version
$ spark-submit --version$ cd some_directory$ git clone https://github.com/RumbleDB/rumble.git$ cd rumble$ mvn clean compile assembly:single$ spark-submit target/rumbledb-2.0.0-with-dependencies.jar replUsing Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
____ __ __ ____ ____
/ __ \__ ______ ___ / /_ / /__ / __ \/ __ )
/ /_/ / / / / __ `__ \/ __ \/ / _ \/ / / / __ | The distributed JSONiq engine
/ _, _/ /_/ / / / / / / /_/ / / __/ /_/ / /_/ / 2.0.0 "Lemon Ironwood" beta
/_/ |_|\__,_/_/ /_/ /_/_.___/_/\___/_____/_____/
Master: local[2]
Item Display Limit: 1000
Output Path: -
Log Path: -
Query Path : -
rumble$rumble$ "Hello, world!"rumble$ "Hello, world!"
>>>
>>>
Hello, worldrumble$ 2 + 2
>>>
>>>
4
rumble$ 1 to 10
>>>
>>>
( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)seq = rumble.jsoniq("""
let $stores :=
[
{ "store number" : 1, "state" : "MA" },
{ "store number" : 2, "state" : "MA" },
{ "store number" : 3, "state" : "CA" },
{ "store number" : 4, "state" : "CA" }
]
let $sales := [
{ "product" : "broiler", "store number" : 1, "quantity" : 20 },
{ "product" : "toaster", "store number" : 2, "quantity" : 100 },
{ "product" : "toaster", "store number" : 2, "quantity" : 50 },
{ "product" : "toaster", "store number" : 3, "quantity" : 50 },
{ "product" : "blender", "store number" : 3, "quantity" : 100 },
{ "product" : "blender", "store number" : 3, "quantity" : 150 },
{ "product" : "socks", "store number" : 1, "quantity" : 500 },
{ "product" : "socks", "store number" : 2, "quantity" : 10 },
{ "product" : "shirt", "store number" : 3, "quantity" : 10 }
]
let $join :=
for $store in $stores[], $sale in $sales[]
where $store."store number" = $sale."store number"
return {
"nb" : $store."store number",
"state" : $store.state,
"sold" : $sale.product
}
return [$join]
""");
print(seq.json());
seq = rumble.jsoniq("""
for $product in json-lines("http://rumbledb.org/samples/products-small.json", 10)
group by $store-number := $product.store-number
order by $store-number ascending
return {
"store" : $store-number,
"products" : [ distinct-values($product.product) ]
}
""");
print(seq.json());res = rumble.jsoniq('$a.Name');res = rumble.jsoniq('$a.Name', a=df);modes = res.availableOutputs();
for mode in modes:
print(mode)df = res.df();
df.show();df.createTempView("myview")
df2 = spark.sql("SELECT * FROM myview").toDF("name");
df2.show();rumble.bind('$b', df2);
seq2 = rumble.jsoniq("for $i in 1 to 5 return $b");
df3 = seq2.df();
df3.show();rumble.bind('$b', df3);
seq3 = rumble.jsoniq("$b[position() lt 3]");
df4 = seq3.df();
df4.show();%%jsoniq -pdf
for $i in 1 to 10000000
return { "foobar" : $i}%%jsoniq -df
for $i in 1 to 10000000
return { "foobar" : $i}%%jsoniq -pdf
declare type local:mytype as {
"product" : "string",
"store-number" : "int",
"quantity" : "decimal"
};
validate type local:mytype* {
for $product in json-lines("http://rumbledb.org/samples/products-small.json", 10)
where $product.quantity ge 995
return $product
}%%jsoniq -t
for $i in 1 to 10000000
return { "foobar" : $i}conf.setPrintIteratorTree(True)rumble = RumbleSession.builder
.config("spark.driver.memory", "10g")
.getOrCreate()Update expressions can also appear outside of a copy-modify-return expression, in which case they propagate and/or persist directly to their targets, to the extent that the context makes it meaningful and possible.
copy $obj := { "foo" : "bar", "bar" : [ 1,2,3 ] }
modify (
insert json { "bar" : 123, "foobar" : [ true, false ] } into $obj,
delete json $obj.bar,
replace value of json $obj.foo with true
)
return $obj
This section assumes that you have installed RumbleDB with one of the proposed ways, and guides you through your first queries.
Create, in the same directory as RumbleDB to keep it simple, a file data.json and put the following content inside. This is a small list of JSON objects in the JSON Lines format.
If you want to later try a bigger version of this data, you can also download a larger version with 100,000 objects from here. Wait, no, in fact you do not even need to download it: you can simply replace the file path in the queries below with "https://rumbledb.org/samples/products-small.json" and it will just work! RumbleDB feels just at home on the Web.
RumbleDB also scales without any problems to datasets that have millions or (on a cluster) billions of objects, although of course, for billions of objects HDFS or S3 are a better idea than the Web to store your data, for obvious reasons.
In the JSON Lines format that this simple dataset uses, you just need to make sure you have one object on each line (this is different from a plain JSON file, which has a single JSON value and can be indented). Of course, RumbleDB can read plain JSON files, too (with json-doc()), but below we will show you how to read JSON Line files, which is how JSON data scales.
Depending on your installation method, the JSONiq queries will go to:
A cell in a jupyter notebook and with the %%jsoniq magic: a simple click is sufficient to execute.
The shell: type the query, and finish by pressing Enter twice.
In a Python program, inside a rumble.jsoniq() call of which you can exploit the output with more Python code.
A JSONiq query file, which you can execute with the RumbleDB CLI interface.
Either way, the meaning of the queries is the same.
or
or
The above queries do not actually use Spark. Spark is used when the I/O workload can be parallelized. The following query should output the file created above.
json-lines() reads its input in parallel, and thus will also work on your machine with MB or GB files (for TB files, a cluster will be preferable). You should specify a minimum number of partitions, here 10 (note that this is a bit ridiculous for our tiny example, but it is very relevant for larger files), as locally no parallelization will happen if you do not specify this number.
The above creates a very simple Spark job and executes it. More complex queries will create several Spark jobs. But you will not see anything of it: this is all done behind the scenes. If you are curious, you can go to in your browser while your query is running (it will not be available once the job is complete) and look at what is going on behind the scenes.
Data can be filtered with the where clause. Again, below the hood, a Spark transformation will be used:
RumbleDB also supports grouping and aggregation, like so:
RumbleDB also supports ordering. Note that clauses (where, let, group by, order by) can appear in any order. The only constraint is that the first clause should be a for or a let clause.
Finally, RumbleDB can also parallelize data provided within the query, exactly like Sparks' parallelize() creation:
Mind the double parenthesis, as parallelize is a unary function to which we pass a sequence of objects.
The docker installation is kindly contributed by Dr. Ingo Müller (Google).
On occasion, the docker version of RumbleDB used to throw a Kryo NoSuchMethodError on some systems. This should be fixed with version 2.0.0, let us know if this is not the case.
You can upgrade to the newest version with
Docker is the easiest way to get a standard environment that just works.
You can download Docker from .
Then, in a shell, type, all on one line:
The first time, it might take some time to download everything, but this is all done automatically. Subsequent commands will run immediately.
When there are new RumbleDB versions, you can upgrade with:
The RumbleDB shell appears:
You can now start typing simple queries like the following few examples. Press three times the return key to execute a query.
or
or
The above queries do not actually use Spark. Spark is used when the I/O workload can be parallelized. The following query should output the file created above.
json-lines() reads its input in parallel, and thus will also work on your machine with MB or GB files (for TB files, a cluster will be preferable). You should specify a minimum number of partitions, here 10 (note that this is a bit ridiculous for our tiny example, but it is very relevant for larger files), as locally no parallelization will happen if you do not specify this number.
The above creates a very simple Spark job and executes it. More complex queries will create several Spark jobs. But you will not see anything of it: this is all done behind the scenes. If you are curious, you can go to in your browser while your query is running (it will not be available once the job is complete) and look at what is going on behind the scenes.
Data can be filtered with the where clause. Again, below the hood, a Spark transformation will be used:
RumbleDB also supports grouping and aggregation, like so:
RumbleDB also supports ordering. Note that clauses (where, let, group by, order by) can appear in any order. The only constraint is that the first clause should be a for or a let clause.
Finally, RumbleDB can also parallelize data provided within the query, exactly like Sparks' parallelize() creation:
Mind the double parenthesis, as parallelize is a unary function to which we pass a sequence of objects.
You can also run the docker as a server like so:
You can change the port to something else than 8001 at all three places it appears. Do not forget -p 8001:8001 that forwards the port to the outside of the docker. Then, you can use a connected to the RumbleDB docker server to write queries in it. Point the notebook to http://localhost:8001/jsoniq in the appropriate cell (or any other port).
In order to query your local files, you need to mount a local directory to a directory within the docker. This is done with the --mount option, and the source path must be absolute. For the target, you can pick anything that makes sense to you.
For example, imagine you have a file products-small.json in the directory /path/to/my/directory. Then you need to run RumbleDB with:
Then you can go ahead and use absolute paths in the target directory in input functions, like so:
You can also mount a local directory in this way running it as a server rather than a shell.
After you have tried RumbleDB locally as explained in the getting started section, you can take RumbleDB to a real cluster simply by modifying the command line parameters as documented here for spark-submit.
Creating a cluster is the easiest part, as most cloud providers today offer that with just a few clicks: Amazon EMR, Azure HDInsight, etc. You can start with 4-5 machines with a few CPUs each and a bit of memory, and increase later when you want to get serious on larger scales.
Maybe sure to select a cluster that has Apache Spark. On Amazon EMR, this is not the default and you need to make sure that you check the box that has Spark below the cluster version dropdown. We recommend taking the latest EMR version 6.5.0 and then picking Spark 3.1 in the software configuration. You will also need to create a public/private key pair if you do not already have one.
Wait for 5 or 6 minutes, and the cluster is ready.
Do not forget to terminate the cluster when you are done!
Next, you need to use ssh to connect to the master node of your cluster as the hadoop user and specifying your private key file. You will find the hostname of the machine on the EMR cluster page. The command looks like:
ssh -i ~/.ssh/yourkey.pem [email protected]
If ssh hangs, then you may need to authorize your IP for incoming connections in the security group of your cluster.
And once you have connected with ssh and are on the shell, you can start using RumbleDB in a way similar to what you do on your laptop.
First you need to download it with wget (which is usually available by default on cloud virtual machines):
This is all you need to do, since Apache Spark is already installed. If spark-submit does not work, you might want to wait for a few more minutes as it might be that the cluster is not fully prepared yet.
Often, the Spark cluster is running on yarn. The --master option can be changed from local[*] (which was for running on your laptop) to yarn compared to the getting started guide.
Most of the time, though (e.g., on Amazon EMR), it needs not be specified, as this is already set up in the environment. So the same command will do:
When you are on a cluster, you can also adapt the number of executors, how many cores you want per executor, etc. It is recommended to use sqrt(n) cores per executor if a node has n cores. For the executor memory, it is just primary school math: you need to divide the memory on a machine with the number of executors per machine (which is also roughly sqrt(n)).
For example, if we have 6 worker nodes with each 16 cores and 64 GB, we can use 5 executores on each machine, with 3 cores and 10 GB per executor. This leaves a core and a bit of memory free for other cluster tasks.
If necesasry, the size limit for materialization can be made higher with --materialization-cap or its shortcut -c (the default is 200). This affects the number of items displayed on the shells as an answer to a query. It also affects the maximum number of items that can be materialized from a large sequence into, say, an array. Warnings are issued if the cap is reached.
json-lines() then takes an HDFS path and the host and port are optional if Spark is configured properly. A second parameter controls the minimum number of splits. By default, each HDFS block is a split if executed on a clustter. In a local execution, there is only one split by default.
The same goes for parallelize(). It is also possible to read text with text-file(), parquet files with parquet-file(), and it is also possible to read data on S3 rather than HDFS for all three functions json-lines(), text-file() and parquet-file().
If you need a bigger data set out of the box, we recommend the , which has 16 million objects. On Amazon EMR, we could even read several billion of objects on less than ten machines.
We tested this with each new release, and suggest the following queries to start with (we assume HDFS is the default file system, and that you copied over this dataset to HDFS with hadoop fs copyFromLocal):
Note that by default only the first 200 items in the output will be displayed on the shell, but you can change it with the --materialization-cap parameter on the CLI.
RumbleDB also supports executing a single query from the command line, reading from HDFS and outputting the results to HDFS, with the query file being either local or on HDFS. For this, use the --query-path (optional as any text without parameter is recognized as a path in any case), --output-path (shortcut -o) and --log-path parameters.
The query path, output path and log path can be any of the supported schemes (HDFS, file, S3, WASB...) and can be relative or absolute.
As in most language, one can distinguish between physical equality and logical equality.
Atomics can only be compared logically. Their physically identity is totally opaque to you.
Result (run with Zorba):true
Result (run with Zorba):false
Result (run with Zorba):false
Result (run with Zorba):true
Two objects or arrays can be tested for logical equality as well, using deep-equal(), which performs a recursive comparison.
Result (run with Zorba):true
Result (run with Zorba):false
The physical identity of objects and arrays is not exposed to the user in the core JSONiq language itself. Some library modules might be able to reveal it, though.
Module
You can group functions and variables in separate library modules.
MainModule
Up to now, everything we encountered were main modules, i.e., a prolog followed by a main query.
LibraryModule
A library module does not contain any query - just functions and variables that can be imported by other modules.
A library module must be assigned to a namespace. For convenience, this namespace is bound to an alias in the module declaration. All variables and functions in a library module must be prefixed with this alias.
ModuleImport
Here is a main module which imports the former library module. An alias is given to the module namespace (my). Variables and functions from that module can be accessed by prefixing their names with this alias. The alias may be different than the internal alias defined in the imported module.
Result (run with Zorba):1764
JSONiq is a query language that was specifically designed for querying JSON, although its data model is powerful enough to handle more similar formats.
As stated on json.org, JSON is a "lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate."
A JSON document is made of the following building blocks: objects, arrays, strings, numbers, booleans and nulls.
JSONiq manipulates sequences of these building blocks, which are called items. Hence, a JSONiq value is a sequence of items.
Any JSONiq expression takes and returns sequences of items.
Comma-separated JSON-like building blocks is all you need to begin building your own sequences. You can mix and match, as JSONiq supports heterogeneous sequences seamlessly.
By default, the memory allocated is limited. This depends on whether you run RumbleDB with the standalone jar or as the thin jar in a Spark environment.
If you run RumbleDB with a standalone jar, then your laptop will allocate by default one quarter of your total working memory. You can check this with:
In order to increase the memory, you can use -Xmx10g (for 10 GB, but you can use any other value):
If you run RumbleDB on your laptop (or a single machine) with the thin jar, then by default this is limited to around 2 GB, and you can change this with --driver-memory
Even though you can build your own JSON values with JSONiq by copying-and-pasting JSON documents, most of the time, your JSON data will come from an external input dataset.
How this dataset is access depends on the JSONiq implementation and of the context. Some engines can read the data from a file located on a file system, local or distributed (HDFS, S3); some others get data from the Web; some others are full-fledged datastores and have collections that can be created, queried, modified and persisted.
It is up to each engine to document which functions should be used, and how, in order to read datasets into a JSONiq Data Model instance. These functions will take implementation-defined parameters and typically return sequences of objects, or sequences of strings, or sequences of items, etc.
For the purpose of examples given in this specification, we assume that a hypothetical engine has collections that are sequences of objects, identified by a name which is a string. We assume that there is a collection() function that returns all objects associated with the provided collection name.
We assume in particular that there are three example collections, shown below.
{ "product" : "broiler", "store number" : 1, "quantity" : 20 }
{ "product" : "toaster", "store number" : 2, "quantity" : 100 }
{ "product" : "toaster", "store number" : 2, "quantity" : 50 }
{ "product" : "toaster", "store number" : 3, "quantity" : 50 }
{ "product" : "blender", "store number" : 3, "quantity" : 100 }
{ "product" : "blender", "store number" : 3, "quantity" : 150 }
{ "product" : "socks", "store number" : 1, "quantity" : 500 }
{ "product" : "socks", "store number" : 2, "quantity" : 10 }
{ "product" : "shirt", "store number" : 3, "quantity" : 10 }docker pull rumbledb/rumble
1 eq 1


Result:foo 2 true foo bar null [ 1, 2, 3 ]
Sequences are flat and cannot be nested. This makes streaming possible, which is very powerful.
Result:foo 2 true 4 null 6
A sequence can be empty. The empty sequence can be constructed with empty parentheses.
Result:
A sequence of just one item is considered the same as just this item. Whenever we say that an expression returns or takes one item, we really mean that it takes a singleton sequence of one item.
Result:foo
JSONiq classifies the items mentioned above in three categories:
Atomic items: strings, numbers, booleans and nulls, but also many other supported atomic values such as dates, binary, etc.
Structured items: objects and arrays.
Function items: items that can take parameters and, upon evaluation, return sequences of items.
The JSONiq data model follows the W3C specification, but, in core JSONiq, does not include XML nodes, and includes instead JSON objects and arrays. Engines are free, however, to optionally support XML nodes in addition to JSON objects and arrays.
An atomic is a non-structured value that is annotated with a type.
JSONiq atomic values follow the W3C specification.
JSONiq supports most atomic values available in the W3C specification. They are described in Chapter The JSONiq type system. JSONiq furthermore defines an additional atomic value, null, with a type of its own, jn:null, which does not exist in the W3C specification.
In particular, JSONiq supports all core JSON values. Note that JSON numbers correspond to three different types in JSONiq.
string: all JSON strings.
integer: all JSON numbers that are integers (no dot, no exponent), infinite range.
decimal: all JSON numbers that are decimals (no exponent), infinite range.
double: IEEE double-precision 64-bit floating point numbers (corresponds to JSON numbers with an exponent).
boolean: the JSON booleans true and false.
null: the JSON null.
Structured items in JSONiq do not follow the XQuery 3.1 standard but are specific to JSONiq.
In JSONiq, an object represents a JSON object, i.e., a collection of string/items pairs.
Objects have the following property:
pairs. A set of pairs. Each pair consists of an atomic value of type xs:string and of an item.
[ Consistency constraint: no two pairs have the same name (using fn:codepoint-equal). ]
The XQuery data model uses accessors to explain the data model. Accessors are not exposed to the user and are only used for convenience in this specification. Objects have the following accessors:
jdm:object-keys($o as js:object) as xs:string*: returns all keys in the object $o.
jdm:object-value($o as js:object, $s as xs:string) as js:item: returns the value associated with $s in the object $o.
An object does not have a typed value.
In JSONiq, an array represents a JSON array, i.e., a ordered list of items.
Objects have the following property:
members. An ordered list of items.
Arrays have the following accessors:
jdm:array-size($a as js:array) as xs:nonNegativeInteger: returns the number of values in the array $a.
jdm:array-value($a as js:array, $i as xs:positiveInteger) as js:item: returns the value at position $i in the array $a..
An array does not have a typed value.
Unlike in the XQuery 3.1 standard, the values in arrays and objects are single items (which disallows the empty sequence or a sequence of more than one item). Also, object keys must be strings (which disallows any other atomic value).
JSONiq also supports function items, also known as higher-order functions. A function item can be passed parameters and evaluated.
A function item has an optional name and an arity. It also has a signature, which consists of the sequence type of each one of its parameters (as many as its arity), and the sequence type of the values it returns.
The fact that functions are items means that they can be returned by expressions, and passed as parameters to other functions. This is why they are also often called higher-order functions.
JSONiq function items follow the W3C specification.
If you run RumbleDB on a cluster, then the memory needs to be allocated to the executors, not the driver:
Setting things up on a cluster requires more thinking because setting the executor memory should be done in conjunction with setting the total number of executors and the number of cores per executor. This highly depends on your cluster hardware.
RumbleDB does not currently support paths with a whitespace. Make sure to put your data and modules at paths without whitespaces.
If this happens, you can download winutils.exe to solve the issue as explained here.
This is a known issue under investigation. It is related to a version conflict between Kryo 4 and Kryo 5 that occasionally happens on some docker installations. We recommend trying a local installation instead, as described in the Getting Started section.
A very common issue leading to some errors is using the wrong Java version. With Spark 3.5, only Java 11 or 17 is supported. With Spark 4, Java 17 or 21 are supported.
You should make sure in particular you are not using a more recent Java version. Multiple Java versions can normally co-habit on the same machine but you need to make sure to set the JAVA_HOME variable appropriately.
It is easy to check the Java version with:
Sometimes, a sequence can be very long, and materializing it to a tuple of JSON values or a tuple of native items can fail because of the materialization cap. While it can be changed in the configuration to allow for larger tuples, this does not scale.
Another way to retrieve a sequence of arbitrary length is to use the iterator API to stream through the items one by one. If you do not keep previous values in memory, there is no limit to the sequence size than can be retrieved in this way (but it may take more time than using RDDs or DataFrames, which benefit from parallelism).
This is how to stream through the items converted to JSON one by one:
This is how to stream through the native items, using the Item API:
Sometimes, it is not possible to retrieve the output sequence as a (pandas or pyspark) DataFrame because no schema could be inferred. This is notably the case if the output sequence is heterogeneous (such as a sequence of items mixing atomics, objects of various structures, arrays, etc).
And materializing or streaming may not be an option either if there are billions of items.
In this case, it is possible to obtain the output as an RDD instead. This gets an RDD of JSON values that can be processed by Python (using the type mapping).
The rdd() method is experimental because we had to reverse-engineer how pyspark encodes RDDs for the Java Virtual Machine (pickling).
list = res.items();
for result in list:
print(result.getStringValue())Result
Result
Result
collection("one-object")
{ "foo" : "bar" }"Hello, World" 1 + 1
(3 * 4) div 5
json-lines("data.json")
for $i in json-lines("data.json", 10)
return $ifor $i in json-lines("data.json", 10)
where $i.quantity gt 99
return $ifor $i in json-lines("data.json", 10)
let $quantity := $i.quantity
group by $product := $i.product
return { "product" : $product, "total-quantity" : sum($quantity) }for $i in json-lines("data.json", 10)
let $quantity := $i.quantity
group by $product := $i.product
let $sum := sum($quantity)
order by $sum descending
return { "product" : $product, "total-quantity" : $sum }for $i in parallelize((
{ "product" : "broiler", "store number" : 1, "quantity" : 20 },
{ "product" : "toaster", "store number" : 2, "quantity" : 100 },
{ "product" : "toaster", "store number" : 2, "quantity" : 50 },
{ "product" : "toaster", "store number" : 3, "quantity" : 50 },
{ "product" : "blender", "store number" : 3, "quantity" : 100 },
{ "product" : "blender", "store number" : 3, "quantity" : 150 },
{ "product" : "socks", "store number" : 1, "quantity" : 500 },
{ "product" : "socks", "store number" : 2, "quantity" : 10 },
{ "product" : "shirt", "store number" : 3, "quantity" : 10 }
), 10)
let $quantity := $i.quantity
group by $product := $i.product
let $sum := sum($quantity)
order by $sum descending
return { "product" : $product, "total-quantity" : $sum }docker run -i rumbledb/rumble repl
docker pull rumbledb/rumble ____ __ __ ____ ____
/ __ \__ ______ ___ / /_ / /__ / __ \/ __ )
/ /_/ / / / / __ `__ \/ __ \/ / _ \/ / / / __ | The distributed JSONiq engine
/ _, _/ /_/ / / / / / / /_/ / / __/ /_/ / /_/ / 2.0.0 "Lemon Ironwood" beta
/_/ |_|\__,_/_/ /_/ /_/_.___/_/\___/_____/_____/
App name: spark-rumble-jar-with-dependencies.jar
Master: local[*]
Driver's memory: (not set)
Number of executors (only applies if running on a cluster): (not set)
Cores per executor (only applies if running on a cluster): (not set)
Memory per executor (only applies if running on a cluster): (not set)
Dynamic allocation: (not set)
Item Display Limit: 200
Output Path: -
Log Path: -
Query Path : -
RumbleDB$"Hello, World" 1 + 1
(3 * 4) div 5
json-lines("https://rumbledb.org/samples/products-small.json")
for $i in json-lines("https://rumbledb.org/samples/products-small.json", 10)
return $ifor $i in json-lines("https://rumbledb.org/samples/products-small.json", 10)
where $i.quantity gt 99
return $ifor $i in json-lines("https://rumbledb.org/samples/products-small.json", 10)
let $quantity := $i.quantity
group by $product := $i.product
return { "product" : $product, "total-quantity" : sum($quantity) }for $i in json-lines("https://rumbledb.org/samples/products-small.json", 10)
let $quantity := $i.quantity
group by $product := $i.product
let $sum := sum($quantity)
order by $sum descending
return { "product" : $product, "total-quantity" : $sum }for $i in parallelize((
{ "product" : "broiler", "store number" : 1, "quantity" : 20 },
{ "product" : "toaster", "store number" : 2, "quantity" : 100 },
{ "product" : "toaster", "store number" : 2, "quantity" : 50 },
{ "product" : "toaster", "store number" : 3, "quantity" : 50 },
{ "product" : "blender", "store number" : 3, "quantity" : 100 },
{ "product" : "blender", "store number" : 3, "quantity" : 150 },
{ "product" : "socks", "store number" : 1, "quantity" : 500 },
{ "product" : "socks", "store number" : 2, "quantity" : 10 },
{ "product" : "shirt", "store number" : 3, "quantity" : 10 }
), 10)
let $quantity := $i.quantity
group by $product := $i.product
let $sum := sum($quantity)
order by $sum descending
return { "product" : $product, "total-quantity" : $sum }docker run -p 8001:8001 --rm rumbledb/rumble serve -p 8001 -h 0.0.0.0docker run -t -i --mount type=bind,source=/path/to/my/directory,target=/home rumbledb/rumble replfor $i in json-lines("/home/products-small.json", 10)
where $i.quantity gt 99
return $iwget https://github.com/RumbleDB/rumble/releases/download/v1.22.0/rumbledb-1.22.0-for-spark-3.5.jarspark-submit --master yarn --deploy-mode client rumbledb-1.22.0-for-spark-3.5.jar repl
spark-submit rumbledb-1.22.0-for-spark-3.5.jar repl
spark-submit --num-executors 30 --executor-cores 3 --executor-memory 10g
rumbledb-1.22.0-for-spark-3.5.jar replspark-submit --num-executors 30 --executor-cores 3 --executor-memory 10g
rumbledb-1.22.0-for-spark-3.5.jar repl -c 10000for $i in json-lines(”/user/you/confusion−2014−03−02.json”, 300)
let $guess := $i.guess
let $target := $i.target
where $guess eq $target
where $target eq ”Russian”
return $i
for $i in json-lines(”/user/you/confusion−2014−03−02.json”, 300)
let $guess := $i.guess, $target := $i.target
where $guess eq $target
order by $target, $i.country descending, $i.date descending return $i
for $i in json-lines(”/user/you/confusion−2014−03−02.json”, 300)
let $country := $i.country, $target := $i.target
group by $target , $country
return {”Language”: $target ,
”Country” : $country , ”Guesses”: length($i)}spark-submit --num-executors 30 --executor-cores 3 --executor-memory 10g
rumbledb-1.22.0-for-spark-3.5.jar run "hdfs:///user/me/query.jq"
-o "hdfs:///user/me/results/output"
--log-path "hdfs:///user/me/logging/mylog"spark-submit --num-executors 30 --executor-cores 3 --executor-memory 10g
rumbledb-1.22.0-for-spark-3.5.jar run "/home/me/my-local-machine/query.jq"
-o "/user/me/results/output"
--log-path "hdfs:///user/me/logging/mylog"
1 eq 2
"foo" eq "bar"
"foo" ne "bar"
deep-equal({ "foo" : "bar" }, { "foo" : "bar" })
deep-equal({ "foo" : "bar" }, { "bar" : "foo" })
module namespace my = "http://www.example.com/my-module";
declare variable $my:variable := { "foo" : "bar" };
declare variable $my:n := 42;
declare function my:function($i as integer) { $i * $i };
import module namespace other= "http://www.example.com/my-module";
other:function($other:n)
"foo", 2, true, { "foo", "bar" }, null, [ 1, 2, 3 ]
( ("foo", 2), ( (true, 4, null), 6 ) )
()
("foo")
java -XX:+PrintFlagsFinal -version | grep -iE 'MaxHeapSize' java -jar -Xmx10g rumbledb-2.0.0-standalone.jar ...spark-submit --driver-memory 10G rumbledb-2.0.0-for-spark-4.0.jar ...spark-submit --executor-memory 10G rumbledb-2.0.0-for-spark-4.0.jar ...java -versionres.open();
while(res.hasNext()):
print(res.nextJSON());
res.close();res.open();
while (res.hasNext()):
print(res.next().getStringValue());
res.close();rdd = res.rdd();
print(rdd.count());
for str in rdd.take(10):
print(str);
collection("captains")
{ "name" : "James T. Kirk", "series" : [ "The original series" ], "century" : 23 }
{ "name" : "Jean-Luc Picard", "series" : [ "The next generation" ], "century" : 24 }
{ "name" : "Benjamin Sisko", "series" : [ "The next generation", "Deep Space 9" ], "century" : 24 }
{ "name" : "Kathryn Janeway", "series" : [ "The next generation", "Voyager" ], "century" : 24 }
{ "name" : "Jonathan Archer", "series" : [ "Entreprise" ], "century" : 22 }
{ "codename" : "Emergency Command Hologram", "surname" : "The Doctor", "series" : [ "Voyager" ], "century" : 24 }
{ "name" : "Samantha Carter", "series" : [ ], "century" : 21 }
collection("films")
{ "id" : "I", "name" : "The Motion Picture", "captain" : "James T. Kirk" }
{ "id" : "II", "name" : "The Wrath of Kahn", "captain" : "James T. Kirk" }
{ "id" : "III", "name" : "The Search for Spock", "captain" : "James T. Kirk" }
{ "id" : "IV", "name" : "The Voyage Home", "captain" : "James T. Kirk" }
{ "id" : "V", "name" : "The Final Frontier", "captain" : "James T. Kirk" }
{ "id" : "VI", "name" : "The Undiscovered Country", "captain" : "James T. Kirk" }
{ "id" : "VII", "name" : "Generations", "captain" : [ "James T. Kirk", "Jean-Luc Picard" ] }
{ "id" : "VIII", "name" : "First Contact", "captain" : "Jean-Luc Picard" }
{ "id" : "IX", "name" : "Insurrection", "captain" : "Jean-Luc Picard" }
{ "id" : "X", "name" : "Nemesis", "captain" : "Jean-Luc Picard" }
{ "id" : "XI", "name" : "Star Trek", "captain" : "Spock" }
{ "id" : "XII", "name" : "Star Trek Into Darkness", "captain" : "Spock" }
A Pending Update List is an unordered list of update primitives. Update primitives are internal and do not appear in the syntax. Each kind of update primitive models one individual update to an object or an array.
A Pending Update List can by analogy be seen as the diff between two git revisions, and a single update primitive can be seen, with this same analogy, as the difference between two single lines of code. Thus, the JSONiq Update Facility is to trees what git is to lines of text: a "tree diff" language.
JSONiq adds the following new update primitives, specific to JSON. They are similar to those defined by the XQuery Update Facility for XML.
Update primitives within a PUL are applied with strict snapshot semantics. For examples, the positions are resolved against the array before the updates. Names are resolved on the object before the updates.
Credits: Dwij Dixit/Ghislain Fourny (student project at ETH)
Update expressions are the visible part of JSONiq Updates in the language. Each primary updating expression contributes an update primitive to the Pending Update List being built.
These expressions may appear in a copy-modify-return (transform) expression (for in-memory updates on cloned values), or outside (for persistent updates to an underlying storage).
This section introduces prologs, which allows declaring functions and global variables that can then be used in the main query. A prolog also allows setting some default behaviour.
MainModule
Prolog
The prolog appears before the main query and is optional. It can contain setters and module imports, followed by function and variable declarations.
Module imports are explained in the next chapter.
jupd:rename-in-object(
$target as object(),
$key as xs:string,
$content as xs:string)
Renames the pair originally named $key in the object $target as $content (do nothing if there is no such pair).
jupd:insert-before-into-collection(
$target as item,
$content as item()*)
Inserts the provided items before the specified item in its collection.
jupd:insert-last-into-collection(
$target as item,
$content as item()*)
Inserts the provided items after the specified item in its collection.
jupd:insert-into-object(
$target as object(),
$content as object())
Inserts all pairs of the object $content into the object $target.
jupd:insert-into-array(
$target as array(),
$position as xs:integer,
$content as item()*)
Inserts all items in the sequence $content before position $position into the array $target.
jupd:delete-from-object(
$target as object(),
$keys as xs:string*)
Removes the pairs the names of which appear in $keys from the object $target.
jupd:delete-from-array(
$target as array(),
$position as xs:integer)
Removes the item at position $position from the array $target (causes all following items in the array to move one position to the left).
jupd:replace-in-array(
$target as array(),
$position as xs:integer,
$content as item())
Replaces the item at position $position in the array $target with the item $content (do nothing if $position is not comprised between 1 and jdm:size($target)).
jupd:replace-in-object(
$target as object(),
$key as xs:string,
$content as item())
Replaces the value of the pair named $key in the object $target with the item $content (do nothing if there is no such pair).
jupd:create-collection(
$name as string,
$mode as string,
$content as item()*)
Creates a collection initialized with the provided items. Mode determines the kind of collection (e.g., a Hive metastore table, a delta lake file, etc).
jupd:truncate-collection(
$name as string,
$mode as string)
Deletes the specified collection.
jupd:edit(
$target as item(),
$content as item())
Modifies an item in a collection into another item, preserving its identity and location.
jupd:delete-in-collection(
$target as item())
Deletes the provided item from its collection.
jupd:insert-first-into-collection(
$name as string,
$mode as string,
$content as item()*)
Inserts the provided items at the very beginning of the specified collection.
jupd:insert-last-into-collection(
$name as string,
$mode as string,
$content as item()*)
Inserts the provided items at the very end of the specified collection.
[FOAR0001] - Division by zero.
[FOAR0002] - Numeric operation overflow/underflow
[FOCA0002] - A value that is not lexically valid for a particular type has been encountered.
[FOCH0001] - Raised by fn:codepoints-to-string if the input contains an integer that is not the codepoint of a valid XML character.
[FOCH0003] - Raised by fn:normalize-unicode if the requested normalization form is not supported by the implementation.
[FODC0002] - Error retrieving resource.
[FODT0001] - Overflow/underflow in date/time operation.
[FODT0002] - Overflow/underflow in duration operation.
[FOFD1340] -This error is raised if the picture string or calendar supplied to fn:format-date, fn:format-time, or fn:format-dateTime has invalid syntax.
[FOFD1350] - This error is raised if the picture string supplied to fn:format-date selects a component that is not present in a date, or if the picture string supplied to fn:format-time selects a component that is not present in a time.
[FOTY0012] - The argument has no typed value (objects, arrays, functions cannot be atomized).
[JNTY0004] - Unexpected non-atomic element. Raised when objects or arrays are supplied where an atomic element is expected.
[JNTY0024] - Error getting the string value for array and object items
[JNTY0018] - Invalid selector error code. It is a type error if there is not exactly one supplied parameter for an object or array selector.
[RBDY0005] - Materialization Error: the sequence is too big to be materialized. Use --materialization-cap to increase the maximum materialization size, or add an output path to write to.
[RBML0001] - Unrecognized RumbleDB ML Class Reference An unrecognized classname is used in query while accessing the RumbleDB ML API.
[RBML0002] - Unrecognized RumbleDB ML Param Reference An unrecognized parameter is used in query while operating with a RumbleDB ML class.
[RBML0003] - Invalid RumbleDB ML Param Provided parameter does not match the expected type or value for the referenced RumbleDB ML class.
[RBML0004] - Input is not a DataFrame Provided input of items does not form a DataFrame as expected by RumbleDB ML.
[RBML0005] - Invalid schema for DataFrame in annotate() The provided schema can not be applied to the item data while converting the data to a DataFrame
[RBST0001] - CLI error. Raised when invalid parameters are supplied at launch.
[RBST0002] - Unimplemented feature error. Raised when a JSONiq feature that is not yet implemented in RumbleDB is used.
[RBST0003] - Invalid for clause expression error. Raised when an expression produces a different, big sequence of items for each binding within a big tuple, which would lead to a data flow explosion and to a nesting of jobs on the Spark cluster.
[RBST0004] - Implementation Error.
[SENR0001] - Serialization error. Function items can not be serialized.
[XPDY0002] - It is a dynamic error if evaluation of an expression relies on some part of the dynamic context that is absent.
[XPDY0050] - Dynamic type treat error. It is a dynamic error if the dynamic type of the operand of a treat expression does not match the sequence type specified by the treat expression. This error might also be raised by a path expression beginning with "/" or "//" if the context node is not in a tree that is rooted at a document node. This is because a leading "/" or "//" in a path expression is an abbreviation for an initial step that includes the clause treat as document-node().
[XPDY0130] - Generic runtime exception [check error message].
[XPST0003] - Parsing error. Invalid syntax or unsupported feature in query.
[XPST0008] - Undefined element reference. It is a static error if an expression refers to an element name, attribute name, schema type name, namespace prefix, or variable name that is not defined in the static context, except for an ElementName in an ElementTest or an AttributeName in an AttributeTest.
[XPST0017] - Invalid function call error. It is a static error if the expanded QName and number of arguments in a static function call do not match the name and arity of a function signature in the static context.
[XPST0080] - Invalid cast error - It is a static error if the target type of a cast or castable expression is NOTATION anySimpleType, or anyAtomicType.
[XPST0081] - Unknown namespace prefix - It is a static error if a QName used in a query contains a namespace prefix that cannot be expanded into a namespace URI by using the statically known namespaces.
[XPTY0004] - Unexpected Type Error. It is a type error if, during the static analysis phase, an expression is found to have a static type that is not appropriate for the context in which the expression occurs, or during the dynamic evaluation phase, the dynamic type of a value does not match a required type. Example: using subtraction on strings.
[XQDY0054] - It is a dynamic error if a cycle is encountered in the definition of a module's dynamic context components, for example because of a cycle in variable declarations.
[XQTY0024] - Attribute After Non Attribute Error - It is a type error if the content sequence in an element constructor contains an attribute node following a node that is not an attribute node.
[XQDY0025] - Duplicate Attribute Error - It is a dynamic error if any attribute of a constructed element does not have a name that is distinct from the names of all other attributes of the constructed element.
[XQDY0074] - Invalid Element Name Error - It is a dynamic error if the value of the name expression in a computed element or attribute constructor cannot be converted to an expanded QName (for example, because it contains a namespace prefix not found in statically known namespaces.)
[XQDY0096] - Invalid Node Name Error - It is a dynamic error if the node-name of a node constructed by a computed element constructor has any of the following properties: 1. Its namespace prefix is xmlns. 2. Its namespace URI is http://www.w3.org/2000/xmlns/. 3. Its namespace prefix is xml and its namespace URI is not http://www.w3.org/XML/1998/namespace. 4. Its namespace prefix is other than xml and its namespace URI is http://www.w3.org/XML/1998/namespace.
[XQDY0137] - Duplicate pair name. It is a dynamic error if two pairs in an object constructor or in a simple object union have the same name.
[XQST0016] - Module declaration error. Current implementation does not support the Module Feature raises a static error if it encounters a module declaration or a module import.
[XQST0031] - Invalid JSONiq version. It is a static error if the version number specified in a version declaration is not supported by the implementation. For now, only version 1.0 is supported.
[XQST0033] - Namespace prefix bound twice. It is a static error if a module contains multiple bindings for the same namespace prefix.
[XQST0034] - Function already exists. It is a static error if multiple functions declared or imported by a module have the same number of arguments and their expanded QNames are equal (as defined by the eq operator).
[XQST0038] - It is a static error if a Prolog contains more than one default collation declaration, or the value specified by a default collation declaration is not present in statically known collations.
[XQST0039] - Duplicate parameter name. It is a static error for a function declaration or an inline function expression to have more than one parameter with the same name.
[XQST0047] - It is a static error if multiple module imports in the same Prolog specify the same target namespace.
[XQST0048] - It is a static error if a function or variable declared in a library module is not in the target namespace of the library module.
[XQST0049] - It is a static error if two or more variables declared or imported by a module have the same name.
[XQST0052] - Simple type error. The type must be the name of a type defined in the in-scope schema types, and the {variety} of the type must be simple.
[XQST0059] - It is a static error if an implementation is unable to process a schema or module import by finding a schema or module with the specified target namespace.
[XQST0069] - A static error is raised if a Prolog contains more than one empty order declaration.
[XQST0088] - It is a static error if the literal that specifies the target namespace in a module import or a module declaration is of zero length.
[XQST0089] - It is a static error if a variable bound in a for or window clause of a FLWOR expression, and its associated positional variable, do not have distinct names (expanded QNames).
[XQST0094] - Invalid variable in group-by clause. The name of each grouping variable must be equal (by the eq operator on expanded QNames) to the name of a variable in the input tuple stream.
[XQST0118] - In a direct element constructor, the name used in the end tag must exactly match the name used in the corresponding start tag, including its prefix or absence of a prefix.
A JSON insert expression is used to insert new pairs into an object. It produces a jupd:insert-into-object update primitive. If the target is not an object, JNUP0008 is raised. If the content is not a sequence of objects, JNUP0019 is raised. These objects are merged prior to inserting the pairs into the target, and JNDY0003 is raised if the content to be inserted has colliding keys.
Example
Result: { "foo" : "bar", "bar" : 123, "foobar" : [ true, false ] }
A JSON insert expression is also used to insert a new member into an array. It produces a jupd:insert-into-array update primitive. If the target is not an array, JNUP0008 is raised. If the position is not an integer, JNUP0007 is raised.
Example
Result: { "foo" : [ 1, 2, 5, 3, 4 ] }
A JSON delete expression is used to remove a pair from an object. It produces a jupd:delete-from-object update primitive. If the key is not a string, JNUP0007 is raised. If the key does not exist, JNUP0016 is raised.
Example
Result: { "bar" : 123 }
A JSON delete expression is also used to remove a member from an array. It produces a jupd:insert-from-array update primitive. If the position is not an integer, JNUP0007 is raised. If the position is out of range, JNUP0016 is raised.
Example
Result: [ 1, 2, 4, 5, 6 ]
A JSON rename expression is used to rename a key in an object. It produces a jupd:rename-in-object update primitive. If the sequence on the left of the dot is not a single object, JNUP0008 is raised. If the new name is not a single string, JNUP0007 is raised. If the old key does not exist, JNUP0016 is raised.
Example 196. JSON rename expression
Result: { "bar" : 123, "foobar" : "bar" }
A JSON append expression is used to add a new member at the end of an array. It produces a jupd:insert-into-array update primitive. JNUP0008 is raised if the target is not an array.
Example 197. JSON append expression
Result: { "foo" : "bar", "bar" : [ 1, 2, 3, 4 ] }
A JSON replace expression is used to replace the value associated with a certain key in an object. It produces a jupd:replace-in-object update primitive. JNUP0007 is raised if the selector is not a single string. If the selector key does not exist, JNUP0016 is raised.
Example
Result: { "bar" : [ 1, 2, 3 ], "foo" : { "nested" : true } }
A JSON replace expression is also used to replace a member in an array. It produces a jupd:insert-in-array update primitive. JNUP0007 is raised if the selector is not a single position. If the selector position is out of range, JNUP0016 is raised.
Example
Result: { "foo" : "bar", "bar" : [ 1, "two", 3 ] }
These expressions may not appear in a copy-modify-return (transform) expression because they can only be used for persistent updates to an underlying storage (document store, data lakehouse, etc).
This expression creates an update primitive that creates a collection.
Example
This expression creates an update primitive that deletes a collection.
Example
This expression creates an update primitive that inserts values at the beginning or end of a collection, or before or after specific values in that collection.
Example
This expression creates an update primitive that modifies a value in a collection into the other supplied value.
Example
This expression creates an update primitive that deletes a specified value from its collection.
Example
Setters allow to specify a default behaviour for various aspects of the language.
DefaultCollationDecl
This specifies the default collation used for grouping and ordering clauses in FLWOR expressions. It can be overriden with a collation directive in these clauses.
OrderingModeDecl
This specifies the default behaviour of from clauses, i.e., if they bind tuples in the order in which items occur in the binding sequence. It can be overriden with ordered and unordered expressions.
EmptyOrderDecl
This specifies whether empty sequences come first or last in an ordering clause. It can be overriden by the corresponding directives in such clauses.
DecimalFormatDecl
DFPropertyName
This specifies a default decimal format for the builtin function format-number().
VarDecl
Variables can be declared global. Global variables are declared in the prolog.
Result (run with Zorba):{ "foo" : "bar" }
Result (run with Zorba):[ 1, 2, 3, 4, 5 ]
You can specify a type for a variable. If the type does not match, an error is raised. Types will be explained later. In general, you do not need to worry too much about variable types except if you want to make sure that what you bind to a variable is really what you want. In most cases, the engine will take care of types for you.
Result (run with Zorba):{ "foo" : "bar" }
An external variable allows you to pass a value from the outside environment, which can be very useful. Each implementation can choose their own way of passing a value to an external variable. A default value for an external variable can also be supplied in case none is provided outside.
Result (run with Zorba):An error was raised: "obj": variable has no value
Result (run with Zorba):{ "foo" : "bar" }
VarDecl
You can define your own functions in the prolog. These user-defined functions must be prefixed with local:, both in the declaration and when called.
Remember than types are optional, and if you do not specify any, item* is assumed, both for parameters and for the return type.
Result (run with Zorba):Hello, Mister Spock!
Result (run with Zorba):Hello, Mister Spock!
Result (run with Zorba):Hello, Mister Spock!
If you do specify types, an error is raised in case of a mismatch
Result (run with Zorba):Hello, 1!

This section describes JSONiq types as well as the sequence type syntax.
JSONiq manipulates semi-structured data: in general, JSONiq allows you, but does not require you to specify types. So you have as much or as little type verification as you wish.
JSONiq is still strongly typed, so that you will be told if there is a type inconsistency or mismatch in your programs.
Whenever you do not specify the type of a variable or the type signature of a function, the most general type for any sequence of items, item*, is assumed.
Section Expressions dealing with types introduces expressions which work with values of these types, as well as type operations (variable types, casts, ...).
JSONiq follows the regarding sequence occurrence indicators. The following explanations, provided as an informal summary for convenience, are non-normative.
A sequence is an ordered list of items.
All sequences match the sequence type js:item*.
A sequence type is made of an item type followed by an occurrence indicator:
The symbol * (star) stands for a sequence of any length (zero or more)
The symbol + (plus) stands for a non-empty sequence (one or more)
The symbol ? (question mark) stands for an empty or a singleton sequence (zero or one)
The absence of indicator stands for a singleton sequence (one).
Examples:
string matches any singleton sequence containing a string.
item+ matches any non-empty sequence.
object? matches the empty sequence and any sequence containing one object.
JSONiq defines the syntax () for the empty sequence, rather than empty-sequence().
SequenceType
Item types are the first component of a sequence type, together with the cardinality indicator. Thus, an item type matches (or not) a single item. For example, "foo" matches the item type xs:string.
There are three categories of item types:
Atomic types (W3C-conformant, additional js:null and js:atomic)
Structured types (JSONiq-specific)
Function types (W3C-conformant)
JSONiq uses a JSONiq-specific, implementation-defined default type namespace that acts as a proxy namespace to all types (xs: or js:). As a consequence, buitin atomic types do not need to be prefixed in the JSONiq syntax (integer instead of xs:integer, null instead of js:null).
All items match the item type js:item, which is a JSONiq-specific synonym for the W3C-confirmant item().
ItemType
JSONiq follows the for atomic types except for modifications in the list of available atomic types and a simplified syntax for xs:anyAtomicType. The following explanations, provided as an informal summary for convenience, are non-normative.
Atomic types are organized in a tree hierarchy.
JSONiq defines the following build-in types that have a direct relation with JSON:
xs:string: the value space is all strings made of Unicode characters.
All string literals build an atomic which matches string.
xs:integer (W3C-conformant): the value space is that of all mathematical integral numbers (N), with an infinite range. This is a subtype of decimal, so that all integers also match the item type decimal.
All integer literals build an atomic which matches integer.
xs:decimal (W3C-conformant): the value space is that of all mathematical decimal numbers (D), with an infinite range.
JSONiq also supports further atomic types, which are conformant with .
These datatypes are already used as a set of atomic datatypes by the other two semi-structured data formats of the Web: XML and RDF, as well as by the corresponding query languages: XQuery and SPARQL, so it is natural for a complete JSON data model to reuse them.
Further number types: xs:float, xs:long, xs:int, xs:short, xs:byte, xs:float, xs:positiveInteger, xs:negativeInteger, xs:nonPositiveInteger, xs:nonNegativeInteger, xs:unsignedLong, xs:unsignedInt, xs:unsignedShort, xs:unsignedByte.
Date or time types: xs:date, xs:dateTime, xs:dateTimeStamp, xs:gDay, xs:gMonth, xs:gMonthDay, xs:gYear, xs:xs:gYearMonth, xs:time.
Duration types: xs:duration, xs:dayTimeDuration, xs:yearMonthDuration.
Binary types: xs:base64Binary, xs:hexBinary.
The support of xs:ID, xs:IDREF, xs:IDREFS, xs:NOTATION, xs:Name, xs:NCName, xs:NMTOKEN, xs:NMTOKENS, xs:ENTITY, xs:ENTITIES is not required by JSONiq, although engines that also support XML can support them.
AtomicType
JSONiq introduces four more types for matching objects and arrays. Like atomic types, they do not need the js: prefix in the syntax (object instead of js:object, etc.).
All objects match the item type js:object.
All arrays match the item type js:array.
All objects and arrays match the item type js:json-item.
For engines that also support optionally XML, js:structured-item matches both XML nodes and JSON objects and arrays.
StructuredType
JSONiq follows the regarding function types. The following explanations are non-normative.
FunctionType
AnyFunctionType
TypedFunctionType
RumbleDB now supports user-defined array and object types both with the JSound compact syntax and the JSound verbose syntax.
RumbleDB user-defined types can be defined with the JSound syntax. A tutorial for the JSound syntax can be found .
For now, RumbleDB only allows the definition of user-defined types for objects and arrays. User-defined atomic types and union types will follow soon. The @ (primary key) and ? (nullable) shortcuts are supported as of version 2.0.5. The behavior of nulls with absent vs. nullable fields can be tweaked in the configuration (e.g., if a null is present in an optional, non-nullable field, RumbleBD can be lenient and simply remove it instead of throwing an error).
The implementation is still experimental and bugs are still expected, which we will appreciate to be informed of.
Even though JSON supports arrays, JSONiq uses a different construct as its first class citizens: sequences. Any value returned by or passed to an expression is a sequence.
The main difference between sequences and arrays is that sequences are completely flat, meaning they cannot contain other sequences.
Since sequences are flat, expressions of the JSONiq language just concatenate them to form bigger sequences.
This is crucial to allow streaming results, for example through an HTTP session.
copy $obj := { "foo" : "bar" }
modify insert json { "bar" : 123, "foobar" : [ true, false ] } into $obj
return $obj
copy $arr := { "foo" : [1,2,3,4] }
modify insert json 5 into $arr.foo at position 3
return $arr
copy $obj := { "foo" : "bar", "bar" : 123 }
modify delete json $obj.foo
return $obj
copy $arr := [1,2,3,4,5,6]
modify delete json $arr[[3]]
return $arr
copy $obj := { "foo" : "bar", "bar" : 123 }
modify rename json $obj.foo as "foobar"
return $obj
copy $obj := { "foo" : "bar", "bar" : [1,2,3] }
modify append json 4 into $obj.bar
return $obj
copy $obj := { "foo" : "bar", "bar" : [1,2,3] }
modify replace value of json $obj.foo with { "nested" : true }
return $obj
copy $obj := { "foo" : "bar", "bar" : [1,2,3] }
modify replace value of json $obj.bar[[2]] with "two"
return $obj
create collection table("mytable") with ({"foo":1},{"foo":2}),
create collection delta-file("/path/to/file.delta") with ({"foo":1},{"foo":2})delete collection table("mytable"),
delete collection delta-file("/path/to/file.delta")insert {"foo":3} first into collection table("mytable"),
insert {"foo":4} last into collection delta-file("/path/to/file.delta"),
insert {"foo":3} before table("mytable")[3] into collection,
insert {"foo":3} after delta-file("/path/to/file.delta")[3] into collectionedit table("mytable")[1] into {"foo":3} in collectiondelete table("mytable")[1] from collection
declare variable $obj := { "foo" : "bar" };
$obj
declare variable $numbers := (1, 2, 3, 4, 5);
[ $numbers ]
declare variable $obj as object := { "foo" : "bar" };
$obj
declare variable $obj external;
$obj
declare variable $obj external := { "foo" : "bar" };
$obj
declare function local:say-hello($x) { "Hello, " || $x || "!" };
local:say-hello("Mister Spock")
declare function local:say-hello($x as string) { "Hello, " || $x || "!" };
local:say-hello("Mister Spock")
declare function local:say-hello($x as string) as string { "Hello, " || $x || "!" };
local:say-hello("Mister Spock")
declare function local:say-hello($x) { "Hello, " || $x || "!" };
local:say-hello(1)
All decimal literals build an atomic which matches decimal.
xs:double (W3C-conformant): the value space is that of all IEEE double-precision 64-bit floating point numbers.
All double literals build an atomic which matches double.
xs:boolean (W3C-conformant): the value space contains the booleans true and false.
All boolean literals build an atomic which matches boolean.
js:null (JSONiq-specific): the value space is a singleton and only contains null.
All null literals build an atomic which matches null.
js:atomic (JSONiq-specific synonym of, and W3C-conformant with, xs:anyAtomicType): all atomic types.
All literals build an atomic which matches atomic.
An URI type: xs:anyURI.













A new type can be declared in the prolog, at the same location where you also define global variables and user-defined functions.
In the above query, although the type is defined, the query returns an object that was not validated against this type.
To validate and annotate a sequence of objects, you need to use the validate-type expression, like so:
You can use user-defined types wherever other types can appear: as type annotation for FLWOR variables or global variables, as function parameter or return types, in instance-of or treat-as expressions, etc.
You can validate larger sequences
You can also validate, in parallel, an entire JSON Lines file, like so:
By defaults, fields are optional:
You can, however, make a field required by adding a ! in front of its name:
Or you can provide a default value with the equal sign:
Extra fields will be rejected. However, the verbose version of JSound supports allowing extra fields (open objects) and will be supported in a future version of RumbleDB.
With the JSound comptact syntax, you can easily define nested array structures:
You can even further nest objects:
Or split your definitions into several types that refer to each other:
In fact, RumbleDB will internally convert the sequence of objects to a Spark DataFrame, leading to faster execution times.
In other words, the JSound Compact Schema Syntax is perfect for defining DataFrames schema!
For advanced JSound features, such as open object types or subtypes, the verbose syntax must be used, like so:
The JSound type system, as its name indicates, is sound: you can only make subtypes more restrictive than the super type. The complete specification of both syntaxes is available on the JSound website.
In the feature, RumbleDB will support user-defined atomic types and union types via the verbose syntax.
Once you have validated your data as a dataframe with a user-defined type, you are all set to use the RumbleDB ML Machine Learning library and feed it through ML pipelines!
Result (run with Zorba):1 2 3 4
Arrays on the other side can contain nested arrays, like in JSON.
Result (run with Zorba):[ [ 1, 2 ], [ 3, 4 ] ]
Many expressions return single items - actually, they really return a singleton sequence, but a singleton sequence of one item is considered the same as this item.
Result (run with Zorba):2
This is different for arrays: a singleton array is distinct from its unique member, like in JSON.
Result (run with Zorba):[ 2 ]
An array is a single item. A (non-singleton) sequence is not. This can be observed by counting the number of items in a sequence.
Result (run with Zorba):1
Result (run with Zorba):4
Other than that, arrays and sequences can contain exactly the same members (atomics, arrays, objects).
Result (run with Zorba):[ 1, "foo", [ 1, 2, 3, 4 ], { "foo" : "bar" } ]
Result (run with Zorba):1 foo [ 1, 2, 3, 4 ] { "foo" : "bar" }
Arrays can be converted to sequences, and vice-versa.
Result (run with Zorba):1 foo [ 1, 2, 3, 4 ] { "foo" : "bar" }
Result (run with Zorba):[ 1, "foo", [ 1, 2, 3, 4 ], { "foo" : "bar" } ]
Null and the empty sequence are two different concepts.
Null is an item (an atomic value), and can be a member of an array or of a sequence, or the value associated with a key in an object. Sequences cannot, as they represent the absence of any item.
Result (run with Zorba):[ null, 1, null, 2 ]
Result (run with Zorba):{ "foo" : null }
Result (run with Zorba):null 1 null 2
If an empty sequence is found as an object value, it is automatically converted to null.
Result (run with Zorba):{ "foo" : null }
In an arithmetic opration or a comparison, if an operand is an empty sequence, an empty sequence is returned. If an operand is a null, an error is raised except for equality and inequality.
Result (run with Zorba):
Result (run with Zorba):An error was raised: arithmetic operation not defined between types "js:null" and "xs:integer"
Result (run with Zorba):
Result (run with Zorba):
Result (run with Zorba):false
Result (run with Zorba):true
Result (run with Zorba):
Result (run with Zorba):
declare type local:my-type as {
"foo" : "string",
"bar" : "integer"
};
{ "foo" : "this is a string", "bar" : 42 }declare type local:my-type as {
"foo" : "string",
"bar" : "integer"
};
validate type local:my-type* {
{ "foo" : "this is a string", "bar" : 42 }
}declare type local:my-type as {
"foo" : "string",
"bar" : "integer"
};
declare function local:proj($x as local:my-type+) as string*
{
$x.foo
};
let $a as local:my-type* := validate type local:my-type* {
{ "foo" : "this is a string", "bar" : 42 }
}
return if($a instance of local:my-type*)
then local:proj($a)
else "Not an instance."declare type local:my-type as {
"foo" : "string",
"bar" : "integer"
};
validate type local:my-type* {
{ "foo" : "this is a string", "bar" : 42 },
{ "foo" : "this is another string", "bar" : 1 },
{ "foo" : "this is yet another string", "bar" : 2 },
{ "foo" : "this is a string", "bar" : 12 },
{ "foo" : "this is a string", "bar" : 42345 },
{ "foo" : "this is a string", "bar" : 42 }
}declare type local:my-type as {
"foo" : "string",
"bar" : "integer"
};
validate type local:my-type* {
json-lines("hdfs:///directory-file.json")
}declare type local:my-type as {
"foo" : "string",
"bar" : "integer"
};
validate type local:my-type* {
{ "foo" : "this is a string", "bar" : 42 },
{ "bar" : 1 },
{ "foo" : "this is yet another string", "bar" : 2 },
{ "foo" : "this is a string" },
{ "foo" : "this is a string", "bar" : 42345 },
{ "foo" : "this is a string", "bar" : 42 }
}declare type local:my-type as {
"foo" : "string",
"!bar" : "integer"
};
validate type local:my-type* {
{ "foo" : "this is a string", "bar" : 42 },
{ "bar" : 1 },
{ "foo" : "this is yet another string", "bar" : 2 },
{ "foo" : "this is a string", "bar" : 1234 },
{ "foo" : "this is a string", "bar" : 42345 },
{ "foo" : "this is a string", "bar" : 42 }
}declare type local:my-type as {
"foo" : "string=foobar",
"!bar" : "integer"
};
validate type local:my-type* {
{ "foo" : "this is a string", "bar" : 42 },
{ "bar" : 1 },
{ "foo" : "this is yet another string", "bar" : 2 },
{ "foo" : "this is a string", "bar" : 1234 },
{ "foo" : "this is a string", "bar" : 42345 },
{ "foo" : "this is a string", "bar" : 42 }
}declare type local:my-type as {
"foo" : "string",
"!bar" : [ "integer" ]
};
validate type local:my-type* {
{ "foo" : "this is a string", "bar" : [ 42, 1234 ] },
{ "bar" : [ 1 ] },
{ "foo" : "this is yet another string", "bar" : [ 2 ] },
{ "foo" : "this is a string", "bar" : [ ] },
{ "foo" : "this is a string", "bar" : [ 1, 2, 3, 4, 5, 6 ] },
{ "foo" : "this is a string", "bar" : [ 42 ] }
}declare type local:my-type as {
"foo" : { "bar" : "integer" },
"!bar" : [ { "first" : "string", "last" : "string" } ]
};
validate type local:my-type* {
{
"foo" : { "bar" : 1 },
"bar" : [
{ "first" : "Albert", "last" : "Einstein" },
{ "first" : "Erwin", "last" : "Schrodinger" }
]
},
{
"foo" : { "bar" : 2 },
"bar" : [
{ "first" : "Alan", "last" : "Turing" },
{ "first" : "John", "last" : "Von Neumann" }
]
},
{
"foo" : { "bar" : 3 },
"bar" : [
]
}
}declare type local:person as {
"first" : "string",
"last" : "string"
};
declare type local:my-type as {
"foo" : { "bar" : "integer" },
"!bar" : [ "local:person" ]
};
validate type local:my-type* {
{
"foo" : { "bar" : 1 },
"bar" : [
{ "first" : "Albert", "last" : "Einstein" },
{ "first" : "Erwin", "last" : "Schrodinger" }
]
},
{
"foo" : { "bar" : 2 },
"bar" : [
{ "first" : "Alan", "last" : "Turing" },
{ "first" : "John", "last" : "Von Neumann" }
]
},
{
"foo" : { "bar" : 3 },
"bar" : [
]
}
}declare type local:x as jsound verbose {
"kind" : "object",
"baseType" : "object",
"content" : [
{ "name" : "foo", "type" : "integer" }
],
"closed" : false
};
declare type local:y as jsound verbose {
"kind" : "object",
"baseType" : "local:x",
"content" : [
{ "name" : "bar", "type" : "date" }
],
"closed" : true
};
( (1, 2), (3, 4) )
[ [ 1, 2 ], [ 3, 4 ] ]
1 + 1
[ 1 + 1 ]
count([ 1, "foo", [ 1, 2, 3, 4 ], { "foo" : "bar" } ])
count( ( 1, "foo", [ 1, 2, 3, 4 ], { "foo" : "bar" } ) )
[ 1, "foo", [ 1, 2, 3, 4 ], { "foo" : "bar" } ]
( 1, "foo", [ 1, 2, 3, 4 ], { "foo" : "bar" } )
[ 1, "foo", [ 1, 2, 3, 4 ], { "foo" : "bar" } ] []
[ ( 1, "foo", [ 1, 2, 3, 4 ], { "foo" : "bar" } ) ]
[ null, 1, null, 2 ]
{ "foo" : null }
(null, 1, null, 2)
{ "foo" : () }
() + 2
null + 2
null + ()
() eq 2
null eq 2
null lt 2
null eq ()
null lt ()

RumbleDB relies on the JSONiq language.
The complete specification can be found here and on the JSONiq.org website. The implementation is now in a very advanced stage and there remain only few unsupported core JSONiq features.
A tutorial can be found . All queries in this tutorial will work with RumbleDB.
A tutorial aimed at Python users can be found . Please keep in mind, though, that examples using not supported features may not work (see below).
FLWOR expressions now support nestedness, for example like so:
However, keep in mind that parallelization cannot be nested in Spark (there cannot be a job within a job), that is, the following will not work:
Many expressions are pushed down to Spark out of the box. For example, this will work on a large file leveraging the parallelism of Spark:
What is pushed down so far is:
FLWOR expressions (as soon as a for clause is encountered, binding a variable to a sequence generated with json-lines() or parallelize())
aggregation functions such as count
JSON navigation expressions: object lookup (as well as keys() call), array lookup, array unboxing, filtering predicates
predicates on positions, include use of context-dependent functions position() and last(), e.g.,
More expressions working on sequences will be pushed down in the future, prioritized on the feedback we receive.
We also started to push down some expressions to DataFrames and Spark SQL (obtained via structured-json-lines, csv-file and parquet-file calls). In particular, keys() pushes down the schema lookup if used on parquet-file() and structured-json-lines(). Likewise, count() as well as object lookup, array unboxing and array lookup is also pushed down on DataFrames.
When an expression does not support pushdown, it will materialize automaticaly. To avoid issues, the materializion is capped by default at 200 items, but this can be changed on the command line with --materialization-cap. A warning is issued if a materialization happened and the sequence was truncated on screen. An error is thrown if this happens within a query.
Prologs with user-defined functions and global variables are supported. Global external variables are supported (use "--variable:foo bar" on the command line to assign values to them). If the declared type is not string, then the literal supplied on the command line is cast. If the declared type is anyURI, the path supplied on the command line is also resolved against the working directory to an absolute URI. Thus, anyURI should be used to supply paths dynamically through an external variable.
Context item declarations are supported and a global context item value can be passed with the "--context-item" or "-I" parameter on the command line.
Library modules are now supported (experimental, please report bugs), and their namespace URI is used for resolution. If it is relative, it is resolved against the importing module location.
The same schemes are supported as for reading queries and data: file, hdfs, and so on. HTTP is also supported: you can import modules from the Web!
Example of library module (the file name is library-module.jq):
Example of importing module (assuming it is in the same directory):
Try/catch expressions are supported. Error codes are in the default, RumbleDB namespace and do not need prefixes.
The JSONiq type system is fully supported. Below is a complete list of JSONiq types and their support status. All builtin types are in the default type namespace, so that no prefix is needed. These types are defined in the XML Schema standard. Note that some types specific to XML (e.g., NOTATION, NMTOKENS, NMTOKEN, ID, IDREF, ENTITY, etc) are not part of the JSONiq standard and not supported by RumbleDB.
Most core features of JSONiq are now in place, and we are working on getting the last (less used) ones into RumbleDB as well. We prioritize their implementation on user requests.
Some prolog settings (base URI, ordering mode, decimal format, namespace declarations) are not supported yet.
Location hints for the resolution of modules are not supported yet.
Window clauses are not supported, because they are not compatible with the Spark execution model.
Function type syntax is supported.
Function annotations are not supported (%public, %private...), but this is planned.
Most JSONiq and XQuery builtin functions are now supported (see function documentation), except XML-specific functions. A few are still missing, do not hesitate to reach out if you need them.
Constructors for atomic types are fully supported.
Buitin functions cannot yet be used with named function reference expressions (example: concat#2).
Error variables ($err:code, ...) for inside catch blocks are not supported.
There are future plans to support JSONiq updates and scripting.
type checking (instance of, treat as)
many builtin function calls (head, tail, exist, etc)
supported
dateTime
supported
dateTimeStamp
supported
dayTimeDuration
supported
decimal
supported
double
supported
duration
supported
float
supported
gDay
supported
gMonth
supported
gYear
supported
gYearMonth
supported
hexBinary
supported
int
supported
integer
supported
long
supported
negativeInteger
supported
nonPositiveInteger
supported
nonNegativeInteger
supported
numeric
supported
positiveInteger
supported
short
supported
string
supported
time
supported
unsignedByte
supported
unsignedInt
supported
unsignedLong
supported
unsignedShort
supported
yearMonthDuration
supported
atomic
JSONiq 1.0 only
anyAtomicType
supported
anyURI
supported
base64Binary
supported
boolean
supported
byte
supported
date
let $x := for $x in json-lines("file.json")
where $x.field eq "foo"
return $x
return count($x)for $x in json-lines("file1.json")
let $z := for $y in json-lines("file2.json")
where $y.foo eq $x.fbar
return $y
return count($z)count(json-lines("file.json")[$$.field eq "foo"].bar[].foo[[1]])json-lines("file.json")[position() ge 10 and position() le last() - 2]module namespace m = "library-module.jq";
declare variable $m:x := 2;
declare function mod:func($v) {
$m:x + $v
);import module namespace mod = "library-module.jq";
mod:func($mod:x)try { 1 div 0 } catch FOAR0001 { "Division by zero!" }The parameters that can be used on the command line as well as on the planned HTTP server are shown below. They are also accessible via the Java API and via Python through the RumbleRuntimeConfiguration class.
RumbleDB runs in three modes. You can select the mode passing a verb as the first parameter. For example:
Previous parameters (--shell, --query-path, --server) work in a backward compatible fashion, however we do recommend to start using the new verb-based format.
--shell
repl
RumbleDB is able to read a variety of formats from a variety of file systems and database management systems.
We support functions to read JSON, JSON Lines, XML, Parquet, CSV, Text, ROOT, Delta files from various storage layers such as S3 and HDFS, Azure blob storage. We run most of our tests on Amazon EMR with S3 or HDFS, as well as locally on the local file system, but we welcome feedback on other setups.
We also support some ETL-based systems such as PostgreSQL, MongoDB and the Hive metastore.
spark-submit rumbledb.jar run file.jq -o output-dir -P 1
spark-submit rumbledb.jar run -q '1+1'
spark-submit rumbledb.jar serve -p 8001
spark-submit rumbledb.jar repl -c 10N/A
yes, no
yes runs the interactive shell. No executes a query specified with --query-path
--shell-filter
N/A
N/A
jq .
Post-processes the output of JSONiq queries on the shell with the specified command (reading the RumbleDB output via stdin)
--query
-q
query
1+1
A JSONiq query directly provided as a string.
--query-path
(any text without -- or - is recognized as a query path)
query-path
file:///folder/file.jq
A JSONiq query file to read from (from any file system, even the Web!).
--output-path
-o
output-path
file:///folder/output
Where to output to (if the output is large, it will create a sharded directory, otherwise it will create a file)
--output-format
-f
N/A
json, csv, avro, parquet, or any other format supported by Spark
An output format to use for the output. Formats other than json can only be output if the query outputs a highly structured sequence of objects (you can nest your query in an annotate() call to specify a schema if it does not).
--output-format-option:foo
N/A
N/A
bar
Options to further specify the output format (example: separator character for CSV, compression format...)
--overwrite
-O (meaning --overwrite yes)
overwrite
yes, no
Whether to overwrite to --output-path. No throws an error if the output file/folder exists.
--materialization-cap
-c
materialization-cap
100000
A cap on the maximum number of items to materialize during the query execution for large sequences within a query. For example, when nesting an expression producing a large sequence of items (and that RumbleDB chose to physically store as an RDD or DataFrame) into an array constructor.
--result-size
result-size
10
A cap on the maximum number of items to output on the screen or to a local list.
--number-of-output-partitions
-P
N/A
ad hoc
How many partitions to create in the output, i.e., the number of files that will be created in the output path directory.
--log-path
N/A
log-path
file:///folder/log.txt
Where to output log information
--print-iterator-tree
N/A
N/A
yes, no
For debugging purposes, prints out the expression tree and runtime interator tree.
--show-error-info
-v (meaning --show-error-info yes)
show-error-info
yes, no
For debugging purposes. If you want to report a bug, you can use this to get the full exception stack. If no, then only a short message is shown in case of error.
--static-typing
-t (meaning --static-typing yes)
static-typing
yes, no
Activates static type analysis, which annotates the expression tree with inferred types at compile time and enables more optimizations (experimental). Deactivated by default.
--server
serve
N/A
yes, no
yes runs RumbleDB as a server on port 8001. Run queries with http://localhost:8001/jsoniq?query-path=/folder/foo.json
--port
-p
N/A
8001 (default)
Changes the port of the RumbleDB HTTP server to any of your liking
--host
-h
N/A
localhost (default)
Changes the host of the RumbleDB HTTP server to any of your liking
--variable:foo
N/A
variable:foo
bar
--variable:foo bar initialize the global variable $foo to "bar". The query must contain the corresponding global variable declaration, e.g., "declare variable $foo external;"
--context-item
-I
context-item
bar
initializes the global context item $$ to "bar". The query must contain the corresponding global variable declaration, e.g., "declare context item external;"
--context-item-input
-i
context-item-input
-
reads the context item value from the standard input
--context-item-input-format
N/A
context-item-input-format
text or json
sets the input format to use for parsing the standard input (as text or as a serialized json value)
--dates-with-timezone
N/A
dates-with-timezone
yes or no
activates timezone support for the type xs:date (deactivated by default)
--lax-json-null-valication
N/A
lax-json-null-validation
yes or no
Allows conflating JSON nulls with absent values when validating nillable object fields for more flexibility (activated by default).
--optimize-general-comparison-to-value-comparison
N/A
optimize-general-comparison-to-value-comparison
yes or no
activates automatic conversion of general comparisons to value comparisons when applicable (activated by default)
--function-inlining
N/A
function-inlining
yes or no
activates function inlining for non-recursive functions (activated by default)
--parallel-execution
N/A
parallel-execution
yes or no
activates parallel execution when possible (activated by default)
--native-execution
N/A
native-execution
yes or no
activates native (Spark SQL) execution when possible (activated by default)
--default-language
N/A
N/A
jsoniq10, jsoniq31, xquery31
specifies the query language to be used
--optimize-steps
N/A
N/A
yes or no
allows RumbleDB to optimize steps, might violate stability of document order (activated by default)
--optimize-steps-experimental
N/A
N/A
yes or no
experimentally optimizes steps more by skipping uniqueness and sorting in some cases. correctness is not yet verified (disabled by default)
--optimize-parent-pointers
N/A
N/A
yes or no
allows RumbleDB to remove parent pointers from items if no steps requiring parent pointers are detected statically (activated by default)
--static-base-uri
N/A
N/A
"../data/"
sets the static base uri for the execution. This option overwrites module location but is overwritten by declaration inside query
A JSON file containing a single JSON object (or value) can be read with json-doc(). It will not spread access in any way, so that the files should be reasonably small. json-doc() can read JSON files even if the object or value is spread over multiple lines.
returns the (single) JSON value read from the supplied JSON file. This will also work for structures spread over multiple lines, as the read is local and not sharded.
json-doc() also works with an HTTP URI.
JSON Lines files are files that have one JSON object (or value) per line. Such files can thus become very large, up to billions or even trillions of JSON objects.
JSON Lines files are read with the json-lines() function (formerly called json-file()). json-lines() exists in unary and binary. The first parameter specifies the JSON file (or set of JSON files) to read. The second, optional parameter specifies the minimum number of partitions. It is recommended to use it in a local setup, as the default is only one partition, which does not fully use the parallelism. If the input is on HDFS, then blocks are taken as splits by default. This is also similar to Spark's textFile().
json-lines() also works with an HTTP URI, however, it will download the file completely and then parallelize, because HTTP does not support blocks. As a consequence, it can only be used for reasonable sizes.
Example of usage:
If a default host and port are set in the Hadoop configuration, you can directly specify an absolute path without host and port:
For a set of files:
If a working directory is set:
Several files or whole directories can be read with the same pattern syntax as in Spark.
In some cases, JSON Lines files are highly structured, meaning that all objects have the same fields and these fields are associated with values with the same types. In this case, RumbleDB will be faster navigating such files if you open them with the function structured-json-lines().
structured-json-lines() parses one or more json files that follow JSON-lines format and returns a sequence of objects. This enables better performance with fully structured data and is recommended to use only when such data is available.
Warning: when the data has multiple types for the same field, this field and contained values will be treated as strings. This is also similar to Spark's spark.read.json().
Example of usage:
XML files can be read into RumbleDB using the doc() function. The parameter specifies the XML file to read and return as a document node.
Example of usage:
Additionally, RumbeDB provides the xml-files() function to read many XML files at once. xml-files() exists in unary and binary. The first parameter specifies the directory of XML files to read. The second, optional parameter specifies the minimum number of partitions. It is recommended to use it in a local setup, as the default is only one partition.
Example of usage:
Text files can be read into a sequence of string items, one string per line. RumbleDB can open files that have billions or potentially even trillions of lines with the function text-file().
text-file() exists in unary and binary. The first parameter specifies the text file (or set of text files) to read and return as a sequence of strings.
The second, optional parameter specifies the minimum number of partitions. It is recommended to use it in a local setup, as the default is only one partition, which does not fully use the parallelism. If the input is on HDFS, then blocks are taken as splits by default. This is also similar to Spark's textFile().
Example of usage:
Several files or whole directories can be read with the same pattern syntax as in Spark.
(Also see examples for json-lines for host and port, sets of files and working directory).
There is also a function local-text-file() that reads locally, without parallelism. RumbleDB can stream through the file efficiently.
RumbleDB supports also the W3C-standard functions unparsed-text and unparsed-text-lines. The output of the latter is automatically parallelized as a potentially large sequence of strings.
Parquet files can be opened with the function parquet-file().
Parses one or more parquet files and returns a sequence of objects. This is also similar to Spark's spark.read.parquet()
Several files or whole directories can be read with the same pattern syntax as in Spark.
CSV files can be opened with the function csv-file().
Parses one or more csv files and returns a sequence of objects. This is also similar to Spark's spark.read.csv()
Several files or whole directories can be read with the same pattern syntax as in Spark.
Options can be given in the form of a JSON object. All available options can be found in the Spark documentation
PostgreSQL tables can be opened with the function postgresql-table().
PostgreSQL is an OLTP system with its own storage system. Thus, unlike most other functions on this page, it uses a connection string rather than a path on a data lake.
It opens one table and returns it as a sequence of objects. The first argument is the connection string in the JDBC format, containing host, port, username, password, and database. The second argument is the name of the table to read.
The third parameter can be used to control the number of partitions.
MongoDB collections can be opened with the function mongodb-collection().
MongoDB is an OLTP system with its own storage system. Thus, unlike most other functions on this page, it uses a connection string rather than a path on a data lake.
It opens one collection and returns it as a sequence of objects. The first argument is the connection string in the MongoDB format, containing host, port, database, collection, username, password. The second argument is the name of the collection to read.
The third parameter can be used to control the number of partitions.
MongoDB does not work "out of the box" but requires some configuration as indicated on the MongoDB Spark connector website. In the Python edition, we simplified the process and all that is needed is to add withMongo() on the session building chain:
RumbleDB can connect to a table registered in the Hive metastore with the function table().
The Hive metastore manages its own storage system. Thus, unlike most other functions on this page, it uses a simple name rather than a path on a data lake.
RumbleDB can also modify data in a Hive metastore table with the JSONiq Update Facility.
Delta files, part of the Delta Lake framework, can be opened with the function delta-file().
RumbleDB can also modify data in a delta file with the JSONiq Update Facility.
Delta files do not work "out of the box" but require some configuration as indicated on the Delta Lake website (importing packages, configuring some parameters). In the Python edition, we simplified the process and all that is needed is to add withDelta() on the session building chain:
Avro files can be opened with the function avro-file().
Parses one or more avro files and returns a sequence of objects. This is similar to Spark's spark.read().format("avro").load()
Several files or whole directories can be read with the same pattern syntax as in Spark.
Options can be given in the form of a JSON object. All available options relevant for reading in avro data can be found in the Spark documentation
libSVM files can be opened with the function libsvm-file().
Parses one or more libsvm files and returns a sequence of objects. This is similar to Spark's spark.read().format("libsvm").load()
Several files or whole directories can be read with the same pattern syntax as in Spark.
ROOT files can be open with the function root-file(). The second parameter specifies the path within the ROOT files (a ROOT file is like a mini-file system of its own). It is often Events or tree.
The function parallelize() can be used to create, on the fly, a big sequence of items in such a way that RumbleDB can spread its querying across cores and machines.
This function behaves like the Spark parallelize() you are familiar with and sends a large sequence to the cluster. The rest of the FLWOR expression is then evaluated with Spark transformations on the cluster.
There is also be a second, optional parameter that specifies the minimum number of partitions.
As a general rule of thumb, RumbleDB can read from any file system that Spark can read from. The file system is inferred from the scheme used in the path used in any of the functions described above, with the exception of MongoDB, the Hive metastore, and PostgreSQL, which are ETL-based.
Note that the scheme is optional, in which case the default file system as configured in Hadoop and Spark is used. A relative path can also be provided, in which case the working directory (including its file system) as configured is used.
The scheme for the local file system is file://. Pay attention to the fact that for reading an absolute path, a third slash will follow the scheme.
Example:
Warning! If you try to open a file from the local file system on a cluster of several machines, this might fail as the file is only on the machine that you are connected to. You need to pass additional parameters to spark-submit to make sure that any files read locally will be copied over to all machines.
If you use spark-submit locally, however, this will work out of the box, but we recommend specifying a number of partitions to avoid reading the file as a single partition.
For Windows, you need to use forward slashes, and if the local file system is set up as the default and you omit the file scheme, you still need a forward slash in front of the drive letter to not confuse it with a URI scheme:
In particular, the following will not work:
The scheme for the Hadoop Distributed File System is hdfs://. A host and port should also be specified, as this is required by Hadoop.
Example:
If HDFS is already set up as the default file system as is often the case in managed Spark clusters, an absolute path suffices:
The following will not work:
There are three schemes for reading from S3: s3://, s3n:// and s3a://.
Examples:
If you are on an Amazon EMR cluster, s3:// is straightforward to use and will automatically authenticate. For more details on how to set up your environment to read from S3 and which scheme is most appropriate, we refer to the Amazon S3 documentation.
The scheme for Azure blob storage is wasb://.
Example:
json-doc("file.json")for $my-json in json-lines("hdfs://host:port/directory/file.json")
where $my-json.property eq "some value"
return $my-jsonfor $my-json in json-lines("/absolute/directory/file.json")
where $my-json.property eq "some value"
return $my-jsonfor $my-json in json-lines("/absolute/directory/file-*.json")
where $my-json.property eq "some value"
return $my-jsonfor $my-json in json-lines("file.json")
where $my-json.property eq "some value"
return $my-jsonfor $my-json in json-lines("*.json")
where $my-json.property eq "some value"
return $my-jsonfor $my-structured-json in structured-json-lines("hdfs://host:port/directory/structured-file.json")
where $my-structured-json.property eq "some value"
return $my-structured-jsondoc("path/to/file.xml")xml-files("path/to/directory/*.xml", 10)count(
for $my-string in text-file("hdfs://host:port/directory/file.txt")
for $token in tokenize($my-string, ";")
where $token eq "some value"
return $token
)count(
for $my-string in local-text-file("file:///home/me/file.txt")
for $token in tokenize($my-string, ";")
where $token eq "some value"
return $token
)count(
for $my-string in unparsed-text-lines("file:///home/me/file.txt")
for $token in tokenize($my-string, ";")
where $token eq "some value"
return $token
)count(
let $text := unparsed-text("file:///home/me/file.txt")
for $my-string in tokenize($text, "\n")
for $token in tokenize($my-string, ";")
where $token eq "some value"
return $token
)for $my-object in parquet-file("file.parquet")
where $my-object.property eq "some value"
return $my-jsonfor $my-object in parquet-file("*.parquet")
where $my-object.property eq "some value"
return $my-jsonfor $i in csv-file("file.csv")
where $i._c0 eq "some value"
return $ifor $i in csv-file("*.csv")
where $i._c0 eq "some value"
return $ifor $i in csv-file("file.csv", {"header": true, "inferSchema": true})
where $i.key eq "some value"
return $ifor $i in postgresql-table("jdbc:postgresql://servername/dbname?user=postgres&password=example", "tablename")
where $i.attribute eq "some value"
return $ifor $i in postgresql-table("jdbc:postgresql://servername/dbname?user=postgres&password=example", "tablename", 10)
where $i.attribute eq "some value"
return $ifor $i in mongodb-collection("mongodb://servername/dbname", "collection")
where $i.attribute eq "some value"
return $ifor $i in mongodb-collection("mongodb://servername/dbname", "collection", 10)
where $i.attribute eq "some value"
return $iRumbleSession.builder.withMongo().getOrCreate();for $i in table("mytable")
where $i.attribute eq "some value"
return $ifor $i in delta-file("hdfs://path/to/my/delta-file")
where $i.attribute eq "some value"
return $iRumbleSession.builder.withDelta().getOrCreate();for $i in avro-file("file.avro")
where $i._col1 eq "some value"
return $ifor $i in avro-file("*.avro")
where $i._col1 eq "some value"
return $ifor $i in avro-file("file.avro", {"ignoreExtension": true, "avroSchema": "/path/to/schema.avsc"})
where $i._col1 eq "some value"
return $ifor $i in libsvm-file("file.txt")
where $i._col1 eq "some value"
return $ifor $i in libsvm-file("*.txt")
where $i._col1 eq "some value"
return $ifor $i in root-file("events.root", "Events")
where $i._c0 eq "some value"
return $ifor $i in parallelize(1 to 1000000)
where $i mod 1000 eq 0
return $ifor $i in parallelize(1 to 1000000, 100)
where $i mod 1000 eq 0
return $ifile:///home/user/file.jsonfile:///C:/Users/hadoop/file.json
file:/C:/Users/hadoop/file.json
/C:/Users/hadoop/file.jsonfile://C:/Users/hadoop/file.json
C:/Users/hadoop/file.json
C:\Users\hadoop\file.json
file://C:\Users\hadoop\file.jsonhdfs://www.example.com:8021/user/hadoop/file.json/user/hadoop/file.jsonhdfs:///user/hadoop/file.json
hdfs://user/hadoop/file.json
hdfs:/user/hadoop/file.jsons3://my-bucket/directory/file.json
s3n://my-bucket/directory/file.json
s3a://my-bucket/directory/file.jsonwasb://[email protected]/directory/file.jsonIn this specification, we detail the JSONiq language in version 1.0. Historically, JSONiq was first created as an extension to XQuery. Later, a separate core syntax was created which makes it 100% tailored for JSON. It is the JSONiq core syntax that is detailed in this document.
The functionality directly inherited from XQuery is described on a higher level and we explicitly refer for more in-depth details to the .
A JSONiq program can either be a main module, which contains a query that can be executed, or a library module, which defines functions and variables that can be used in other modules.
A main or library module can be optionally prefixed with a JSONiq declaration with a version (currently 1.0) and an encoding.
A JSONiq main module is made of two parts: an optional prolog, and an expression, which is the main query.
MainModule
The result of the main JSONiq program is the result of its main query.
In the prolog, it is possible to declare global variables and functions. Mostly, you will recognize a prolog declaration by the semi-colon it ends with. The main query does not contain semi-colons (at least in core JSONiq).
Global variables and functions can use and call each other arbitrarily, even if the dependency is further down in the prolog. If there a cycle, an error is thrown.
JSONiq largely follows the W3C standard regarding modules. The detailed specification is found here.
Library modules do not contain any main query, just global variables and functions. They can be imported by other modules.
A library module is introduced with a module declaration, followed by the prolog containing its variables and functions.
LibraryModule
JSONiq is 99% reliant on XQuery, a W3C standard. For everything taken over from the W3C standard, a brief, non-normative explanation is provided with a link to the corresponding part in the W3C specification.
JSONiq Data Model
Atomic items
W3C-conformant
Structured items
JSONiq-specific
Function items
W3C-conformant
Node items (XML)
Omitted (optional support by some engines)
JSONiq Type System
The namespace http://jsoniq.org/functions is used for JSONiq builtin functions defined by this specification. This namespace is exposed to the user and is bound by default to the prefix jn. For instance, the function name jn:keys() is in this namespace.
The namespace http://jsoniq.org/types is used for JSONiq builtin types defined by this specification (including synonyms for some XQuery types). This namespace is exposed to the user and is bound by default to the prefix js. For instance, the type name js:null is in this namespace.
The namespace http://jsoniq.org/default-function-namespace is a proxy namespace that maps to the jn: (JSONiq), fn: (XQuery) and math: (XQuery) namespaces. It is the default function namespace, allowing to call all these functions with no prefix.
The namespace http://jsoniq.org/default-type-namespace is a proxy namespace that maps to the js: (JSONiq) and xs: (XQuery) namespaces. It is the default type namespace, allowing to use all builtin types with no prefix.
Accessors used in JSONiq Data Model use the jdm: prefix. These functions are not exposed to the user and are for explanatory purposes of the data model within this document only. The jdm: prefix is not associated with a namespace.
This function returns the distinct keys of all objects in the supplied sequence, in an implementation-dependent order.
keys($o as item*) as string*
Getting all distinct key names in the supplied objects, ignoring non-objects.
Result (run with Zorba):a b c
Retrieving all Pairs from an Object:
Result (run with Zorba):{ "eyes" : "blue" } { "hair" : "fuchsia" }
This functions returns all members of all arrays of the supplied sequence.
members($a as item*) as item*
Retrieving the members of all supplied arrays, ignoring non-arrays.
Result (run with Zorba):mercury venus earth mars 1 2 3
This function returns the JSON null.
null() as null
This function parses its first parameter (a string) as JSON, and returns the resulting sequence of objects and arrays.
parse-json($arg as string?) as json-item*
parse-json($arg as string?, $options as object) as json-item*
The object optionally supplied as the second parameter may contain additional options:
jsoniq-multiple-top-level-items (boolean): indicates whether parsing to zero, or several objects is allowed. An error is raised if this value is false and there is not exactly one object that was parsed.
If parsing is not successful, an error is raised. Parsing is considered in particular to be non-successful if the boolean associated with "jsoniq-multiple-top-level-items" in the additional parameters is false and there is extra content after parsing a single abject or array.
Parsing a JSON document
Result (run with Zorba):{ "foo" : "bar" }
Parsing multiple, whitespace-separated JSON documents
Result (run with Zorba):{ "foo" : "bar" } { "bar" : "foo" }
This function returns the size of the supplied array, or the empty sequence if the empty sequence is provided.
size($a as array?) as integer?
Retrieving the size of an array
Result (run with Zorba):10
This function dynamically builds an object, like the {| |} syntax, except that it does not throw an error upon pair collision. Instead, it accumulates them, wrapping into an array if necessary. Non-objects are ignored.
This function returns all arrays contained within the supplied items, regardless of depth.
This function returns all objects contained within the supplied items, regardless of depth.
This function returns all descendant pairs within the supplied items.
Accessing all descendant pairs
Result (run with Zorba):An error was raised: "descendant-pairs": function with arity 1 not declared
This function recursively flattens arrays in the input sequence, leaving non-arrays intact.
This function returns the intersection of the supplied objects, and aggregates values corresponding to the same name into an array. Non-objects are ignored.
This function iterates on the input sequence. It projects objects by filtering their pairs and leaves non-objects intact.
Projecting an object 1
Result (run with Zorba):{ "Captain" : "Kirk", "First Officer" : "Spock" }
Projecting an object 2
Result (run with Zorba):{ }
This function iterates on the input sequence. It removes the pairs with the given keys from all objects and leaves non-objects intact.
Removing keys from an object (not implemented yet)
Result (run with Zorba):An error was raised: "remove-keys": function with arity 2 not declared
This function returns all values in the supplied objects. Non-objects are ignored.
This function encodes any sequence of items, even containing non-JSON types, to a sequence of JSON items that can be serialized as pure JSON, in a way that it can be parsed and decoded back using decode-from-roundtrip. JSON features are left intact, while atomic items annotated with a non-JSON type are converted to objects embedding all necessary information.
encode-for-roundtrip($items as item*) as json-item*
This function decodes a sequence previously encoded with encode-for-roundtrip.
decode-from-roundtrip($items as json-item*) as item*
Access to the external environment: collection#1
Function to turn atomics into booleans for use in two-valued logics: boolean#1
Functions on numeric values: abs#1, ceilingabs#1, floorabs#1, roundabs#1,
Parsing numbers: ,
Formatting integers: ,
Formatting numbers: ,
Trigonometric and exponential functions: , , , , , , , , , , , , ,
Functions to assemble and disassemble strings: ,
Comparison of strings: , ,
Functions on string values: , , , , , , , , , , , , ,
Functions based on substring matching: , , , , , , , , ,
String functions that use regular expressions: , , , , ,
Functions that manipulate URIs: , , , ,
General functions on sequences: , , , , , , , , ,
Function that compare values in sequences: , , , , .
Functions that test the cardinality of sequences: , ,
Aggregate functions: , , , ,
Serializing functions: (unary)
Context information: , , , ,
Constructor functions: for all builtin types, with the name of the builtin type and unary. Equivalent to a cast expression.
let $o := ("foo", [ 1, 2, 3 ], { "a" : 1, "b" : 2 }, { "a" : 3, "c" : 4 })
return keys($o)
let $map := { "eyes" : "blue", "hair" : "fuchsia" }
for $key in keys($map)
return { $key : $map.$key }
let $planets := ( "foo", { "foo" : "bar "}, [ "mercury", "venus", "earth", "mars" ], [ 1, 2, 3 ])
return members($planets)
parse-json("{ \"foo\" : \"bar\" }", { "jsoniq-multiple-top-level-items" : false })
parse-json("{ \"foo\" : \"bar\" } { \"bar\" : \"foo\" }")
let $a := [1 to 10]
return size($a)
declare function accumulate($seq as item*) as object
{
{|
keys($seq) ! { $$ : $seq.$$ }
|}
};
declare function descendant-arrays($seq as item*) as array*
{
for $i in $seq
return typeswitch ($i)
case array return ($i, descendant-arrays($i[])
case object return descendant-arrays(values($i))
default return ()
};
declare function descendant-objects($seq as item*) as object*
{
for $i in $seq
return typeswitch ($i)
case object return ($i, descendant-objects(values($i)))
case array return descendant-objects($i[])
default return ()
};
declare function descendant-pairs($seq as item*)
{
for $i in $seq
return typeswitch ($i)
case object return
for $k in keys($o)
let $v := $o.$k
return ({ $k : $v }, descendant-pairs($v))
case array return descendant-pairs($i[])
default return ()
};
let $o :=
{
"first" : 1,
"second" : {
"first" : "a",
"second" : "b"
}
}
return descendant-pairs($o)
declare function flatten($seq as item*) as item*
{
for $value in $seq
return typeswitch ($value)
case array return flatten($value[])
default return $value
};
declare function intersect($seq as item*)
{
{|
let $objects := $seq[. instance of object()]
for $key in keys(head($objects))
where every $object in tail($objects)
satisfies exists(index-of(keys($object), $key))
return { $key : $objects.$key }
|}
};
declare function project($seq as item*, $keys as string*) as item*
{
for $item in $seq
return typeswitch ($item)
case $object as object return
{|
for $key in keys($object)
where some $to-project in $keys satisfies $to-project eq $key
let $value := $object.$key
return { $key : $value }
|}
default return $item
};
let $o := {
"Captain" : "Kirk",
"First Officer" : "Spock",
"Engineer" : "Scott"
}
return project($o, ("Captain", "First Officer"))
let $o := {
"Captain" : "Kirk",
"First Officer" : "Spock",
"Engineer" : "Scott"
}
return project($o, "XQuery Evangelist")
declare function remove-keys($seq as item*, $keys as string*) as item*
{
for $item in $seq
return typeswitch ($item)
case $object as object return
{|
for $key in keys($object)
where every $to-remove in $keys satisfies $to-remove ne $key
let $value := $object.$key
return { $key : $value }
|}
default return $item
};
let $o := {
"Captain" : "Kirk",
"First Officer" : "Spock",
"Engineer" : "Scott"
}
return remove-keys($o, ("Captain", "First Officer"))
declare function values($seq as item*) as item* {
for $i in $seq
for $k in jn:keys($i)
return $i($k)
};
Atomic types
W3C-conformant, but support for xs:ID, xs:IDREF, xs:IDREFS, xs:Name, xs:NCName, xs:ENTITY, xs:ENTITIES, xs:NOTATION omitted (except for engines also supporting XML)
js:null type
JSONiq-specific
js:item, js:atomic types
JSONiq-specific synonyms for item() and xs:anyAtomicType
Structured types
JSONiq-specific
Function types
W3C-conformant
Empty sequence type
JSONiq-specific notation () for empty-sequence()
XML node types
Omitted (optional support by engines supporting XML)
Concepts
Effective boolean value
W3C-conformant, extended with object, array and null semantics
Atomization
Omitted (optional support by engines supporting XML)
Expressions
Numeric literals
W3C-conformant
String literals
W3C-conformant, but escape is done with \ not with &
Boolean and null literals
JSONiq-specific
Variable reference
W3C-conformant
Parenthesized expressions
W3C-conformant
Context item expressions
W3C-conformant but $$ syntax instead of .
Static function calls
W3C-conformant
Named function reference
W3C-conformant
Inline function expressions
W3C-conformant
Filter expressions
W3C-conformant
Dynamic function calls
W3C-conformant
Path expressions (XML)
Omitted (optional support by engines supporting XML, but relative paths must start with ./)
Object lookup
JSONiq-specific
Array lookup
JSONiq-specific
Array unboxing
JSONiq-specific
Sequence expressions
W3C-conformant
Arithmetic expressions
W3C-conformant, no atomization needed (except for engines also supporting XML)
String concatenation expressions
W3C-conformant
Comparison expressions
W3C-conformant, no need to atomize or convert from untyped and untypedAtomic (except for engines also supporting XML)
Logical expressions
W3C-conformant
XML constructors
Omitted (optional support by engines supporting XML)
JSON (object and array) constructors
JSONiq-specific
FLWOR expressions
W3C-conformant
Unordered and ordered expressions
W3C-conformant
Conditional expressions
W3C-conformant
Switch expressions
W3C-conformant
Quantified expressions
W3C-conformant
Try-catch expressions
W3C-conformant
Instance-of expressions
W3C-conformant
Typeswitch expressions
W3C-conformant
Cast expressions
W3C-conformant
Castable expressions
W3C-conformant
Constructor functions
W3C-conformant, additional constructor function for null()
Treat expressions
W3C-conformant
Simple map operator
W3C-conformant
Validate expressions
Omitted (optional support by engines supporting XML)
Extension expressions
W3C-conformant
Static context
XPath 1.0 compatibility mode
Omitted (optional support by engines supporting XML)
Statically known namespaces
W3C-conformant
Default element/type namespace
W3C-conformant, strong recommendation for implementations to overwrite with the proxy namespace http://jsoniq.org/default-type-namespace to omit prefixes.
Default function namespace
W3C-conformant, strong recommendation for implementations to overwrite with http://jsoniq.org/default-function-namespace to omit prefixes.
In-scope schema definitions
Omitted (optional support by engines supporting XML)
In-scope variables
W3C-conformant
Context item static type
W3C-conformant
Statically known function signatures
W3C-conformant, augmented with all JSONiq builtin functions
Statically known collations
W3C-conformant
Default collation
W3C-conformant
Construction mode
Omitted (optional support by engines supporting XML)
Ordering mode
W3C-conformant
Default order for empty sequences
W3C-conformant
Boundary-space policy
Omitted (optional support by engines supporting XML)
Copy-namespaces mode
Omitted (optional support by engines supporting XML)
Static Base URI
W3C-conformant
Statically known documents
Omitted (optional support by engines supporting XML)
Statically known collections
Omitted (optional support by engines supporting XML)
Statically known default collection type
Omitted (optional support by engines supporting XML)
Statically known decimal formats
W3C-conformant
Dynamic context
Context item
W3C-conformant (but with syntax $$ not .)
Initial context item
W3C-conformant
Context position
W3C-conformant
Context size
W3C-conformant
Variable values
W3C-conformant
Named functions
W3C-conformant
Current dateTime
W3C-conformant
Implicit timezone
W3C-conformant
Default language
W3C-conformant
Default calendar
W3C-conformant
Default place
W3C-conformant
Available documents
Omitted (optional support by engines supporting XML)
Available text resources
W3C-conformant
Available node collections
Omitted (optional support by engines supporting XML)
Default node collection
Omitted (optional support by engines supporting XML)
Available resource collections
Omitted (optional support by engines supporting XML)
Default resource collection
Omitted (optional support by engines supporting XML)
Environment variables
W3C-conformant


RumbleDB ML is a Machine Learning library built on top of the RumbleDB engine that makes it more productive and easier to perform ML tasks thanks to the abstraction layer provided by JSONiq.
The machine learning capabilities are exposed through JSONiq function items. The concepts of "estimator" and "transformer", which are core to Machine Learning, are naturally function items and fit seamlessly in the JSONiq data model.
Training sets, test sets, and validation sets, which contain features and labels, are exposed through JSONiq sequences of object items: the keys of these objects are the features and labels.
The names of the estimators and of the transformers, as well as the functionality they encapsulate, are directly inherited from the SparkML library which RumbleDB ML is based on: we chose not to reinvent the wheel.
A transformer is a function item that maps a sequence of objects to a sequence of objects.
It is an abstraction that either performs a feature transformation or generates predictions based on trained models. For example:
Tokenizer is a feature transformer that receives textual input data and splits it into individual terms (usually words), which are called tokens.
KMeansModel is a trained model and a transformer that can read a dataset containing features and generate predictions as its output.
An estimator is a function item that maps a sequence of objects to a transformer (yes, you got it right: that's a function item returned by a function item. This is why they are also called higher-order functions!).
Estimators abstract the concept of a Machine Learning algorithm or any algorithm that fits or trains on data. For example, a learning algorithm such as KMeans is implemented as an Estimator. Calling this estimator on data essentially trains a KMeansModel, which is a Model and hence a Transformer.
Transformers and estimators are function items in the RumbleDB Data Model. Their first argument is the sequence of objects that represents, for example, the training set or test set. Parameters can be provided as their second argument. This second argument is expected to be an object item. The machine learning parameters form the fields of the said object item as key-value pairs.
RumbleDB ML works on highly structured data, because it requires full type information for all the fields in the training set or test set. It is on our development plan to automate the detection of these types when the sequence of objects gets created in the fly.
RumbleDB supports a user-defined type system with which you can validate and annotate datasets against a JSound schema.
This annotation is required to be applied on any dataset that must be used as input to RumbleDB ML, but it is superfluous if the data was directly read from a structured input format such as Parquet, CSV, Avro, SVM or ROOT.
Tokenizer Example:
KMeans Example:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
Parameters:
declare type local:id-and-sentence as {
"id": "integer",
"sentence": "string"
};
let $local-data := (
{"id": 1, "sentence": "Hi I heard about Spark"},
{"id": 2, "sentence": "I wish Java could use case classes"},
{"id": 3, "sentence": "Logistic regression models are neat"}
)
let $df-data := validate type local:id-and-sentence* { $local-data }
let $transformer := get-transformer("Tokenizer")
for $i in $transformer(
$df-data,
{"inputCol": "sentence", "outputCol": "output"}
)
return $i
// returns
// { "id" : 1, "sentence" : "Hi I heard about Spark", "output" : [ "hi", "i", "heard", "about", "spark" ] }
// { "id" : 2, "sentence" : "I wish Java could use case classes", "output" : [ "i", "wish", "java", "could", "use", "case", "classes" ] }
// { "id" : 3, "sentence" : "Logistic regression models are neat", "output" : [ "logistic", "regression", "models", "are", "neat" ] }declare type local:col-1-2-3 as {
"id": "integer",
"col1": "decimal",
"col2": "decimal",
"col3": "decimal"
};
let $vector-assembler := get-transformer("VectorAssembler")(
?,
{ "inputCols" : [ "col1", "col2", "col3" ], "outputCol" : "features" }
)
let $local-data := (
{"id": 0, "col1": 0.0, "col2": 0.0, "col3": 0.0},
{"id": 1, "col1": 0.1, "col2": 0.1, "col3": 0.1},
{"id": 2, "col1": 0.2, "col2": 0.2, "col3": 0.2},
{"id": 3, "col1": 9.0, "col2": 9.0, "col3": 9.0},
{"id": 4, "col1": 9.1, "col2": 9.1, "col3": 9.1},
{"id": 5, "col1": 9.2, "col2": 9.2, "col3": 9.2}
)
let $df-data := validate type local:col-1-2-3* {$local-data }
let $df-data := $vector-assembler($df-data)
let $est := get-estimator("KMeans")
let $tra := $est(
$df-data,
{"featuresCol": "features"}
)
for $i in $tra(
$df-data,
{"featuresCol": "features"}
)
return $i
// returns
// { "id" : 0, "col1" : 0, "col2" : 0, "col3" : 0, "prediction" : 0 }
// { "id" : 1, "col1" : 0.1, "col2" : 0.1, "col3" : 0.1, "prediction" : 0 }
// { "id" : 2, "col1" : 0.2, "col2" : 0.2, "col3" : 0.2, "prediction" : 0 }
// { "id" : 3, "col1" : 9, "col2" : 9, "col3" : 9, "prediction" : 1 }
// { "id" : 4, "col1" : 9.1, "col2" : 9.1, "col3" : 9.1, "prediction" : 1 }
// { "id" : 5, "col1" : 9.2, "col2" : 9.2, "col3" : 9.2, "prediction" : 1 }- aggregationDepth: integer
- censorCol: string
- featuresCol: string
- fitIntercept: boolean
- labelCol: string
- maxIter: integer
- predictionCol: string
- quantileProbabilities: array (of double)
- quantilesCol: string
- tol: double- alpha: double
- checkpointInterval: integer
- coldStartStrategy: string
- finalStorageLevel: string
- implicitPrefs: boolean
- intermediateStorageLevel: string
- itemCol: string
- maxIter: integer
- nonnegative: boolean
- numBlocks: integer
- numItemBlocks: integer
- numUserBlocks: integer
- predictionCol: string
- rank: integer
- ratingCol: string
- regParam: double
- seed: double
- userCol: string- distanceMeasure: string
- featuresCol: string
- k: integer
- maxIter: integer
- minDivisibleClusterSize: double
- predictionCol: string
- seed: double- bucketLength: double
- inputCol: string
- numHashTables: integer
- outputCol: string
- seed: double- fdr: double
- featuresCol: string
- fpr: double
- fwe: double
- labelCol: string
- numTopFeatures: integer
- outputCol: string
- percentile: double
- selectorType: string- binary: boolean
- inputCol: string
- maxDF: double
- minDF: double
- minTF: double
- outputCol: string
- vocabSize: integer- collectSubModels: boolean
- estimator: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- numFolds: integer
- parallelism: integer
- seed: double- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- impurity: string
- labelCol: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- seed: double
- thresholds: array (of double)- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- impurity: string
- labelCol: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- predictionCol: string
- seed: double
- varianceCol: string- itemsCol: string
- minConfidence: double
- minSupport: double
- numPartitions: integer
- predictionCol: string- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- labelCol: string
- lossType: string
- maxBins: integer
- maxDepth: integer
- maxIter: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- seed: double
- stepSize: double
- subsamplingRate: double
- thresholds: array (of double)
- validationIndicatorCol: string- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- labelCol: string
- lossType: string
- maxBins: integer
- maxDepth: integer
- maxIter: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- predictionCol: string
- seed: double
- stepSize: double
- subsamplingRate: double
- validationIndicatorCol: string- featuresCol: string
- k: integer
- maxIter: integer
- predictionCol: string
- probabilityCol: string
- seed: double
- tol: double- family: string
- featuresCol: string
- fitIntercept: boolean
- labelCol: string
- link: string
- linkPower: double
- linkPredictionCol: string
- maxIter: integer
- offsetCol: string
- predictionCol: string
- regParam: double
- solver: string
- tol: double
- variancePower: double
- weightCol: string- inputCol: string
- minDocFreq: integer
- outputCol: string- inputCols: array (of string)
- missingValue: double
- outputCols: array (of string)
- strategy: string- featureIndex: integer
- featuresCol: string
- isotonic: boolean
- labelCol: string
- predictionCol: string
- weightCol: string- distanceMeasure: string
- featuresCol: string
- initMode: string
- initSteps: integer
- k: integer
- maxIter: integer
- predictionCol: string
- seed: double
- tol: double- checkpointInterval: integer
- docConcentration: double
- docConcentration: array (of double)
- featuresCol: string
- k: integer
- keepLastCheckpoint: boolean
- learningDecay: double
- learningOffset: double
- maxIter: integer
- optimizeDocConcentration: boolean
- optimizer: string
- seed: double
- subsamplingRate: double
- topicConcentration: double
- topicDistributionCol: string- aggregationDepth: integer
- elasticNetParam: double
- epsilon: double
- featuresCol: string
- fitIntercept: boolean
- labelCol: string
- loss: string
- maxIter: integer
- predictionCol: string
- regParam: double
- solver: string
- standardization: boolean
- tol: double
- weightCol: string- aggregationDepth: integer
- featuresCol: string
- fitIntercept: boolean
- labelCol: string
- maxIter: integer
- predictionCol: string
- rawPredictionCol: string
- regParam: double
- standardization: boolean
- threshold: double
- tol: double
- weightCol: string- aggregationDepth: integer
- elasticNetParam: double
- family: string
- featuresCol: string
- fitIntercept: boolean
- labelCol: string
- lowerBoundsOnCoefficients: object (of object of double)
- lowerBoundsOnIntercepts: object (of double)
- maxIter: integer
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- regParam: double
- standardization: boolean
- threshold: double
- thresholds: array (of double)
- tol: double
- upperBoundsOnCoefficients: object (of object of double)
- upperBoundsOnIntercepts: object (of double)
- weightCol: string- inputCol: string
- outputCol: string- inputCol: string
- numHashTables: integer
- outputCol: string
- seed: double- inputCol: string
- max: double
- min: double
- outputCol: string- blockSize: integer
- featuresCol: string
- initialWeights: object (of double)
- labelCol: string
- layers: array (of integer)
- maxIter: integer
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- seed: double
- solver: string
- stepSize: double
- thresholds: array (of double)
- tol: double- featuresCol: string
- labelCol: string
- modelType: string
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- smoothing: double
- thresholds: array (of double)
- weightCol: string- dropLast: boolean
- handleInvalid: string
- inputCols: array (of string)
- outputCols: array (of string)- featuresCol: string
- labelCol: string
- parallelism: integer
- predictionCol: string
- rawPredictionCol: string
- weightCol: string- inputCol: string
- k: integer
- outputCol: string- handleInvalid: string
- inputCol: string
- inputCols: array (of string)
- numBuckets: integer
- numBucketsArray: array (of integer)
- outputCol: string
- outputCols: array (of string)
- relativeError: double- featuresCol: string
- forceIndexLabel: boolean
- formula: string
- handleInvalid: string
- labelCol: string
- stringIndexerOrderType: string- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- labelCol: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- numTrees: integer
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- seed: double
- subsamplingRate: double
- thresholds: array (of double)- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- labelCol: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- numTrees: integer
- predictionCol: string
- seed: double
- subsamplingRate: double- inputCol: string
- outputCol: string
- withMean: boolean
- withStd: boolean- handleInvalid: string
- inputCol: string
- outputCol: string
- stringOrderType: string- collectSubModels: boolean
- estimator: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- parallelism: integer
- seed: double
- trainRatio: double- handleInvalid: string
- inputCol: string
- maxCategories: integer
- outputCol: string- inputCol: string
- maxIter: integer
- maxSentenceLength: integer
- minCount: integer
- numPartitions: integer
- outputCol: string
- seed: double
- stepSize: double
- vectorSize: integer
- windowSize: integer- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- quantileProbabilities: array (of double)
- quantilesCol: string- coldStartStrategy: string
- itemCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- userCol: string- inputCol: string
- outputCol: string
- threshold: double- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)- handleInvalid: string
- inputCol: string
- inputCols: array (of string)
- outputCol: string
- outputCols: array (of string)
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- splits: array (of double)
- splitsArray: array (of array of double)- featuresCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)- binary: boolean
- inputCol: string
- minTF: double
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)- inputCol: string
- inverse: boolean
- outputCol: string- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- impurity: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- seed: double
- thresholds: array (of double)- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- impurity: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- seed: double
- varianceCol: string- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- seed: double
- topicDistributionCol: string- inputCol: string
- outputCol: string
- scalingVec: object (of double)- itemsCol: string
- minConfidence: double
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string- categoricalCols: array (of string)
- inputCols: array (of string)
- numFeatures: integer
- outputCol: string- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- maxBins: integer
- maxDepth: integer
- maxIter: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- seed: double
- stepSize: double
- subsamplingRate: double
- thresholds: array (of double)- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- maxBins: integer
- maxDepth: integer
- maxIter: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- seed: double
- stepSize: double
- subsamplingRate: double- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- probabilityCol: string- featuresCol: string
- linkPredictionCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string- binary: boolean
- inputCol: string
- numFeatures: integer
- outputCol: string- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)- inputCols: array (of string)
- outputCols: array (of string)
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)- inputCol: string
- labels: array (of string)
- outputCol: string- inputCols: array (of string)
- outputCol: string- featureIndex: integer
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- rawPredictionCol: string
- threshold: double
- weightCol: double- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- seed: double
- topicDistributionCol: string- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- threshold: double
- thresholds: array (of double)- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)- inputCol: string
- max: double
- min: double
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- thresholds: array (of double)- inputCol: string
- n: integer
- outputCol: string- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- thresholds: array (of double)- inputCol: string
- outputCol: string
- p: double- dropLast: boolean
- inputCol: string
- outputCol: string- dropLast: boolean
- handleInvalid: string
- inputCols: array (of string)
- outputCols: array (of string)
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- rawPredictionCol: string- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)- degree: integer
- inputCol: string
- outputCol: string- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- numTrees: integer
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- seed: double
- subsamplingRate: double
- thresholds: array (of double)- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- numTrees: integer
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- seed: double
- subsamplingRate: double- gaps: boolean
- inputCol: string
- minTokenLength: integer
- outputCol: string
- pattern: string
- toLowercase: boolean- statement: string- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)- caseSensitive: boolean
- inputCol: string
- locale: string
- outputCol: string
- stopWords: array (of string)- handleInvalid: string
- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)- inputCol: string
- outputCol: string- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)- handleInvalid: string
- inputCols: array (of string)
- outputCol: string- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)- handleInvalid: string
- inputCol: string
- size: integer- indices: array (of integer)
- inputCol: string
- names: array (of string)
- outputCol: string- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)We list here the most important functions supported by RumbleDB, and introduce them by means of examples. Highly detailed specifications can be found in the underlying W3C standard, unless the function is marked as specific to JSON or RumbleDB, in which case it can be found here. JSONiq and RumbleDB intentionally do not support builtin functions on XML nodes, NOTATION or QNames. RumbleDB supports almost all other W3C-standardized functions, please contact us if you are still missing one.
For the sake of ease of use, all W3C standard builtin functions and JSONiq builtin functions are in the RumbleDB namespace, which is the default function namespace and does not require any prefix in front of function names.
It is recommended that user-defined functions are put in the local namespace, i.e., their name should have the local: prefix (which is predefined). Otherwise, there is the risk that your code becomes incompatible with subsequent releases if new (unprefixed) builtin functions are introduced.
Fully implemented
returns (1, 2, 3) and logs it in the log-path if specified
Fully implemented
returns 2.0
Fully implemented
returns 3.0
Fully implemented
returns 2.0
Fully implemented
returns 2.0
returns 2.23
Fully implemented
Fully implemented
returns 15 as a double
returns NaN as a double
returns 15 as a double
Not implemented
##Formatting numbers
Not implemented
##Trigonometric and exponential functions
###pi
Fully implemented
returns 3.141592653589793
###exp
Fully implemented
###exp10
Fully implemented
Fully implemented
Fully implemented
Fully implemented
Fully implemented
returns 2
Fully implemented
Fully implemented
JSONiq-specific. Fully implemented
JSONiq-specific. Fully implemented
Fully implemented
Fully implemented
Fully implemented
Fully implemented
Fully implemented
Not implemented
Fully implemented
returns (84, 104, 233, 114, 232, 115, 101)
returns ()
Fully implemented
returns "अशॊक"
returns ""
Fully implemented
returns -1
Fully implemented
returns true
returns ()
Not implemented
Not implemented
Fully implemented
returns "foobarfoobar"
Fully implemented
returns "foobarfoobar"
returns "foo-bar-foobar"
Fully implemented
returns "bar"
returns "ba"
Fully implemented
Returns the length of the supplied string, or 0 if the empty sequence is supplied.
returns 3.
returns 0.
###normalize-space
Fully implemented
Normalization of spaces in a string.
returns "The wealthy curled darlings of our nation."
Fully implemented
Returns the value of the input after applying Unicode normalization.
returns the unicode-normalized version of the input string. Normalization forms NFC, NFD, NFKC, and NFKD are supported. "FULLY-NORMALIZED" though supported, should be used with caution as only the composition exclusion characters supported FULLY-NORMALIZED are which are uncommented in the .
Fully implemented
returns "ABCD0"
Fully implemented
returns "abc!d"
Fully implemented
returns "BAr"
returns "AAA"
Fully implemented
returns true.
Fully implemented
returns true
Fully implemented
returns true.
Fully implemented
returns "foo"
returns "f"
Fully implemented
returns "bar"
returns ""
Arity 2 implemented, arity 3 is not.
Regular expression matching. The semantics of regular expressions are those of Java's Pattern class.
returns true.
returns true.
Arity 3 implemented, arity 4 is not.
Regular expression matching and replacing. The semantics of regular expressions are those of Java's Pattern class.
returns "a*cada*"
returns "abbraccaddabbra"
Arity 2 implemented, arity 3 is not.
returns ("aa", "bb", "cc", "dd")
returns ("aa", "bb", "cc", "dd")
Not implemented
Fully implemented
returns http://www.examples.com/examples
Fully implemented
returns 100%25%20organic
Not implemented
Not implemented
Fully implemented
returns true
Fully implemented
returns false
Fully implemented
returns true
returns false
Fully implemented
returns false
returns true
Fully implemented
returns 2021.
Fully implemented
returns 6.
Fully implemented
returns 17.
Fully implemented
returns 12.
Fully implemented
returns 35.
Fully implemented
returns 30.
Fully implemented
returns 2004-04-12T13:20:00+14:00
Fully implemented
returns 2021.
Fully implemented
returns 04.
Fully implemented
returns 12.
Fully implemented
returns 13.
Fully implemented
returns 20.
Fully implemented
returns 32.
Fully implemented
returns PT2H.
Fully implemented
returns 2021.
Fully implemented
returns 6.
Fully implemented
returns 4.
Fully implemented
returns -PT14H.
Fully implemented
returns 13.
Fully implemented
returns 20.
Fully implemented
returns 32.123.
Fully implemented
returns PT2H.
Fully implemented
returns 2004-04-12T03:25:15+04:05.
Fully implemented
returns 2014-03-12+04:00.
Fully implemented
returns 04:20:00-14:00.
The functions in this section accept a simplified version of the picture string, in which a variable marker accepts only:
One of the following component specifiers: Y, M, d, D, F, H, m, s, P
A first presentation modifier, for which the value can be:
Nn, for all supported component specifiers, besides P
N, if the component specifier is P
Fully implemented
returns 20-13-12-4-2004
Fully implemented
returns 12-4-2004
Fully implemented
returns 13-20-0
Not implemented
Fully implemented
Returns a boolean whether the input sequence is empty or not.
returns false.
Fully implemented
Returns a boolean whether the input sequence has at least one item or not.
returns true.
returns false.
This is pushed down to Spark and works on big sequences.
Fully implemented
Returns the first item of a sequence, or the empty sequence if it is empty.
returns 1.
returns ().
This is pushed down to Spark and works on big sequences.
Fully implemented
Returns all but the last item of a sequence, or the empty sequence if it is empty.
returns (2, 3, 4, 5).
returns ().
This is pushed down to Spark and works on big sequences.
Fully implemented
returns (1, 2, 3, 4, 5).
Fully implemented
returns (1, 2).
Fully implemented
returns (3, 2, 1).
Fully implemented
returns (2, 3).
Fully implemented
returns (1, 2, 3).
Fully implemented
Eliminates duplicates from a sequence of atomic items.
returns (1, 4, 3, "foo", true, 5).
This is pushed down to Spark and works on big sequences.
Fully implemented
returns 3.
returns "".
Fully implemented
returns true.
returns false.
Fully implemented
returns "a".
returns an error.
Fully implemented
returns "a".
returns an error.
Fully implemented
returns "a".
returns an error.
Fully implemented
returns 4.
Count calls are pushed down to Spark, so this works on billions of items as well:
Fully implemented
returns 2.5.
Avg calls are pushed down to Spark, so this works on billions of items as well:
Fully implemented
returns 4.
returns (1, 2, 3).
Max calls are pushed down to Spark, so this works on billions of items as well:
Fully implemented
returns 1.
returns (1, 2, 3).
Min calls are pushed down to Spark, so this works on billions of items as well:
Fully implemented
returns 10.
Sum calls are pushed down to Spark, so this works on billions of items as well:
Fully implemented
Returns the corresponding document node
Not implemented
Fully implemented
Serializes the supplied input sequence, returning the serialized representation of the sequence as a string
returns { "hello" : "world" }
Fully implemented
returns 5
Fully implemented
returns 10
returns 10
Fully implemented
returns 2020-02-26T11:22:48.423+01:00
Fully implemented
returns 2020-02-26Europe/Zurich
Fully implemented
returns 11:24:10.064+01:00
Fully implemented
returns PT1H.
Fully implemented
returns http://www.w3.org/2005/xpath-functions/collation/codepoint.
Not implemented
Not implemented
Not implemented
Not implemented
Not implemented
Not implemented
Not implemented
Not implemented
Fully implemented
returns ("foo", "bar"). Also works on an input sequence, eliminating duplicates
Keys calls are pushed down to Spark, so this works on billions of items as well:
Fully implemented
This function returns the members as an array, but not recursively, i.e., nested arrays are not unboxed.
Returns the first 100 integers as a sequence. Also works on an input sequence, in a distributive way.
Fully implemented
Returns a JSON null (also available as the literal null).
Fully implemented
Fully implemented
returns 100. Also works if the empty sequence is supplied, in which case it returns the empty sequence.
Fully implemented
returns
Fully implemented
returns
Fully implemented
returns
Fully implemented
returns
Fully implemented
Unboxes arrays recursively, stopping the recursion when any other item is reached (object or atomic). Also works on an input sequence, in a distributive way.
Returns (1, 2, 3, 4, 5, 6, 7, 8, 9).
Fully implemented
returns
Fully implemented
returns the object {"foo" : "bar", "bar" : "foobar"}. Also works on an input sequence, in a distributive way.
Fully implemented
returns the object {"foobar" : "foo"}. Also works on an input sequence, in a distributive way.
Fully implemented
returns ("bar", "foobar"). Also works on an input sequence, in a distributive way.
Values calls are pushed down to Spark, so this works on billions of items as well:
Not implemented
Not implemented
returns the (unique) JSON value parsed from a local JSON (but not necessarily JSON Lines) file where this value may be spread over multiple lines.



a format token that indicates a numbering sequence of the the following form: '0001'
A second presentation modifier, for which the value can be t or c, which are also the default values
A width modifier, both minimum and maximum values
trace(1 to 3)abs(-2)ceiling(2.3)floor(2.3)round(2.3)round(2.2345, 2)round-half-to-even(2.2345, 2), round-half-to-even(2.2345)number("15")number("foo")number(15)pi()exp(10)exp10(10)log(100)log10(100)pow(10, 2)sqrt(4)sin(pi())cos(pi())cosh(pi())sinh(pi())tan(pi())asin(1)acos(1)atan(1)atan2(1)string-to-codepoints("Thérèse")string-to-codepoints("")codepoints-to-string((2309, 2358, 2378, 2325))codepoints-to-string(())compare("aa", "bb")codepoint-equal("abcd", "abcd")codepoint-equal("", ())concat("foo", "bar", "foobar")string-join(("foo", "bar", "foobar"))string-join(("foo", "bar", "foobar"), "-")substring("foobar", 4)substring("foobar", 4, 2)string-length("foo")string-length(())normalize-space(" The wealthy curled darlings of our nation. "),normalize-unicode("hello world", "NFC")upper-case("abCd0")lower-case("ABc!D")translate("bar","abc","ABC")translate("--aaa--","abc-","ABC")contains("foobar", "ob")starts-with("foobar", "foo")ends-with("foobar", "bar")substring-before("foobar", "bar")substring-before("foobar", "o")substring-after("foobar", "foo")substring-after("foobar", "r")matches("foobar", "o+")matches("foobar", "^fo+.*")replace("abracadabra", "bra", "*")replace("abracadabra", "a(.)", "a$1$1")tokenize("aa bb cc dd")tokenize("aa;bb;cc;dd", ";")string(resolve-uri("examples","http://www.examples.com/"))encode-for-uri("100% organic")fn:true()fn:false()boolean(9)boolean("")not(9)boolean("")years-from-duration(duration("P2021Y6M"))months-from-duration(duration("P2021Y6M"))days-from-duration(duration("P2021Y6M17D"))hours-from-duration(duration("P2021Y6M17DT12H35M30S"))minutes-from-duration(duration("P2021Y6M17DT12H35M30S"))minutes-from-duration(duration("P2021Y6M17DT12H35M30S"))dateTime("2004-04-12T13:20:00+14:00")year-from-dateTime(dateTime("2021-04-12T13:20:32.123+02:00"))month-from-dateTime(dateTime("2021-04-12T13:20:32.123+02:00"))day-from-dateTime(dateTime("2021-04-12T13:20:32.123+02:00"))hours-from-dateTime(dateTime("2021-04-12T13:20:32.123+02:00"))minutes-from-dateTime(dateTime("2021-04-12T13:20:32.123+02:00"))seconds-from-dateTime(dateTime("2021-04-12T13:20:32.123+02:00"))timezone-from-dateTime(dateTime("2021-04-12T13:20:32.123+02:00"))year-from-date(date("2021-06-04"))month-from-date(date("2021-06-04"))day-from-date(date("2021-06-04"))timezone-from-date(date("2021-06-04-14:00"))hours-from-time(time("13:20:32.123+02:00"))minutes-from-time(time("13:20:32.123+02:00"))seconds-from-time(time("13:20:32.123+02:00"))timezone-from-time(time("13:20:32.123+02:00"))adjust-dateTime-to-timezone(dateTime("2004-04-12T13:20:15+14:00"), dayTimeDuration("PT4H5M"))adjust-date-to-timezone(date("2014-03-12"), dayTimeDuration("PT4H"))adjust-time-to-timezone(time("13:20:00-05:00"), dayTimeDuration("-PT14H"))format-dateTime(dateTime("2004-04-12T13:20:00"), "[m]-[H]-[D]-[M]-[Y]")format-date(date("2004-04-12"), "[D]-[M]-[Y]")format-time(time("13:20:00"), "[H]-[m]-[s]")empty(1 to 10)exists(1 to 10)exists(())exists(json-lines("file.json"))head(1 to 10)head(())head(json-lines("file.json"))tail(1 to 5)tail(())tail(json-lines("file.json"))insert-before((3, 4, 5), 0, (1, 2))remove((1, 2, 10), 3)remove((1, 2, 3))subsequence((1, 2, 3), 2, 5)unordered((1, 2, 3))distinct-values((1, 1, 4, 3, 1, 1, "foo", 4, "foo", true, 3, 1, true, 5, 3, 1, 1))distinct-values(json-lines("file.json").foo)distinct-values(text-file("file.txt"))index-of((10, 20, 30, 40), 30)index-of((10, 20, 30, 40), 35)deep-equal((10, 20, "a"), (10, 20, "a"))deep-equal(("b", "0"), ("b", 0))zero-or-one(("a"))zero-or-one(("a", "b"))one-or-more(("a"))one-or-more(())exactly-one(("a"))exactly-one(("a", "b"))let $x := (1, 2, 3, 4)
return count($x)count(json-lines("file.json"))count(
for $i in json-lines("file.json")
where $i.foo eq "bar"
return $i
)let $x := (1, 2, 3, 4)
return avg($x)avg(json-lines("file.json").foo)let $x := (1, 2, 3, 4)
return max($x)for $i in 1 to 3
return max($i)max(json-lines("file.json").foo)let $x := (1, 2, 3, 4)
return min($x)for $i in 1 to 3
return min($i)min(json-lines("file.json").foo)let $x := (1, 2, 3, 4)
return sum($x)sum(json-lines("file.json").foo)doc("path/to/file.xml")serialize({hello: "world"})(1 to 10)[position() eq 5](1 to 10)[position() eq last()](1 to 10)[last()]current-dateTime()current-date()current-time()implicit-timezone()default-collation()keys({"foo" : "bar", "bar" : "foobar"})keys(({"foo" : "bar", "bar" : "foobar"}, {"foo": "bar2"}))keys(json-lines("file.json"))members([1 to 100])members(([1 to 100], [ 300 to 1000 ]))null()size([1 to 100])size(())accumulate(({ "b" : 2 }, { "c" : 3 }, { "b" : [1, "abc"] }, {"c" : {"d" : 0.17}})){ "b" : [ 2, [ 1, "abc" ] ], "c" : [ 3, { "d" : 0.17 } ] }descendant-arrays(([0, "x", { "a" : [1, {"b" : 2}, [2.5]], "o" : {"c" : 3} }]))[ 0, "x", { "a" : [ 1, { "b" : 2 }, [ 2.5 ] ], "o" : {"c" : 3} } ]
[ 1, { "b" : 2 }, [ 2.5 ] ]
[ 2.5 ]descendant-objects(([0, "x", { "a" : [1, {"b" : 2}, [2.5]], "o" : {"c" : 3} }])){ "a" : [ 1, { "b" : 2 }, [ 2.5 ] ], "o" : { "c" : 3 } }
{ "b" : 2 }
{ "c" : 3 }descendant-pairs(({ "a" : [1, {"b" : 2}], "d" : {"c" : 3} })){ "a" : [ 1, { "b" : 2 } ] }
{ "b" : 2 }
{ "d" : { "c" : 3 } }
{ "c" : 3 }flatten(([1, 2], [[3, 4], [5, 6]], [7, [8, 9]]))intersect(({"a" : "abc", "b" : 2, "c" : [1, 2], "d" : "0"}, { "a" : 2, "b" : "ab", "c" : "foo" })){ "a" : [ "abc", 2 ], "b" : [ 2, "ab" ], "c" : [ [ 1, 2 ], "foo" ] }project({"foo" : "bar", "bar" : "foobar", "foobar" : "foo" }, ("foo", "bar"))project(({"foo" : "bar", "bar" : "foobar", "foobar" : "foo" }, {"foo": "bar2"}), ("foo", "bar"))remove-keys({"foo" : "bar", "bar" : "foobar", "foobar" : "foo" }, ("foo", "bar"))remove-keys(({"foo" : "bar", "bar" : "foobar", "foobar" : "foo" }, {"foo": "bar2"}), ("foo", "bar"))values({"foo" : "bar", "bar" : "foobar"})values(({"foo" : "bar", "bar" : "foobar"}, {"foo" : "bar2"}))values(json-lines("file.json"))json-doc("/Users/sheldon/object.json")
In JSONiq, objects, arrays and basic atomic values (string, number, boolean, null) are constructed exactly as they are constructed in JSON. Any JSON document is also a valid JSONiq query which just "returns itself".
Because JSONiq expressions are fully composable, however, in objects and arrays constructors, it is possible to put any JSONiq expression and not only atomic literals, object constructors and array constructors. Furthermore, JSONiq supports the construction of other W3C-standardized builtin types (date, hexBinary, etc).
The following examples are a few of many operators available in JSONiq: "to" for creating arithmetic sequences, "||" for concatenating strings, "+" for adding numbers, "," for appending sequences.
In an array, the operand expression will evaluated to a sequence of items, and these items will be copied and become members of the newly created array.
Result:[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ]
In an object, the expression you use for the key must evaluate to an atomic - if it is not a string, it will just be cast to it.
Result:{ "foobar" : true }
An error is raised if the key expressions is not an atomic.
Result:An error was raised: can not atomize an array item: an array has probably been passed where an atomic value is expected (e.g., as a key, or to a function expecting an atomic item)
If the value expression is empty, null will be used as a value, and if it contains two items or more, they will be wrapped into an array.
If the colon is preceded with a question mark, then the pair will be omitted if the value expression evaluates to the empty sequence.
Result:{ "foo" : 2 }
Result:{ "foo" : null, "bar" : [ 1, 2 ] }
Result:An error was raised: invalid expression: syntax error, unexpected "?", expecting "end of file" or "," or "}"
The {| |} syntax can be used to merge several objects.
Result:{ "foo" : "bar", "bar" : "foo" }
An error is raised if the operand expression does not evaluate to a sequence of objects.
Result:An error was raised: xs:integer can not be treated as type object()*
JSONiq follows the for constructing numbers. The following explanations, provided as an informal summary for convenience, are non-normative.
Literal
NumericLiteral
IntegerLiteral
DecimalLiteral
DoubleLiteral
The syntax for creating numbers is identical to that of JSON (it is actually a more flexible superset, for example leading 0s are allowed, and a decimal literal can begin with a dot). Note that JSONiq distinguishes between integers (no dot, no scientific notation), decimals (dot but no scientific notation) and doubles (scientific notation). As expected, an integer literal creates an atomic of type integer, and so on.
Integer literals
Result:42
Decimal literals
Result:3.14
Double literals
Result:6.022E23
The syntax for creating string items is conformant to rather than to the W3C standard for string literals. This means concretely that escaping is done with backslashes and not with ampersands. Also, like in JSON, double quotes are required and single quotes are forbidden.
StringLiteral
String literals
Result:foo
String literals with escaping
Result:This is a line and this is a new line
String literals with Unicode character escaping
Result:
String literals with a nested quote
Result:This is a nested "quote"
JSONiq also introduces three more literals for constructing booleans and nulls: true, false and null. This makes in particular the functions true() and false() superfluous.
BooleanLiteral
NullLiteral
Boolean literals (true)
Result:true
Boolean literals (false)
Result:false
Null literals
Result:null
JSONiq follows the for constructing most atomic values with constructors. In JSONiq, the xs prefix is optional.
Expressions constructing objects are JSONiq-specific and introduced in this specification.
ObjectConstructor
PairConstructor
The syntax for creating objects is identical to that of JSON. You can use for an object key any string literal, and for an object value any literal, object constructor or array constructor.
Empty object constructors
Result:{ }
Object constructors 1
Result:{ "foo" : "bar" }
Object constructors 2
Result:{ "foo" : [ 1, 2, 3, 4, 5, 6 ] }
Object constructors 3
Result:{ "foo" : true, "bar" : false }
Nested object constructors
Result:{ "this is a key" : { "value" : "a value" } }
As in JavaScript, if your key is simple enough (like alphanumerics, underscores, dashes, this kind of things), the quotes can be omitted. The strings for which quotes are not mandatory are called NCNames. This class of strings can be used for unquoted keys, for variable and function names, and for module aliases.
Object constructors with unquoted key 1
Result:{ "foo" : "bar" }
Object constructors with unquoted key 2
Result:{ "foo" : [ 1, 2, 3, 4, 5, 6 ] }
Object constructors with unquoted key 3
Result:{ "foo" : "bar", "bar" : "foo" }
Object constructors with needed quotes around the key
Result:{ "but you need the quotes here" : null }
Objects can be constructed more dynamically (e.g., dynamic keys) by constructing and merging smaller objects. Duplicate key names throw an error.
Object constructors with needed quotes around the key
Result:{ "foo1" : 1, "foo2" : 2, "foo3" : 3 }
Expressions constructing arrays are JSONiq-specific and introduced in this specification.
ArrayConstructor
Expr
The syntax for creating arrays is identical to that of JSON: square brackets, comma separated literals, object constructors and arrays constructors.
Empty array constructors
Result:[ ]
Array constructors
Result:[ 1, 2, 3, 4, 5, 6 ]
Nested array constructors
Result:[ "foo", 3.14, [ "Go", "Boldly", "When", "No", "Man", "Has", "Gone", "Before" ], { "foo" : "bar" }, true, false, null ]
Square brackets are mandatory. Do not push it.
JSONiq follows the for constructing function items with inline expressions or . The following explanations, provided as an informal summary for convenience, are non-normative.
Function items can be constructed in two ways: by definining its body directly (inline function expression), or by referring by name to a function declared in a prolog.
FunctionItemExpr
Inline function expression
JSONiq follows the for constructing function items with inline expressions. The following explanations, provided as an informal summary for convenience, are non-normative.
A function can be built directly by specifying its parameters and its body as expression. Types are optional and by default, assumed to be item*.
Function items can also be produced with a partial function application.
Inline function expression
Result(two function items)
InlineFunctionExpr
ParamList
Named function reference
JSONiq follows the for constructing function items with named function references. The following explanations, provided as an informal summary for convenience, are non-normative.
If a function is builtin or declared in a prolog, in the same module or imported, then it is also possible to build a function item by referring to its name and arity.
Named function reference
Result(a function items)
NamedFunctionRef
We now introduce the expressions that manipulate atomic values: arithmetics, logics, comparison, string concatenation.
JSONiq follows the for arithmetic expressions, and naturally extends to return errors for null values. The following explanations, provided as an informal summary for convenience, are non-normative.
JSONiq supports the basic four operations, integer division and modulo.
Multiplicative operations have precedence over additive operations. Parentheses can override it.
Basic arithmetic operations with precedence override
Result (run with Zorba):8
Dates, times and durations are also supported in a natural way.
Using basic operations with dates.
Result (run with Zorba):P29D
If any of the operands is a sequence of more than one item, an error is raised.
Sequence of more than one number in an addition
Result (run with Zorba):An error was raised: sequence of more than one item can not be promoted to parameter type xs:anyAtomicType? of function add()
If any of the operands is not a number, a date, a time or a duration, an error is raised, which seamlessly includes raising errors for null with no need to extend the specification.
Null in an addition
Result (run with Zorba):An error was raised: arithmetic operation not defined between types "xs:integer" and "js:null"
If one of the operands evaluates to the empty sequence, then the operation results in the empty sequence.
If the two operands do not have the same number type, JSONiq will do the adequate conversions.
Basic arithmetic operations with an empty sequence
Result (run with Zorba):
AdditiveExpr
MultiplicativeExpr
UnaryExpr
JSONiq follows the for string concatenation. The following explanations, provided as an informal summary for convenience, are non-normative.
Two strings or more can be concatenated using the concatenation operator.
String concatenation
Result (run with Zorba):Captain Kirk
An empty sequence is treated like an empty string.
String concatenation with the empty sequence
Result (run with Zorba):CaptainKirk
StringConcatExpr
JSONiq follows the for comparison, and only extends its semantics to null values as follows.
null can be compared for equality or inequality to anything - it is only equal to itself so that false is returned when comparing if for equality with any non-null atomic. True is returned when comparing it with non-equality with any non-null atomic.
Equality and non-equality comparison with null
Result (run with Zorba):false true true
For ordering operators (lt, le, gt, ge), null is considered the smallest possible value (like in JavaScript).
Ordering comparison with null
Result (run with Zorba):false
The following explanations, provided as an informal summary for convenience, are non-normative.
ComparisonExpr
Atomics can be compared with the usual six comparison operators (equality, non-equality, lower-than, greater-than, lower-or-equal, greater-or-equal), and with the same two-letter symbols as in MongoDB.
Equality comparison
Result (run with Zorba):true true
Comparison is only possible between two compatible types, otherwise, an error is raised.
Comparisons with a type mismatch
Result (run with Zorba):An error was raised: "xs:string": invalid type: can not compare for equality to type "xs:integer"
Like for arithmetic operations, if an operand is the empty sequence, the empty sequence is returned as well.
Comparison with the empty sequence
Result (run with Zorba):
Comparisons and logic operators are fundamental for a query language and for the implementation of a query processor as they impact query optimization greatly. The current comparison semantics for them is carefully chosen to have the right characteristics as to enable optimization.
JSONiq follows the for logical expressions; it introduces a prefix unary not operator as a synonym for fn:not, and extends the semantics of effective boolean values to objects, arrays and nulls. The following explanations, provided as an informal summary for convenience, are non-normative.
OrExpr
AndExpr
NotExpr
JSONiq logics support is based on two-valued logics: just true and false.
Non-boolean operands get automatically converted to either true or false, or an error is raised. The boolean() function performs a manual conversion.
An empty sequence is converted to false.
A singleton sequence of one null is converted to false.
A singleton sequence of one string is converted to true except the empty string which is converted to false.
A singleton sequence of one number is converted to true except zero or NaN which are converted to false.
JSONiq supports the most famous three boolean operations: conjunction, disjunction and negation. Negation has the highest precedence, then conjunction, then disjunction. Parentheses can override.
Logics with booleans
Result (run with Zorba):true
Logics with comparing operands
Result (run with Zorba):true
Conversion of the empty sequence to false
Result (run with Zorba):false
Conversion of null to false
Result (run with Zorba):false
Conversion of a string to true
Result (run with Zorba):true false
Conversion of a number to false
Result (run with Zorba):false true
Conversion of an object to a boolean (not implemented in Zorba at this point)
Result (run with Zorba):true
If the input sequence has more than one item, and the first item is not an object or array, an error is raised.
Error upon conversion of a sequence of more than one item, not beginning with a JSON item, to a boolean
Result (run with Zorba):An error was raised: invalid argument type for function fn:boolean(): effective boolean value not defined for sequence of more than one item that starts with "xs:integer"
Unlike in C++ or Java, you cannot rely on the order of evaluation of the operands of a boolean operation. The following query may return true or may return an error.
Non-determinism in presence of errors.
Result (run with Zorba):true
JSONiq follows the for quantified expressions. The following explanations, provided as an informal summary for convenience, are non-normative.
QuantifiedExpr
It is possible to perform a conjunction or a disjunction on a predicate for each item in a sequence.
Universal quantifier
Result (run with Zorba):true
Existential quantifier on several variables
Result (run with Zorba):true
Variables can be annotated with a type. If no type is specified, item* is assumed. If the type does not match, an error is raised.
Existential quantifier with type checking
Result (run with Zorba):true
JSONiq can create sequences with concatenation (comma) or with a range. Parentheses can be used for overriding precedence.
JSONiq follows the for the concatenation of sequences with commas. The following explanations, provided as an informal summary for convenience, are non-normative.
Expr
Use a comma to concatenate two sequences, or even single items. This operator has the lowest precedence of all.
Comma
Result (run with Zorba):1 2 3 4 5 6 7 8 9 10
Comma
Result (run with Zorba):{ "foo" : "bar" } [ 1 ]
Sequences do not nest. You need to use arrays in order to nest.
JSONiq follows the for range expressions. The following explanations, provided as an informal summary for convenience, are non-normative.
RangeExpr
With the binary operator "to", you can generate larger sequences with just two integer operands.
Range operator
Result (run with Zorba):1 2 3 4 5 6 7 8 9 10
If one operand evaluates to the empty sequence, then the range operator returns the empty sequence.
Range operator with the empty sequence
Result (run with Zorba):
Otherwise, if an operand evaluates to something else than a single integer or an empty sequence, an error is raised.
Range operator with a type inconsistency
Result (run with Zorba):An error was raised: sequence of more than one item can not be promoted to parameter type xs:integer? of function to()
JSONiq follows the for parenthesized expressions. The following explanations, provided as an informal summary for convenience, are non-normative.
ParenthesizedExpr
Use parentheses to override the precedence of expressions.
If the parentheses are empty, the empty sequence is produced.
Empty sequence
Result (run with Zorba):
JSONiq follows the for function calls. The following explanations, provided as an informal summary for convenience, are non-normative.
Function calls in JSONiq can either be made statically, with a named function, or dynamically, by passing a function item on the fly.
The syntax for function calls is similar to many other languages. JSONiq supports four sorts of functions:
Builtin functions: these have no prefix and can be called without any import.
Local functions: they are defined in the prolog, to be used in the main query. They have the prefix local:. Chapter describes how to define your own local functions.
Imported functions: they are defined in a library module. They have the prefix corresponding to the alias to which the imported module has been bound to. Chapter describes how to define your own modules.
The first three are named functions and can be called statictically. All four can be called dynamically, as a named function can be also passed as an item with a named function reference.
JSONiq follows the for static function calls. The following explanations, provided as an informal summary for convenience, are non-normative.
A static function call consists of the name of the function and of expressions returning its parameters. An error is thrown if no function with the corresponding name and arity is found.
A builtin function call.
Result:foo bar
A builtin function call.
Result:foobar
An error is raised if the actual types do not match the expected types.
A type error in a function call.
Result:An error was raised: can not atomize an object item: an object has probably been passed where an atomic value is expected (e.g., as a key, or to a function expecting an atomic item)
JSONiq static function calls follow the .
FunctionCall
JSONiq follows the for dynamic function calls. The following explanations, provided as an informal summary for convenience, are non-normative.
A dynamic function call is a postfix expression. Its left-hand-side is an expression that must return a single function item (see in the data model ). Its right-hand side is a list of parameters, each one of which is an arbitrary expression providing a sequence of items, one such sequence for each parameter.
A dynamic function call.
Result:3
If the number of parameters does not match the arity of the function, an error is raised. An error is also raised if an argument value does not match the corresponding type in the function signature.
Otherwise, the function is evaluated with the supplied parameters. If the result matches the return type of the function, it is returned, otherwise an error is raised.
A dynamic function call with signature
Result:3
JSONiq dynamic function calls follow the .
PostfixExpr
ArgumentList
Argument
JSONiq follows the for partial application. The following explanations, provided as an informal summary for convenience, are non-normative.
A static or dynamic function call also have placeholder parameters, represented with a question mark in the syntax. When this is the case, the function call returns a function item that is the partial application of the original function, and its arity is the number of remaining placeholders.
A partial application.
Result:4
JSONiq dynamic function calls follow the .
Like in JavaScript, it is possible to navigate through objects and arrays. This is a specific JSONiq extension.
JSONiq also allows to filter sequences with a predicate and predicates are fully W3C-conformant.
JSONiq supports filtering items from a sequence, looking up the value associated with a given key in an object, looking up the item at a given position in an array, and looking up all items in an array.
PostfixExpr
ObjectLookup
The simplest way to navigate in an object is similar to JavaScript, using a dot. This will work as soon as you do not push it too much: alphanumerical characters, dashes, underscores - just like unquoted keys in object constructors, any NCName is allowed.
Object lookup
Result (run with Zorba):bar
Since JSONiq expressions are composable, you can also use any expression for the left-hand side. You might need parentheses depending on the precedence.
Lookup on a single-object collection.
Result (run with Zorba):bar
The dot operator does an implicit mapping on the left-hand-side, i.e., it applies the lookup in turn on each item. Lookup on an object returns the value associated with the supplied key, or the empty sequence if there is none. Lookup on any item which is not an object (arrays and atomics) results in the empty sequence.
Object lookup with an iteration on several objects
Result (run with Zorba):bar bar2
Object lookup with an iteration on a collection
Result (run with Zorba):James T. Kirk Jean-Luc Picard Benjamin Sisko Kathryn Janeway Jonathan Archer Samantha Carter
Object lookup on a mixed sequence
Result (run with Zorba):bar1 bar2
Of course, unquoted keys will not work for strings that are not NCNames, e.g., if the field contains a dot or begins with a digit. Then you will need quotes.
Quotes for object lookup
Result (run with Zorba):bar
If you use an expression on the right side of the dot, it must always have parentheses. The result of the right-hand-side expression is cast to a string. An error is raised if the cast fails.
Object lookup with a nested expression
Result (run with Zorba):bar
Object lookup with a nested expression
Result (run with Zorba):An error was raised: sequence of more than one item can not be treated as type xs:string
Object lookup with a nested expression
Result (run with Zorba):bar
Variables, or a context item reference, do not need parentheses. Variables are introduced later, but here is a sneak peek:
Object lookup with a variable
Result (run with Zorba):bar
ArrayLookup
Array lookup uses double square brackets.
Array lookup
Result (run with Zorba):bar
Since JSONiq expressions are composable, you can also use any expression for the left-hand side. You might need parentheses depending on the precedence.
Array lookup after an object lookup
Result (run with Zorba):bar
The array lookup operator does an implicit mapping on the left-hand-side, i.e., it applies the lookup in turn on each item. Lookup on an array returns the item at that position in the array, or the empty sequence if there is none (position larger than size or smaller than 1). Lookup on any item which is not an array (objects and atomics) results in the empty sequence.
Array lookup with an iteration on several arrays
Result (run with Zorba):2 5
Array lookup with an iteration on a collection
Result (run with Zorba):The original series The next generation The next generation The next generation Entreprise Voyager
Array lookup on a mixed sequence
Result (run with Zorba):3 6
The expression inside the double-square brackets may be any expression. The result of evaluating this expression is cast to an integer. An error is raised if the cast fails.
Array lookup with a right-hand-side expression
Result (run with Zorba):bar
ArrayUnboxing
You can also extract all items from an array (i.e., as a sequence) with the [] syntax. The [] operator also implicitly iterates on the left-hand-side, returning the empty sequence for non-arrays.
Extracting all items from an array
Result (run with Zorba):foo bar
Extracting all items from arrays in a mixed sequence
Result (run with Zorba):foo bar 1 2 3
Predicate
A predicate allows filtering a sequence, keeping only items that fulfill it.
The predicate is evaluated once for each item in the left-hand-side sequence, with the context item set to that item. The predicate expression can use $$ to access this context item.
ContextItemExpr
If the predicate evaluates to an integer, it is matched against the item position in the left-hand side sequence automatically
Predicate expression
Result (run with Zorba):2
Otherwise, the result of the predicate is converted to a boolean.
All items for which the converted predicate result evaluates to true are then output.
Predicate expression
Result (run with Zorba):2 4 6 8 10
JSONiq supports control flow expressions such as if-then-else, switch and typeswitch following the W3C standard.
JSONiq follows the for conditional expressions. The following explanations, provided as an informal summary for convenience, are non-normative.
IfExpr
A conditional expressions allows you to pick one or another value depending on a boolean value.
A conditional expression
Result (run with Zorba):{ "foo" : "yes" }
The behavior of the expression inside the if is similar to that of logical operations (two-valued logics), meaning that non-boolean values get converted to a boolean.
A conditional expression
Result (run with Zorba):{ "foo" : "no" }
A conditional expression
Result (run with Zorba):{ "foo" : "yes" }
A conditional expression
Result (run with Zorba):{ "foo" : "no" }
A conditional expression
Result (run with Zorba):{ "foo" : "yes" }
A conditional expression
Result (run with Zorba):{ "foo" : "no" }
A conditional expression
Result (run with Zorba):{ "foo" : "no" }
A conditional expression
Result (run with Zorba):{ "foo" : "yes" }
Note that the else clause is mandatory (but can be the empty sequence)
A conditional expression
Result (run with Zorba):{ "foo" : "yes" }
JSONiq follows the for switch expressions. The following explanations, provided as an informal summary for convenience, are non-normative.
SwitchExpr
SwitchCaseClause
A switch expression evaluates the expression inside the switch. If it is an atomic, it compares it in turn to the provided atomic values (with the semantics of the eq operator) and returns the value associated with the first matching case clause.
Note that if there is an object or array in the base switch expression or any case expression, a JSONiq-specific type error JNTY0004 will be raised, because objects and arrays cannot be atomized and the W3C standard requires atomization of the base and case expressions.
A switch expression
Result (run with Zorba):bar
If it is not an atomic, an error is raised.
A switch expression
Result (run with Zorba):An error was raised: can not atomize an object item: an object has probably been passed where an atomic value is expected (e.g., as a key, or to a function expecting an atomic item)
If no value matches, the default is used.
A switch expression
Result (run with Zorba):none
The case clauses support composability of expressions as well.
A switch expression
Result (run with Zorba):foo
A switch expression
Result (run with Zorba):1 + 1 is 2
JSONiq follows the for try-catch expressions. The following explanations, provided as an informal summary for convenience, are non-normative.
TryCatchExpr
A try catch expression evaluates the expression inside the try block and returns its resulting value.
However, if an error is raised dynamically, the catch clause is evaluated and its result value returned.
A try catch expression
Result (run with Zorba):division by zero!
Only errors raised within the lexical scope of the try block are caught.
A try catch expression
Result (run with Zorba):An error was raised: division by zero
Errors that are detected statically within the try block are still reported statically.
A try catch expression
Result (run with Zorba):syntax error
JSONiq follows the for FLWOR expressions. The following explanations, provided as an informal summary for convenience, are non-normative.
FLWORExpr
FLWOR expressions are probably the most powerful JSONiq construct and correspond to SQL's SELECT-FROM-WHERE statements, but they are more general and more flexible. In particular, clauses can almost appear in any order (apart that it must begin with a for or let clause, and end with a return clause).
Here is a bit of theory on how it works.
A clause binds values to some variables according to its own semantics, possibly several times. Each time, a tuple of variable bindings (mapping variable names to sequences) is passed on to the next clause.
This goes all the way down, until the return clause. The return clause is eventually evaluated for each tuple of variable bindings, resulting in a sequence of items for each tuple.
These sequences of items are concatenated, in the order of the incoming tuples, and the obtained sequence is returned by the FLWOR expression.
We are now giving practical examples with a hint on how it maps to SQL.
JSONiq follows the for for clauses. The following explanations, provided as an informal summary for convenience, are non-normative.
ForClause
For clauses allow iteration on a sequence.
For each incoming tuple, the expression in the for clause is evaluated to a sequence. Each item in this sequence is in turn bound to the for variable. A tuple is hence produced for each incoming tuple, and for each item in the sequence produced by the for clause for this tuple.
The order in which items are bound by the for clause can be relaxed with unordered expressions, as described later in this section.
The following query, using a for and a return clause, is the counterpart of SQL's "SELECT name FROM captains". $x is bound in turn to each item in the captains collection.
A for clause.
Result (run with Zorba):James T. Kirk Jean-Luc Picard Benjamin Sisko Kathryn Janeway Jonathan Archer Samantha Carter
For clause expressions are composable, there can be several of them.
Two for clauses.
Result (run with Zorba):11 12 13 21 22 23 31 32 33
A for clause.
Result (run with Zorba):11 12 13 21 22 23 31 32 33
A for variable is visible to subsequence bindings.
A for clause.
Result (run with Zorba):1 2 3 4 5 6 7 8 9
A for clause.
Result (run with Zorba):{ "captain" : "James T. Kirk", "series" : "The original series" } { "captain" : "Jean-Luc Picard", "series" : "The next generation" } { "captain" : "Benjamin Sisko", "series" : "The next generation" } { "captain" : "Benjamin Sisko", "series" : "Deep Space 9" } { "captain" : "Kathryn Janeway", "series" : "The next generation" } { "captain" : "Kathryn Janeway", "series" : "Voyager" } { "captain" : "Jonathan Archer", "series" : "Entreprise" } { "captain" : null, "series" : "Voyager" }
It is also possible to bind the position of the current item in the sequence to a variable.
A for clause.
Result (run with Zorba):{ "captain" : "James T. Kirk", "id" : 1 } { "captain" : "Jean-Luc Picard", "id" : 2 } { "captain" : "Benjamin Sisko", "id" : 3 } { "captain" : "Kathryn Janeway", "id" : 4 } { "captain" : "Jonathan Archer", "id" : 5 } { "captain" : null, "id" : 6 } { "captain" : "Samantha Carter", "id" : 7 }
JSONiq supports joins. For example, the counterpart of "SELECT c.name AS captain, m.name AS movie FROM captains c JOIN movies m ON c.name = m.name" is:
A join
Result (run with Zorba):{ "captain" : "James T. Kirk", "movie" : "The Motion Picture" } { "captain" : "James T. Kirk", "movie" : "The Wrath of Kahn" } { "captain" : "James T. Kirk", "movie" : "The Search for Spock" } { "captain" : "James T. Kirk", "movie" : "The Voyage Home" } { "captain" : "James T. Kirk", "movie" : "The Final Frontier" } { "captain" : "James T. Kirk", "movie" : "The Undiscovered Country" } { "captain" : "Jean-Luc Picard", "movie" : "First Contact" } { "captain" : "Jean-Luc Picard", "movie" : "Insurrection" } { "captain" : "Jean-Luc Picard", "movie" : "Nemesis" }
Note how JSONiq handles semi-structured data in a flexible way.
Outer joins are also possible with "allowing empty", i.e., output will also be produced if there is no matching movie for a captain. The following query is the counterpart of "SELECT c.name AS captain, m.name AS movie FROM captains c LEFT JOIN movies m ON c.name = m.captain".
A join
Result (run with Zorba):{ "captain" : "James T. Kirk", "movie" : "The Motion Picture" } { "captain" : "James T. Kirk", "movie" : "The Wrath of Kahn" } { "captain" : "James T. Kirk", "movie" : "The Search for Spock" } { "captain" : "James T. Kirk", "movie" : "The Voyage Home" } { "captain" : "James T. Kirk", "movie" : "The Final Frontier" } { "captain" : "James T. Kirk", "movie" : "The Undiscovered Country" } { "captain" : "Jean-Luc Picard", "movie" : "First Contact" } { "captain" : "Jean-Luc Picard", "movie" : "Insurrection" } { "captain" : "Jean-Luc Picard", "movie" : "Nemesis" } { "captain" : "Benjamin Sisko", "movie" : null } { "captain" : "Kathryn Janeway", "movie" : null } { "captain" : "Jonathan Archer", "movie" : null } { "captain" : null, "movie" : null } { "captain" : "Samantha Carter", "movie" : null }
JSONiq follows the for where clauses. The following explanations, provided as an informal summary for convenience, are non-normative.
WhereClause
Where clauses are used for filtering (selection operator in the relational algebra).
For each incoming tuple, the expression in the where clause is evaluated to a boolean (possibly converting an atomic to a boolean). if this boolean is true, the tuple is forwarded to the next clause, otherwise it is dropped.
The following query corresponds to "SELECT series FROM captains WHERE name = 'Kathryn Janeway'".
A where clause.
Result (run with Zorba):[ "The next generation", "Voyager" ]
JSONiq follows the for order by clauses. The following explanations, provided as an informal summary for convenience, are non-normative.
OrderByClause
Order clauses are for reordering tuples.
For each incoming tuple, the expression in the where clause is evaluated to an atomic. The tuples are then sorted based on the atomics they are associated with, and then forwarded to the next clause.
Like for ordering comparisons, null values are always considered the smallest.
The following query is the counterpart of SQL's "SELECT * FROM captains ORDER BY name".
An order by clause.
Result (run with Zorba):{ "name" : "Benjamin Sisko", "series" : [ "The next generation", "Deep Space 9" ], "century" : 24 } { "name" : "James T. Kirk", "series" : [ "The original series" ], "century" : 23 } { "name" : "Jean-Luc Picard", "series" : [ "The next generation" ], "century" : 24 } { "name" : "Jonathan Archer", "series" : [ "Entreprise" ], "century" : 22 } { "name" : "Kathryn Janeway", "series" : [ "The next generation", "Voyager" ], "century" : 24 } { "name" : "Samantha Carter", "series" : [ ], "century" : 21 } { "codename" : "Emergency Command Hologram", "surname" : "The Doctor", "series" : [ "Voyager" ], "century" : 24 }
Multiple sorting criteria can be given - they are treated like a lexicographic order (most important criterium first).
An order by clause.
Result (run with Zorba):{ "name" : "Samantha Carter", "series" : [ ], "century" : 21 } { "name" : "James T. Kirk", "series" : [ "The original series" ], "century" : 23 } { "name" : "Jean-Luc Picard", "series" : [ "The next generation" ], "century" : 24 } { "name" : "Jonathan Archer", "series" : [ "Entreprise" ], "century" : 22 } { "codename" : "Emergency Command Hologram", "surname" : "The Doctor", "series" : [ "Voyager" ], "century" : 24 } { "name" : "Benjamin Sisko", "series" : [ "The next generation", "Deep Space 9" ], "century" : 24 } { "name" : "Kathryn Janeway", "series" : [ "The next generation", "Voyager" ], "century" : 24 }
It can be specified whether the order is ascending or descending. Empty sequences are allowed and it can be chosen whether to put them first or last.
An order by clause.
Result (run with Zorba):{ "codename" : "Emergency Command Hologram", "surname" : "The Doctor", "series" : [ "Voyager" ], "century" : 24 } { "name" : "Samantha Carter", "series" : [ ], "century" : 21 } { "name" : "Kathryn Janeway", "series" : [ "The next generation", "Voyager" ], "century" : 24 } { "name" : "Jonathan Archer", "series" : [ "Entreprise" ], "century" : 22 } { "name" : "Jean-Luc Picard", "series" : [ "The next generation" ], "century" : 24 } { "name" : "James T. Kirk", "series" : [ "The original series" ], "century" : 23 } { "name" : "Benjamin Sisko", "series" : [ "The next generation", "Deep Space 9" ], "century" : 24 }
An error is raised if the expression does not evaluate to an atomic or the empty sequence.
An order by clause.
Result (run with Zorba):An error was raised: can not atomize an object item: an object has probably been passed where an atomic value is expected (e.g., as a key, or to a function expecting an atomic item)
Collations can be used to give a specific way of how strings are to be ordered. A collation is identified by a URI.
Use of a collation in an order by clause.
Result (run with Zorba):Benjamin Sisko James T. Kirk Jean-Luc Picard Jonathan Archer Kathryn Janeway Samantha Carter
JSONiq follows the for group by clauses. The following explanations, provided as an informal summary for convenience, are non-normative.
GroupByClause
Grouping is also supported, like in SQL.
For each incoming tuple, the expression in the group clause is evaluated to an atomic (a grouping key). The incoming tuples are then grouped according to the key they are associated with.
For each group, a tuple is output, with a binding from the grouping variable to the key of the group.
A group by clause.
Result (run with Zorba):{ "century" : 21 } { "century" : 22 } { "century" : 23 } { "century" : 24 }
As for the other (non-grouping) variables, their values within one group are all concatenated, keeping the same name. Aggregations can be done on these variables.
The following query is equivalent to "SELECT century, COUNT(*) FROM captains GROUP BY century".
A group by clause.
Result (run with Zorba):{ "century" : 21, "count" : 1 } { "century" : 22, "count" : 1 } { "century" : 23, "count" : 1 } { "century" : 24, "count" : 4 }
JSONiq's group by is more flexible than SQL and is fully composable.
A group by clause.
Result (run with Zorba):{ "century" : 21, "captains" : [ "Samantha Carter" ] } { "century" : 22, "captains" : [ "Jonathan Archer" ] } { "century" : 23, "captains" : [ "James T. Kirk" ] } { "century" : 24, "captains" : [ "Jean-Luc Picard", "Benjamin Sisko", "Kathryn Janeway" ] }
Unlike SQL, JSONiq does not need a having clause, because a where clause works perfectly after grouping as well.
The following query is the counterpart of "SELECT century, COUNT(*) FROM captains GROUP BY century HAVING COUNT(*) > 1"
A group by clause.
Result (run with Zorba):{ "century" : 24, "count" : 4 }
JSONiq follows the for let clauses. The following explanations, provided as an informal summary for convenience, are non-normative.
LetClause
Let bindings can be used to define aliases for any sequence, for convenience.
For each incoming tuple, the expression in the let clause is evaluated to a sequence. A binding is added from this sequence to the let variable in each tuple. A tuple is hence produced for each incoming tuple.
A let clause.
Result (run with Zorba):{ "century" : 24, "count" : 4 }
Note that it is perfectly fine to reuse a variable name and hide a variable binding.
A let clause.
Result (run with Zorba):{ "century" : 24, "number of series" : 3 }
JSONiq follows the for count clauses. The following explanations, provided as an informal summary for convenience, are non-normative.
CountClause
For each incoming tuple, a binding from the position of this tuple in the tuple stream to the count variable is added. The new tuple is then forwarded to the next clause.
A count clause.
Result (run with Zorba):{ "id" : 1, "captain" : { "name" : "Benjamin Sisko", "series" : [ "The next generation", "Deep Space 9" ], "century" : 24 } } { "id" : 2, "captain" : { "name" : "James T. Kirk", "series" : [ "The original series" ], "century" : 23 } } { "id" : 3, "captain" : { "name" : "Jean-Luc Picard", "series" : [ "The next generation" ], "century" : 24 } } { "id" : 4, "captain" : { "name" : "Jonathan Archer", "series" : [ "Entreprise" ], "century" : 22 } } { "id" : 5, "captain" : { "name" : "Kathryn Janeway", "series" : [ "The next generation", "Voyager" ], "century" : 24 } } { "id" : 6, "captain" : { "name" : "Samantha Carter", "series" : [ ], "century" : 21 } } { "id" : 7, "captain" : { "codename" : "Emergency Command Hologram", "surname" : "The Doctor", "series" : [ "Voyager" ], "century" : 24 } }
JSONiq follows the for the map operator, except that it changes the syntax for the context item to $$ instead of the . syntax.
The following explanations, provided as an informal summary for convenience, are non-normative.
SimpleMapExpr
ContextItemExpr
JSONiq provides a shortcut for a for-return construct, automatically binding each item in the left-hand-side sequence to the context item.
A simple map
Result (run with Zorba):2 4 6 8 10 12 14 16 18 20
An equivalent query
Result (run with Zorba):2 4 6 8 10 12 14 16 18 20
JSONiq follows the for variable references, except that it disallows the character . in variable names, which is instead used for object lookup.
Like all other expressions, FLWOR expressions can be composed. In the following examples, a FLWOR is nested in a function call, nested in a FLWOR, nested in an array constructor:
Nested FLWORs
Result (run with Zorba):[ "James T. Kirk", "Jean-Luc Picard" ]
JSONiq follows the for ordered and unordered expressions. The following explanations, provided as an informal summary for convenience, are non-normative.
OrderedExpr
UnorderedExpr
By default, the order in which a for clause binds its items is important.
This behaviour can be relaxed in order give the optimizer more leeway. An unordered expression relaxes ordering by for clauses within its operand scope:
An unordered expression.
Result (run with Zorba):{ "name" : "Jean-Luc Picard", "series" : [ "The next generation" ], "century" : 24 } { "name" : "Benjamin Sisko", "series" : [ "The next generation", "Deep Space 9" ], "century" : 24 } { "name" : "Kathryn Janeway", "series" : [ "The next generation", "Voyager" ], "century" : 24 } { "codename" : "Emergency Command Hologram", "surname" : "The Doctor", "series" : [ "Voyager" ], "century" : 24 }
An ordered expression can be used to reactivate ordering behaviour in a subscope.
An ordered expression.
Result (run with Zorba):{ "name" : "James T. Kirk", "series" : [ "The original series" ], "century" : 23 }
This section describes JSONiq types as well as the sequence type syntax.
JSONiq follows the for ordered and unordered expressions. The following explanations, provided as an informal summary for convenience, are non-normative.
InstanceofExpr
An instance expression can be used to tell whether a JSONiq value matches a given sequence type.
Instance of expression
Result (run with Zorba):true
Instance of expression
Result (run with Zorba):false
Instance of expression
Result (run with Zorba):true
Instance of expression
Result (run with Zorba):true
Instance of expression
Result (run with Zorba):true
Instance of expression
Result (run with Zorba):true
Instance of expression
Result (run with Zorba):true
JSONiq follows the for ordered and unordered expressions. The following explanations, provided as an informal summary for convenience, are non-normative.
TreatExpr
A treat expression checks that a JSONiq value matches a given sequence type. If it is not the case, an error is raised.
Treat as expression
Result (run with Zorba):1
Treat as expression
Result (run with Zorba):An error was raised: "xs:integer" cannot be treated as type xs:string
Treat as expression
Result (run with Zorba):foo
Treat as expression
Result (run with Zorba):{ "foo" : "bar" }
Treat as expression
Result (run with Zorba):{ "foo" : "bar" } { "bar" : "foo" }
Treat as expression
Result (run with Zorba):[ 1, 2, 3 ]
Treat as expression
Result (run with Zorba):
JSONiq follows the for ordered and unordered expressions. The following explanations, provided as an informal summary for convenience, are non-normative.
CastableExpr
A castable expression checks whether a JSONiq value can be cast to a given atomic type and returns true or false accordingly. It can be used before actually casting to that type.
Castable as expression
Result (run with Zorba):true
Castable as expression
Result (run with Zorba):false
Castable as expression
Result (run with Zorba):true
Castable as expression
Result (run with Zorba):false
Castable as expression
Result (run with Zorba):false
The question mark allows for an empty sequence.
Castable as expression
Result (run with Zorba):true
JSONiq follows the for ordered and unordered expressions. The following explanations, provided as an informal summary for convenience, are non-normative.
CastExpr
A cast expression casts a JSONiq value to a given atomic type. The resulting value is annotated with this type.
Cast as expression
Result (run with Zorba):1
Cast as expression
Result (run with Zorba):An error was raised: "foo": value of type xs:string is not castable to type xs:integer
Cast as expression
Result (run with Zorba):2013-04-02
Cast as expression
Result (run with Zorba):An error was raised: empty sequence can not be cast to type with quantifier '1'
Cast as expression
Result (run with Zorba):An error was raised: sequence of more than one item can not be cast to type with quantifier '1' or '?'
The question mark allows for an empty sequence.
Cast as expression
Result (run with Zorba):
Cast as expression
Result (run with Zorba):2013-04-02
JSONiq follows the for ordered and unordered expressions. The following explanations, provided as an informal summary for convenience, are non-normative.
TypeswitchExpr
CaseClause
A typeswitch expressions tests if the value resulting from the first operand matches a given list of types. The expression corresponding to the first matching case is finally evaluated. If there is no match, the expression in the default clause is evaluated.
Typeswitch expression
Result (run with Zorba):string
In each clause, it is possible to bind the value of the first operand to a variable.
Typeswitch expression
Result (run with Zorba):foofoo
The vertical bar can be used to allow several types in the same case clause.
Typeswitch expression
Result (run with Zorba):{ "integer or string" : "foo" }
An operand singleton sequence whose first item is an object or array is converted to true.
Other operand sequences cannot be converted and an error is raised.
[ 1 to 10 ]
{ "foo" || "bar" : true }
{ [ 1, 2 ] : true }
{ "foo" : 1 + 1 }
{ "foo" : (), "bar" : (1, 2) }
{ "foo" ?: (), "bar" : (1, 2) }
{| { "foo" : "bar" }, { "bar" : "foo" } |}
{| 1 |}
42
3.14
+6.022E23
"foo"
"This is a line\nand this is a new line"
"\u0001"
"This is a nested \"quote\""
true
false
null
{}
{ "foo" : "bar" }
{ "foo" : [ 1, 2, 3, 4, 5, 6 ] }
{ "foo" : true, "bar" : false }
{ "this is a key" : { "value" : "a value" } }
{ foo : "bar" }
{ foo : [ 1, 2, 3, 4, 5, 6 ] }
{ foo : "bar", bar : "foo" }
{ "but you need the quotes here" : null }
{|
for $i in 1 to 3
return { "foo" || $i : $i }
|}
[]
[ 1, 2, 3, 4, 5, 6 ]
[ "foo", 3.14, [ "Go", "Boldly", "When", "No", "Man", "Has", "Gone", "Before" ], { "foo" : "bar" }, true, false, null ]
function ($x as integer, $y as integer) as integer { $x + 2 },
function ($x) { $x + 2 }
declare function local:sum($x as integer, $y as integer) as integer
{
$x + 2
};
local:sum#2
1 * ( 2 + 3 ) + 7 idiv 2 - (-8) mod 2
date("2013-05-01") - date("2013-04-02")
(1, 2) + 3
1 + null
() + 2
"Captain" || " " || "Kirk"
"Captain" || () || "Kirk"
1 eq null, "foo" ne null, null eq null
1 lt null
1 + 1 eq 2, 1 lt 2
"foo" eq 1
() eq 1
true and ( true or not true )
1 + 1 eq 2 or 1 + 1 eq 3
boolean(())
boolean(null)
boolean("foo"), boolean("")
0 and true, not (not 1e42)
{ "foo" : "bar" } or false
( 1, 2, 3 ) or false
true or (1 div 0)
every $i in 1 to 10 satisfies $i gt 0
some $i in -5 to 5, $j in 1 to 10 satisfies $i eq $j
some $i as integer in -5 to 5, $j as integer in 1 to 10 satisfies $i eq $j
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
{ "foo" : "bar" }, [ 1 ]
1 to 10
() to 10, 1 to ()
(1, 2) to 10
()
keys({ "foo" : "bar", "bar" : "foo" })
concat("foo", "bar")
sum({ "foo" : "bar" })
let $f := function($x) { $x + 1 }
return $f(2)
let $f := function($x as integer) as integer { $x + 1 }
return $f(2)
let $f := function($x as integer, $y as integer) as integer { $x + $y }
let $g := $f(?, 2)
return $g(2)
{ "foo" : "bar" }.foo
collection("one-object").foo
({ "foo" : "bar" }, { "foo" : "bar2" }, { "bar" : "foo" }).foo
collection("captains").name
({ "foo" : "bar1" }, [ "foo", "bar" ], { "foo" : "bar2" }, "foo").foo
{ "foo bar" : "bar" }."foo bar"
{ "foobar" : "bar" }.("foo" || "bar")
{ "foobar" : "bar" }.("foo", "bar")
{ "1" : "bar" }.(1)
let $field := "foo" || "bar"
return { "foobar" : "bar" }.$field
[ "foo", "bar" ] [[2]]
{ field : [ "one", { "foo" : "bar" } ] }.field[[2]].foo
([ 1, 2, 3 ], [ 4, 5, 6 ])[[2]]
collection("captains").series[[1]]
([ 1, 2, 3 ], [ 4, 5, 6 ], { "foo" : "bar" }, true)[[3]]
[ "foo", "bar" ] [[ 1 + 1 ]]
[ "foo", "bar" ][]
([ "foo", "bar" ], { "foo" : "bar" }, true, [ 1, 2, 3 ] )[]
(1 to 10)[2]
(1 to 10)[$$ mod 2 eq 0]
if (1 + 1 eq 2) then { "foo" : "yes" } else { "foo" : "false" }
if (null) then { "foo" : "yes" } else { "foo" : "no" }
if (1) then { "foo" : "yes" } else { "foo" : "no" }
if (0) then { "foo" : "yes" } else { "foo" : "no" }
if ("foo") then { "foo" : "yes" } else { "foo" : "no" }
if ("") then { "foo" : "yes" } else { "foo" : "no" }
if (()) then { "foo" : "yes" } else { "foo" : "no" }
if (({ "foo" : "bar" }, [ 1, 2, 3, 4])) then { "foo" : "yes" } else { "foo" : "no" }
if (1+1 eq 2) then { "foo" : "yes" } else ()
switch ("foo")
case "bar" return "foo"
case "foo" return "bar"
default return "none"
switch ({ "foo" : "bar" })
case "bar" return "foo"
case "foo" return "bar"
default return "none"
switch ("no-match")
case "bar" return "foo"
case "foo" return "bar"
default return "none"
switch (2)
case 1 + 1 return "foo"
case 2 + 2 return "bar"
default return "none"
switch (true)
case 1 + 1 eq 2 return "1 + 1 is 2"
case 2 + 2 eq 5 return "2 + 2 is 5"
default return "none of the above is true"
try { 1 div 0 } catch * { "division by zero!" }
let $x := 1 div 0
return try { $x }
catch * { "division by zero!" }
try { x } catch * { "syntax error" }
for $x in collection("captains")
return $x.name
for $x in ( 1, 2, 3 )
for $y in ( 1, 2, 3 )
return 10 * $x + $y
for $x in ( 1, 2, 3 ), $y in ( 1, 2, 3 )
return 10 * $x + $y
for $x in ( [ 1, 2, 3 ], [ 4, 5, 6 ], [ 7, 8, 9 ] ), $y in $x[]
return $y
for $x in collection("captains"), $y in $x.series[]
return { "captain" : $x.name, "series" : $y }
for $x at $position in collection("captains")
return { "captain" : $x.name, "id" : $position }
for $captain in collection("captains"), $movie in collection("movies")[ try { $$.captain eq $captain.name } catch * { false } ]
return { "captain" : $captain.name, "movie" : $movie.name }
for $captain in collection("captains"), $movie allowing empty in collection("movies")[ try { $$.captain eq $captain.name } catch * { false } ]
return { "captain" : $captain.name, "movie" : $movie.name }
for $x in collection("captains")
where $x.name eq "Kathryn Janeway"
return $x.series
for $x in collection("captains")
order by $x.name
return $x
for $x in collection("captains")
order by size($x.series), $x.name
return $x
for $x in collection("captains")
order by $x.name descending empty greatest
return $x
for $x in collection("captains")
order by $x
return $x.name
for $x in collection("captains")
order by $x.name collation "http://www.w3.org/2005/xpath-functions/collation/codepoint"
return $x.name
for $x in collection("captains")
group by $century := $x.century
return { "century" : $century }
for $x in collection("captains")
group by $century := $x.century
return { "century" : $century, "count" : count($x) }
for $x in collection("captains")
group by $century := $x.century
return { "century" : $century, "captains" : [ $x.name ] }
for $x in collection("captains")
group by $century := $x.century
where count($x) gt 1
return { "century" : $century, "count" : count($x) }
for $x in collection("captains")
let $century := $x.century
group by $century
let $number := count($x)
where $number gt 1
return { "century" : $century, "count" : $number }
for $x in collection("captains")
let $century := $x.century
group by $century
let $number := count($x)
let $number := count(distinct-values(for $series in $x.series
return typeswitch($series)
case array return $series()
default return $series ))
where $number gt 1
return { "century" : $century, "number of series" : $number }
for $x in collection("captains")
order by $x.name
count $c
return { "id" : $c, "captain" : $x }
(1 to 10) ! ($$ * 2)
for $i in 1 to 10
return $i * 2
[
for $c in collection("captains")
where exists(for $m in collection("movies")
where some $moviecaptain in let $captain := $m.captain
return typeswitch ($captain)
case array return $captain()
default return $captain
satisfies
$moviecaptain eq $c.name
return $m)
return $c.name
]
unordered {
for $captain in collection("captains")
where $captain.century eq 24
return $captain
}
unordered {
for $captain in collection("captains")
where ordered { exists(for $movie at $i in collection("movies")
where $i eq 5
where $movie.captain eq $captain.name
return $movie) }
return $captain
}
1 instance of integer
1 instance of string
"foo" instance of string
{ "foo" : "bar" } instance of object
({ "foo" : "bar" }, { "bar" : "foo" }) instance of json-item+
[ 1, 2, 3 ] instance of array?
() instance of ()
1 treat as integer
1 treat as string
"foo" treat as string
{ "foo" : "bar" } treat as object
({ "foo" : "bar" }, { "bar" : "foo" }) treat as json-item+
[ 1, 2, 3 ] treat as array?
() treat as ()
"1" castable as integer
"foo" castable as integer
"2013-04-02" castable as date
() castable as date
("2013-04-02", "2013-04-03") castable as date
() castable as date?
"1" cast as integer
"foo" cast as integer
"2013-04-02" cast as date
() cast as date
("2013-04-02", "2013-04-03") cast as date
() cast as date?
"2013-04-02" cast as date?
typeswitch("foo")
case integer return "integer"
case string return "string"
case object return "object"
default return "other"
typeswitch("foo")
case $i as integer return $i + 1
case $s as string return $s || "foo"
case $o as object return [ $o ]
default $d return $d
typeswitch("foo")
case $a as integer | string return { "integer or string" : $a }
case $o as object return [ $o ]
default $d return $d













































