Importing data

Problem

I want to import data from a file into ArangoDB.

Solution

ArangoDB comes with a command-line tool utility named arangoimp. This utility can be used for importing JSON-encoded, CSV, and tab-separated files into ArangoDB.

arangoimp needs to be invoked from the command-line once for each import file. The target collection can already exist or can be created by the import run.

Importing JSON-encoded data

Input formats

There are two supported input formats for importing JSON-encoded data into ArangoDB:

  • line-by-line format: This format expects each line in the input file to be a valid JSON objects. No line breaks must occur within each single JSON object

  • array format: Expects a file containing a single array of JSON objects. Whitespace is allowed for formatting inside the JSON array and the JSON objects

Here's an example for the line-by-line format looks like this:

{"author":"Frank Celler","time":"2011-10-26 08:42:49 +0200","sha":"c413859392a45873936cbe40797970f8eed93ff9","message":"first commit","user":"f.celler"}
{"author":"Frank Celler","time":"2011-10-26 21:32:36 +0200","sha":"10bb77b8cc839201ff59a778f0c740994083c96e","message":"initial release","user":"f.celler"}
...

Here's an example for the same data in array format:

[
  {
    "author": "Frank Celler",
    "time": "2011-10-26 08:42:49 +0200",
    "sha": "c413859392a45873936cbe40797970f8eed93ff9",
    "message": "first commit",
    "user": "f.celler"
  },
  {
    "author": "Frank Celler",
    "time": "2011-10-26 21:32:36 +0200",
    "sha": "10bb77b8cc839201ff59a778f0c740994083c96e",
    "message": "initial release",
    "user": "f.celler"
  },
  ...
]

Importing JSON data in line-by-line format

An example data file in line-by-line format can be downloaded here. The example file contains all the commits to the ArangoDB repository as shown by git log --reverse.

The following commands will import the data from the file into a collection named commits:

# download file
wget http://jsteemann.github.io/downloads/code/git-commits-single-line.json

# actually import data
arangoimp --file git-commits-single-line.json --collection commits --create-collection true

Note that no file type has been specified when arangoimp was invoked. This is because json is its default input format.

The other parameters used have the following meanings:

  • file: input filename
  • collection: name of the target collection
  • create-collection: whether or not the collection should be created if it does not exist

The result of the import printed by arangoimp should be:

created:          20039
warnings/errors:  0
total:            20039

The collection commits should now contain the example commit data as present in the input file.

Importing JSON data in array format

An example input file for the array format can be found here.

The command for importing JSON data in array format is similar to what we've done before:

# download file
wget http://jsteemann.github.io/downloads/code/git-commits-array.json

# actually import data
arangoimp --file git-commits-array.json --collection commits --create-collection true

Though the import command is the same (except the filename), there is a notable difference between the two JSON formats: for the array format, arangoimp will read and parse the JSON in its entirety before it sends any data to the ArangoDB server. That means the whole input file must fit into arangoimp's buffer. By default, arangoimp will allocate a 16 MiB internal buffer, and input files bigger than that will be rejected with the following message:

import file is too big. please increase the value of --batch-size (currently 16777216).

So for JSON input files in array format it might be necessary to increase the value of --batch-size in order to have the file imported. Alternatively, the input file can be converted to line-by-line format manually.

Importing CSV data

Data can also be imported from a CSV file. An example file can be found here.

The --type parameter for the import command must now be set to csv:

# download file
wget http://jsteemann.github.io/downloads/code/git-commits.csv

# actually import data
arangoimp --file git-commits.csv --type csv --collection commits --create-collection true

For the CSV import, the first line in the input file has a special meaning: every value listed in the first line will be treated as an attribute name for the values in all following lines. All following lines should also have the same number of "columns".

"columns" inside the CSV input file can be left empty though. If a "column" is left empty in a line, then this value will be omitted for the import so the respective attribute will not be set in the imported document. Note that values from the input file that are enclosed in double quotes will always be imported as strings. To import numeric values, boolean values or the null value, don't enclose these values in quotes in the input file. Note that leading zeros in numeric values will be removed. Importing numbers with leading zeros will only work when putting the numbers into strings.

Here is an example CSV file:

"author","time","sha","message"
"Frank Celler","2011-10-26 08:42:49 +0200","c413859392a45873936cbe40797970f8eed93ff9","first commit"
"Frank Celler","2011-10-26 21:32:36 +0200","10bb77b8cc839201ff59a778f0c740994083c96e","initial release"
...

arangoimp supports Windows (CRLF) and Unix (LF) line breaks. Line breaks might also occur inside values that are enclosed with the quote character.

The default separator for CSV files is the comma. It can be changed using the --separator parameter when invoking arangoimp. The quote character defaults to the double quote ("). To use a literal double quote inside a "column" in the import data, use two double quotes. To change the quote character, use the --quote parameter. To use a backslash for escaping quote characters, please set the option --backslash-escape to true.

Changing the database and server endpoint

By default, arangoimp will connect to the default database on 127.0.0.1:8529 with a user named root. To change this, use the following parameters:

  • server.database: name of the database to use when importing (default: _system)
  • server.endpoint: address of the ArangoDB server (default: tcp://127.0.0.1:8529)

Using authentication

arangoimp will by default send an username root and an empty password to the ArangoDB server. This is ArangoDB's default configuration, and it should be changed. To make arangoimp use a different username or password, the following command-line arguments can be used:

  • server.username: username, used if authentication is enabled on server
  • server.password: password for user, used if authentication is enabled on server

The password argument can also be omitted in order to avoid having it saved in the shell's command-line history. When specifying a username but omitting the password parameter, arangoimp will prompt for a password.

Additional parameters

By default, arangoimp will import data into the specified collection but will not touch existing data. Often it is convenient to first remove all data from a collection and then run the import. arangoimp supports this with the optional --overwrite flag. When setting it to true, all documents in the collection will be removed prior to the import.

Author: Jan Steemann

Tags: #arangoimp #import