Migration from ArangoDB 2.8 to 3.0
Problem
I want to use ArangoDB 3.0 from now on but I still have data in ArangoDB 2.8. I need to migrate my data. I am running an ArangoDB 3.0 cluster (and possibly a cluster with ArangoDB 2.8 as well).
Solution
The internal data format changed completely from ArangoDB 2.8 to 3.0,
therefore you have to dump all data using arangodump
and then
restore it to the new ArangoDB instance using arangorestore
.
General instructions for this procedure can be found in the manual. Here, we cover some additional details about the cluster case.
Dumping the data in ArangoDB 2.8
Basically, dumping the data works with the following command (use arangodump
from your ArangoDB 2.8 distribution!):
arangodump --server.endpoint tcp://localhost:8530 --output-directory dump
or a variation of it, for details see the above mentioned manual page and
this section.
If your ArangoDB 2.8 instance is a cluster, simply use one of the
coordinator endpoints as the above --server.endpoint
.
Restoring the data in ArangoDB 3.0
The output consists of JSON files in the output directory, two for each
collection, one for the structure and one for the data. The data format
is 100% compatible with ArangoDB 3.0, except that ArangoDB 3.0 has
an additional option in the structure files for synchronous replication,
namely the attribute replicationFactor
, which is used to specify,
how many copies of the data for each shard are kept in the cluster.
Therefore, you can simply use this command (use the arangorestore
from
your ArangoDB 3.0 distribution!):
arangorestore --server.endpoint tcp://localhost:8530 --input-directory dump
to import your data into your new ArangoDB 3.0 instance. See
this page
for details on the available command line options. If your ArangoDB 3.0
instance is a cluster, then simply use one of the coordinators as
--server.endpoint
.
That is it, your data is migrated.
Controling the number of shards and the replication factor
This procedure works for all four combinations of single server and cluster for source and destination respectively. If the target is a single server all simply works.
So it remains to explain how one controls the number of shards and the replication factor if the destination is a cluster.
If the source was a cluster, arangorestore
will use the same number
of shards as before, if you do not tell it otherwise. Since ArangoDB 2.8
does not have synchronous replication, it does not produce dumps
with the replicationFactor
attribute, and so arangorestore
will
use replication factor 1 for all collections. If the source was a
single server, the same will happen, additionally, arangorestore
will always create collections with just a single shard.
There are essentially 3 ways to change this behaviour:
- The first is to create the collections explicitly on the
ArangoDB 3.0 cluster, and then set the
--create-collection false
flag. In this case you can control the number of shards and the replication factor for each collection individually when you create them. - The second is to use
arangorestore
's options--default-number-of-shards
and--default-replication-factor
(this option was introduced in Version 3.0.2) respectively to specify default values, which are taken if the dump files do not specify numbers. This means that all such restored collections will have the same number of shards and replication factor. - If you need more control you can simply edit the structure files
in the dump. They are simply JSON files, you can even first
use a JSON pretty printer to make editing easier. For the
replication factor you simply have to add a
replicationFactor
attribute to theparameters
subobject with a numerical value. For the number of shards, locate theshards
subattribute of theparameters
attribute and edit it, such that it has the right number of attributes. The actual names of the attributes as well as their values do not matter. Alternatively, add anumberOfShards
attribute to theparameters
subobject, this will override theshards
attribute (this possibility was introduced in Version 3.0.2).
Note that you can remove individual collections from your dump by
deleting their pair of structure and data file in the dump directory.
In this way you can restore your data in several steps or even
parallelise the restore operation by running multiple arangorestore
processes concurrently on different dump directories. You should
consider using different coordinators for the different arangorestore
processes in this case.
All these possibilities together give you full control over the sharding layout of your data in the new ArangoDB 3.0 cluster.