6.2.3. Joins With Views

Linked Documents

If your map function emits an object value which has {'_id': XXX} and you query view with include_docs=true parameter, then CouchDB will fetch the document with id XXX rather than the document which was processed to emit the key/value pair.

This means that if one document contains the ids of other documents, it can cause those documents to be fetched in the view too, adjacent to the same key if required.

For example, if you have the following hierarchically-linked documents:

[
    { "_id": "11111" },
    { "_id": "22222", "ancestors": ["11111"], "value": "hello" },
    { "_id": "33333", "ancestors": ["22222","11111"], "value": "world" }
]

You can emit the values with the ancestor documents adjacent to them in the view like this:

function(doc) {
    if (doc.value) {
        emit([doc.value, 0], null);
        if (doc.ancestors) {
            for (var i in doc.ancestors) {
                emit([doc.value, Number(i)+1], {_id: doc.ancestors[i]});
            }
        }
    }
}

The result you get is:

{
    "total_rows": 5,
    "offset": 0,
    "rows": [
        {
            "id": "22222",
            "key": [
                "hello",
                0
            ],
            "value": null,
            "doc": {
                "_id": "22222",
                "_rev": "1-0eee81fecb5aa4f51e285c621271ff02",
                "ancestors": [
                    "11111"
                ],
                "value": "hello"
            }
        },
        {
            "id": "22222",
            "key": [
                "hello",
                1
            ],
            "value": {
                "_id": "11111"
            },
            "doc": {
                "_id": "11111",
                "_rev": "1-967a00dff5e02add41819138abb3284d"
            }
        },
        {
            "id": "33333",
            "key": [
                "world",
                0
            ],
            "value": null,
            "doc": {
                "_id": "33333",
                "_rev": "1-11e42b44fdb3d3784602eca7c0332a43",
                "ancestors": [
                    "22222",
                    "11111"
                ],
                "value": "world"
            }
        },
        {
            "id": "33333",
            "key": [
                "world",
                1
            ],
            "value": {
                "_id": "22222"
            },
            "doc": {
                "_id": "22222",
                "_rev": "1-0eee81fecb5aa4f51e285c621271ff02",
                "ancestors": [
                    "11111"
                ],
                "value": "hello"
            }
        },
        {
            "id": "33333",
            "key": [
                "world",
                2
            ],
            "value": {
                "_id": "11111"
            },
            "doc": {
                "_id": "11111",
                "_rev": "1-967a00dff5e02add41819138abb3284d"
            }
        }
    ]
}

which makes it very cheap to fetch a document plus all its ancestors in one query.

Note that the "id" in the row is still that of the originating document. The only difference is that include_docs fetches a different doc.

The current revision of the document is resolved at query time, not at the time the view is generated. This means that if a new revision of the linked document is added later, it will appear in view queries even though the view itself hasn’t changed. To force a specific revision of a linked document to be used, emit a "_rev" property as well as "_id".

Using View Collation

Author:Christopher Lenz
Date:2007-10-05
Source:http://www.cmlenz.net/archives/2007/10/couchdb-joins

Just today, there was a discussion on IRC on how you’d go about modeling a simple blogging system with “post” and “comment” entities, where any blog post might have N comments. If you’d be using an SQL database, you’d obviously have two tables with foreign keys and you’d be using joins. (At least until you needed to add some denormalization).

But what would the “obvious” approach in CouchDB look like?

Approach #1: Comments Inlined

A simple approach would be to have one document per blog post, and store the comments inside that document:

{
    "_id": "myslug",
    "_rev": "123456",
    "author": "john",
    "title": "My blog post",
    "content": "Bla bla bla …",
    "comments": [
        {"author": "jack", "content": "…"},
        {"author": "jane", "content": "…"}
    ]
}

Note

Of course the model of an actual blogging system would be more extensive, you’d have tags, timestamps, etc, etc. This is just to demonstrate the basics.

The obvious advantage of this approach is that the data that belongs together is stored in one place. Delete the post, and you automatically delete the corresponding comments, and so on.

You may be thinking that putting the comments inside the blog post document would not allow us to query for the comments themselves, but you’d be wrong. You could trivially write a CouchDB view that would return all comments across all blog posts, keyed by author:

function(doc) {
    for (var i in doc.comments) {
        emit(doc.comments[i].author, doc.comments[i].content);
    }
}

Now you could list all comments by a particular user by invoking the view and passing it a ?key="username" query string parameter.

However, this approach has a drawback that can be quite significant for many applications: To add a comment to a post, you need to:

  • Fetch the blog post document
  • Add the new comment to the JSON structure
  • Send the updated document to the server

Now if you have multiple client processes adding comments at roughly the same time, some of them will get a HTTP 409 Conflict error on step 3 (that’s optimistic concurrency in action). For some applications this makes sense, but in many other apps, you’d want to append new related data regardless of whether other data has been added in the meantime.

The only way to allow non-conflicting addition of related data is by putting that related data into separate documents.

Approach #2: Comments Separate

Using this approach you’d have one document per blog post, and one document per comment. The comment documents would have a “backlink” to the post they belong to.

The blog post document would look similar to the above, minus the comments property. Also, we’d now have a type property on all our documents so that we can tell the difference between posts and comments:

{
    "_id": "myslug",
    "_rev": "123456",
    "type": "post",
    "author": "john",
    "title": "My blog post",
    "content": "Bla bla bla …"
}

The comments themselves are stored in separate documents, which also have a type property (this time with the value “comment”), and additionally feature a post property containing the ID of the post document they belong to:

{
    "_id": "ABCDEF",
    "_rev": "123456",
    "type": "comment",
    "post": "myslug",
    "author": "jack",
    "content": "…"
}
{
    "_id": "DEFABC",
    "_rev": "123456",
    "type": "comment",
    "post": "myslug",
    "author": "jane",
    "content": "…"
}

To list all comments per blog post, you’d add a simple view, keyed by blog post ID:

function(doc) {
    if (doc.type == "comment") {
        emit(doc.post, {author: doc.author, content: doc.content});
    }
}

And you’d invoke that view passing it a ?key="post_id" query string parameter.

Viewing all comments by author is just as easy as before:

function(doc) {
    if (doc.type == "comment") {
        emit(doc.author, {post: doc.post, content: doc.content});
    }
}

So this is better in some ways, but it also has a disadvantage. Imagine you want to display a blog post with all the associated comments on the same web page. With our first approach, we needed just a single request to the CouchDB server, namely a GET request to the document. With this second approach, we need two requests: a GET request to the post document, and a GET request to the view that returns all comments for the post.

That is okay, but not quite satisfactory. Just imagine you wanted to add threaded comments: you’d now need an additional fetch per comment. What we’d probably want then would be a way to join the blog post and the various comments together to be able to retrieve them with a single HTTP request.

This was when Damien Katz, the author of CouchDB, chimed in to the discussion on IRC to show us the way.

Optimization: Using the Power of View Collation

Obvious to Damien, but not at all obvious to the rest of us: it’s fairly simple to make a view that includes both the content of the blog post document, and the content of all the comments associated with that post. The way you do that is by using complex keys. Until now we’ve been using simple string values for the view keys, but in fact they can be arbitrary JSON values, so let’s make some use of that:

function(doc) {
    if (doc.type == "post") {
        emit([doc._id, 0], doc);
    } else if (doc.type == "comment") {
        emit([doc.post, 1], doc);
    }
}

Okay, this may be confusing at first. Let’s take a step back and look at what views in CouchDB are really about.

CouchDB views are basically highly efficient on-disk dictionaries that map keys to values, where the key is automatically indexed and can be used to filter and/or sort the results you get back from your views. When you “invoke” a view, you can say that you’re only interested in a subset of the view rows by specifying a ?key=foo query string parameter. Or you can specify ?startkey=foo and/or ?endkey=bar query string parameters to fetch rows over a range of keys.

It’s also important to note that keys are always used for collating (i.e. sorting) the rows. CouchDB has well defined (but as of yet undocumented) rules for comparing arbitrary JSON objects for collation. For example, the JSON value ["foo", 2] is sorted after (considered “greater than”) the values ["foo"] or ["foo", 1, "bar"], but before e.g. ["foo", 2, "bar"]. This feature enables a whole class of tricks that are rather non-obvious...

See also

Views Collation

With that in mind, let’s return to the view function above. First note that, unlike the previous view functions we’ve used here, this view handles both “post” and “comment” documents, and both of them end up as rows in the same view. Also, the key in this view is not just a simple string, but an array. The first element in that array is always the ID of the post, regardless of whether we’re processing an actual post document, or a comment associated with a post. The second element is 0 for post documents, and 1 for comment documents.

Let’s assume we have two blog posts in our database. Without limiting the view results via key, startkey, or endkey, we’d get back something like the following:

{
    "total_rows": 5, "offset": 0, "rows": [{
            "id": "myslug",
            "key": ["myslug", 0],
            "value": {...}
        }, {
            "id": "ABCDEF",
            "key": ["myslug", 1],
            "value": {...}
        }, {
            "id": "DEFABC",
            "key": ["myslug", 1],
            "value": {...}
        }, {
            "id": "other_slug",
            "key": ["other_slug", 0],
            "value": {...}
        }, {
            "id": "CDEFAB",
            "key": ["other_slug", 1],
            "value": {...}
        },
    ]
}

Note

The ... placeholders here would contain the complete JSON encoding of the corresponding documents

Now, to get a specific blog post and all associated comments, we’d invoke that view with the query string:

?startkey=["myslug"]&endkey;=["myslug", 2]

We’d get back the first three rows, those that belong to the myslug post, but not the others. Et voila, we now have the data we need to display a post with all associated comments, retrieved via a single GET request.

You may be asking what the 0 and 1 parts of the keys are for. They’re simply to ensure that the post document is always sorted before the the associated comment documents. So when you get back the results from this view for a specific post, you’ll know that the first row contains the data for the blog post itself, and the remaining rows contain the comment data.

One remaining problem with this model is that comments are not ordered, but that’s simply because we don’t have date/time information associated with them. If we had, we’d add the timestamp as third element of the key array, probably as ISO date/time strings. Now we would continue using the query string ?startkey=["myslug"]&endkey=["myslug", 2] to fetch the blog post and all associated comments, only now they’d be in chronological order.