6.2.3. Joins With Views¶
Linked Documents¶
If your map function emits an object value which has
{'_id': XXX}
and you query view with
include_docs=true
parameter, then CouchDB will fetch the document with id
XXX
rather than the document which was processed to emit the key/value pair.
This means that if one document contains the ids of other documents, it can cause those documents to be fetched in the view too, adjacent to the same key if required.
For example, if you have the following hierarchically-linked documents:
[
{ "_id": "11111" },
{ "_id": "22222", "ancestors": ["11111"], "value": "hello" },
{ "_id": "33333", "ancestors": ["22222","11111"], "value": "world" }
]
You can emit the values with the ancestor documents adjacent to them in the view like this:
function(doc) {
if (doc.value) {
emit([doc.value, 0], null);
if (doc.ancestors) {
for (var i in doc.ancestors) {
emit([doc.value, Number(i)+1], {_id: doc.ancestors[i]});
}
}
}
}
The result you get is:
{
"total_rows": 5,
"offset": 0,
"rows": [
{
"id": "22222",
"key": [
"hello",
0
],
"value": null,
"doc": {
"_id": "22222",
"_rev": "1-0eee81fecb5aa4f51e285c621271ff02",
"ancestors": [
"11111"
],
"value": "hello"
}
},
{
"id": "22222",
"key": [
"hello",
1
],
"value": {
"_id": "11111"
},
"doc": {
"_id": "11111",
"_rev": "1-967a00dff5e02add41819138abb3284d"
}
},
{
"id": "33333",
"key": [
"world",
0
],
"value": null,
"doc": {
"_id": "33333",
"_rev": "1-11e42b44fdb3d3784602eca7c0332a43",
"ancestors": [
"22222",
"11111"
],
"value": "world"
}
},
{
"id": "33333",
"key": [
"world",
1
],
"value": {
"_id": "22222"
},
"doc": {
"_id": "22222",
"_rev": "1-0eee81fecb5aa4f51e285c621271ff02",
"ancestors": [
"11111"
],
"value": "hello"
}
},
{
"id": "33333",
"key": [
"world",
2
],
"value": {
"_id": "11111"
},
"doc": {
"_id": "11111",
"_rev": "1-967a00dff5e02add41819138abb3284d"
}
}
]
}
which makes it very cheap to fetch a document plus all its ancestors in one query.
Note that the "id"
in the row is still that of the originating document.
The only difference is that include_docs
fetches a different doc.
The current revision of the document is resolved at query time, not at the time
the view is generated. This means that if a new revision of the linked document
is added later, it will appear in view queries even though the view itself
hasn’t changed. To force a specific revision of a linked document to be used,
emit a "_rev"
property as well as "_id"
.
Using View Collation¶
Author: | Christopher Lenz |
---|---|
Date: | 2007-10-05 |
Source: | http://www.cmlenz.net/archives/2007/10/couchdb-joins |
Just today, there was a discussion on IRC on how you’d go about modeling a simple blogging system with “post” and “comment” entities, where any blog post might have N comments. If you’d be using an SQL database, you’d obviously have two tables with foreign keys and you’d be using joins. (At least until you needed to add some denormalization).
But what would the “obvious” approach in CouchDB look like?
Approach #1: Comments Inlined¶
A simple approach would be to have one document per blog post, and store the comments inside that document:
{
"_id": "myslug",
"_rev": "123456",
"author": "john",
"title": "My blog post",
"content": "Bla bla bla …",
"comments": [
{"author": "jack", "content": "…"},
{"author": "jane", "content": "…"}
]
}
Note
Of course the model of an actual blogging system would be more extensive, you’d have tags, timestamps, etc, etc. This is just to demonstrate the basics.
The obvious advantage of this approach is that the data that belongs together is stored in one place. Delete the post, and you automatically delete the corresponding comments, and so on.
You may be thinking that putting the comments inside the blog post document would not allow us to query for the comments themselves, but you’d be wrong. You could trivially write a CouchDB view that would return all comments across all blog posts, keyed by author:
function(doc) {
for (var i in doc.comments) {
emit(doc.comments[i].author, doc.comments[i].content);
}
}
Now you could list all comments by a particular user by invoking the view and
passing it a ?key="username"
query string parameter.
However, this approach has a drawback that can be quite significant for many applications: To add a comment to a post, you need to:
- Fetch the blog post document
- Add the new comment to the JSON structure
- Send the updated document to the server
Now if you have multiple client processes adding comments at roughly the same time, some of them will get a HTTP 409 Conflict error on step 3 (that’s optimistic concurrency in action). For some applications this makes sense, but in many other apps, you’d want to append new related data regardless of whether other data has been added in the meantime.
The only way to allow non-conflicting addition of related data is by putting that related data into separate documents.
Approach #2: Comments Separate¶
Using this approach you’d have one document per blog post, and one document per comment. The comment documents would have a “backlink” to the post they belong to.
The blog post document would look similar to the above, minus the comments property. Also, we’d now have a type property on all our documents so that we can tell the difference between posts and comments:
{
"_id": "myslug",
"_rev": "123456",
"type": "post",
"author": "john",
"title": "My blog post",
"content": "Bla bla bla …"
}
The comments themselves are stored in separate documents, which also have a type property (this time with the value “comment”), and additionally feature a post property containing the ID of the post document they belong to:
{
"_id": "ABCDEF",
"_rev": "123456",
"type": "comment",
"post": "myslug",
"author": "jack",
"content": "…"
}
{
"_id": "DEFABC",
"_rev": "123456",
"type": "comment",
"post": "myslug",
"author": "jane",
"content": "…"
}
To list all comments per blog post, you’d add a simple view, keyed by blog post ID:
function(doc) {
if (doc.type == "comment") {
emit(doc.post, {author: doc.author, content: doc.content});
}
}
And you’d invoke that view passing it a ?key="post_id"
query string
parameter.
Viewing all comments by author is just as easy as before:
function(doc) {
if (doc.type == "comment") {
emit(doc.author, {post: doc.post, content: doc.content});
}
}
So this is better in some ways, but it also has a disadvantage.
Imagine you want to display a blog post with all the associated comments on the
same web page. With our first approach, we needed just a single request to the
CouchDB server, namely a GET
request to the document. With this second
approach, we need two requests: a GET
request to the post document, and a
GET
request to the view that returns all comments for the post.
That is okay, but not quite satisfactory. Just imagine you wanted to add threaded comments: you’d now need an additional fetch per comment. What we’d probably want then would be a way to join the blog post and the various comments together to be able to retrieve them with a single HTTP request.
This was when Damien Katz, the author of CouchDB, chimed in to the discussion on IRC to show us the way.
Optimization: Using the Power of View Collation¶
Obvious to Damien, but not at all obvious to the rest of us: it’s fairly simple to make a view that includes both the content of the blog post document, and the content of all the comments associated with that post. The way you do that is by using complex keys. Until now we’ve been using simple string values for the view keys, but in fact they can be arbitrary JSON values, so let’s make some use of that:
function(doc) {
if (doc.type == "post") {
emit([doc._id, 0], doc);
} else if (doc.type == "comment") {
emit([doc.post, 1], doc);
}
}
Okay, this may be confusing at first. Let’s take a step back and look at what views in CouchDB are really about.
CouchDB views are basically highly efficient on-disk dictionaries that map keys
to values, where the key is automatically indexed and can be used to filter
and/or sort the results you get back from your views. When you “invoke” a view,
you can say that you’re only interested in a subset of the view rows by
specifying a ?key=foo
query string parameter. Or you can specify
?startkey=foo
and/or ?endkey=bar
query string parameters to fetch rows
over a range of keys.
It’s also important to note that keys are always used for collating (i.e.
sorting) the rows. CouchDB has well defined (but as of yet undocumented) rules
for comparing arbitrary JSON objects for collation. For example, the JSON value
["foo", 2]
is sorted after (considered “greater than”) the values
["foo"]
or ["foo", 1, "bar"]
, but before e.g. ["foo", 2, "bar"]
.
This feature enables a whole class of tricks that are rather non-obvious...
See also
With that in mind, let’s return to the view function above. First note that, unlike the previous view functions we’ve used here, this view handles both “post” and “comment” documents, and both of them end up as rows in the same view. Also, the key in this view is not just a simple string, but an array. The first element in that array is always the ID of the post, regardless of whether we’re processing an actual post document, or a comment associated with a post. The second element is 0 for post documents, and 1 for comment documents.
Let’s assume we have two blog posts in our database. Without limiting the view
results via key
, startkey
, or endkey
, we’d get back something like
the following:
{
"total_rows": 5, "offset": 0, "rows": [{
"id": "myslug",
"key": ["myslug", 0],
"value": {...}
}, {
"id": "ABCDEF",
"key": ["myslug", 1],
"value": {...}
}, {
"id": "DEFABC",
"key": ["myslug", 1],
"value": {...}
}, {
"id": "other_slug",
"key": ["other_slug", 0],
"value": {...}
}, {
"id": "CDEFAB",
"key": ["other_slug", 1],
"value": {...}
},
]
}
Note
The ...
placeholders here would contain the complete JSON encoding of the
corresponding documents
Now, to get a specific blog post and all associated comments, we’d invoke that view with the query string:
?startkey=["myslug"]&endkey;=["myslug", 2]
We’d get back the first three rows, those that belong to the myslug
post,
but not the others. Et voila, we now have the data we need to display a post
with all associated comments, retrieved via a single GET
request.
You may be asking what the 0 and 1 parts of the keys are for. They’re simply to ensure that the post document is always sorted before the the associated comment documents. So when you get back the results from this view for a specific post, you’ll know that the first row contains the data for the blog post itself, and the remaining rows contain the comment data.
One remaining problem with this model is that comments are not ordered, but
that’s simply because we don’t have date/time information associated with them.
If we had, we’d add the timestamp as third element of the key array, probably
as ISO date/time strings. Now we would continue using the query string
?startkey=["myslug"]&endkey=["myslug", 2]
to fetch the blog post and all
associated comments, only now they’d be in chronological order.