Finding Your Data With Views

What is a View?

We mentioned in the introduction that views are good for many things. Let’s go through the different use-cases. First: Extracting data that you might need for a special purpose in a specific order. For the front page we want a list of blog post titles sorted by date. We’ll work with a set of example documents as we walk though how views work. These are abridged versions of the documents we been used in the previous chapter, but they really could be the same.

Example: Example Documents

{
  "_id":"biking",
  "_rev":"AE19EBC7654",

  "title":"Biking",
  "body":"My biggest hobby is mountainbiking. The other day...",
  "date":"2009/01/30 18:04:11"
}

{
  "_id":"bought-a-cat",
  "_rev":"4A3BBEE711",

  "title":"Bought a Cat",
  "body":"I went to the the pet store earlier and brought home a little kitty...",
  "date":"2009/02/17 21:13:39"
}

{
  "_id":"hello-world",
  "_rev":"43FBA4E7AB",

  "title":"Hello World",
  "body":"Well hello and welcome to my new blog...",
  "date":"2009/01/15 15:52:20"
}

Three will do for the example. Note that the documents are sorted by "_id", which is how they are stored in the database. Now we define a view. Chapter 2: Getting Started showed you how to create a view in Futon, the CouchDB administration client. If you can’t remember how to do it, go back to page XY. Bear with us without an explanation while we show you some code.

Example: A Basic Map Function

function(doc) {
  if(doc.date && doc.title) {
    emit(doc.date, doc.title);
  }
}

This is a map function and it is written in JavaScript. If you are not familiar with JavaScript but have used C or any other C-like language such as Java, PHP or C#, this should look familiar. It is a simple function definition.

You provide CouchDB with view functions as strings stored inside the views field of a design document. You don’t run it yourself. Instead, when you query your view, CouchDB takes the source code and runs it for you on every document in the database your view was defined in. You query your view to retrieve the view result.

All map functions have a single parameter doc. This is a single document in your database. Our map function checks if our document has a date and a title attribute — luckily all of our documents have them — and then calls the built-in emit() function with these two attributes as arguments.

The emit() function always takes two arguments: The first is key and and the second is value. The emit(key, value) function creates an entry in our view result. One more thing, and we are not doing that just yet, but the emit() function can be called multiple times in the map function to create multiple entries in the view results from a single document.

Example: View Results

| key                  | value
|----------------------------------------------
|"2009/01/15 15:52:20" | "Hello World"
|"2009/01/30 18:04:11" | "Biking"
|"2009/02/17 21:13:39" | "Bought a Cat"

CouchDB takes whatever you pass into the emit() function and puts it into a list. Each row in that list includes they key and our value. More importantly, the list is sorted by key, by doc.date in our case. The most important feature of a view result, is that it is sorted by key. We will come back to that over and over again to do neat things. Stay tuned.

If you read carefully over the last few paragraphs, one clause stands out: “when you query your view, CouchDB takes the source code and runs it for you on every document in the database”. If you have a lot of documents, that takes quite a bit of time and you might wonder if it is not horribly inefficient to do this. Yes it would be, but CouchDB is designed to avoid any extra costs: it only runs through all documents once, when you first query your view. If a document is changed, the map function is only run once, to recompute the keys and values for that single document.

The view result is stored in a B-tree, just like the structure which is responsible for holding your documents. View B-trees are stored in their own file, so that for high-performance CouchDB usage, you can keep views on their own disk. The B-tree provides very fast lookups of rows by key, as well as efficient streaming of rows in a key range. In our example, a single view can answer all questions that involve time: “Give me all the blog posts from last week” or “last month” or “this year”. Pretty neat. Read more about how CouchDB’s B-trees work in Appendix G: The Power of B-Trees.

When we query our view, we get back a list of all documents sorted by date, each row also includes the post title so we can construct links to posts. The listing/figure above is just a graphical representation of the view result. The actual result is JSON-encoded, and contains a little more metadata.

Example: Actual View Result

Now, we lied again, the actual result is not as nicely formatted and doesn’t include any superfluous whitespace or newlines, but this is better for you (and us!) to read and understand. And hey, where does that "id" member in the result rows comes from, that wasn’t there before. Well spotted again, we omitted this earlier to avoid confusion. CouchDB automatically includes the document id of the document that created the entry in the view result. We’ll use this as well, when constructing links to the blog post pages.

Efficient Lookups

Let’s move on to the second use-case for views: “building efficient indexes to find documents by any value or structure that resides in them”. We already explained the efficient indexing but we skipped a few details. This is a good time to finish this discussion as we are looking at map functions that are a little more complex.

First, back to the B-trees! We explained that the B-tree that backs the key-sorted view result is only built once, when you first query a view and all subsequent queries will just read the B-tree instead of executing the map function for all documents again. What happens though, when you change a document, or add a new one or delete one? Easy: CouchDB is smart enough to find the rows in the view result that were created by a specific document. It marks them invalid to have them no longer show up in view results. If the document was deleted, we’re good, the resulting B-tree reflects the state of the database. If a doc got updated, the new doc is run through the map function and the resulting new lines are inserted into the B-tree at the correct spots; new documents are handled in the same way. We already saw in in the previous chapter Storing Documents that a B-tree is a very efficient data structure for our needs and the crash-only design is carried over to the view indexes as well.

To add one more to the efficiency discussion, usually multiple documents get updated between view queries. The mechanism explained in the previous paragraph gets applied to all changes in the database since the last time the view got query in a batch operation which makes things even faster and is generally better use of your resources.

Find One

On to more complex map functions. We said “find documents by any value or structure that resides ins them”. We already explained how to extract a value to sort a list of views by (our date field). The same mechanism is used for fast lookups. The URI to query to get a view’s result is /database/_design/designdocname/_view/viewname. This gives you a list of all rows in the view. We only have three documents so things are small, but with thousands of documents, this can get long. You can add view parameters to the URI to constrain the result set. To find a single document, say we know the date of a blog post would be /blog/_design/docs/_view/by_date?key="2009/01/30 18:04:11" to get the “Biking” blog post. Remember that you can place whatever you like in the key parameter to the emit() function. Whatever you put in there, we can now use to look up exactly — and fast.

Note that in the case where multiple rows have the same key (perhaps we design a view where the key is the name of the post’s author), key queries can return more than one row.

Find Many

We talked about ``getting all posts for last month'' (it’s February now), this is as easy as `/blog/_design/docs/_view/by_date?startkey="2009/01/01 00:00:00"&endkey="2009/01/31 00:00:00". The `startkey and and endkey parameters specify an inclusive range on which we can search.

To make things a little nicer and to prepare for a future example, we are going to change the format of our date field. Instead of a string, we are going to use an array, where individual members are part of a timestamp in decreasing significance. This sounds fancy, but it is rather easy. Instead of

{
  "date": "2009/01/31 00:00:00"
}

we use

  "date": [2009, 1, 31, 0, 0, 0]

Our map function does not have to change for this, but our view result looks a little different.

Example: New View Results

| key                      | value
|----------------------------------------------
|[2009, 01, 15, 15, 52, 20] | "Hello World"
|[2009, 02, 17, 21, 13, 39] | "Bought a Cat"
|[2009, 01, 30, 18, 04, 11] | "Biking"

And our queries change to /blog/_design/docs/_view/by_date?key=[2009, 01, 01, 00, 00, 00] and /blog/_design/docs/_viewby_date?key=[2009, 01, 31, 00, 00, 00] For all you care, this is just a change in syntax, not meaning. But it shows you the power of views. Not only can you construct an index with scalar values like strings and integers, no you can also use JSON structures to put in there. Say we tag our documents with a list of tags and want to see all tags, but we don’t care for documents that have not been tagged.

Example: A Document Snippet With Tags

{
  ...
  tags: ["cool", "freak", "plankton"],
  ...
}

Example: A Document Snippet Without Tags

{
  ...
  tags: [],
  ...
}

Example: A Contrived Map Function

function(doc) {
  if(doc.tags.length > 0) {
    for(var idx in doc.tags) {
      emit(doc.tags[idx], null);
    }
  }
}

This shows a few new things. You can have conditions on structure (if(doc.tags.length > 0)) instead of just values. This is also an example of how a map function call emit() multiple times per document. And finally, you can pass null instead of a value to the value parameter; and the same is true for the key parameter. We’ll see in a bit how that is useful.

The view to get comments for posts

Figure: Comments map function

We use an array key here to support the group_level reduce query parameter. CouchDB’s views are stored in the Btree file-structure (which will be described in more detail in the advanced views section). Because of the way Btree’s are structured, we can cache the intermediate reduce results in the non-leaf nodes of the tree, so that reduce queries can be computed along arbitrary key ranges in logarithmic time.

In the blog app, we use group_level reduce queries to compute the count of comments both on a per-post and total basis, achieved by querying the same view index with different methods. With some array keys, and assuming each key has the value 1:

["a","b","c"]
["a","b","e"]
["a","c","m"]
["b","a","c"]
["b","a","g"]

The reduce view:

function(keys, values, rereduce) {
  return sum(values)
}

returns the total number of rows between the start and end key. So with startkey=["a","b"]&endkey=["b"] (which includes the first 3 of the above keys) the result would equal 3. The effect is to count rows. If you’d like to count rows without depending on the row value, you can switch on rereduce parameter:

function(keys, values, rereduce) {
  if (rereduce) {
    return sum(values);
  } else {
    return values.length;
  }
}

This is the reduce view used by the example app to count comments, while utilizing the map to output the comments, which are more useful than just 1 over and over. It pays to spend some time playing around with Map and Reduce functions. Futon is alright for this, but doesn’t give full access to all the query parameters. Writing your own test code for views in your language of choice is a great way to explore the nuances and capabilities of CouchDB’s incremental Map Reduce system.

Anyway… with a group_level query you’re basically running a series of reduce range queries. One for each group that shows up at the level you query. Let’s reprint the key list from above, grouped at level 1:

["a"]   3
["b"]   2

And at group_level=2:

["a","b"]   2
["a","c"]   1
["b","a"]   2

Using the parameter group=true behaves as though it were group_level=Exact, so in the case of our current example, it would give the number 1 for each key, as there are no exactly duplicated keys.

Setup comment view query code in post.html

Figure: Comment display Javascript

Create a comment.

[Screenshot]

Surprise! Gravatar

Summary

Map functions are side-effect-free functions which take a document as argument and emit key/value pairs. CouchDB stores the emitted rows by constructing a sorted B-Tree index, so row lookups by key, as well as streaming operations across a range of rows, can be accomplished in a small memory and processing footprint, while writes avoid seeks. Generating a view takes O(N), where N is the total number of rows in the view. However, querying a view is very quick, as the Btree remains shallow even when it contains many many keys.

Figure: A basic reduce view

Reduce functions operate on the sorted rows emitted by map view functions. CouchDB’s reduce functionality takes advantage of one of the fundamental properties of Btree indexes: for every leaf node (a sorted row), there is a chain of internal nodes reaching back to the root. Each leaf node in the Btree carries a few rows (on the order of tens, depending on row size), and each internal node may link to a few leaf nodes or other internal nodes.

btree figure about reduce

The reduce function is run on every node in the tree, in order to calculate the final reduce value. The end result is a reduce function which can be incrementally updated upon changes to the map function, while recalculating the reduction values for a minimum number of nodes. The initial reduction is calculated once per each node (inner and leaf) in the tree.

When run on leaf nodes (which contain actual map rows), the reduce function’s third parameter, rereduce, is false. The arguments in this case are the keys and values as output by the map function. The function has a single returned reduction value, which is stored on the inner node that working set of leaf nodes has in common, and used as a cache in future reduce calculations.

When the reduce function is run on inner nodes, the rereduce flag is true. This allows the function to account for the fact that it will be receiving its own prior output. When rereduce is true, the values passed to the function are intermediate reduction values as cached from previous calculations. When the tree is more than 2 levels deep, the rereduce phase is repeated, consuming chunks of the previous level’s output until the final reduce value is calculated at the root node.

A common mistake new CouchDB users make, is attempting to construct complex aggregate values with a reduce function. Full reductions should result in a scalar value, like 5, not, for instance, a JSON hash with the set of unique keys, and the count of each. The problem with this approach is that you’ll end up with a very very large final value. The number of unique keys can be nearly as large as the number of total keys, even for a large set. It is fine to combine a few scalar calculations into one reduce function, for instance to find the total, average, and standard deviation of a set of numbers in a single function.

If you’re interested in pushing the edge of CouchDB’s incremental reduce functionality, have a look at Google’s Sawzall paper, which gives examples of some of the more exotic reductions that can be accomplished in a system with similar constraints.

Performance… Generation: O(N) where N is the number of total nodes in the tree. Querying…

Writing a view

Let’s look at the JavaScript listing of our recent-posts view. We see that it is a single function, which takes one argument, doc, and processes it, emitting the title field, a short

from it as a key-value pair.

Figure: Annotated recent-posts view listing

In our view, we first calculate a short summary of the blogpost, which has had all HTML tags stripped from it, and been limited to 350 characters in length. Cutting down the post body in this way makes the view queries faster and easier for the client to integrate into the page. The post summaries are then emitted, along with the post title, into a view where they are kept sorted by creation timestamp. The goal of this view is to sort posts summaries in temporal order, as thats the list we’ll need to display the blog’s index page.

Querying a view

We query the view for latest posts, by supplying the descending=true parameter, and limit the maximum # of returned rows with the limit=7

The HTML is static blah blah… Calling views from html

We render the blog post to an HTML string using B.postToEntry, and insert it in the DOM with jQuery’s append method. (This method is provisional, when Couch’s document templates become available, this work will be simpler.)

Figure: The index page HTML (with JavaScript)

At this point, we’ve got a working blog with posts and an index page. We could stop here, and have perfectly useful software, but by adding a few bells and whistles, we’ll get the opportunity to explore CouchDB’s views in depth. The next chapter will add comments to the blog (and introduce complex view keys - which is the way CouchDB does joins).

Reduce / Rereduce

when your map view could have an unbounded number of rows, you must be able to reduce recursively, because the intermediate reduction values are stored on the inner nodes of the btree, so they can be incrementally updated.

once you understand why rereduce isn’t optional (but you can write functions that don’t make the distinction between it and reduce, it starts to seem like it couldn’t be otherwise. I think Hadoop style Map Reduce has an issue reducing when you have very many maps with the same key. CouchDB can reduce that just fine, because our key ranges can be arbitrarily picked at query time, so even a whole db emitting the key null will be reduceable to a single value.

it would be better to think of the initial reduce as the special case, and the rereduce as the normal case.

the initial reduce is tasked with converting the data from the map row format to a format suitable for reduction. when you have the option, it often simplifies things to output data from your map suitable for direct injection into the rereduce phase, this way you can ignore the parameter.

we could definitely make that clearer, maybe by inverting the bit and renaming the parameter to initial_reduce.

hence the simplicity of:

map:

function(doc) {
  doc.tags && doc.tags.forEach(function(tag) {
    emit(tag, 1);
  });
}

with the standard sum() reduce.

but say you wanted to repurpose the map index also for a browsable tag index in the client. you might emit the title as well, and sort by date.

function(doc) {
  doc.date && doc.title && doc.tags && doc.tags.forEach(function(tag) {
    emit([tag, doc.date], doc.title);
  });
}

sum of the titles is nonsense, and you still want to count the tags, hence:

function(keys, values, rereduce) {
  if (rereduce) {
    return sum(values);
  } else {
    return values.length;
  }
}

will convert any value to a format suitable to reduction, namely, the number of map rows, so this view is equivalent to the emit(key, 1) strategy, but allows other row values.

in this case (because the date) is in the key array, you’d end up querying the reduce with group_level=1 so that you can get the counts of each tag, ignoring the dates in the key.

hope that helps your understanding of rereduce.