Section {name} {number} {maxlen} [when] [format] [cloneflag] [separator] [{expression} {replacement}]
When used in search.htm, the "Section" command requires only the first three parameters and activates recognition of section name references in search queries. See the Section called Restrict searched words to a section in Chapter 8 for details. There are no any other purposes of using the "Section" command in search.htm. The rest of this article applies mostly for indexer.conf.
"string" is a section name and "number" is section ID between 0 and 255. Use 0 if you don't want to index some of these sections. It is better to use different sections IDs for different documents parts. In this case during search time you'll be able to give different weight to each part or even disallow some sections at a search time. maxlen argument contains a maximum length of section which will be stored in database.
"when" is an optional parameter defining when the section should be created. Three values are possible:
"format" is a flag telling indexer which parser to use for the section. Two values are understood:
text - use text parser
html - use HTML parser
"cloneflag" is a flag describing whether the section should affect clone detection. It can be "DetectClone" or "cdon", or "NoDetectClone" or "cdoff". By default, url.* section values are not taken in account for clone detection, while any other sections take part in clone detection.
"separator" is a string that separates section. This is useful for attribute sections.
"expression" and "replacement" can be used to extract user defined sections.
There is a special "User.Date" section. It makes possible to use a user defined meta tag (or even any other document part) as an alternative "Last-Modified" value. A number of widespread formats is understood:
Sun, 06 Nov 1994 08:49:37 GMT Sun, 6 Nov 1994 08:49:37 GMT Sunday, 06-Nov-94 08:49:37 GMT Sun Nov 6 08:49:37 1994 1994-11-06 06.11.1994
"nobody" is another section with a special meaning. When parsing HTML documents, indexer ignores the words outside the <body> and </body> tags by default. To activate indexing of these words, you can define a special section "nobody", which should have the same ID and length with the section "body". Making indexer see the words outside the body tags can be useful to index a remote site with broken HTML pages - when you can't modify the pages, or to index local HTML pages having SSI (sever side include) directives directly from disk using file:/// schema, even if the <body> and </body> tags are not in the HTML pages themselves, but in shared files included using SSI directives, like <!--#include virtual="../include/top.html"-->. For example:
Section body 1 256 Section nobody 1 256
Section body 1 256 Section title 2 128 Section meta.keywords 3 128 Section meta.description 4 128 Section header.server 5 64 Section url.file 6 0 Section url.path 7 0 Section url.host 8 0 Section url.proto 9 0 Section crosswords 10 0 Section Charset 11 32 Section Content-Type 12 64 Section Content-Language 13 16 Section attribute.alt 14 128 Section attribute.label 15 128 Section attribute.summary 16 128 Section attribute.title 17 128 Section References 18 0 Section Message-ID 19 0 Section Parent-ID 20 0 Section MP3.Song 21 128 Section MP3.Album 22 128 Section MP3.Artist 23 128 Section MP3.Year 24 128 Section CachedCopy 25 64000 Section attribute.face 27 0 Section attribute.title 28 0 "." # A user-defined section Section h1 29 128 "<h1>(.*)</h1>" $1 # User-defined date extracted from the "Date" meta-tag Section User.Date 0 10 '<META NAME="Date" +CONTENT="([^"]*)">' "$1" # Replacing Content-Type to application/msword Section Content-Type 0 64 afterheaders cdoff "" "${URL}" "http://site/*.doc" "application/msword" # Using "afterguesser" in conjuction with ${HTTP.LocalCharsetContent} Section HTTP.LocalCharsetContent 0 0 Section h1lcs 30 128 afterguesser cdoff "" "${HTTP.LocalCharsetContent}" "<h1>(.*)</h1>" $1 # Using a simple HTDBDoc query for a SQL table with text and HTML columns Section 1 256 column1 text Section 2 256 colimn2 html