Content Indexing and Querying using interMedia and Lucene


This document will outline how CMS content is indexed, and how a search string is translated into a set of results with both Lucene and Oracle interMedia (also known as Oracle Text).

Indexing

Basic Attributes

Both interMedia and Lucene store the same set of basic attributes about a piece of content that are used for filtering a search query or displaying search results. These attributes are also used to match search query terms when using Lucene, but not when using interMedia. Attributes include:

Full Text

The content that is full-text indexed is created by traversing the object graph and generating a representation that is usable by the indexer. This is handled by an implementation of DomainObjectTraversal. The set of objects to be visited by the traversal is configurable by editing the file indicated by the com.arsdigita.cms.item_adapters configuration parameter (it defaults to: WEB-INF/resources/cms-item-adapters.xml).

This file contains a set of contexts, which are used to group adapter directives for different purposes. The context used for indexing is com.arsdigita.cms.search.ContentPageMetadataProvider. Within this context there a number of adapters. Each adapter specifies a list of attributes and associations that should be handled by the DomainObjectTraversal for the given object type. Each of these lists also has an adapter rule. If the rule equals include, then the list is an inclusion list, and only attributes (or associations) listed will be processed by the DomainObjectTraversal. If the rule equals exclude, then it is an exclusion list, and all attributes (or associations) will be processed except for those listed.

When indexing an object for Lucene, the DomainObjectTraversal converts all processed attributes into their string values, concatenates them into one long space-separated string, and passes that string to Lucene for indexing. The details of how Lucene handles this data is beyond the scope of this document, but you can read the documentation available from the official Lucene project page for more information.

When indexing an object for interMedia, the set of traversed objects is converted to an XML document. The object being indexed is processed recursively, with the root object creating the first element, and all attributes and associated objects creating new child elements. This XML document is then inserted into a database table that has been indexed to allow for efficient searching of XML data.

interMedia also allows for indexing of raw content - plain text and binary data (Word, PDF, etc). When using interMedia an additional traversal is performed, collecting any associated text or binary assets. The rules for which attributes and associations are visited during this traversal are contained in com.arsdigita.cms.search.AssetExtractor context. The content of the first asset found is inserted into another column that is indexed using interMedia. If that content is plain text or one of the interMedia-supported binary file types, it will also be used when performing searches. Oracle provides a list of supported document formats.

Querying

Lucene

Human-entered search terms are passed directly to the Lucene APIs to be parsed and matched against the search index. The Lucene documentation explains the query syntax and matching rules in detail.

interMedia

interMedia is more restrictive about the queries it will accept, and an incorrectly formatted query can cause an error. To avoid this, the human-entered search terms are first "cleaned" by removing illegal words and characters and separating the remaining words and phrases (words surrounded by quotes, which will be search for as a group) with "and". Illegal characters are: |&,-*;{}%_$?!()\:@.<>#^+=[]~` and illegal words are: the, of, to, with, and, or, for, this. This set of cleaned search terms is then used to query the columns containing the XML and raw content we generated earlier. The query is performed using the interMedia contains() operator. This operator uses the interMedia indexes on these columns to perform efficient full-text searching. The rank for each search result is obtained by multiplying the value of the score() operator for the XML and raw content columns by the weights specified in the waf.search.intermedia.xml_content_weight and waf.search.intermedia.raw_content_weight config paratemers respectively, and summing the resulting values. Both config parameters default to 1. Detailed information on the use of the contains() and score() operators is provided by Oracle. Note that, as a result of the "cleaning" of the search terms, advanced interMedia search features are not supported. Only simple matching of all specified words and phrases is currently supported.

Because full-text searching with interMedia is done against a table in the same schema as the rest of the CMS data (as opposed to the external files used by Lucene), it enables filtering based on permissions and categories, which is not available to Lucene. When filtering based on permissions, the search query is modified to join with the permissions-denormalization tables and filter out any objects for which the specified user (generally the currently logged-in user) does not have the specified privilege. For category-based filtering, the search query joins with the categorization tables to filter out any objects not in the specified categories.