Content Indexing and Querying using interMedia and Lucene

This document will outline how CMS content is indexed, and how a search string is translated into a set of results with both Lucene and Oracle interMedia (also known as Oracle Text).

Indexing

Basic Attributes

Both interMedia and Lucene store the same set of basic attributes about a piece of content that are used for filtering a search query or displaying search results. These attributes are also used to match search query terms when using Lucene, but not when using interMedia. Attributes include:

Object ID
Object type
Title
Summary
Language
Content creator
Creation date
Last user to modify content
Last modification date

Full Text

The content that is full-text indexed is created by traversing the object graph and generating a representation that is usable by the indexer. This is handled by an implementation of DomainObjectTraversal. The set of objects to be visited by the traversal is configurable by editing the file indicated by the com.arsdigita.cms.item_adapters configuration parameter (it defaults to: WEB-INF/resources/cms-item-adapters.xml).

This file contains a set of contexts, which are used to group adapter directives for different purposes. The context used for indexing is com.arsdigita.cms.search.ContentPageMetadataProvider. Within this context there a number of adapters. Each adapter specifies a list of attributes and associations that should be handled by the DomainObjectTraversal for the given object type. Each of these lists also has an adapter rule. If the rule equals include, then the list is an inclusion list, and only attributes (or associations) listed will be processed by the DomainObjectTraversal. If the rule equals exclude, then it is an exclusion list, and all attributes (or associations) will be processed except for those listed.

When indexing an object for Lucene, the DomainObjectTraversal converts all processed attributes into their string values, concatenates them into one long space-separated string, and passes that string to Lucene for indexing. The details of how Lucene handles this data is beyond the scope of this document, but you can read the documentation available from the official Lucene project page for more information.

When indexing an object for interMedia, the set of traversed objects is converted to an XML document. The object being indexed is processed recursively, with the root object creating the first element, and all attributes and associated objects creating new child elements. This XML document is then inserted into a database table that has been indexed to allow for efficient searching of XML data.

interMedia also allows for indexing of raw content - plain text and binary data (Word, PDF, etc). When using interMedia an additional traversal is performed, collecting any associated text or binary assets. The rules for which attributes and associations are visited during this traversal are contained in com.arsdigita.cms.search.AssetExtractor context. The content of the first asset found is inserted into another column that is indexed using interMedia. If that content is plain text or one of the interMedia-supported binary file types, it will also be used when performing searches. Oracle provides a list of supported document formats.

`Querying`



      Lucene
      Human-entered search terms are passed directly to the Lucene
      APIs to be parsed and matched against the search index.  The
      Lucene documentation explains the query
      syntax and matching rules in detail.

      interMedia
      interMedia is more restrictive about the queries it will accept,
      and an incorrectly formatted query can cause an error.  To avoid
      this, the human-entered search terms are first "cleaned" by
      removing illegal words and characters and separating the
      remaining words and phrases (words surrounded by quotes, which
      will be search for as a group) with "and".  Illegal characters
      are: |&,-*;{}%_$?!()\:@.<>#^+=[]~` and
      illegal words are: the, of, to, with, and, or, for,
      this.  This set of cleaned search terms is then used to
      query the columns containing the XML and raw content we
      generated earlier.  The query is performed using the interMedia
      contains() operator.  This operator uses the
      interMedia indexes on these columns to perform efficient
      full-text searching.  The rank for each search result is
      obtained by multiplying the value of the score()
      operator for the XML and raw content columns by the weights
      specified in the
      waf.search.intermedia.xml_content_weight and
      waf.search.intermedia.raw_content_weight config
      paratemers respectively, and summing the resulting values. Both
      config parameters default to 1.  Detailed
      information on the use of the contains() and
      score() operators is provided
      by Oracle.  Note that, as a result of the "cleaning" of the
      search terms, advanced interMedia search features are not
      supported.  Only simple matching of all specified words and
      phrases is currently supported.

      
      Because full-text searching with interMedia is done against a
      table in the same schema as the rest of the CMS data (as opposed
      to the external files used by Lucene), it enables filtering
      based on permissions and categories, which is not available to
      Lucene.  When filtering based on permissions, the search query
      is modified to join with the permissions-denormalization tables
      and filter out any objects for which the specified user
      (generally the currently logged-in user) does not have the
      specified privilege.  For category-based filtering, the search
      query joins with the categorization tables to filter out any
      objects not in the specified categories.