libreccm-legacy/ccm-ldn-search/doc/indexing-and-querying.html

151 lines
7.3 KiB
HTML
Executable File

<html>
<head>
<title>Content Indexing and Querying using interMedia and Lucene</title>
</head>
<body>
<h2>Content Indexing and Querying using interMedia and Lucene</h2>
<hr/>
This document will outline how CMS content is indexed, and how a
search string is translated into a set of results with both Lucene
and Oracle interMedia (also known as Oracle Text).
<h3>Indexing</h3>
<h4>Basic Attributes</h4>
Both interMedia and Lucene store the same set of basic attributes
about a piece of content that are used for filtering a search
query or displaying search results. These attributes are also
used to match search query terms when using Lucene, but not when
using interMedia. Attributes include:
<ul>
<li>Object ID</li>
<li>Object type</li>
<li>Title</li>
<li>Summary</li>
<li>Language</li>
<li>Content creator</li>
<li>Creation date</li>
<li>Last user to modify content</li>
<li>Last modification date</li>
</ul>
<h4>Full Text</h4>
The content that is full-text indexed is created by traversing the
object graph and generating a representation that is
usable by the indexer. This
is handled by an implementation of <a
href="http://ccm.redhat.com/documentation/api/ccm-core-6.0.0/com/arsdigita/domain/DomainObjectTraversal.html">
DomainObjectTraversal</a>. The set of objects to be visited by
the traversal is configurable by editing the file indicated by
the <code>com.arsdigita.cms.item_adapters</code> configuration
parameter (it defaults to:
<code>WEB-INF/resources/cms-item-adapters.xml</code>).
<p/>
This file contains a set of <code>contexts</code>, which are
used to group adapter directives for different purposes. The
<code>context</context> used for indexing is
<code>com.arsdigita.cms.search.ContentPageMetadataProvider</code>.
Within this <code>context</code> there a number of
<code>adapters</code>. Each <code>adapter</code> specifies a
list of attributes and associations that should be handled by
the DomainObjectTraversal for the given <code>object
type</code>. Each of these lists also has an adapter
<code>rule</code>. If the <code>rule</code> equals
<code>include</code>, then the list is an inclusion list, and
only attributes (or associations) listed will be processed by
the DomainObjectTraversal. If the rule equals
<code>exclude</code>, then it is an exclusion list, and all
attributes (or associations) will be processed except for those
listed.
<p/>
When indexing an object for Lucene, the
<code>DomainObjectTraversal</code> converts all processed
attributes into their string values, concatenates them into
one long space-separated string, and passes that string to
Lucene for indexing. The details of how Lucene handles this
data is beyond the scope of this document, but you can read the
documentation available from the official <a
href="http://jakarta.apache.org/lucene/docs/index.html">Lucene</a>
project page for more information.
<p/>
When indexing an object for interMedia, the set of traversed
objects is converted to an XML document. The object being
indexed is processed recursively, with the root object
creating the first element, and all attributes and associated objects
creating new child elements. This XML document is then inserted
into a database table that has been indexed to allow for
efficient searching of XML data.
<p/>
interMedia also allows for indexing of raw content - plain text and binary
data (Word, PDF, etc). When using interMedia an additional
traversal is performed, collecting any associated text or binary
assets. The rules for which attributes and associations are
visited during this traversal are contained in
<code>com.arsdigita.cms.search.AssetExtractor</code>
<code>context</code>. The content of the first asset found is
inserted into another column that is indexed using interMedia.
If that content is plain text or one of the interMedia-supported
binary file types, it will also be used when performing
searches. Oracle provides a <a
href="http://download-west.oracle.com/docs/cd/B10501_01/text.920/a96518/afilsupt.htm#CCREF1300">list</a>
of supported document formats.
<h3>Querying</h3>
<h4>Lucene</h4>
Human-entered search terms are passed directly to the Lucene
APIs to be parsed and matched against the search index. The
Lucene documentation explains the <a
href="http://jakarta.apache.org/lucene/docs/queryparsersyntax.html">query
syntax</a> and matching rules in detail.
<h4>interMedia</h4>
interMedia is more restrictive about the queries it will accept,
and an incorrectly formatted query can cause an error. To avoid
this, the human-entered search terms are first "cleaned" by
removing illegal words and characters and separating the
remaining words and phrases (words surrounded by quotes, which
will be search for as a group) with "and". Illegal characters
are: <code>|&amp;,-*;{}%_$?!()\:@.&lt;&gt;#^+=[]~`</code> and
illegal words are: <code>the, of, to, with, and, or, for,
this</code>. This set of cleaned search terms is then used to
query the columns containing the XML and raw content we
generated earlier. The query is performed using the interMedia
<code>contains()</code> operator. This operator uses the
interMedia indexes on these columns to perform efficient
full-text searching. The rank for each search result is
obtained by multiplying the value of the <code>score()</code>
operator for the XML and raw content columns by the weights
specified in the
<code>waf.search.intermedia.xml_content_weight</code> and
<code>waf.search.intermedia.raw_content_weight</code> config
paratemers respectively, and summing the resulting values. Both
config parameters default to <code>1</code>. Detailed
information on the use of the <code>contains()</code> and
<code>score()</code> operators is <a
href="http://download-west.oracle.com/docs/cd/B10501_01/text.920/a96518/csql.htm#21732">provided</a>
by Oracle. Note that, as a result of the "cleaning" of the
search terms, advanced interMedia search features are not
supported. Only simple matching of all specified words and
phrases is currently supported.
<p/>
Because full-text searching with interMedia is done against a
table in the same schema as the rest of the CMS data (as opposed
to the external files used by Lucene), it enables filtering
based on permissions and categories, which is not available to
Lucene. When filtering based on permissions, the search query
is modified to join with the permissions-denormalization tables
and filter out any objects for which the specified user
(generally the currently logged-in user) does not have the
specified privilege. For category-based filtering, the search
query joins with the categorization tables to filter out any
objects not in the specified categories.
</body>
</html>