XQuery/Lucene Search

Motivation

You want to perform a full text keyword search on one or more XML documents. This is done using the Lucene index extensions to eXist.

Background

The Apache Lucene full text search framework was added to eXist 1.4 as a full text index, replacing the previous native full text index. The new Lucene full text search framework is faster, more configurable, and more feature-rich than eXist's legacy full text index. It will also be the basis for an implementation of the W3C's full text extensions for XQuery.

eXist associates a distinct node-id with each node in an XML document. This node-id is used as the Lucene document ID in the Lucene index files, that is, each XML node becomes a Lucene document. This means that you can customize to a very high degree the search weight of keyword matches to every node in your document. So, for example, a match of a keyword within a title can have a higher score than a match in the body of a document. This means that a search hit retrieving a document title in a large number of documents will have a higher probability of being ranked first in your search results. This means your searches will have higher Precision and Recall than search systems that do not retain document structure.

eXist and Lucene Documentation

The following is the eXist documentation on how to use Lucene:

eXist-db Lucene Documentation

eXist supports the full Lucene Query Parser Syntax (with the exception of "fielded search"):

Lucene Query Parser Syntax

Sample XML File

<test>
    <p n="1">this paragraph tests the things made up by
      <name>ron</name>.</p>
    <p n="2">this paragraph tests the other issues made up by
      <name>edward</name>.</p>
</test>

Setting up a Lucene Index

In order to perform Lucene-indexed, full text searching of this document, we need to create an index configuration file, collection.xconf, describing which elements and attributes should be indexed, and the various details of that indexing:

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index>
      <!-- Enable the legacy full text index for comparison with Lucene -->
      <fulltext default="all" attributes="no"/>
      <!-- Lucene index is configured below -->
      <lucene>
        <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
        <analyzer id="ws" class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
        <text match="//test"/>
      </lucene>
    </index>
</collection>

Notes:

If your test data are saved in db/test, you should save collection.xconf in db/system/config/db/test. Index configuration files are always saved in a directory structure inside system/config/db which is isomorphic to the directory structure of db.
After you create or update this index configuration file, you will need to reindex the data. You can do this either by using the eXist Java-based admin client, selecting the test collection and choosing "Reindex collection", or by using the xmldb:reindex() function, supplying xmldb:reindex('/db/test') in eXide or in the XQuery Sandbox.
Although the legacy full text index is not needed for Lucene-based search, we have explicitly enabled it here for this example configuration in order to point out the expressive similarities between the Lucene and legacy search functions/operators (i.e. Lucene's ft:query() vs. the legacy full text index's &=, |=, near(), text:match-all(), text:match-any()).

Indexing Strategies

You can either define a Lucene index on a single element or attribute name (qname="...") or on a node path (match="...").

If you define an index on a qname, such as <text qname="test"/>, an index is created on <test> alone. What is passed to Lucene is the string value of <test>, which includes the text of all its descendant text nodes. With such an index, one cannot search for the nodes below <test>, e.g. for <p> or <name>, since such nodes have all been collapsed. If you want to be able to query descendant nodes, you should set up additional indexes on these, such as <text qname="p"/> or <text qname="name"/>.

If you define an index on a node path, as above with <text match="//test"/>, the node structure below <test> is maintained in the index and you can still query descendant nodes, such as <p> or <name>. This can be seen as a shortcut to establishing an index on all elements below <test>. Be aware that, according to the documentation, this feature is "subject to change" [1].

When deciding which approach to use, you should consider which parts of your document will be of interest as context for full text query. How narrow or broad to make it is best decided when considering concrete search scenarios.

Standard Lucene query syntax

eXist can process Lucene searches expressed in two kinds of query syntax, Lucene's standard query syntax and an XML syntax specific to eXist. In this section the standard query syntax is presented. This is the syntax one can expect a user to input in a search field.

A search for "Ron" in the current context will be expressed as [ft:query(., 'ron')]. The first argument holds the nodes to be searched, here ".", the current context node. The second argument supplies the query string, here simply the word "ron".

The ft:query() function allows the use of Lucene wildcards.

"?" can be used for a single character and "*" for zero, one or more characters: "edward" is found with "ed?ard" and "e*d". Lucene standard query syntax does not allow "*" and "?" to occur in the beginning of a word. In eXist, however, it is possible to add an option to the query to allow leading wildcards in searches; see eXist Lucene Documentation.

Fuzzy searches, with "~" at the end of a word, make it possible to retrieve "ron" through "don~". One can quantify the fuzziness, by appending a number between 0.0f and 1.0f, making it possible to retrieve "ron" by [ft:query(., 'don~0.6')], but not by [ft:query(., 'don~0.7')]. The amount of fuzziness is based on the Levenshtein Distance, or Edit Distance algorithm.[2]. The default is 0.5.

The boolean operators "AND" and "OR" can be used, with the expected semantics. There is a variant notation for this: [ft:query(., 'edward AND ron')] can also be written [ft:query(., '+edward +ron')]. [ft:query(., '+edward ron')] would require "edward", but not "ron", to be present. "NOT" can also be used: [ft:query(., 'edward NOT ron')] finds "edward" without "ron". "NOT" can also be represented with "-": [ft:query(., '+edward -ron')]. Operators can be grouped with parentheses, as in [ft:query(., '(edward OR ron) NOT things')].

Phrases can be searched for by putting them in quotation marks: [ft:query(., '"other issues"')].

Fields, proximity searches, range searches, boosting, and escaped reserved characters are not supported in eXist with queries using Lucene's standard query syntax. Boosting can be effected during indexing: eXist Lucene Documentation.

See Lucene Query Parser Syntax

Indexing

Since we have indexed the <test> element as a path, the index includes descendant nodes, and queries for nested elements therefore also return hits:

    collection('/db/test')/test/p/name[ft:query(., 'edward')]
    collection('/db/test')/test/p[ft:query(name, 'edward')]

If we had indexed the qname test with <text qname="test"/>, we would not be able to do so.

Stopwords

The standard Lucene analyser, activated in the above collection.conf file with <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>, applies the Lucene default list of English stop words and removes the following words from the index: a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with.

If you wish to customize the list of stopwords, specify an analyzer with the absolute location on your file system of a text file in which the stopwords you wish to apply are listed separated by newlines, and reindex the collection.

<analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer">
	<param name="stopwords" type="java.io.File" value="/tmp/stop.txt"/>
</analyzer>

If you wish to make all words searchable, you can leave the stop.txt empty or omit the reference to stop.txt:

<analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer">
	<param name="stopwords" type="java.io.File"/>
</analyzer>

After making these changes, restart eXist and reindex.

Ranking

Lucene assigns a relevance score or rank to each match. The more frequently a word occurs in a document, the higher the score. This score is preserved by eXist and can be accessed through the score function, which returns a decimal value.

 for $m in collection('/db/test')//p[ft:query(., 'tests ron')]
   let $score := ft:score($m)
   order by $score descending
   return 
   <hit score="{$score}">{$m}</hit>

The higher the score, the more relevant is the hit.

Boosting Values

The configuration file can be set up to apply higher search weights to specific elements within your document. So for example a match of a keyword in the title of a book will rank that search higher than matches in the body of the book.

Legacy Full Text Search Vs. Lucene XML Search

The following queries are equivalent (apart from the index used):

Matching any terms

To express the "match any" (|=) legacy style full text query using the new Lucene query function:

 collection('/db/test')//p[. |= 'tests edward']

you would use the following:

 collection('/db/test')//p[ft:query(.,
   <query>
     <bool>
       <term occur="should">tests</term>
       <term occur="should">edward</term>
     </bool>
   </query>)]

Matching all terms

To express the "match all" (&=) legacy full text query using the new Lucene query function:

 collection('/db/test')//p[. &= 'tests edward']

you would use the following:

 collection('/db/test')//p[ft:query(., 
   <query>
     <bool>
       <term occur="must">tests</term>
       <term occur="must">edward</term>
     </bool>
   </query>)]

Matching no terms

To express the "match none" (not + |=) legacy full text query using the new Lucene query function:

 collection('/db/test')//p[not(. |= 'issues edward')]

you would use the following:

 collection('/db/test')//p[not(ft:query(., 
   <query>
     <bool>
       <term occur="should">issues</term>
       <term occur="should">edward</term>
     </bool>
   </query>))]

Note that the last one could not be expressed as:

 collection('/db/test')//p[ft:query(., 
   <query>
     <bool>
       <term occur="not">issues</term>
       <term occur="not">edward</term>
     </bool>
   </query>)]

because Lucene's NOT operator can't be used on its own, without the presence of a 'positive' search term.

XML Query Syntax vs. Default Lucene Syntax

Following queries are equivalent, and can be tested against the Shakespeare examples shipped with eXist, by supplying them as value for $query in this XQuery snippet:

declare option exist:serialize "highlight-matches=both";
let $query := 'query'
return //SPEECH[ft:query(., $query)]

search type	Lucene syntax	XML syntax
'atomic', match any term	fillet snake	<query> <bool> <term>fillet</term> <term>snake</term> </bool> </query>
'atomic', match all terms	+fillet +snake	<query> <bool> <term occur="must">fillet</term> <term occur="must">snake</term> </bool> </query>
'atomic', match only some terms	-fillet +snake	<query> <bool> <term occur="not">fillet</term> <term occur="must">snake</term> </bool> </query>
'atomic', with wildcards	+fillet +sn*e	<query> <bool> <term occur="must">fillet</term> <wildcard occur="must">sn*e</wildcard> </bool> </query>
'atomic', with regex		<query> <bool> <term occur="must">fillet</term> <regex occur="must">sn.*e</regex> </bool> </query>
phrase search	"fillet snake"	<query> <phrase>fillet snake</phrase> </query> <query> <near>fillet snake</near> </query>
proximity search	"fillet snake"~1	<query> <near slop="3"> <term>fillet</term> <term>snake</term> </near> </query>
proximity search, unordered		<query> <near slop="1" ordered="no"> <term>snake</term> <term>fillet</term> </near> </query>
fuzzy search, no similarity parameter	snake~	<query> <fuzzy>snake</fuzzy> </query>
fuzzy search, with similarity parameter	snake~0.3	<query> <fuzzy min-similarity="0.3">snake</fuzzy> </query>

Mind the gaps in the table above! In standard Lucene syntax you can't express:

regular expressions: this is a unique feature of eXist's XML query syntax, by means of the <regex> element
ordering of proximity search terms: this is a unique feature of eXist's XML query syntax, by means of the @ordered attribute on <near>

Finally, a more complex case, in which boolean operator are grouped to override default priority rules:

search type	Lucene syntax	XML syntax
groups of boolean search operators	(fillet OR malice) AND snake	<query> <bool> <bool occur="must"> <term occur="should">fillet</term> <term occur="should">malice</term> </bool> <term occur="must">snake</term> </bool> </query>

Note how:

grouping in standard Lucene syntax can be expressed with nesting in XML syntax
for nested <bool> operators, the @occur attribute can be specified as well

Notes on Using Wildcards

Note that if you include a wildcard in your string the <wildcard> element must be used to enclose the string:

The following:

  //SPEECH[ft:query(., 'fennny sna*')]

is equivlant to:

xquery version "1.0";

let $query :=
<query>
  <term>fen</term>
  <wildcard>sna*</wildcard>
</query>
return
   //SPEECH[ft:query(., $query)]

References

eXist Lucene XML Syntax blog posting by Ron Van den Branden