Search index configuration file

Jackrabbit allows you to control which properties of a node are indexed and how much they affect that node’s jcr:score value in the result. You also have the option to configure different analyzers on a property-by-property basis. The index configuration file defines how Lucene indexes the content of a workspace.

Making changes to the JCR search indexing configuration impacts modules that depend on it, including the User Result Ranker module.

Summary

Apache Lucene is a Java-based indexing and search technology with spellchecking, hit highlighting, and advanced analysis/tokenization capabilities. This page is based on:

Indexing configuration

The configuration parameter indexingConfiguration isn’t set by default. This means all properties of a node are indexed. To configure the indexing behavior, add a parameter to the SearchIndex element in your repository or workspace configuration file.

For more, see IndexingConfiguration.

Configuration files

Indexing configuration file should be located in the package info.magnolia.jackrabbit.

  • Index Rules

  • Aggregates

  • Analyzers

To optimize the index size, you can index only certain properties of a node type. Index rules are processed top-down, and the first matching rule is applied while all remaining ones are ignored.

As of Jackrabbit 2.0 you can also use the match all regex for the namespace prefix part of a property name. However that’s currently the only supported regular expression. Please note that you must declare the namespace prefixes in the configuration element you use throughout the XML file.

Node scope and excluding nodes

With the nodeScopeIndex attribute set to false, the property isn’t in the full-text index. This means it would be available for all searches except for those using contains(…​) in sql and sql2 or jcr:contains(…​) for xpath.

Here, an index rule is applied against nodes of type nt:base. The rule also applies to nodes with types that extend from nt:base, and so it applies everywhere since nt:base is the base node type of all primary node types. To minimize the index and speed-up search, all properties starting with jcr: or mgnl: are excluded from the index. This means you get fewer results but those results are more relevant.

<index-rule nodeType="nt:base">
  <property isRegexp="true" nodeScopeIndex="false">mgnl:.*</property> <!-- Exclude Magnolia metadata from the full-text index. -->
  <property isRegexp="true" nodeScopeIndex="false">jcr:.*</property> <!-- Exclude JCR metadata from the full-text index. -->
  <property isRegexp="true">.*:.*</property> <!-- Include all properties from any namespace, even the empty namespace. -->
</index-rule>

Conditions

You may also add a condition to the index rule and have multiple rules with the same node type.

For example, you only want to boost page titles when the page is marked with a priority property. Furthermore, you are required to provide three priority levels: low, medium, and high.

<!-- Since the default boost it 1.0 we don't need to specify it. Anything not medium or high will be considered low. -->
<index-rule nodeType="mgnl:page"
            condition="@priority = 'medium'">
  <property boost="3.0">title</property>
</index-rule>
<index-rule nodeType="mgnl:page"
            condition="@priority = 'high'">
  <property boost="5.0">title</property>
</index-rule>

Boost

Configuring a boost value on both nodes or properties that match an index rule is possible. The default boost value is 1.0. Higher boost values (a reasonable range is 1.0 to 5.0) yield a higher score value and appear more relevant.

A boost value of 3.0 is added to the title property on nodes of type mgnl:page in the example below.

<index-rule nodeType="mgnl:page">
  <property boost="3.0">title</property>
</index-rule>

Generally, it can be helpful to include the contents of descendant nodes in a single node to facilitate searching content scattered across multiple nodes.

Including areas and components

The configuration uses index aggregates to ensure area and component content is included in the index. The properties of mgnl:area and mgnl:component comprise most of the page content and must be included explicitly. Nested areas are also included using the recursive flag.

The example below creates an index aggregate on mgnl:page that includes the content of mgnl:area and mgnl:component. This makes searching content on a page in one of its area or component subnodes easier.

<aggregate primaryType="mgnl:page">
  <include primaryType="mgnl:area">*</include>
  <include primaryType="mgnl:component">*</include>
</aggregate>

<!-- areas can be nested -->
<aggregate primaryType="mgnl:area" recursive="true">
    <include primaryType="mgnl:component">*</include>
    <include primaryType="mgnl:area">*</include>
</aggregate>

With this configuration part, you can define how a property should be analyzed.

For example, if you want to target properties that store German language content with a German language analyzer.

<analyzer class="org.apache.lucene.analysis.de.GermanAnalyzer">
   <property>text_de</property>
</analyzer>

Custom configuration file

You can create a custom indexing configuration for any workspace. Once created, the file can be configured in the workspace.xml file of the workspace you wish to target. Changes to this configuration require reindexing the workspace.

An example of this is the website specific example above or the DAM specific configuration.

This DAM example shows node data aggregation. Since the Magnolia metadata is stored in the mgnl:asset node and the image metadata/data is stored in an mgnl:resource subnode, you can aggregate it into one Lucene document.

Index configuration parameters

Performance

You can tune indexing performance with the following parameters.

For more performance ideas, see Improve Indexing Speed in the Lucene documentation.
Parameter Description

useCompoundFile

optional

All files belonging to a segment have the same name with varying extensions. When using the Compound File format, these files are collapsed into a single .cfs file. Useful for systems that frequently run out of file handles.

minMergeDocs

optional

This setting no longer exists in Lucene 3.x.

volatileIdleTime

optional

The Lucene indexer doesn’t write changes to the permanent index immediately. At first, the indexer writes the changes to a volatile index. Once the volatile index reaches a certain size, it’s persisted to the permanent index. Also there is the option to set a timer, in seconds, to control how often changes are written.

maxMergeDocs

optional

While merging segments, Lucene ensures that no segment with more than maxMergeDocs is created.

mergeFactor

optional

This value tells Lucene how many documents to store in memory before writing them to the disk, as well as how often to merge multiple segments together. With the default value of 10, Lucene stores 10 documents in memory before writing them to a single segment on the disk.

maxFieldLength

optional

Deprecated in Lucene 3.x.

bufferSize

optional

Maximum number of documents that are held in a pending queue until added to the index.

cacheSize

optional

Size of the document number cache. This cache maps UUIDs to Lucene document numbers. If the doc number cache hits are poor, then increasing this number could help.

maxVolatileIndexSize

optional

The maximum volatile index size in bytes until it’s written to disk. The default value is 1MB.

maxHistoryAge

optional

The maximum age (in seconds) of the index history. The default value is 0, which means that index commits are deleted as soon as they’re not used anymore.

initializeHierarchyCache

optional

With the default value of true, the hierarchy cache is initialized on startup and control is only given back when the initialization has completed.

When set to false, the cache is populated during regular use.

Consistency

Repository consistency settings are covered in more detail in the Troubleshooting section.

Parameter Description

forceConsistencyCheck

optional

Runs a consistency check on every startup.

If false, a consistency check is only performed when the search index detects a prior forced shutdown. When a consistency check is performed, it can delay the start of the system. So this should only be run when a search index inconsistency is suspected. For example, a "node not found" error. A UUID exists in the search index, but the corresponding node isn’t found. On the other hand, a node exists but isn’t recorded in the index. In both cases, the index is inconsistent with the data.

autoRepair

optional

Errors detected by a consistency check are automatically repaired. If false, errors are only written to the log.

enableConsistencyCheck

optional

If set to true, a consistency check is performed depending on the parameter forceConsistencyCheck.

If set to false no consistency check is performed on startup, even if a redo log was applied.

redoLogFactoryClass

optional

The name of the class that implements RedoLogFactory.

A redo log keeps track of changes that haven’t been committed to disk. While nodes are added and removed from the volatile index (held in memory), a redo log is maintained to keep track of the changes. If the Jackrabbit process terminates unexpectedly, the redo log is applied when Jackrabbit is restarted the next time.

The default value is DefaultRedoLogFactory.

Parameter Description

queryClass

optional

Class used to perform JCR Queries. QueryImpl provides the default implementation for a JCR query. Raising the log level on QueryImpl to DEBUG prints query execution times to the log.

respectDocumentOrder

optional

If true and the query doesn’t contain an order by clause, result nodes are in document order (the order in which they were indexed by the system).

resultFetchSize

optional

The number of results the query handler should initially fetch when a query is executed. Keep in mind that ACL checks must be performed on the result set. The larger the set, the more time to load and check.

termInfosIndexDivisor

optional

An indexDivisor for TermInfosReader, so that on opening a reader, you could further sub-sample the termIndexInterval to use less RAM. Set to 1 by default, meaning all terms are loaded into RAM. Set to 2 loads every other term into RAM but the trade off is you might have to scan twice. See LUCENE-1052.

Extraction

Parameter Description

extractorPoolSize

optional

Defines the maximum number of background threads that are used to extract text from binary properties. If set to 0, then no background threads are allocated and text extractors run in the current thread. If you are using Jackrabbit version 1.5 or later, then there are twice the number of available processors.

extractorTimeout

optional

A text extractor is executed using a background thread if it doesn’t finish within this timeout (defined in milliseconds). This parameter has no effect if extractorPoolSize is 0.

extractorBackLogSize

optional

The size of the extractor pool back log. If all threads in the pool are busy, incoming work is put into a wait queue. If the wait queue reaches the back log size, incoming extractor work isn’t queued anymore but is executed with the current thread.

maxExtractLength

optional

Positive values are used as they are, negative values are interpreted as factors of the maxFieldLength parameter.

forkJavaCommand

optional

Java command used to fork external parser processes, or null (the default) for in-process text extraction. Use this to better control system stability and reliability by forcing indexing of binary documents into separate JVM processes. Any problems caused by parsing large or malformed documents don’t affect the main process.

  • Linux

  • Windows

nice java -Xmx512m
cmd /c start /low /wait /b java -Xmx512m

Term identification

You can configure the Lucene index to provide excerpts and highlighting in the search results.

For example, the workspace.xml file in each workspace enables highlighting in search results. The workspace.xml files are in /<CATALINA_HOME>/webapps/<contextPath>/repositories/magnolia/workspaces/<workspace name>. Below’s the relevant extract from workspace.xml in the contacts workspace.

<!-- needed to highlight the searched term -->
<param name="supportHighlighting" value="true"/>
<!-- custom provider for getting an HTML excerpt in a query result with rep:excerpt() -->
<param name="excerptProviderClass" value="info.magnolia.jackrabbit.lucene.SearchHTMLExcerpt"/>

If you have configured your own app that operates on its own workspace and provides content for the website, you need to add these parameters to the searchIndex element of your workspace.xml file to show excerpts and highlighting in Web search results.

If you have used fields which allow for the storing of HTML, then that HTML will be indexed along with content. There is potential for the excerpt to contain HTML tags which are not closed.

Parameter Description

supportHighlighting

optional

If set to true, additional information is stored in the index to support highlighting using the rep:excerpt() function.

excerptProviderClass

optional

The name of the class that implements ExcerptProvider and should be used for the rep:excerpt() function in a query. By default, this is set to SearchHTMLExcerpt.

Parsing

Parameter Description

textFilterClasses

optional

Deprecated in Jackrabbit 2.x. With Jackrabbit 2.x, Apache Tika was introduced as the default binaries parser. By default, Jackrabbit comes with a default tika-config.xml file that contains the configuration for the mime-types to parse and extract.

tikaConfigPath

optional

Sets the location of the tika-config.xml. For example, ${wsp.home}/tika-config.xml.

See Configuring Tika for some example configurations, such as using the DefaultParser to exclude PDFs and other files.

Synonym provider

This allows users to use generalized language-dependent synonyms and, more importantly, domain-specific synonyms like abbreviations or product names.

Parameter Description

synonymProviderClass

optional

The name of a class that implements SynonymProvider. The default value is null, which means no class set. Jackrabbit provides the PropertiesSynonymProvider which implements a synonym provider based on a properties file. The location of the properties file is specified by the synonymProviderConfigPath.

synonymProviderConfigPath

optional

The path to the synonym provider configuration file. This path interpreted relative to the path parameter. If there is a FileSystem element inside the SearchIndex element, then this path is interpreted relative to the root path of the FileSystem. Whether this parameter is mandatory or not depends on the synonym provider implementation. The default value is null, which means no class set.

Spellchecking

Parameter Description

spellCheckerClass

optional

The name of a class that implements SpellChecker. No known implementation exists.

Scoring

Parameter Description

similarityClass

optional

The name of a class that extends Similarity. Similarity defines the components of Lucene scoring.

Feedback

DX Core

×

Location

This widget lets you know where you are on the docs site.

You are currently perusing through the Performance tuning guide docs.

Main doc sections

DX Core Headless PaaS Legacy Cloud Incubator modules