Performance tuning guide master

Adyen Connector module
- master
AI Accelerator module
- 2.2
- 1.4
ai12z AI chatbot
- master
Algolia E-commerce connector
- master
Amplience DAM Connector module
- master
API
- master
- 1.1
B-FY Connector module
- master
Backend Live
- master
Backup Extended module
- master
Bitbucket module
- master
Bot Protection module
- master
Campaign manager module
- 4.0
- 3.1
CDN Helper module
- master
CDP integration framework
- master
Celum DAM Connector module
- 3.0
- 2.1
Cloudinary External DAM module
- 2.1
- 1.3
Commenting module
- 2.0
- 1.1
Configuration Injection module
- master
Content Diff module
- 2.0
- 1.0
Content Exporter module
- master
Content Locking module
- 3.0
- 2.0
Content Recommender module
- 3.0
- 2.0
Content Translation Extended module
- 4.2
- 3.6
Content Type models
- master
Content Types module
- 2.0.0
Custom CSS module
- master
Customer Journey Mapping module
- master
DAM App module
- 5.0.0-beta1
DAM JCR Fastly renderer module
- master
DAM module
- 5.0.0-beta1
- 4.0
Dotdigital Integration module
- master
DX Cloud Cockpit
- master
DX Core
- 6.3
- 6.2
Dynamic Form module
- 2.0
- 1.2
E-commerce Category Sync
- master
E-commerce module
- 2.0
- 1.3
Eight Eye Workflow module
- master
Elasticsearch provider module
- master
Extended Health Check module
- master
Freeze module
- master
Frontify DAM connector
- 2.0
- 1.0
Fullstory Integration module
- master
Groovy shell scripts
- master
- 6.2.55
Hi Magnolia
- master
Home
- master
Hooks API module
- master
Hybrid Assets module
- master
Image Focal module
- 3.0
- 2.4
Image placement module
- master
Image Recognition module
- 3.0.0-beta1
- 2.0
Incubator Modules
- master
Insights Accelerator module
- master
Instrumentation module
- master
internal
- master
Javascript Models
- 3.0
- 2.0
JavaScript UI module
- 3.1
- 2.2
Language Availability module
- master
Link Mapper module
- master
Linkmapper Shared Database module
- master
Live Copy module
- 4.x
- 3.x
Magnolia CLI
- 5.x
- 4.x
Magnolia Cloud
- master
Magnolia PaaS
- master
Magnolia Search Index Feeder module
- master
Magnolia Support documentation
- master
Magnolia Vercel App
- master
MediaValet DAM connector
- 1.0
Microsoft DAM Connector module
- master
Migration Tool module
- master
Multi Assets Upload module
- master
Multisite module
- 3.0.0
Netlify Integration module
- master
Orchestrate module
- 1.0-SNAPSHOT
Page-editor Apps extension
- 2.0
Performance tuning guide
- master
Periscope Control module
- master
Piano Analytics Connector module
- 2.0
- 1.0
Public User Registration Database module
- master
Publication Task Config
- master
REST module
- 3.0.0
REST Proxy module
- 2.0
- 1.0
RMQ Publication module
- master
Salesforce B2B Commerce connector
- master
Salesforce Commerce Cloud B2B connector API Reference
- master
SearchStax integration module
- master
SEO module
- master
Shop module
- master
Site module
- master
Siteimprove module
- master
Six Eye Workflow module
- master
Slack Integration module
- master
SSO Login Extension module
- master
SSO module
- 4.0
- 3.1
- 2.0
Task Email Notifications module
- master
Tasks cleaner module
- master
Throttling Filter module
- master
Two Factor Authentication module
- 2.0
- 1.0
URI Mapping app
- master
URL Translation Module
- master
Veeva DAM Connector module
- 2.0
- 1.1
Version Cleaner module
- master
VWO AB Testing module
- master
Webhooks module
- 2.0
- 1.0
WeChat Login module
- 1.0
Workflow Extended module
- master

Search index configuration file

Jackrabbit allows you to control which properties of a node are indexed and how much they affect that node’s jcr:score value in the result. You also have the option to configure different analyzers on a property-by-property basis. The index configuration file defines how Lucene indexes the content of a workspace.

Making changes to the JCR search indexing configuration impacts modules that depend on it, including the User Result Ranker module.

Summary

Apache Lucene is a Java-based indexing and search technology with spellchecking, hit highlighting, and advanced analysis/tokenization capabilities. This page is based on:

Indexing configuration

The configuration parameter indexingConfiguration isn’t set by default. This means all properties of a node are indexed. To configure the indexing behavior, add a parameter to the SearchIndex element in your repository or workspace configuration file.

For more, see IndexingConfiguration.

Configuration files

Indexing configuration file should be located in the package info.magnolia.jackrabbit.

Index Rules
Aggregates
Analyzers

To optimize the index size, you can index only certain properties of a node type. Index rules are processed top-down, and the first matching rule is applied while all remaining ones are ignored.

As of Jackrabbit 2.0 you can also use the match all regex for the namespace prefix part of a property name. However that’s currently the only supported regular expression. Please note that you must declare the namespace prefixes in the configuration element you use throughout the XML file.

Node scope and excluding nodes

With the nodeScopeIndex attribute set to false, the property is excluded from the full-text index and can’t be queried with contains(…) in SQL or jcr:contains(…) in XPath. Setting nodeScopeIndex to true includes the property in the full-text index. In both cases, direct property constraints such as LIKE or equality remain applicable.

Here, an index rule is applied against nodes of type nt:base. The rule also applies to nodes with types that extend from nt:base, and so it applies everywhere since nt:base is the base node type of all primary node types. To minimize the index and speed-up search, all properties starting with jcr: or mgnl: are excluded from the index. This means you get fewer results but those results are more relevant.

<index-rule nodeType="nt:base">
  <property isRegexp="true" nodeScopeIndex="false">mgnl:.*</property> <!-- Exclude Magnolia metadata from the full-text index. -->
  <property isRegexp="true" nodeScopeIndex="false">jcr:.*</property> <!-- Exclude JCR metadata from the full-text index. -->
  <property isRegexp="true">.*:.*</property> <!-- Include all properties from any namespace, even the empty namespace. -->
</index-rule>

Conditions

You may also add a condition to the index rule and have multiple rules with the same node type.

For example, you only want to boost page titles when the page is marked with a priority property. Furthermore, you are required to provide three priority levels: low, medium, and high.

<!-- Since the default boost it 1.0 we don't need to specify it. Anything not medium or high will be considered low. -->
<index-rule nodeType="mgnl:page"
            condition="@priority = 'medium'">
  <property boost="3.0">title</property>
</index-rule>
<index-rule nodeType="mgnl:page"
            condition="@priority = 'high'">
  <property boost="5.0">title</property>
</index-rule>

Boost

Configuring a boost value on both nodes or properties that match an index rule is possible. The default boost value is 1.0. Higher boost values (a reasonable range is 1.0 to 5.0) yield a higher score value and appear more relevant.

A boost value of 3.0 is added to the title property on nodes of type mgnl:page in the example below.

<index-rule nodeType="mgnl:page">
  <property boost="3.0">title</property>
</index-rule>

Generally, it can be helpful to include the contents of descendant nodes in a single node to facilitate searching content scattered across multiple nodes.

Including areas and components

The configuration uses index aggregates to ensure area and component content is included in the index. The properties of mgnl:area and mgnl:component comprise most of the page content and must be included explicitly. Nested areas are also included using the recursive flag.

The example below creates an index aggregate on mgnl:page that includes the content of mgnl:area and mgnl:component. This makes searching content on a page in one of its area or component subnodes easier.

<aggregate primaryType="mgnl:page">
  <include primaryType="mgnl:area">*</include>
  <include primaryType="mgnl:component">*</include>
</aggregate>

<!-- areas can be nested -->
<aggregate primaryType="mgnl:area" recursive="true">
    <include primaryType="mgnl:component">*</include>
    <include primaryType="mgnl:area">*</include>
</aggregate>

With this configuration part, you can define how a property should be analyzed.

For example, if you want to target properties that store German language content with a German language analyzer.

<analyzer class="org.apache.lucene.analysis.de.GermanAnalyzer">
   <property>text_de</property>
</analyzer>

Custom configuration file

You can create a custom indexing configuration for any workspace. Once created, the file can be configured in the workspace.xml file of the workspace you wish to target. Changes to this configuration require reindexing the workspace.

An example of this is the website specific example above or the DAM specific configuration.

This DAM example shows node data aggregation. Since the Magnolia metadata is stored in the mgnl:asset node and the image metadata/data is stored in an mgnl:resource subnode, you can aggregate it into one Lucene document.

Index configuration parameters

Performance

You can tune indexing performance with the following parameters.

For more performance ideas, see Improve Indexing Speed in the Lucene documentation.

Parameter Description

Parameter	Description
`useCompoundFile`	optional All files belonging to a segment have the same name with varying extensions. When using the Compound File format, these files are collapsed into a single `.cfs` file. Useful for systems that frequently run out of file handles.
`minMergeDocs`	optional This setting no longer exists in Lucene 3.x.
`volatileIdleTime`	optional The Lucene indexer doesn’t write changes to the permanent index immediately. At first, the indexer writes the changes to a volatile index. Once the volatile index reaches a certain size, it’s persisted to the permanent index. Also there is the option to set a timer, in seconds, to control how often changes are written.
`maxMergeDocs`	optional While merging segments, Lucene ensures that no segment with more than `maxMergeDocs` is created.
`mergeFactor`	optional This value tells Lucene how many documents to store in memory before writing them to the disk, as well as how often to merge multiple segments together. With the default value of 10, Lucene stores 10 documents in memory before writing them to a single segment on the disk.
`maxFieldLength`	optional Deprecated in Lucene 3.x.
`bufferSize`	optional Maximum number of documents that are held in a pending queue until added to the index.
`cacheSize`	optional Size of the document number cache. This cache maps UUIDs to Lucene document numbers. If the doc number cache hits are poor, then increasing this number could help.
`maxVolatileIndexSize`	optional The maximum volatile index size in bytes until it’s written to disk. The default value is 1MB.
`maxHistoryAge`	optional The maximum age (in seconds) of the index history. The default value is 0, which means that index commits are deleted as soon as they’re not used anymore.
`initializeHierarchyCache`	optional With the default value of `true`, the hierarchy cache is initialized on startup and control is only given back when the initialization has completed. When set to `false`, the cache is populated during regular use.

useCompoundFile

optional

All files belonging to a segment have the same name with varying extensions. When using the Compound File format, these files are collapsed into a single .cfs file. Useful for systems that frequently run out of file handles.

minMergeDocs

optional

This setting no longer exists in Lucene 3.x.

volatileIdleTime

optional

The Lucene indexer doesn’t write changes to the permanent index immediately. At first, the indexer writes the changes to a volatile index. Once the volatile index reaches a certain size, it’s persisted to the permanent index. Also there is the option to set a timer, in seconds, to control how often changes are written.

maxMergeDocs

optional

While merging segments, Lucene ensures that no segment with more than maxMergeDocs is created.

mergeFactor

optional

This value tells Lucene how many documents to store in memory before writing them to the disk, as well as how often to merge multiple segments together. With the default value of 10, Lucene stores 10 documents in memory before writing them to a single segment on the disk.

maxFieldLength

optional

Deprecated in Lucene 3.x.

bufferSize

optional

Maximum number of documents that are held in a pending queue until added to the index.

cacheSize

optional

Size of the document number cache. This cache maps UUIDs to Lucene document numbers. If the doc number cache hits are poor, then increasing this number could help.

maxVolatileIndexSize

optional

The maximum volatile index size in bytes until it’s written to disk. The default value is 1MB.

maxHistoryAge

optional

The maximum age (in seconds) of the index history. The default value is 0, which means that index commits are deleted as soon as they’re not used anymore.

initializeHierarchyCache

optional

With the default value of true, the hierarchy cache is initialized on startup and control is only given back when the initialization has completed.

When set to false, the cache is populated during regular use.

Consistency

Repository consistency settings are covered in more detail in the Troubleshooting section.

Parameter Description

Parameter	Description
`forceConsistencyCheck`	optional Runs a consistency check on every startup. If `false`, a consistency check is only performed when the search index detects a prior forced shutdown. When a consistency check is performed, it can delay the start of the system. So this should only be run when a search index inconsistency is suspected. For example, a "node not found" error. A UUID exists in the search index, but the corresponding node isn’t found. On the other hand, a node exists but isn’t recorded in the index. In both cases, the index is inconsistent with the data.
`autoRepair`	optional Errors detected by a consistency check are automatically repaired. If `false`, errors are only written to the log.
`enableConsistencyCheck`	optional If set to `true`, a consistency check is performed depending on the parameter `forceConsistencyCheck`. If set to `false` no consistency check is performed on startup, even if a redo log was applied.
`redoLogFactoryClass`	optional The name of the class that implements `RedoLogFactory`. A redo log keeps track of changes that haven’t been committed to disk. While nodes are added and removed from the volatile index (held in memory), a redo log is maintained to keep track of the changes. If the Jackrabbit process terminates unexpectedly, the redo log is applied when Jackrabbit is restarted the next time. The default value is `DefaultRedoLogFactory`.

forceConsistencyCheck

optional

Runs a consistency check on every startup.

If false, a consistency check is only performed when the search index detects a prior forced shutdown. When a consistency check is performed, it can delay the start of the system. So this should only be run when a search index inconsistency is suspected. For example, a "node not found" error. A UUID exists in the search index, but the corresponding node isn’t found. On the other hand, a node exists but isn’t recorded in the index. In both cases, the index is inconsistent with the data.

autoRepair

optional

Errors detected by a consistency check are automatically repaired. If false, errors are only written to the log.

enableConsistencyCheck

optional

If set to true, a consistency check is performed depending on the parameter forceConsistencyCheck.

If set to false no consistency check is performed on startup, even if a redo log was applied.

redoLogFactoryClass

optional

The name of the class that implements RedoLogFactory.

A redo log keeps track of changes that haven’t been committed to disk. While nodes are added and removed from the volatile index (held in memory), a redo log is maintained to keep track of the changes. If the Jackrabbit process terminates unexpectedly, the redo log is applied when Jackrabbit is restarted the next time.

The default value is DefaultRedoLogFactory.

Search

Parameter Description

Parameter	Description
`queryClass`	optional Class used to perform JCR Queries. `QueryImpl` provides the default implementation for a JCR query. Raising the log level on `QueryImpl` to `DEBUG` prints query execution times to the log.
`respectDocumentOrder`	optional If `true` and the query doesn’t contain an `order by` clause, result nodes are in document order (the order in which they were indexed by the system).
`resultFetchSize`	optional The number of results the query handler should initially fetch when a query is executed. Keep in mind that ACL checks must be performed on the result set. The larger the set, the more time to load and check.
`termInfosIndexDivisor`	optional An `indexDivisor` for `TermInfosReader`, so that on opening a reader, you could further sub-sample the `termIndexInterval` to use less RAM. Set to `1` by default, meaning all terms are loaded into RAM. Set to `2` loads every other term into RAM but the trade off is you might have to scan twice. See LUCENE-1052.

queryClass

optional

Class used to perform JCR Queries. QueryImpl provides the default implementation for a JCR query. Raising the log level on QueryImpl to DEBUG prints query execution times to the log.

respectDocumentOrder

optional

If true and the query doesn’t contain an order by clause, result nodes are in document order (the order in which they were indexed by the system).

resultFetchSize

optional

The number of results the query handler should initially fetch when a query is executed. Keep in mind that ACL checks must be performed on the result set. The larger the set, the more time to load and check.

termInfosIndexDivisor

optional

An indexDivisor for TermInfosReader, so that on opening a reader, you could further sub-sample the termIndexInterval to use less RAM. Set to 1 by default, meaning all terms are loaded into RAM. Set to 2 loads every other term into RAM but the trade off is you might have to scan twice. See LUCENE-1052.

Extraction

Parameter Description

Parameter	Description
`extractorPoolSize`	optional Defines the maximum number of background threads that are used to extract text from binary properties. If set to `0`, then no background threads are allocated and text extractors run in the current thread. If you are using Jackrabbit version 1.5 or later, then there are twice the number of available processors.
`extractorTimeout`	optional A text extractor is executed using a background thread if it doesn’t finish within this timeout (defined in milliseconds). This parameter has no effect if `extractorPoolSize` is `0`.
`extractorBackLogSize`	optional The size of the extractor pool back log. If all threads in the pool are busy, incoming work is put into a wait queue. If the wait queue reaches the back log size, incoming extractor work isn’t queued anymore but is executed with the current thread.
`maxExtractLength`	optional Positive values are used as they are, negative values are interpreted as factors of the `maxFieldLength` parameter.
`forkJavaCommand`	optional Java command used to fork external parser processes, or `null` (the default) for in-process text extraction. Use this to better control system stability and reliability by forcing indexing of binary documents into separate JVM processes. Any problems caused by parsing large or malformed documents don’t affect the main process. Linux Windows `nice java -Xmx512m` `cmd /c start /low /wait /b java -Xmx512m`

extractorPoolSize

optional

Defines the maximum number of background threads that are used to extract text from binary properties. If set to 0, then no background threads are allocated and text extractors run in the current thread. If you are using Jackrabbit version 1.5 or later, then there are twice the number of available processors.

extractorTimeout

optional

A text extractor is executed using a background thread if it doesn’t finish within this timeout (defined in milliseconds). This parameter has no effect if extractorPoolSize is 0.

extractorBackLogSize

optional

The size of the extractor pool back log. If all threads in the pool are busy, incoming work is put into a wait queue. If the wait queue reaches the back log size, incoming extractor work isn’t queued anymore but is executed with the current thread.

maxExtractLength

optional

Positive values are used as they are, negative values are interpreted as factors of the maxFieldLength parameter.

forkJavaCommand

optional

Java command used to fork external parser processes, or null (the default) for in-process text extraction. Use this to better control system stability and reliability by forcing indexing of binary documents into separate JVM processes. Any problems caused by parsing large or malformed documents don’t affect the main process.

Linux
Windows

nice java -Xmx512m

cmd /c start /low /wait /b java -Xmx512m

Term identification

You can configure the Lucene index to provide excerpts and highlighting in the search results.

For example, the workspace.xml file in each workspace enables highlighting in search results. The workspace.xml files are in /<CATALINA_HOME>/webapps/<contextPath>/repositories/magnolia/workspaces/<workspace name>. Below’s the relevant extract from workspace.xml in the contacts workspace.

<!-- needed to highlight the searched term -->
<param name="supportHighlighting" value="true"/>
<!-- custom provider for getting an HTML excerpt in a query result with rep:excerpt() -->
<param name="excerptProviderClass" value="info.magnolia.jackrabbit.lucene.SearchHTMLExcerpt"/>

If you have configured your own app that operates on its own workspace and provides content for the website, you need to add these parameters to the searchIndex element of your workspace.xml file to show excerpts and highlighting in Web search results.

If you have used fields which allow for the storing of HTML, then that HTML will be indexed along with content. There is potential for the excerpt to contain HTML tags which are not closed.

Parameter Description

Parameter	Description
`supportHighlighting`	optional If set to `true`, additional information is stored in the index to support highlighting using the `rep:excerpt()` function.
`excerptProviderClass`	optional The name of the class that implements `ExcerptProvider` and should be used for the `rep:excerpt()` function in a query. By default, this is set to `SearchHTMLExcerpt`.

supportHighlighting

optional

If set to true, additional information is stored in the index to support highlighting using the rep:excerpt() function.

excerptProviderClass

optional

The name of the class that implements ExcerptProvider and should be used for the rep:excerpt() function in a query. By default, this is set to SearchHTMLExcerpt.

Parsing

Parameter Description

Parameter	Description
`textFilterClasses`	optional Deprecated in Jackrabbit 2.x. With Jackrabbit 2.x, Apache Tika was introduced as the default binaries parser. By default, Jackrabbit comes with a default `tika-config.xml` file that contains the configuration for the mime-types to parse and extract.
`tikaConfigPath`	optional Sets the location of the `tika-config.xml`. For example, `${wsp.home}/tika-config.xml`. See Configuring Tika for some example configurations, such as using the `DefaultParser` to exclude PDFs and other files.

textFilterClasses

optional

Deprecated in Jackrabbit 2.x. With Jackrabbit 2.x, Apache Tika was introduced as the default binaries parser. By default, Jackrabbit comes with a default tika-config.xml file that contains the configuration for the mime-types to parse and extract.

tikaConfigPath

optional

Sets the location of the tika-config.xml. For example, ${wsp.home}/tika-config.xml.

See Configuring Tika for some example configurations, such as using the DefaultParser to exclude PDFs and other files.

Synonym provider

This allows users to use generalized language-dependent synonyms and, more importantly, domain-specific synonyms like abbreviations or product names.

Parameter Description

Parameter	Description
`synonymProviderClass`	optional The name of a class that implements `SynonymProvider`. The default value is `null`, which means no class set. Jackrabbit provides the `PropertiesSynonymProvider` which implements a synonym provider based on a properties file. The location of the properties file is specified by the `synonymProviderConfigPath`.
`synonymProviderConfigPath`	optional The path to the synonym provider configuration file. This path interpreted relative to the `path` parameter. If there is a `FileSystem` element inside the `SearchIndex` element, then this path is interpreted relative to the root path of the `FileSystem`. Whether this parameter is mandatory or not depends on the synonym provider implementation. The default value is `null`, which means no class set.

synonymProviderClass

optional

The name of a class that implements SynonymProvider. The default value is null, which means no class set. Jackrabbit provides the PropertiesSynonymProvider which implements a synonym provider based on a properties file. The location of the properties file is specified by the synonymProviderConfigPath.

synonymProviderConfigPath

optional

The path to the synonym provider configuration file. This path interpreted relative to the path parameter. If there is a FileSystem element inside the SearchIndex element, then this path is interpreted relative to the root path of the FileSystem. Whether this parameter is mandatory or not depends on the synonym provider implementation. The default value is null, which means no class set.

Spellchecking

Parameter Description

Parameter	Description
`spellCheckerClass`	optional The name of a class that implements `SpellChecker`. No known implementation exists.

spellCheckerClass

optional

The name of a class that implements SpellChecker. No known implementation exists.

Scoring

Parameter Description

Parameter	Description
`similarityClass`	optional The name of a class that extends `Similarity`. Similarity defines the components of Lucene scoring.

similarityClass

optional

The name of a class that extends Similarity. Similarity defines the components of Lucene scoring.

Feedback

DX Core

Search index configuration file

Summary

Indexing configuration

Configuration files

Custom configuration file

Index configuration parameters

Performance

Consistency

Search

Extraction

Term identification

Parsing

Synonym provider

Spellchecking

Scoring

Location

Main doc sections