Search index configuration file
Jackrabbit allows you to control which properties of a node are indexed and how much they affect that node’s jcr:score
value in the result.
You also have the option to configure different analyzers on a property-by-property basis.
The index configuration file defines how Lucene indexes the content of a workspace.
Making changes to the JCR search indexing configuration impacts modules that depend on it, including the User Result Ranker module. |
Summary
Apache Lucene is a Java-based indexing and search technology with spellchecking, hit highlighting, and advanced analysis/tokenization capabilities. This page is based on:
Indexing configuration
The configuration parameter indexingConfiguration
isn’t set by default.
This means all properties of a node are indexed.
To configure the indexing behavior, add a parameter to the SearchIndex
element in your repository or workspace configuration file.
For more, see IndexingConfiguration.
Configuration files
Indexing configuration file should be located in the package info.magnolia.jackrabbit.
To optimize the index size, you can index only certain properties of a node type. Index rules are processed top-down, and the first matching rule is applied while all remaining ones are ignored.
As of Jackrabbit 2.0 you can also use the match all regex for the namespace prefix part of a property name. However that’s currently the only supported regular expression. Please note that you must declare the namespace prefixes in the configuration element you use throughout the XML file. |
Node scope and excluding nodes
With the nodeScopeIndex
attribute set to false,
the property isn’t in the full-text index.
This means it would be available for all searches except for those using contains(…)
in sql
and sql2
or jcr:contains(…)
for xpath
.
Here, an index rule is applied against nodes of type nt:base
.
The rule also applies to nodes with types that extend from nt:base
, and so it applies everywhere since nt:base
is the base node type of all primary node types.
To minimize the index and speed-up search, all properties starting with jcr:
or mgnl:
are excluded from the index.
This means you get fewer results but those results are more relevant.
<index-rule nodeType="nt:base">
<property isRegexp="true" nodeScopeIndex="false">mgnl:.*</property> <!-- Exclude Magnolia metadata from the full-text index. -->
<property isRegexp="true" nodeScopeIndex="false">jcr:.*</property> <!-- Exclude JCR metadata from the full-text index. -->
<property isRegexp="true">.*:.*</property> <!-- Include all properties from any namespace, even the empty namespace. -->
</index-rule>
Conditions
You may also add a condition to the index rule and have multiple rules with the same node type.
For example, you only want to boost page titles when the page is marked with a priority property. Furthermore, you are required to provide three priority levels: low, medium, and high.
<!-- Since the default boost it 1.0 we don't need to specify it. Anything not medium or high will be considered low. -->
<index-rule nodeType="mgnl:page"
condition="@priority = 'medium'">
<property boost="3.0">title</property>
</index-rule>
<index-rule nodeType="mgnl:page"
condition="@priority = 'high'">
<property boost="5.0">title</property>
</index-rule>
Boost
Configuring a boost
value on both nodes or properties that match an index rule is possible.
The default boost
value is 1.0
. Higher boost
values (a reasonable range is 1.0
to 5.0
) yield a higher score value and appear more relevant.
A boost
value of 3.0
is added to the title
property on nodes of type mgnl:page
in the example below.
<index-rule nodeType="mgnl:page">
<property boost="3.0">title</property>
</index-rule>
Generally, it can be helpful to include the contents of descendant nodes in a single node to facilitate searching content scattered across multiple nodes.
Including areas and components
The configuration uses index aggregates to ensure area and component content is included in the index.
The properties of mgnl:area
and mgnl:component
comprise most of the page content and must be included explicitly.
Nested areas are also included using the recursive
flag.
The example below creates an index aggregate on mgnl:page
that includes the content of mgnl:area
and mgnl:component
.
This makes searching content on a page in one of its area or component subnodes easier.
<aggregate primaryType="mgnl:page">
<include primaryType="mgnl:area">*</include>
<include primaryType="mgnl:component">*</include>
</aggregate>
<!-- areas can be nested -->
<aggregate primaryType="mgnl:area" recursive="true">
<include primaryType="mgnl:component">*</include>
<include primaryType="mgnl:area">*</include>
</aggregate>
With this configuration part, you can define how a property should be analyzed.
For example, if you want to target properties that store German language content with a German language analyzer.
<analyzer class="org.apache.lucene.analysis.de.GermanAnalyzer">
<property>text_de</property>
</analyzer>
Custom configuration file
You can create a custom indexing configuration for any workspace.
Once created, the file can be configured in the workspace.xml
file of the workspace you wish to target. Changes to this configuration require reindexing the workspace.
An example of this is the website specific example above or the DAM specific configuration.
This DAM example shows node data aggregation.
Since the Magnolia metadata is stored in the |
Index configuration parameters
Performance
You can tune indexing performance with the following parameters.
For more performance ideas, see Improve Indexing Speed in the Lucene documentation. |
Parameter | Description |
---|---|
|
optional All files belonging to a segment have the same name with varying extensions.
When using the Compound File format, these files are collapsed into a single |
|
optional This setting no longer exists in Lucene 3.x. |
|
optional The Lucene indexer doesn’t write changes to the permanent index immediately. At first, the indexer writes the changes to a volatile index. Once the volatile index reaches a certain size, it’s persisted to the permanent index. Also there is the option to set a timer, in seconds, to control how often changes are written. |
|
optional While merging segments, Lucene ensures that no segment with more than |
|
optional This value tells Lucene how many documents to store in memory before writing them to the disk, as well as how often to merge multiple segments together. With the default value of 10, Lucene stores 10 documents in memory before writing them to a single segment on the disk. |
|
optional Deprecated in Lucene 3.x. |
|
optional Maximum number of documents that are held in a pending queue until added to the index. |
|
optional Size of the document number cache. This cache maps UUIDs to Lucene document numbers. If the doc number cache hits are poor, then increasing this number could help. |
|
optional The maximum volatile index size in bytes until it’s written to disk. The default value is 1MB. |
|
optional The maximum age (in seconds) of the index history. The default value is 0, which means that index commits are deleted as soon as they’re not used anymore. |
|
optional With the default value of When set to |
Consistency
Repository consistency settings are covered in more detail in the Troubleshooting section.
Parameter | Description |
---|---|
|
optional Runs a consistency check on every startup. If |
|
optional Errors detected by a consistency check are automatically repaired.
If |
|
optional If set to If set to |
|
optional The name of the class that implements A redo log keeps track of changes that haven’t been committed to disk. While nodes are added and removed from the volatile index (held in memory), a redo log is maintained to keep track of the changes. If the Jackrabbit process terminates unexpectedly, the redo log is applied when Jackrabbit is restarted the next time. The default value is |
Search
Parameter | Description |
---|---|
|
optional Class used to perform JCR Queries.
|
|
optional If |
|
optional The number of results the query handler should initially fetch when a query is executed. Keep in mind that ACL checks must be performed on the result set. The larger the set, the more time to load and check. |
|
optional An |
Extraction
Parameter | Description |
---|---|
|
optional Defines the maximum number of background threads that are used to extract text from binary properties.
If set to |
|
optional A text extractor is executed using a background thread if it doesn’t finish within this timeout (defined in milliseconds).
This parameter has no effect if |
|
optional The size of the extractor pool back log. If all threads in the pool are busy, incoming work is put into a wait queue. If the wait queue reaches the back log size, incoming extractor work isn’t queued anymore but is executed with the current thread. |
|
optional Positive values are used as they are, negative values are interpreted as factors of the |
|
optional Java command used to fork external parser processes, or |
Term identification
You can configure the Lucene index to provide excerpts and highlighting in the search results.
For example, the workspace.xml
file in each workspace enables highlighting in search results.
The workspace.xml
files are in /<CATALINA_HOME>/webapps/<contextPath>/repositories/magnolia/workspaces/<workspace name>
.
Below’s the relevant extract from workspace.xml
in the contacts
workspace.
<!-- needed to highlight the searched term -->
<param name="supportHighlighting" value="true"/>
<!-- custom provider for getting an HTML excerpt in a query result with rep:excerpt() -->
<param name="excerptProviderClass" value="info.magnolia.jackrabbit.lucene.SearchHTMLExcerpt"/>
If you have configured your own app
that operates on its own workspace and provides content for the website,
you need to add these parameters to the |
If you have used fields which allow for the storing of HTML, then that HTML will be indexed along with content. There is potential for the excerpt to contain HTML tags which are not closed. |
Parameter | Description |
---|---|
|
optional If set to |
|
optional The name of the class that implements |
Parsing
Parameter | Description |
---|---|
|
optional Deprecated in Jackrabbit 2.x.
With Jackrabbit 2.x, Apache Tika was introduced as the default binaries parser.
By default, Jackrabbit comes with a default |
|
optional Sets the location of the See Configuring Tika for some example configurations, such as using the |
Synonym provider
This allows users to use generalized language-dependent synonyms and, more importantly, domain-specific synonyms like abbreviations or product names.
Parameter | Description |
---|---|
|
optional The name of a class that implements |
|
optional The path to the synonym provider configuration file.
This path interpreted relative to the |
Spellchecking
Parameter | Description |
---|---|
|
optional The name of a class that implements |
Scoring
Parameter | Description |
---|---|
|
optional The name of a class that extends |