Workspace configuration

Once a workspace is created, you can adjust the workspace configuration on a workspace-by-workspace basis. For each new workspace, there is a corresponding workspace.xml file for fine-tuning individual performance. By default, the file is located in the file system inside the corresponding workspace folder.

To modify the configuration of an existing workspace, you need to change the workspace.xml file for that workspace. Changing the <Workspace/> element in the repository configuration file does not affect existing workspaces.

File system

The virtual file system passed to the persistence manager and search index.

<FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
  <param name="path" value="${rep.home}/repository" />
</FileSystem>

Jackrabbit provides a lot of choices for how you can configure the FileSystem. Choose the class that best fits your use case.

Persistence manager

Each workspace in a Jackrabbit content repository uses separate persistence managers to store the content in that workspace.

<PersistenceManager class="org.apache.jackrabbit.core.persistence.pool.DerbyPersistenceManager">
  <param name="url" value="jdbc:derby:${wsp.home}/db;create=true"/>
  <param name="schemaObjectPrefix" value="${wsp.name}_"/>
</PersistenceManager>

Jackrabbit provides a lot of choices for how you can configure the PersistenceManager. Choose the class that best fits your use case.

pool
in-mem

Search index

Node names and property values are indexed as soon as the data is saved or as soon as the transaction is committed.

Text extraction is done asynchronously in a background thread. That means text that’s changed or added isn’t available immediately, but rather after a short delay. You can configure the exact behaviour using the extractor settings.

Jackrabbit provides the following options in the class SearchIndex. All parameters (except path) have default values and you can omit them and use the default value instead.

See Jackrabbit Search for more details.

Basic configuration

Parameter Description

Parameter	Description
`path`	required The location of the index directory. A reasonable value is: `${wsp.home}/index`
`indexingConfiguration`	optional When not set, all properties of a node are indexed. Magnolia provides a default indexing configuration file located in the Core module. indexing_configuration_default.xml indexing_configuration_website.xml You can also create a custom indexing configuration file on a per-workspace basis to adjust the indexing behavior. For example, the `magnolia-dam-jcr` module provides DAM-specific configuration for aggregating asset metadata with its binary data. indexing_configuration_dam.xml
`indexingConfigurationClass`	optional The name of the class that implements `IndexingConfiguration`. `IndexingConfigurationImpl` implements a concrete indexing configuration.
`analyzer`	optional Sets the default analyzer in use for indexing. The default value is the `StandardAnalyzer`. This analyzer uses an English-language stop word set. Lucene provides language-specific analyzers, which you can configure property-by-property in the indexing configuration file.
`directoryManager`	optional The name of the class that implements `DirectoryManager`. `FSDirectoryManager` implements a directory manager for `FSDirectory` instances. `RAMDirectoryManager` implements a directory manager for `RAMDirectory` instances.
`useSimpleFSDirectory`	optional Indicates whether the `DirectoryManager` should use the `SimpleFSDirectory` instead of letting Lucene automatically pick an implementation based on the platform you’re running on. Default is `false`.

path

required

The location of the index directory. A reasonable value is: ${wsp.home}/index

indexingConfiguration

optional

When not set, all properties of a node are indexed. Magnolia provides a default indexing configuration file located in the Core module.

indexing_configuration_default.xml
indexing_configuration_website.xml

You can also create a custom indexing configuration file on a per-workspace basis to adjust the indexing behavior. For example, the magnolia-dam-jcr module provides DAM-specific configuration for aggregating asset metadata with its binary data.
indexing_configuration_dam.xml

indexingConfigurationClass

optional

The name of the class that implements IndexingConfiguration. IndexingConfigurationImpl implements a concrete indexing configuration.

analyzer

optional

Sets the default analyzer in use for indexing. The default value is the StandardAnalyzer.

This analyzer uses an English-language stop word set. Lucene provides language-specific analyzers, which you can configure property-by-property in the indexing configuration file.

directoryManager

optional

The name of the class that implements DirectoryManager. FSDirectoryManager implements a directory manager for FSDirectory instances. RAMDirectoryManager implements a directory manager for RAMDirectory instances.

useSimpleFSDirectory

optional

Indicates whether the DirectoryManager should use the SimpleFSDirectory instead of letting Lucene automatically pick an implementation based on the platform you’re running on. Default is false.

Performance

You can tune indexing performance with the following parameters.

For more performance ideas, see Improve Indexing Speed in the Lucene documentation.

Parameter Description

Parameter	Description
`useCompoundFile`	optional All files belonging to a segment have the same name with varying extensions. When using the Compound File format, these files are collapsed into a single `.cfs` file. Useful for systems that frequently run out of file handles.
`minMergeDocs`	optional This setting no longer exists in Lucene 3.x.
`volatileIdleTime`	optional The Lucene indexer doesn’t write changes to the permanent index immediately. At first, the indexer writes the changes to a volatile index. Once the volatile index reaches a certain size, it’s persisted to the permanent index. Also there is the option to set a timer, in seconds, to control how often changes are written.
`maxMergeDocs`	optional While merging segments, Lucene ensures that no segment with more than `maxMergeDocs` is created.
`mergeFactor`	optional This value tells Lucene how many documents to store in memory before writing them to the disk, as well as how often to merge multiple segments together. With the default value of 10, Lucene stores 10 documents in memory before writing them to a single segment on the disk.
`maxFieldLength`	optional Deprecated in Lucene 3.x.
`bufferSize`	optional Maximum number of documents that are held in a pending queue until added to the index.
`cacheSize`	optional Size of the document number cache. This cache maps UUIDs to Lucene document numbers. If the doc number cache hits are poor, then increasing this number could help.
`maxVolatileIndexSize`	optional The maximum volatile index size in bytes until it’s written to disk. The default value is 1MB.
`maxHistoryAge`	optional The maximum age (in seconds) of the index history. The default value is 0, which means that index commits are deleted as soon as they’re not used anymore.
`initializeHierarchyCache`	optional With the default value of `true`, the hierarchy cache is initialized on startup and control is only given back when the initialization has completed. When set to `false`, the cache is populated during regular use.

useCompoundFile

optional

All files belonging to a segment have the same name with varying extensions. When using the Compound File format, these files are collapsed into a single .cfs file. Useful for systems that frequently run out of file handles.

minMergeDocs

optional

This setting no longer exists in Lucene 3.x.

volatileIdleTime

optional

The Lucene indexer doesn’t write changes to the permanent index immediately. At first, the indexer writes the changes to a volatile index. Once the volatile index reaches a certain size, it’s persisted to the permanent index. Also there is the option to set a timer, in seconds, to control how often changes are written.

maxMergeDocs

optional

While merging segments, Lucene ensures that no segment with more than maxMergeDocs is created.

mergeFactor

optional

This value tells Lucene how many documents to store in memory before writing them to the disk, as well as how often to merge multiple segments together. With the default value of 10, Lucene stores 10 documents in memory before writing them to a single segment on the disk.

maxFieldLength

optional

Deprecated in Lucene 3.x.

bufferSize

optional

Maximum number of documents that are held in a pending queue until added to the index.

cacheSize

optional

Size of the document number cache. This cache maps UUIDs to Lucene document numbers. If the doc number cache hits are poor, then increasing this number could help.

maxVolatileIndexSize

optional

The maximum volatile index size in bytes until it’s written to disk. The default value is 1MB.

maxHistoryAge

optional

The maximum age (in seconds) of the index history. The default value is 0, which means that index commits are deleted as soon as they’re not used anymore.

initializeHierarchyCache

optional

With the default value of true, the hierarchy cache is initialized on startup and control is only given back when the initialization has completed.

When set to false, the cache is populated during regular use.

Consistency

Repository consistency settings are covered in more detail in the Troubleshooting section.

Parameter Description

Parameter	Description
`forceConsistencyCheck`	optional Runs a consistency check on every startup. If `false`, a consistency check is only performed when the search index detects a prior forced shutdown. When a consistency check is performed, it can delay the start of the system. So this should only be run when a search index inconsistency is suspected. For example, a "node not found" error. A UUID exists in the search index, but the corresponding node isn’t found. On the other hand, a node exists but isn’t recorded in the index. In both cases, the index is inconsistent with the data.
`autoRepair`	optional Errors detected by a consistency check are automatically repaired. If `false`, errors are only written to the log.
`enableConsistencyCheck`	optional If set to `true`, a consistency check is performed depending on the parameter `forceConsistencyCheck`. If set to `false` no consistency check is performed on startup, even if a redo log was applied.
`redoLogFactoryClass`	optional The name of the class that implements `RedoLogFactory`. A redo log keeps track of changes that haven’t been committed to disk. While nodes are added and removed from the volatile index (held in memory), a redo log is maintained to keep track of the changes. If the Jackrabbit process terminates unexpectedly, the redo log is applied when Jackrabbit is restarted the next time. The default value is `DefaultRedoLogFactory`.

forceConsistencyCheck

optional

Runs a consistency check on every startup.

If false, a consistency check is only performed when the search index detects a prior forced shutdown. When a consistency check is performed, it can delay the start of the system. So this should only be run when a search index inconsistency is suspected. For example, a "node not found" error. A UUID exists in the search index, but the corresponding node isn’t found. On the other hand, a node exists but isn’t recorded in the index. In both cases, the index is inconsistent with the data.

autoRepair

optional

Errors detected by a consistency check are automatically repaired. If false, errors are only written to the log.

enableConsistencyCheck

optional

If set to true, a consistency check is performed depending on the parameter forceConsistencyCheck.

If set to false no consistency check is performed on startup, even if a redo log was applied.

redoLogFactoryClass

optional

The name of the class that implements RedoLogFactory.

A redo log keeps track of changes that haven’t been committed to disk. While nodes are added and removed from the volatile index (held in memory), a redo log is maintained to keep track of the changes. If the Jackrabbit process terminates unexpectedly, the redo log is applied when Jackrabbit is restarted the next time.

The default value is DefaultRedoLogFactory.

Search

Parameter Description

Parameter	Description
`queryClass`	optional Class used to perform JCR Queries. `QueryImpl` provides the default implementation for a JCR query. Raising the log level on `QueryImpl` to `DEBUG` prints query execution times to the log.
`respectDocumentOrder`	optional If `true` and the query doesn’t contain an `order by` clause, result nodes are in document order (the order in which they were indexed by the system).
`resultFetchSize`	optional The number of results the query handler should initially fetch when a query is executed. Keep in mind that ACL checks must be performed on the result set. The larger the set, the more time to load and check.
`termInfosIndexDivisor`	optional An `indexDivisor` for `TermInfosReader`, so that on opening a reader, you could further sub-sample the `termIndexInterval` to use less RAM. Set to `1` by default, meaning all terms are loaded into RAM. Set to `2` loads every other term into RAM but the trade off is you might have to scan twice. See LUCENE-1052.

queryClass

optional

Class used to perform JCR Queries. QueryImpl provides the default implementation for a JCR query. Raising the log level on QueryImpl to DEBUG prints query execution times to the log.

respectDocumentOrder

optional

If true and the query doesn’t contain an order by clause, result nodes are in document order (the order in which they were indexed by the system).

resultFetchSize

optional

The number of results the query handler should initially fetch when a query is executed. Keep in mind that ACL checks must be performed on the result set. The larger the set, the more time to load and check.

termInfosIndexDivisor

optional

An indexDivisor for TermInfosReader, so that on opening a reader, you could further sub-sample the termIndexInterval to use less RAM. Set to 1 by default, meaning all terms are loaded into RAM. Set to 2 loads every other term into RAM but the trade off is you might have to scan twice. See LUCENE-1052.

Extraction

Parameter Description

Parameter	Description
`extractorPoolSize`	optional Defines the maximum number of background threads that are used to extract text from binary properties. If set to `0`, then no background threads are allocated and text extractors run in the current thread. If you are using Jackrabbit version 1.5 or later, then there are twice the number of available processors.
`extractorTimeout`	optional A text extractor is executed using a background thread if it doesn’t finish within this timeout (defined in milliseconds). This parameter has no effect if `extractorPoolSize` is `0`.
`extractorBackLogSize`	optional The size of the extractor pool back log. If all threads in the pool are busy, incoming work is put into a wait queue. If the wait queue reaches the back log size, incoming extractor work isn’t queued anymore but is executed with the current thread.
`maxExtractLength`	optional Positive values are used as they are, negative values are interpreted as factors of the `maxFieldLength` parameter.
`forkJavaCommand`	optional Java command used to fork external parser processes, or `null` (the default) for in-process text extraction. Use this to better control system stability and reliability by forcing indexing of binary documents into separate JVM processes. Any problems caused by parsing large or malformed documents don’t affect the main process. Linux Windows `nice java -Xmx512m` `cmd /c start /low /wait /b java -Xmx512m`

extractorPoolSize

optional

Defines the maximum number of background threads that are used to extract text from binary properties. If set to 0, then no background threads are allocated and text extractors run in the current thread. If you are using Jackrabbit version 1.5 or later, then there are twice the number of available processors.

extractorTimeout

optional

A text extractor is executed using a background thread if it doesn’t finish within this timeout (defined in milliseconds). This parameter has no effect if extractorPoolSize is 0.

extractorBackLogSize

optional

The size of the extractor pool back log. If all threads in the pool are busy, incoming work is put into a wait queue. If the wait queue reaches the back log size, incoming extractor work isn’t queued anymore but is executed with the current thread.

maxExtractLength

optional

Positive values are used as they are, negative values are interpreted as factors of the maxFieldLength parameter.

forkJavaCommand

optional

Java command used to fork external parser processes, or null (the default) for in-process text extraction. Use this to better control system stability and reliability by forcing indexing of binary documents into separate JVM processes. Any problems caused by parsing large or malformed documents don’t affect the main process.

Linux
Windows

nice java -Xmx512m

cmd /c start /low /wait /b java -Xmx512m

Term identification

You can configure the Lucene index to provide excerpts and highlighting in the search results.

For example, the workspace.xml file in each workspace enables highlighting in search results. The workspace.xml files are in /<CATALINA_HOME>/webapps/<contextPath>/repositories/magnolia/workspaces/<workspace name>. Below’s the relevant extract from workspace.xml in the contacts workspace.

<!-- needed to highlight the searched term -->
<param name="supportHighlighting" value="true"/>
<!-- custom provider for getting an HTML excerpt in a query result with rep:excerpt() -->
<param name="excerptProviderClass" value="info.magnolia.jackrabbit.lucene.SearchHTMLExcerpt"/>

If you have configured your own app that operates on its own workspace and provides content for the website, you need to add these parameters to the searchIndex element of your workspace.xml file to show excerpts and highlighting in Web search results.

If you have used fields which allow for the storing of HTML, then that HTML will be indexed along with content. There is potential for the excerpt to contain HTML tags which are not closed.

Parameter Description

Parameter	Description
`supportHighlighting`	optional If set to `true`, additional information is stored in the index to support highlighting using the `rep:excerpt()` function.
`excerptProviderClass`	optional The name of the class that implements `ExcerptProvider` and should be used for the `rep:excerpt()` function in a query. By default, this is set to `SearchHTMLExcerpt`.

supportHighlighting

optional

If set to true, additional information is stored in the index to support highlighting using the rep:excerpt() function.

excerptProviderClass

optional

The name of the class that implements ExcerptProvider and should be used for the rep:excerpt() function in a query. By default, this is set to SearchHTMLExcerpt.

Parsing

Parameter Description

Parameter	Description
`textFilterClasses`	optional Deprecated in Jackrabbit 2.x. With Jackrabbit 2.x, Apache Tika was introduced as the default binaries parser. By default, Jackrabbit comes with a default `tika-config.xml` file that contains the configuration for the mime-types to parse and extract.
`tikaConfigPath`	optional Sets the location of the `tika-config.xml`. For example, `${wsp.home}/tika-config.xml`. See Configuring Tika for some example configurations, such as using the `DefaultParser` to exclude PDFs and other files.

textFilterClasses

optional

Deprecated in Jackrabbit 2.x. With Jackrabbit 2.x, Apache Tika was introduced as the default binaries parser. By default, Jackrabbit comes with a default tika-config.xml file that contains the configuration for the mime-types to parse and extract.

tikaConfigPath

optional

Sets the location of the tika-config.xml. For example, ${wsp.home}/tika-config.xml.

See Configuring Tika for some example configurations, such as using the DefaultParser to exclude PDFs and other files.

Synonym provider

This allows users to use generalized language-dependent synonyms and, more importantly, domain-specific synonyms like abbreviations or product names.

Parameter Description

Parameter	Description
`synonymProviderClass`	optional The name of a class that implements `SynonymProvider`. The default value is `null`, which means no class set. Jackrabbit provides the `PropertiesSynonymProvider` which implements a synonym provider based on a properties file. The location of the properties file is specified by the `synonymProviderConfigPath`.
`synonymProviderConfigPath`	optional The path to the synonym provider configuration file. This path interpreted relative to the `path` parameter. If there is a `FileSystem` element inside the `SearchIndex` element, then this path is interpreted relative to the root path of the `FileSystem`. Whether this parameter is mandatory or not depends on the synonym provider implementation. The default value is `null`, which means no class set.

synonymProviderClass

optional

The name of a class that implements SynonymProvider. The default value is null, which means no class set. Jackrabbit provides the PropertiesSynonymProvider which implements a synonym provider based on a properties file. The location of the properties file is specified by the synonymProviderConfigPath.

synonymProviderConfigPath

optional

The path to the synonym provider configuration file. This path interpreted relative to the path parameter. If there is a FileSystem element inside the SearchIndex element, then this path is interpreted relative to the root path of the FileSystem. Whether this parameter is mandatory or not depends on the synonym provider implementation. The default value is null, which means no class set.

Spellchecking

Parameter Description

Parameter	Description
`spellCheckerClass`	optional The name of a class that implements `SpellChecker`. No known implementation exists.

spellCheckerClass

optional

The name of a class that implements SpellChecker. No known implementation exists.

Scoring

Parameter Description

Parameter	Description
`similarityClass`	optional The name of a class that extends `Similarity`. Similarity defines the components of Lucene scoring.

similarityClass

optional

The name of a class that extends Similarity. Similarity defines the components of Lucene scoring.

Workspace security

Workspace security is handled by the MagnoliaAccessProvider.

See the workspace security section for more details.

Synchronize workspaces between Magnolia instances

When using Magnolia, you often store content in a variety of workspaces. Typically, workspaces are kept under your magnolia.repositories.home in the WEB-INF/config/default/magnolia.properties file. The Content Types module creates node types, workspaces, and namespaces on-the-fly. If using, make sure your repository configuration and workspaces are properly synchronized as this on-the-fly feature makes changes to repository configuration files.

The following should be considered when creating a new content type:

📁 repo

📁 magnolia

📁 repository

📁 datastore

📁 meta

⸬ rootUUID

📁 namespaces

⬩ ns_idx.properties

⬩ ns_reg.properties

📁 nodetypes

⬩ custom_nodetypes.xml

⬩ db.mv.db

📁 workspaces

📁 config

⬩ db.mv.db

⬩ workspace.xml

Item Notes

Namespace definitions

Found in the repository/namespaces folder, you can find your custom namespace registry and its index which are stored as a text file in ns_idx.properties and ns_reg.properties.

Copy your custom namespace registry and index to the target environment to synchronize them.

Node type definitions

Custom node type definitions are stored in the repository/nodetypes folder in the custom_nodetypes.xml file.

This is not automatically generated if you aren’t starting up a clean Magnolia instance. So you would also have to merge your existing definition (in the target environment) with the one defined there.Otherwise, an invalid nodetype error may occur when accessing content with that nodetype after your data synchronization.

Workspace configuration

Your workspace configuration stores the detailed workspace configuration in the workspaces/<your_workspace_name>/workspace.xml file.

If you want to change the PersistenceManager class, SearchIndex class, excerptProviderClass or AccessControlProvider class, change it in each of your workspace configuration files. These files are then used in the next system startup process.

Index and lock

You can remove all files and folder under the index folder. The system then regenerates, reindexing the entire workspace for you at restart.

Why is this important?

This ensures repository consistency and cleans up all unsynchronized indexes.

For content synchronization, this folder should not be copied over different instances. It need to be cleaned up in the target instance instead.

The actual content

The actual content is typically stored in your configured database tables with the name prefix according to "schemaObjectPrefix" name where {wsp.name} is the workspace name.

For example

pm_${wsp.name}_NODE (table)
pm_${wsp.name}_NODE_IDX (index)
pm_${wsp.name}_PROP (table)
pm_${wsp.name}_PROP_IDX (index)
pm_${wsp.name}_REFS (table)
pm_${wsp.name}_REFS_IDX (index)
pm_${wsp.name}_BINVAL (table)
pm_${wsp.name}_BINVAL_IDX (index)

Feedback

DX Core

Workspace configuration

File system

Persistence manager

Search index

Basic configuration

Performance

Consistency

Search

Extraction

Term identification

Parsing

Synonym provider

Spellchecking

Scoring

Workspace security

Synchronize workspaces between Magnolia instances

Location

Main doc sections