This page describes how to configure the Content Indexer submodule of the Magnolia Solr module to index Magnolia workspaces and crawl a website.
The Solr module allows you to use Apache Solr, a standalone enterprise-grade search server with a REST-like API, for indexing and crawling Magnolia content, especially if you need to manage assets in high volumes (100,000+ DAM assets).
Configuring Solr clients
The Solr module supports multiple Solr servers/cores.
You can configure a client for each server/core under Configuration >/modules/solr-search-provider/config/solrClientConfigs.
It’s recommended to have one client named default. This default client is used when no specific client is defined for the indexer, crawler, or search result page template.
If you need to have more servers/cores, duplicate the default client and change the baseURL property to point to another server/core.
Since Solr version 9.0, Magnolia supports light module or decoration-based Solr indexer configurations.
The Content Indexer module is a recursive repository indexer and an event based indexer.
You can configure multiple indexers for different sites and document types.
The content indexer also allows you to crawl external websites using JSoup and CSS selectors.
You then define different field mappings that are obtained for each node and indexed in the Solr index.
IndexerService
Both the indexer and the crawler use the IndexerService to handle content indexing.
A basic implementation, configured by default, is info.magnolia.search.solrsearchprovider.logic.indexer.SolrIndexerService.
You can define and configure your own IndexService for specific needs.
For a globally configured indexing service, register the IndexerService in the configuration of the Content Indexer module.
For a custom indexing service, use the indexServiceClass (see above in the properties table).
Since Solr version 7.0, the indexer service supports asynchronous indexing in multiple threads.
All indexers and crawlers use a shared number of threads for asynchronous indexing.
This can also be configured in Content Indexer module config.
The default number of threads is 10.
true enables the indexer configuration, false disables it.
indexed
required
Since Solr module version 9.0, this property is managed outside the definition, see Indexer status.
Indicates whether indexing was done. When Solr finishes indexing content-indexer will set this property to true.
You can set it to false to trigger re-indexing.
nodeType
optional, default is mgnl:page
JCR node type to index.
For example, if you were indexing assets in the Magnolia DAM you would set this to mgnl:asset.
assetProviderId
optional
When assetProviderId is specified, the assets are obtained from the provider and Solr will use Tika to extract the information from a document, for instance a PDF.
rootNode
required
Node in the workspace where indexing starts.
Use this property to limit indexing to a particular site branch.
type
required
Sets the type of the indexed content such as website or documents.
When you search the index, you can filter results by type.
If configured, the indexer pre-creates fields in Solr according to content-type definition during the initial (re)indexing.
nestedIndexing
optional (Solr module version 7.0+), default is false
If set to true, the subcomponents and submodels are indexed as nested documents.
indexSubHierarchy
optional (Solr module version 7.0+), default is true
If nestedIndexing is set to false and indexSubHierarchy to true, then the subcomponents and submodels are indexed into a parent index.
If set to false, the subcomponents and submodels aren’t indexed at all.
ignorePropertiesRegEx
optional (Solr module version 7.0+)
When no field mapping is set for JcrIndexer, then every property is going to be indexed except jcr: and properties matching this regEx pattern.
indexServiceClass
optional
Custom IndexerService used by this indexer.
If not defined, the global one is used.
batchSize
optional (Solr module version 7.0+), default is 1000
Size of a batch of documents sent to Solr in one request.
commit
optional (Solr module version 7.0+), default is true
Determines whether the commit parameter is included in the update request.
changeListenerDelay
optional (Solr module version 7.0+), default is 1000 milliseconds
Delay for an event listener which collects and indexes the changes of content of this indexer.
fieldMappings
optional
Defines how fields in Magnolia content are mapped to Solr fields.
The left side in the mapping is Magnolia, the right side is Solr.
If not set, see the ignorePropertiesRegEx property for the behavior.
<Magnolia_field>
<Solr_field>
You can use the fields available in the schema.
If a field does not exist in Solr’s schema you can use a dynamic field mgnlmeta_*.
For instance, if you have information nested in a deep leaf of your page stored with property specComponentAbstract, you can map this field with mgnlmeta_specComponentAbstract.
The indexer contains a recursive call which explores the node’s child until it finds the property.
To include language variants in the indexing task, include the respective i18n-marked property key such as title_de (name of the property in Magnolia) and pair it with the field name in Solr.
Example:
Node name
Value
fieldMappings
abstract
abstract
abstract_de
abstract_de
title
title
title_de
title_de
clients
optional, default is default
Solr clients which will be used by this indexer.
Allows to index content for multiple instances of Solr.
<client-name>
required
Name of the client.
Indexer status
The status of an indexer is stored under the Configuration > /modules/solr-search-provider/indexers-status/<indexer-name>@indexed property and indicates whether the initial indexing has been done for the specific indexer.
When Solr finishes indexing, the content-indexer sets the indexed property to true.
You can set it to false to trigger a re-indexing.
Crawler configuration
The crawler mechanism uses the Scheduler to crawl a site periodically.
You can configure indexers in Configuration >/modules/content-indexer/config/crawlers.
Property
Description
enabled
required
true enables the crawler, false disables it.
When a crawler is enabled, the info.magnolia.module.indexer.CrawlerIndexerFactory registers a new scheduler job for the crawler automatically.
depth
required
The max depth of a page in terms of distance in clicks from the root page.
This shouldn’t be too high, ideally 2 or 3 max.
nbrCrawlers
required
The max number of simultaneous crawler threads that crawl a site, 2 or 3 is enough.
maxOutgoingLinksToFollow
optional, default value is 5000
Maximum number of outgoing links which are processed from a page.
politenessDelay
optional, default value is 200
Delay in milliseconds between sending two requests to the same host.
socketTimeout
optional, default value is 20000
Socket timeout in milliseconds.
connectionTimeout
optional, default value is 30000
Connection timeout in milliseconds.
userAgent
optional, default value is crawler4j
A User-Agent string used to represent your crawler and the clean command to web servers.
A CRON expression that specifies how often the site will be crawled.
See also the CronMaker, a useful tool for building expressions.
type
optional
Sets the type of the crawled content such as news.
When you search the index, you can filter results by type.
indexServiceClass
optional
Custom IndexerService used by this crawler.
If not defined, the global one is used.
batchSize, default is 1000
optional (Solr module version 7.0+)
Size of batch of documents sent to Solr in one request.
commit, default is true
optional (Solr module version 7.0+)
clients
optional, default is default client
Solr clients which will be used by this indexer.
Allows index content into multiple Solr instances.
<client-name>
required
Name of the client.
sites
*required
Sites which will crawled.
<site-name>
required
Site name, an arbitrary node value.
url
required
The URL of the site which will be crawled by this crawler.
fieldMappings
required
Defines how fields parsed from the site’s pages are mapped to Solr fields.
The left side represents Solr fields, the right side the crawled pages.
<site_field>
required
You can use any CSS selector to target an element on the page.
For example, #story_continues_1 targets an element by ID.
You can also use a custom syntax to get content inside attributes.
For example, meta keywords are extracted using meta[name=keywords] attr(0,content).
This will extract the first value of keywords meta element.
If you don’t specify anything after the CSS selector, the text contained in the element is indexed.
meta[name=keywords] would return an empty string because a meta element does contain any text, keywords are in the attributes.
To get the value of a specific attribute, specify attr(<index>,<Solr_field_name>).
If you set index=-1, then all attributes are extracted and separated by a semicolon ;.
jcrItems
optional
List of JCR items.
If any of these items is activated, a crawler will be triggered.
<item_name>
optional
Name of the JCR item.
workspace
required
Workspace where the JCR item is stored.
path
required
Path of the JCR item.
siteAuthenticationConfig
optional
Authentication information to allow crawling a password-restricted area.
username
required
Username which is used as a login for a restricted area.
password
required
User password used to log in to a restricted area.
loginUrl
required
A URL of a page with a login form.
usernameField
required, default is mgnlUserID
Name of the input field for entering the username in a login form.
passwordField
required, default is mgnlUserPSWD
Name of the input field for entering the password in a login form.
logoutUrlIdentifier
required, default is mgnlLogout
String which identifies the logout URL.
The crawler doesn’t crawl over the URLs that contain the logoutUrlIdentifier to avoid a logout.
Example: Configuration to crawl https://www.bbc.co.uk.
Node name
Value
bbc_co_uk
clients
default
default
sites
bbc
url
http://www.bbc.co.uk/
fieldMappings
abstract
#story_continues_1
keywords
meta[name=Description] attr(0,content)
depth
2
enabled
false
nbrCrawlers
2
type
news
Configuration of crawler commands
You can configure crawler commands in Configuration >/modules/content-indexer/commands/.
By default, the crawler mechanism is connected with the CleanSolrIndexCommand to clean the index from outdated indexes (pages).
The CleanSolrIndexCommand is chained before the CrawlerIndexerCommand.
Node name
Value
modules
content-indexer
config
indexers
crawlers
commands
content-indexer
Note: Name of the folder is referenced by the crawler catalog property.
<crawler-name>
Note: Name of the node is referenced by the crawler command property.
If set to true, only the head of the page is requested instead of fetching the whole page.
If the deleteNoIndex property is set to true, then this configuration is ignored, because the robots meta tag cannot be resolved from the head request.
followRedirects
optional, default is false
If set to true, the redirects are followed and the status code of the last page is evaluated.
statusCodes
optional
List of status codes.
If a page returns any of the status codes listed, then the page is removed from the index.
By default, there is no list but if a page returns 404 at any time, the page is removed from the index.
deleteNoIndex
optional, default is false
If set to true, then also pages with the robots meta tag set to noindex are removed from the index.
skipIfAlreadyRunning
optional, default is false
If set to true, any previously running CleanSolrIndexCommand is finished and the new one is skipped.
Normally, if the CleanSolrIndexCommand is already running for the crawler, it is stopped and a new one is started.
Crawling triggered by publishing
Crawlers can also be connected with the publishing process by adding info.magnolia.module.indexer.crawler.commands.CrawlerIndexerActivationCommand into command chain with the publishing command.
By default, this is done for the following commands:
catalog: default, command: publish
configured under /modules/publishing-core/commands/default/publish
catalog: default, command: unpublish
configured under /modules/publishing-core/commands/default/unpublish
configured under /modules/personalization-integration/commands/default/personalizationActivation
If you are using a custom publishing command and you wish to connect it with the crawler mechanism, you can use the info.magnolia.module.indexer.setup.AddCrawlerIntoCommandChainTask install/update task for it.