Indexing and crawling a website with Solr
This page describes how to configure the Content Indexer submodule of the Magnolia Solr module to index Magnolia workspaces and crawl a website.
The Solr module allows you to use Apache Solr, a standalone enterprise-grade search server with a REST-like API, for indexing and crawling Magnolia content, especially if you need to manage assets in high volumes (100,000+ DAM assets).
Configuring Solr clients
The Solr module supports multiple Solr servers/cores.
You can configure a client
for each server/core under Configuration > /modules/solr-search-provider/config/solrClientConfigs
.
It’s recommended to have one client named default
. This default client is used when no specific client is defined for the indexer, crawler, or search result page template.
If you need to have more servers/cores, duplicate the default client and change the baseURL property to point to another server/core.
|
Node name | Value |
---|---|
solr-search-provider |
|
config |
|
solrClientConfigs |
|
default |
|
allowCompression |
false |
baseURL |
|
connectionTimeout |
100 |
soTimeout |
1,000 |
The value entered for the baseURL
property must comply with the following syntax:
<protocol>://<domain_name>:<port>/solr/<solr_core_name>
If the Solr server is installed as described in Installing Apache Solr, then the value is http://localhost:8983/solr/magnolia
.
For a description of the other properties, see:
Indexing Magnolia workspaces
Since Solr version 9.0, Magnolia supports light module or decoration-based Solr indexer configurations. |
The Content Indexer module is a recursive repository indexer and an event based indexer. You can configure multiple indexers for different sites and document types. The content indexer also allows you to crawl external websites using JSoup and CSS selectors. You then define different field mappings that are obtained for each node and indexed in the Solr index.
IndexerService
Both the indexer and the crawler use the IndexerService
to handle content indexing.
A basic implementation, configured by default, is info.magnolia.search.solrsearchprovider.logic.indexer.SolrIndexerService.
You can define and configure your own IndexService for specific needs. |
Implement the IndexerService
interface:
public class I18nIndexerService implements info.magnolia.module.indexer.indexservices.IndexerService {
private static final Logger log = LoggerFactory.getLogger(I18nIndexerService.class);
@Override
Future<Boolean> addUpdate(List<Map<String, Object>> dataToIndex, boolean commit) {
// ...
}
// ...
}
For a globally configured indexing service, register the IndexerService in the configuration of the Content Indexer module.
For a custom indexing service, use the indexServiceClass
(see above in the properties table).
Since Solr version 7.0, the indexer service supports asynchronous indexing in multiple threads. |
All indexers and crawlers use a shared number of threads for asynchronous indexing.
This can also be configured in Content Indexer module config.
The default number of threads is 10
.
Node name | Value |
---|---|
modules |
|
content-indexer |
|
config |
|
indexServiceClass |
info.magnolia.search.solrsearchprovider.logic.indexer.SolrIndexerService |
indexersThreadPoolSize |
10 |
Indexer configuration
You can configure an indexer in Configuration > /modules/content-indexer/config/indexers
.
See an example configuration for indexing assets and folders in the DAM workspace (with the older 5 Magnolia UI) or the below example configuration for indexing of content in the website workspace:
|
Node name | Value |
---|---|
modules |
|
content-indexer |
|
config |
|
indexers |
|
websiteIndexer |
|
clients |
|
default |
default |
fieldMappings |
|
abstract |
abstract |
author |
author |
date |
date |
teaserAbstract |
mgnlmeta_teaserAbstract |
text |
content |
title |
title |
enabled |
false |
rootNode |
/ |
type |
website |
workspace |
website |
Property | Description | ||
---|---|---|---|
|
required
|
||
|
required
Indicates whether indexing was done. When Solr finishes indexing content-indexer will set this property to |
||
|
optional, default is JCR node type to index.
For example, if you were indexing assets in the Magnolia DAM you would set this to |
||
|
optional When |
||
|
required Node in the workspace where indexing starts. Use this property to limit indexing to a particular site branch. |
||
|
required Sets the type of the indexed content such as When you search the index, you can filter results by type. |
||
|
required A workspace to index. |
||
|
optional (Solr module version 7.0+) If configured, the indexer pre-creates fields in Solr according to content-type definition during the initial (re)indexing. |
||
|
optional (Solr module version 7.0+), default is If set to |
||
|
optional (Solr module version 7.0+), default is If If set to |
||
|
optional (Solr module version 7.0+) When no field mapping is set for |
||
|
optional Custom IndexerService used by this indexer. If not defined, the global one is used. |
||
|
optional (Solr module version 7.0+), default is Size of a batch of documents sent to Solr in one request. |
||
|
optional (Solr module version 7.0+), default is Determines whether the commit parameter is included in the update request. |
||
|
optional (Solr module version 7.0+), default is Delay for an event listener which collects and indexes the changes of content of this indexer. |
||
|
optional Defines how fields in Magnolia content are mapped to Solr fields. The left side in the mapping is Magnolia, the right side is Solr. If not set, see the |
||
|
You can use the fields available in the schema.
If a field does not exist in Solr’s schema you can use a dynamic field For instance, if you have information nested in a deep leaf of your page stored with property The indexer contains a recursive call which explores the node’s child until it finds the property. |
||
|
optional, default is Solr clients which will be used by this indexer. Allows to index content for multiple instances of Solr. |
||
|
required Name of the client. |
Indexer status
The status of an indexer is stored under the Configuration
> /modules/solr-search-provider/indexers-status/<indexer-name>@indexed
property and indicates whether the initial indexing has been done for the specific indexer.
When Solr finishes indexing, the content-indexer sets the indexed
property to true
.
You can set it to false
to trigger a re-indexing.
Crawler configuration
The crawler mechanism uses the Scheduler to crawl a site periodically.
You can configure indexers in Configuration > /modules/content-indexer/config/crawlers .
|
Property | Description | ||
---|---|---|---|
|
required
When a crawler is enabled, the info.magnolia.module.indexer.CrawlerIndexerFactory registers a new scheduler job for the crawler automatically. |
||
|
required The max depth of a page in terms of distance in clicks from the root page. This shouldn’t be too high, ideally 2 or 3 max. |
||
|
required The max number of simultaneous crawler threads that crawl a site, |
||
|
optional, default value is Maximum number of outgoing links which are processed from a page. |
||
|
optional, default value is Delay in milliseconds between sending two requests to the same host. |
||
|
optional, default value is Socket timeout in milliseconds. |
||
|
optional, default value is Connection timeout in milliseconds. |
||
|
optional, default value is A User-Agent string used to represent your For more details, see en.wikipedia.org/wiki/User_agent. |
||
|
optional, default is An implementation of edu.uci.ics.crawler4j.crawler.WebCrawler used by the Crawler to crawl sites. |
||
|
optional, default is Name of the catalog where the command resides. |
||
|
optional, default is Command which is used to instantiate and trigger the Crawler. |
||
|
optional If set to
|
||
|
optional, default is Defines the delay (in seconds) after which crawler should start when activation is done. Default value is 5s. |
||
|
optional, default is "every hour": A CRON expression that specifies how often the site will be crawled. See also the CronMaker, a useful tool for building expressions. |
||
|
optional Sets the type of the crawled content such as |
||
|
optional Custom IndexerService used by this crawler. If not defined, the global one is used. |
||
|
optional (Solr module version 7.0+) Size of batch of documents sent to Solr in one request. |
||
|
optional (Solr module version 7.0+) |
||
|
optional, default is Solr clients which will be used by this indexer. Allows index content into multiple Solr instances. |
||
|
required Name of the client. |
||
|
*required Sites which will crawled. |
||
|
required Site name, an arbitrary node value. |
||
|
required The URL of the site which will be crawled by this crawler. |
||
|
required Defines how fields parsed from the site’s pages are mapped to Solr fields. The left side represents Solr fields, the right side the crawled pages. |
||
|
required You can use any CSS selector to target an element on the page. For example, You can also use a custom syntax to get content inside attributes. For example, meta keywords are extracted using This will extract the first value of keywords meta element. If you don’t specify anything after the CSS selector, the text contained in the element is indexed.
To get the value of a specific attribute, specify If you set |
||
|
optional List of JCR items. If any of these items is activated, a crawler will be triggered. |
||
|
optional Name of the JCR item. |
||
|
required Workspace where the JCR item is stored. |
||
|
required Path of the JCR item. |
||
|
optional Authentication information to allow crawling a password-restricted area. |
||
|
required Username which is used as a login for a restricted area. |
||
|
required User password used to log in to a restricted area. |
||
|
required A URL of a page with a login form. |
||
|
required, default is Name of the input field for entering the username in a login form. |
||
|
required, default is Name of the input field for entering the password in a login form. |
||
|
required, default is String which identifies the logout URL.
The crawler doesn’t crawl over the URLs that contain the |
Example: Configuration to crawl https://www.bbc.co.uk
.
Node name | Value |
---|---|
bbc_co_uk |
|
clients |
|
default |
default |
sites |
|
bbc |
|
url |
http://www.bbc.co.uk/ |
fieldMappings |
|
abstract |
#story_continues_1 |
keywords |
meta[name=Description] attr(0,content) |
depth |
2 |
enabled |
false |
nbrCrawlers |
2 |
type |
news |
Configuration of crawler commands
You can configure crawler commands in Configuration > /modules/content-indexer/commands/
.
By default, the crawler mechanism is connected with the CleanSolrIndexCommand
to clean the index from outdated indexes (pages).
The CleanSolrIndexCommand is chained before the CrawlerIndexerCommand .
|
Node name | Value |
---|---|
modules |
|
content-indexer |
|
config |
|
indexers |
|
crawlers |
|
commands |
|
content-indexer |
Note: Name of the folder is referenced by the crawler catalog property. |
<crawler-name> |
Note: Name of the node is referenced by the crawler command property. |
cleanSolr |
|
class |
info.magnolia.search.solrsearchprovider.logic.commands.CleanSolrIndexCommand |
<crawler-name> |
|
class |
info.magnolia.module.indexer.crawler.commands.CrawlerIndexerCommand |
Property | Description |
---|---|
|
optional, default is Maximum number of documents to be checked. |
|
optional, default is If set to If the |
|
optional, default is If set to |
|
optional List of status codes. If a page returns any of the status codes listed, then the page is removed from the index. By default, there is no list but if a page returns |
|
optional, default is If set to |
|
optional, default is If set to Normally, if the |
Crawling triggered by publishing
Crawlers can also be connected with the publishing process by adding info.magnolia.module.indexer.crawler.commands.CrawlerIndexerActivationCommand into command chain with the publishing command.
By default, this is done for the following commands:
-
catalog:
default
, command:publish
-
configured under
/modules/publishing-core/commands/default/publish
-
-
catalog:
default
, command:unpublish
-
configured under
/modules/publishing-core/commands/default/unpublish
-
-
catalog:
default
, command:personalizationActivation
-
configured under
/modules/personalization-integration/commands/default/personalizationActivation
-
If you are using a custom publishing command and you wish to connect it with the crawler mechanism, you can use the info.magnolia.module.indexer.setup.AddCrawlerIntoCommandChainTask install/update task for it. |
Creating a sitemap with Solr
To create a custom sitemap with Solr, please see Generating Custom sitemap with SOLR (restricted access).