Indexing and crawling a website with Solr

This page describes how to configure the Content Indexer submodule of the Magnolia Solr module to index Magnolia workspaces and crawl a website.

The Solr module allows you to use Apache Solr, a standalone enterprise-grade search server with a REST-like API, for indexing and crawling Magnolia content, especially if you need to manage assets in high volumes (100,000+ DAM assets).

Configuring Solr clients

The Solr module supports multiple Solr servers/cores.

You can configure a client for each server/core under Configuration > /modules/solr-search-provider/config/solrClientConfigs.

It’s recommended to have one client named default. This default client is used when no specific client is defined for the indexer, crawler, or search result page template.

If you need to have more servers/cores, duplicate the default client and change the baseURL property to point to another server/core.

Node name	Value
solr-search-provider
config
solrClientConfigs
default
allowCompression	false
baseURL	http://localhost:8983/solr/magnolia
connectionTimeout	100
soTimeout	1,000

Node name

Value

solr-search-provider

config

solrClientConfigs

default

allowCompression

false

baseURL

http://localhost:8983/solr/magnolia

connectionTimeout

100

soTimeout

1,000

The value entered for the baseURL property must comply with the following syntax:

<protocol>://<domain_name>:<port>/solr/<solr_core_name>

If the Solr server is installed as described in Installing Apache Solr, then the value is http://localhost:8983/solr/magnolia.

For a description of the other properties, see:

Indexing Magnolia workspaces

Since Solr version 9.0, Magnolia supports light module or decoration-based Solr indexer configurations.

The Content Indexer module is a recursive repository indexer and an event based indexer. You can configure multiple indexers for different sites and document types. The content indexer also allows you to crawl external websites using JSoup and CSS selectors. You then define different field mappings that are obtained for each node and indexed in the Solr index.

IndexerService

Both the indexer and the crawler use the IndexerService to handle content indexing. A basic implementation, configured by default, is info.magnolia.search.solrsearchprovider.logic.indexer.SolrIndexerService.

You can define and configure your own IndexService for specific needs.

Implement the IndexerService interface:

IndexerService

public class I18nIndexerService implements info.magnolia.module.indexer.indexservices.IndexerService {

   private static final Logger log = LoggerFactory.getLogger(I18nIndexerService.class);

   @Override
   Future<Boolean> addUpdate(List<Map<String, Object>> dataToIndex, boolean commit) {
      // ...
   }

   // ...
}

For a globally configured indexing service, register the IndexerService in the configuration of the Content Indexer module. For a custom indexing service, use the indexServiceClass (see above in the properties table).

Since Solr version 7.0, the indexer service supports asynchronous indexing in multiple threads.

All indexers and crawlers use a shared number of threads for asynchronous indexing. This can also be configured in Content Indexer module config. The default number of threads is 10.

Node name	Value
modules
content-indexer
config
indexServiceClass	info.magnolia.search.solrsearchprovider.logic.indexer.SolrIndexerService
indexersThreadPoolSize	10

Node name

Value

modules

content-indexer

config

indexServiceClass

info.magnolia.search.solrsearchprovider.logic.indexer.SolrIndexerService

indexersThreadPoolSize

Indexer configuration

You can configure an indexer in Configuration > /modules/content-indexer/config/indexers.

See an example configuration for indexing assets and folders in the DAM workspace (with the older 5 Magnolia UI) or the below example configuration for indexing of content in the website workspace:

Node name	Value
modules
content-indexer
config
indexers
websiteIndexer
clients
default	default
fieldMappings
abstract	abstract
author	author
date	date
teaserAbstract	mgnlmeta_teaserAbstract
text	content
title	title
enabled	false
rootNode	/
type	website
workspace	website

Node name

Value

modules

content-indexer

config

indexers

websiteIndexer

clients

default

fieldMappings

abstract

author

date

teaserAbstract

mgnlmeta_teaserAbstract

text

content

title

enabled

false

rootNode

type

website

workspace

website

Property Description

enabled

required

true enables the indexer configuration, false disables it.

indexed

required

Since Solr module version 9.0, this property is managed outside the definition, see Indexer status.

Indicates whether indexing was done. When Solr finishes indexing content-indexer will set this property to true. You can set it to false to trigger re-indexing.

nodeType

optional, default is mgnl:page

JCR node type to index. For example, if you were indexing assets in the Magnolia DAM you would set this to mgnl:asset.

assetProviderId

optional

When assetProviderId is specified, the assets are obtained from the provider and Solr will use Tika to extract the information from a document, for instance a PDF.

rootNode

required

Node in the workspace where indexing starts. Use this property to limit indexing to a particular site branch.

type

required

Sets the type of the indexed content such as website or documents.

When you search the index, you can filter results by type.

workspace

required

A workspace to index.

contentTypeName

optional (Solr module version 7.0+)

If configured, the indexer pre-creates fields in Solr according to content-type definition during the initial (re)indexing.

nestedIndexing

optional (Solr module version 7.0+), default is false

If set to true, the subcomponents and submodels are indexed as nested documents.

indexSubHierarchy

optional (Solr module version 7.0+), default is true

If nestedIndexing is set to false and indexSubHierarchy to true, then the subcomponents and submodels are indexed into a parent index.

If set to false, the subcomponents and submodels aren’t indexed at all.

ignorePropertiesRegEx

optional (Solr module version 7.0+)

When no field mapping is set for JcrIndexer, then every property is going to be indexed except jcr: and properties matching this regEx pattern.

indexServiceClass

optional

Custom IndexerService used by this indexer. If not defined, the global one is used.

batchSize

optional (Solr module version 7.0+), default is 1000

Size of a batch of documents sent to Solr in one request.

commit

optional (Solr module version 7.0+), default is true

Determines whether the commit parameter is included in the update request.

changeListenerDelay

optional (Solr module version 7.0+), default is 1000 milliseconds

Delay for an event listener which collects and indexes the changes of content of this indexer.

fieldMappings

optional

Defines how fields in Magnolia content are mapped to Solr fields. The left side in the mapping is Magnolia, the right side is Solr.

If not set, see the ignorePropertiesRegEx property for the behavior.

<Magnolia_field>

<Solr_field>

You can use the fields available in the schema. If a field does not exist in Solr’s schema you can use a dynamic field mgnlmeta_*.

For instance, if you have information nested in a deep leaf of your page stored with property specComponentAbstract, you can map this field with mgnlmeta_specComponentAbstract.

The indexer contains a recursive call which explores the node’s child until it finds the property.

To include language variants in the indexing task, include the respective i18n-marked property key such as title_de (name of the property in Magnolia) and pair it with the field name in Solr.

Example:

Node name

Value

fieldMappings

abstract

abstract_de

title

title_de

clients

optional, default is default

Solr clients which will be used by this indexer. Allows to index content for multiple instances of Solr.

<client-name>

required

Name of the client.

Indexer status

The status of an indexer is stored under the Configuration > /modules/solr-search-provider/indexers-status/<indexer-name>@indexed property and indicates whether the initial indexing has been done for the specific indexer.

When Solr finishes indexing, the content-indexer sets the indexed property to true. You can set it to false to trigger a re-indexing.

Crawler configuration

The crawler mechanism uses the Scheduler to crawl a site periodically.

You can configure indexers in Configuration > /modules/content-indexer/config/crawlers.

Property Description

enabled

required

true enables the crawler, false disables it.

When a crawler is enabled, the info.magnolia.module.indexer.CrawlerIndexerFactory registers a new scheduler job for the crawler automatically.

depth

required

The max depth of a page in terms of distance in clicks from the root page. This shouldn’t be too high, ideally 2 or 3 max.

nbrCrawlers

required

The max number of simultaneous crawler threads that crawl a site, 2 or 3 is enough.

maxOutgoingLinksToFollow

optional, default value is 5000

Maximum number of outgoing links which are processed from a page.

politenessDelay

optional, default value is 200

Delay in milliseconds between sending two requests to the same host.

socketTimeout

optional, default value is 20000

Socket timeout in milliseconds.

connectionTimeout

optional, default value is 30000

Connection timeout in milliseconds.

userAgent

optional, default value is crawler4j

A User-Agent string used to represent your crawler and the clean command to web servers.

For more details, see en.wikipedia.org/wiki/User_agent.

crawlerClass

optional, default is info.magnolia.module.indexer.crawler.DefaultCrawler

An implementation of edu.uci.ics.crawler4j.crawler.WebCrawler used by the Crawler to crawl sites.

catalog

optional, default is content-indexer

Name of the catalog where the command resides.

command

optional, default is crawlerIndexer

Command which is used to instantiate and trigger the Crawler.

activationOnly

optional

If set to `true, then the crawler should be triggered only during publication. No scheduler job will be registered for this crawler.

The jcrItems property (see below) must be configured as well for this feature to work.

delayAfterActivation

optional, default is 5 seconds

Defines the delay (in seconds) after which crawler should start when activation is done. Default value is 5s.

cron

optional, default is "every hour": 0 0 0/1 1/1 * ? *

A CRON expression that specifies how often the site will be crawled.

See also the CronMaker, a useful tool for building expressions.

type

optional

Sets the type of the crawled content such as news. When you search the index, you can filter results by type.

indexServiceClass

optional

Custom IndexerService used by this crawler. If not defined, the global one is used.

batchSize, default is 1000

optional (Solr module version 7.0+)

Size of batch of documents sent to Solr in one request.

commit, default is true

optional (Solr module version 7.0+)

clients

optional, default is default client

Solr clients which will be used by this indexer. Allows index content into multiple Solr instances.

<client-name>

required

Name of the client.

sites

*required

Sites which will crawled.

<site-name>

required

Site name, an arbitrary node value.

url

required

The URL of the site which will be crawled by this crawler.

fieldMappings

required

Defines how fields parsed from the site’s pages are mapped to Solr fields. The left side represents Solr fields, the right side the crawled pages.

<site_field>

required

You can use any CSS selector to target an element on the page.

For example, #story_continues_1 targets an element by ID.

You can also use a custom syntax to get content inside attributes.

For example, meta keywords are extracted using meta[name=keywords] attr(0,content).

This will extract the first value of keywords meta element. If you don’t specify anything after the CSS selector, the text contained in the element is indexed.

meta[name=keywords] would return an empty string because a meta element does contain any text, keywords are in the attributes.

To get the value of a specific attribute, specify attr(<index>,<Solr_field_name>).

If you set index=-1, then all attributes are extracted and separated by a semicolon ;.

jcrItems

optional

List of JCR items. If any of these items is activated, a crawler will be triggered.

<item_name>

optional

Name of the JCR item.

workspace

required

Workspace where the JCR item is stored.

path

required

Path of the JCR item.

siteAuthenticationConfig

optional

Authentication information to allow crawling a password-restricted area.

username

required

Username which is used as a login for a restricted area.

password

required

User password used to log in to a restricted area.

loginUrl

required

A URL of a page with a login form.

usernameField

required, default is mgnlUserID

Name of the input field for entering the username in a login form.

passwordField

required, default is mgnlUserPSWD

Name of the input field for entering the password in a login form.

logoutUrlIdentifier

required, default is mgnlLogout

String which identifies the logout URL. The crawler doesn’t crawl over the URLs that contain the logoutUrlIdentifier to avoid a logout.

Example: Configuration to crawl https://www.bbc.co.uk.

Node name	Value
bbc_co_uk
clients
default	default
sites
bbc
url	http://www.bbc.co.uk/
fieldMappings
abstract	#story_continues_1
keywords	meta[name=Description] attr(0,content)
depth	2
enabled	false
nbrCrawlers	2
type	news

Node name

Value

bbc_co_uk

clients

default

sites

bbc

url

http://www.bbc.co.uk/

fieldMappings

abstract

#story_continues_1

keywords

meta[name=Description] attr(0,content)

depth

enabled

false

nbrCrawlers

type

news

Configuration of crawler commands

You can configure crawler commands in Configuration > /modules/content-indexer/commands/.

By default, the crawler mechanism is connected with the CleanSolrIndexCommand to clean the index from outdated indexes (pages).

The CleanSolrIndexCommand is chained before the CrawlerIndexerCommand.

Node name	Value
modules
content-indexer
config
indexers
crawlers
commands
content-indexer	Note: Name of the folder is referenced by the crawler catalog property.
<crawler-name>	Note: Name of the node is referenced by the crawler command property.
cleanSolr
class	info.magnolia.search.solrsearchprovider.logic.commands.CleanSolrIndexCommand
<crawler-name>
class	info.magnolia.module.indexer.crawler.commands.CrawlerIndexerCommand

Node name

Value

modules

content-indexer

config

indexers

crawlers

commands

content-indexer

Note: Name of the folder is referenced by the crawler catalog property.

<crawler-name>

Note: Name of the node is referenced by the crawler command property.

cleanSolr

class

info.magnolia.search.solrsearchprovider.logic.commands.CleanSolrIndexCommand

<crawler-name>

class

info.magnolia.module.indexer.crawler.commands.CrawlerIndexerCommand

Table 1. cleanSolr command properties
Property	Description
`max`	optional, default is `1000` Maximum number of documents to be checked.
`onlyHead`	optional, default is `false` If set to `true`, only the head of the page is requested instead of fetching the whole page. If the `deleteNoIndex` property is set to `true`, then this configuration is ignored, because the `robots` meta tag cannot be resolved from the `head` request.
`followRedirects`	optional, default is `false` If set to `true`, the redirects are followed and the status code of the last page is evaluated.
`statusCodes`	optional List of status codes. If a page returns any of the status codes listed, then the page is removed from the index. By default, there is no list but if a page returns `404` at any time, the page is removed from the index.
`deleteNoIndex`	optional, default is `false` If set to `true`, then also pages with the `robots` meta tag set to `noindex` are removed from the index.
`skipIfAlreadyRunning`	optional, default is `false` If set to `true`, any previously running `CleanSolrIndexCommand` is finished and the new one is skipped. Normally, if the `CleanSolrIndexCommand` is already running for the crawler, it is stopped and a new one is started.

Crawling triggered by publishing

Crawlers can also be connected with the publishing process by adding info.magnolia.module.indexer.crawler.commands.CrawlerIndexerActivationCommand into command chain with the publishing command.

By default, this is done for the following commands:

catalog: default, command: publish
- configured under /modules/publishing-core/commands/default/publish
catalog: default, command: unpublish
- configured under /modules/publishing-core/commands/default/unpublish
catalog: default, command: personalizationActivation
- configured under /modules/personalization-integration/commands/default/personalizationActivation

If you are using a custom publishing command and you wish to connect it with the crawler mechanism, you can use the info.magnolia.module.indexer.setup.AddCrawlerIntoCommandChainTask install/update task for it.

Creating a sitemap with Solr

To create a custom sitemap with Solr, please see Generating Custom sitemap with SOLR (restricted access).

Indexing and crawling a website with Solr

Configuring Solr clients

Indexing Magnolia workspaces

IndexerService

Indexer configuration

Indexer status

Crawler configuration

Configuration of crawler commands

Crawling triggered by publishing

Creating a sitemap with Solr

Location

Main doc sections