Indexing and crawling a website with Solr

This page describes how to configure the Content Indexer submodule of the Magnolia Solr module to index Magnolia workspaces and crawl a website.

The Solr module allows you to use Apache Solr, a standalone enterprise-grade search server with a REST-like API, for indexing and crawling Magnolia content, especially if you need to manage assets in high volumes (100,000+ DAM assets).

Configuring Solr clients

From version 5.2, the Solr module supports multiple Solr servers/cores.

You can configure a client for every server/core under Configuration > /modules/solr-search-provider/config/solrClientConfigs. It’s recommended to have one client named default. This default client is used when no specific client is defined for the indexer, crawler, or search result page template.

If you need to have more servers/cores, duplicate the default client and change the baseURL property to point to another server/core.
Node name Value

solr-search-provider

     config

         solrClientConfigs

             default

                 allowCompression

false

                 baseURL

http://localhost:8983/solr/magnolia

                 connectionTimeout

100

                 soTimeout

1,000

The value entered for the baseURL property should conform with the following syntax:

<protocol>://<domain_name>:<port>/solr/<solr_core_name>

If the Solr server is installed as described in Installing Apache Solr, then the value is http://localhost:8983/solr/magnolia. For a description of the other properties see the HttpSolrClient.Builder Javadoc and Using SolrJ - Common Configuration Options.

Indexing Magnolia workspaces

The Content Indexer module is a recursive repository indexer and an event based indexer. You can configure multiple indexers for different sites and document types. The content indexer also allows you to crawl external websites using JSoup and CSS selectors. You then define different field mappings that will be obtained for each node and indexed in the Solr index.

IndexService

Both the indexer and the crawler use the IndexService to handle the indexing of a content. A basic implementation is configured by default: info.magnolia.search.solrsearchprovider.logic.indexer.BasicSolrIndexService.

You can define and configure your own IndexService for specific needs.

Implement the IndexService interface:

IndexService
public class I18nIndexerService implements info.magnolia.module.indexer.indexservices.IndexService {

   private static final Logger log = LoggerFactory.getLogger(I18nIndexerService.class);

   @Override
   public boolean index(Node node, IndexerConfig config) {
      // ...
   }

For a globally configured indexing service, register the IndexService in the configuration of the Content Indexer module. For your custom indexing service, use the indexServiceClass (see above in the properties table):

Node name Value

modules

     content-indexer

         config

             indexServiceClass

info.magnolia.search.solrsearchprovider.logic.indexer.BasicSolrIndexService

Indexer configuration

You can configure an indexer in Configuration > /modules/content-indexer/config/indexers.

See an example configuration for indexing assets and folders in the DAM workspace or the below example configuration for indexing of content in the website workspace:
Node name Value

modules

     content-indexer

         config

             indexers

                 websiteIndexer

                     clients

                         default

default

                     fieldMappings

                         abstract

abstract

                         author

author

                         date

date

                         teaserAbstract

mgnlmeta_teaserAbstract

                         text

content

                         title

title

                     enabled

false

                     pull

false

                     rootNode

/

                     type

website

                     workspace

website

Property Description

enabled

required

true enables the indexer configuration. false disables the indexer configuration.

indexed

required

Indicates whether indexing was done. When Solr finishes indexing content-indexer will set this property to true. You can set it to false to trigger re-indexing.

nodeType

optional, default is mgnl:page

JCR node type to index. For example, if you were indexing assets in the Magnolia DAM you would set this to mgnl:asset.

pull

optional, default is false (push)

Pull URLs instead of pushing. When true Solr will use Tika to extract information from a document, for instance a PDF. When false it will push the collected information using a Solr document.

assetProviderId

optional, default is `jcr`

If pull is set to true, specify an assetProviderId to obtain an asset correctly.

rootNode

required

Node in the workspace where indexing starts. Use this property to limit indexing to a particular site branch.

type

required

Sets the type of the indexed content such as website or documents. When you search the index you can filter results by type.

workspace

required

Workspace to index.

indexServiceClass

optional (Solr module version 5.2+)

Custom IndexService used by this indexer. If not defined, the global one is used.

fieldMappings

required

Field mappings defines how fields in Magnolia content are mapped to Solr fields. Left side is Magnolia, right side is Solr.

     <Magnolia_field>

<Solr_field>

You can use the fields available in the schema. If a field does not exist in Solr’s schema you can use a dynamic field mgnlmeta_*. For instance if you have information nested in a deep leaf of your page stored with property specComponentAbstract, you can map this field with mgnlmeta_specComponentAbstract. The indexer contains a recursive call which will explore the node’s child leaves until it finds the property.

clients

optional, default is `default` (Solr module version 5.2+)

Solr clients which will be used by this indexer. Allows to index content for multiple instances of Solr.

     <client-name>

required

Name of the client.

Crawler configuration

The crawler mechanism uses the Scheduler to crawl a site periodically.

You can configure indexers in Configuration > /modules/content-indexer/config/crawlers.
Property Description

enabled

required

true enables the crawler. false disables the crawler.

When a crawler is enabled info.magnolia.module.indexer.CrawlerIndexerFactory registers a new scheduler job for the crawler automatically.

depth

required

The max depth of a page in terms of distance in clicks from the root page. This should not be too high, ideally 2 or 3 max.

nbrCrawlers

required

The max number of simultaneous crawler threads that crawl a site. 2 or 3 is enough.

maxOutgoingLinksToFollow

optional, default value is 5000

Maximum number of outgoing links which are processed from a page.

politenessDelay

optional, default value is 200

Delay in milliseconds between sending two requests to the same host.

socketTimeout

optional, default value is 20000

Socket timeout in milliseconds.

connectionTimeout

optional, default value is 30000

Connection timeout in milliseconds.

userAgent

optional, default value is "crawler4j (https://github.com/yasserg/crawler4j/)"

A User-Agent string used to represent your crawler and the clean command to web servers.

For more details, see en.wikipedia.org/wiki/User_agent.

crawlerClass

optional, since version 3.0, default value is info.magnolia.module.indexer.crawler.DefaultCrawler

Implementation of edu.uci.ics.crawler4j.crawler.WebCrawler which is used by the Crawler to crawl sites.

catalog

optional, since version 3.0, default value is content-indexer

Name of the catalog where the command resides.

command

optional, since version 3.0, default value is crawlerIndexer

Command which is used to instantiate and trigger the Crawler.

activationOnly

optional, since version 3.0

If it’s set to true then crawler should be triggered only during publication. No scheduler job will be registered for this crawler.

The jcrItems property (see below) has to be configured too for this feaure to work.

delayAfterActivation

optional, since version 3.0, default value is 5s

Defines the delay (in seconds) after which crawler should start when activation is done. Default value is 5s.

cron

optional, default is every hour `0 0 0/1 1/1 * ? *`

A CRON expression that specifies how often the site will be crawled. CronMaker is a useful tool for building expressions.

type

optional

Sets the type of the crawled content such as news. When you search the index you can filter results by type.

indexServiceClass

optional, since version 5.2

Custom IndexService used by this crawler. If not defined, the global one is used.

clients

optional, since version 5.2, default is default client

Solr clients which will be used by this indexer. Allows index content into multiple Solr instances.

     <client-name>

required

Name of the client.

fieldMappings

required

Field mappings defines how fields parsed from the site pages are mapped to Solr fields. Left side is Solr field, right side is the crawled site.

     <site_field>

required

You can use any CSS selector to target an element on the page. For example, #story_continues_1 targets an element by ID.

You can also use custom syntax to get content inside attributes. For example, meta keywords are extracted using meta[name=keywords] attr(0,content). This will extract first value of keywords meta element. If you don’t specify anything after the CSS selector then the text contained in the element is indexed. meta[name=keywords] would return an empty string because a meta element does contain any text, keywords are in the attributes. To get the value of a specific attribute specify attr(<index>,<Solr_field_name>). If you set index=-1 then all attributes are extracted and separated by a semicolon ;.

jcrItems

optional, since version 3.0

List of JCR items. If any of this items is activated crawler will be triggered.

     <item_name>

optional, since version 3.0

Name of the JCR item.

         workspace

required, since version 3.0

Workspace where JCR item is stored.

         path

required, since version 3.0

Path of the JCR item.

siteAuthenticationConfig

optional, since version 5.0.2

Authentication information to allow crawling password restricted area.

     username

required, since version 5.0.2

Username which is used for login into restricted area.

     password

required, since version 5.0.2

User’s password used for login into restricted area.

     loginUrl

required, since version 5.0.2

Url to page with login form.

     usernameField

required, since version 5.0.2, default value is mgnlUserID

Name of input field for entering the username in login form.

     passwordField

required, since version 5.0.2, default value is mgnlUserPSWD

Name of input field for entering the password in login form.

     logoutUrlIdentifier

required, since version 5.0.2, default value is mgnlLogout

String which identifies the logout Url. Crawler doesn’t crawl over the urls which contains logoutUrlIdentifier to avoid logout.

Example: Configuration to crawl www.bbc.co.uk

Node name Value

bbc_co_uk

     clients

         default

default

     sites

         bbc

             url

http://www.bbc.co.uk/

         fieldMappings

             abstract

#story_continues_1

             keywords

meta[name=Description] attr(0,content)

         depth

2

         enabled

false

         nbrCrawlers

2

         type

news

Configuration of crawler commands

You can configure crawler commands in Configuration > /modules/content-indexer/commands/.

By default, the crawler mechanism is connected with the CleanSolrIndexCommand to clean the index from outdated indexes (pages). The CleanSolrIndexCommand is chained before the CrawlerIndexerCommand.

Node name Value

modules

     content-indexer

         config

             indexers

             crawlers

         commands

             content-indexer

Note: Name of the folder is referenced by the crawler catalog property.

                 <crawler-name>

Note: Name of the node is referenced by the crawler command property.

                     cleanSolr

                         class

info.magnolia.search.solrsearchprovider.logic.commands.CleanSolrIndexCommand

                     <crawler-name>

                         class

info.magnolia.module.indexer.crawler.commands.CrawlerIndexerCommand

Table 1. cleanSolr command properties
Property Description

max

optional, since version 5.0.1, default value is 500

Maximum number of documents to be checked.

onlyHead

optional, since version 5.5.1, default value is false

If set to true, only the head of the page is requested instead of fetching the whole page.

If the deleteNoIndex property is set to true, then this configuration is ignored, because the robots meta tag cannot be resolved from the head request.

followRedirects

optional, since version 5.5.2, default value is false

If set to true, redirects are followed and the status code of the last page is evaluated.

statusCodes

optional, since version 5.0.1

List of status codes. If a page returns any of the status codes listed, then the page will be removed from the index.

By default, there is no list but if a page returns 404 at any time, the page is removed from the index.

deleteNoIndex

optional, since version 5.5.4, default value is false

If set to true, then also pages with the robots meta tag set to noindex will be removed from the index.

skipIfAlreadyRunning

optional, since version 5.5.1, default value is false

If set to true, any previously running CleanSolrIndexCommand is finished and the new one is skipped.

Normally, if the CleanSolrIndexCommand is already running for the crawler, it is stopped and a new one is started.

Crawling triggered by publishing

From version 3.0, crawlers can be also connected with the publishing (activation) process by adding info.magnolia.module.indexer.crawler.commands.CrawlerIndexerActivationCommand into command chain with publishing (activation) command. By default, this is done for these commands:

  • If you are using the Publishing module:

    • catalog: default, command: publish - configured under /modules/publishing-core/commands/default/publish

    • catalog: default, command: unpublish - configured under /modules/publishing-core/commands/default/unpublish

  • catalog: default, command: personalizationActivation - configured under /modules/personalization-integration/commands/default/personalizationActivation

If you are using a custom publishing (activation) command and you wish to connect it with the crawler mechanism, you can use info.magnolia.module.indexer.setup.AddCrawlerIntoCommandChainTask install/update task for it.

Creating a sitemap with Solr

To create a custom sitemap with Solr, please refer to the page called Generating Sitemap with SOLR - SOLR Module available on the Magnolia Community Wiki.

Feedback

DX Core

×

Location

This widget lets you know where you are on the docs site.

You are currently perusing through the DX Core docs.

Main doc sections

DX Core Headless PaaS Legacy Cloud Incubator modules