Major Elasticsearch Package Releases

Version 7 of the Elasticsearch ContentRepositoryAdapter is a major overhaul of the package which adds compatibility to modern Elasticsearch versions, improved concepts and a ton of new features. You really should check it out!

– Written by


With nearly 100,000 installations, the Flowpack.ElasticSearch.ContentRepositoryAdaptor is one of the most commonly used 3rd party packages for Neos.

The package is not only used for implementing a feature-rich full-text search for web projects, it provides a powerful API to find, filter and sort Nodes. Using the incredible speed and scalability of Elasticsearch, results are delivered in a couple of milliseconds, independent of the amount of content stored. This makes the package crucial for large Neos instances.

Within the last months I intensively worked on the Elasticsearch packages, merged and improved long pending pull requests, added some great new features and made the package compatible with Elasticsearch versions 6 and 7. I am very excited that the packages are finally released and I'm able to present the new features to you today. 

The work was partly funded by the members of the Neos community. Thank you very much! If you are enjoying Neos and use it in your projects you should really consider also becoming a sponsor.

Beware that some information in the remainder of this post are rather technical and can be hard to understand if you are not familiar with the Elasticsearch concepts.

But this should not scare you off, as most of the techniques and API you need to integrate search into your project should be described thoroughly in the restructured Readme. And if you have further questions, don't hesitate to ask in the forum or the #guild-search channel on slack.

Elasticsearch 6.x and 7.x compatibilty

Up to the Adapter version 6 we only supported Elasticsearch version 5, which has been unmaintained for over a year now. With every major version, Elastic introduces breaking changes, which have more or less impact on our use case. Among others, this time the ability to have multiple mapping types per index has been removed in 6.0. These mapping types have been used to map the NodeTypes.

Now the NodeTypes are mapped to an explicit field, named neos_type. That makes it now also possible to easily filter by an exact type instead of taking the complete NodeType inheritance into account.

With this version, the support for Elasticsearch Version 5.x is dropped. You can still use Version 6 which is compatible with the latest Neos Version and Elasticsearch 5.x.

Bildschirmfoto 2020-05-17 um 23.05.42.png

Kibana Visualisation of the NodeTypes used in the Neos Demo page.

One Index per dimension combination

Previously, all nodes where ingested into a single index, regardless of the dimension-combination they belong to. Now there is an index for every dimension-combination. This change is huge when it comes to language dimensions. Now you are able to configure filter and analyzer specifically for every language dimension.

In order to address the index configuration for a dimension combination, you add the dimension-combination-hash to the index name:

Flowpack:
  ElasticSearch:
    indexes:
      default:
        'neoscontentrepository-0359ed5c416567b8bc2e5ade0f277b36': # The hash specifies the dimension combination
          settings:
            analysis:
              filter:
                elision:
                  type: 'elision'
                  articles: [ 'l', 'm', 't', 'qu', 'n', 's', 'j', 'd' ]
              analyzer:
                custom_french_analyzer:
                  tokenizer: 'letter'
                  filter: [ 'asciifolding', 'lowercase', 'french_stem', 'elision', 'stop' ]
                tag_analyzer:
                  tokenizer: 'keyword'
                  filter: [ 'asciifolding', 'lowercase' ]

Specific configuration for a dimension combination.

If you don't care about dimension-combination specific index configuration you can still add configuration that apply to all indices:

	Flowpack:
	  ElasticSearch:
	    indexes:
	      default:
	        neoscontentrepository: 
	          settings:
	            index:
	              number_of_shards: 1
	              number_of_replicas: 0

Default configuration without content dimension specifics

The separate indices are referenced via aliases, so that in the example above requests against neoscontentrepository addresses documents in all separate dimension indices.

Most part of this feature was contributed by Dominique Feyer.

Exclude NodeTypes from indexing

Nodes that don't add to the fulltext index and which are not searched directly, don't need to be indexed in the first place.
Now these nodes can be excluded from indexing which may speed up a reindexing tremendously.

The mechanism is kept rather simple. You can define a default, a configuration per package and a configuration per node type. The most specific takes precedence.

	Neos:
	  ContentRepository:
	    Search:
	      defaultConfigurationPerNodeType:
	
	        # default
	        '*':
	          indexed: true
	
			 # Exclude a complete package
	        'My.Pakage:*':
	          indexed: false
	
	        # Neos
	        'Neos.Neos:FallbackNode':
	          indexed: false
	        'Neos.Neos:Shortcut':
	          indexed: false
	        'Neos.Neos:ContentCollection':
	          indexed: false

Query Time Boosting

For fulltext search, we have the concept of assigning content to different buckets and then boosting these buckets individually according to their relevance.
Previously boosting was defined on fields and applied at index-time. This is a deprecated approach and rather cumbersome, as you have to rebuild the index for every adjustment.

Now boosting is defined at search-time. It can be configured among a variety of other options by parametrizing the query object in the settings.

    queryStringParameters      
      default_operator: or
      fields:
        - neos_fulltext.h1^20
        - neos_fulltext.h2^12
        - neos_fulltext.h3^10
        - neos_fulltext.h4^5
        - neos_fulltext.h5^3
        - neos_fulltext.h6^2
        - neos_fulltext.text^1

Exclude properties from being mapped and indexed

A small feature that can have a big impact. If you set the indexing parameter of a NodeType property explicitly to false, the property is not mapped to the index at all and also not indexed.

You may run into mapping conflicts because of two properties that are defined with different mapping types, which can happen if third-party packages define a field name with same name but different type. You are now able to deactivate the indexing of that property completely and instead map and index that property to a custom field.

'Neos.Neos:Node':
  properties:
    'conflictingProperty':
      search:
        indexing: false

    'es_customProperty':
      search:
        indexing: '${some.compatibility.eel.code}'
		elasticSearchMapping:
          type: keyword

Better indexing of asset content

Previously there has been an option to ingest attachments by using the deprecated elasticsearch-mapper-attachments plugin and setting the field-type to attachment. This plugin is no longer available for Elasticsearch version 6 and 7.

Now there is an Indexing helper extractAssetContent(asset) which takes an Asset object or an array of Assets and returns the assets content. Internally it uses an ingest pipeline and the Ingest Attachment Processor Plugin to extract the indexable data. The asset content can thus be indexed to a field or to one of the fulltext buckets.

The Ingest Attachment Processor Plugin does not only return the pure content, but is also able to provide different types of meta data. With the second parameter `field` in extractAssetContent(asset, field) you can return this additional meta data.

properties:
  file:
    type: 'Neos\Media\Domain\Model\Asset'
    search:
	  fulltextExtractor: "${Array.concat(Indexing.extractInto('text', Indexing.extractAssetContent(value), Indexing.extractInto('h2', Indexing.extractAssetContent(value, 'keywords'))}"

The field parameter can be set to one of the following values: content, title, name, author, keywords, date, content_type, content_length and language.

Indexer consumes less memory

Indexing lots of nodes with the ./flow nodeindex:build command happened to hit memory limits rather quickly due to memory leaks. Now the indexing is split up into several sub-commands for applying the mapping and for indexing every workspace and dimension combination. With that, a lot more nodes can be indexed with the same amount of RAM available.

Internal field names now comply to Beats naming convention

Neos-internal meta properties of nodes have previously been indexed with a leading underscore. Meta properties created by the Elasticsearch.ContentRepositoryAdapter with two leading underscores. 

Also Elasticsearch uses such underscore-prefixed fields for their internal meta data. That's why it is not possible to analyze these fields with Kibana, which makes it sometimes hard to use the tool to analyze and debug your index.

Now all internal properties are prefixed with the "neos_" prefix, and use snake_case complying with the beats naming convention.

Mapping table from old to new field names:

OldNew
__identifierneos_node_identifier
__parentPathneos_parent_path
__pathneos_path
__typeAndSupertypesneos_type_and_supertypes
__workspaceneos_workspace
_creationDateTimeneos_creationdate_time
_hiddenneos_hidden
_hiddenBeforeDateTimeneos_hidden_before_datetime
_hiddenAfterDateTimeneos_hidden_after_datetime
_hiddenInIndexneos_hidden_in_index
_lastModificationDateTimeneos_last_modification_datetime
_lastPublicationDateTimeneos_last_publication_datetime
__fulltextPartsneos_fulltext_parts
__fulltextneos_fulltext

 

Bildschirmfoto 2020-05-17 um 23.02.46.png

Examine the indexed data in Kibana with correct field-type declarations.

There is also a node migration provided. To update the packages in your project run:

./flow core:migrate YOUR.PACKAGE --version 20200513223401

New Querybuilder Operations

There are two new Eel operations for the query builder which build Elasticsearch filters 

Prefix Filter

prefix('propertyName', 'prefix', [clauseType])

does a prefix match, for example to search for the first n characters of a keyword.

Geodistance Filter

geoDistance(propertyName, geoPoint, distance, [clauseType])

can be used to filter records by geographical distance to a given geoPoint.

New Flow Commands

There are some new Flow cli commands, that helped me a lot while implementing and testing the new features and I am pretty sure the are very handy when integrating search into projects.

Show dimension to index mapping

./flow nodeindexmapping:indices

Shows the mapping between the projects dimensions presets and the resulting index name.

flow_indices.png

Show indexable nodes

./flow nodetype:showIndexableConfiguration

The command evaluates the indexed section of the defaultConfigurationPerNodeType configuration for all available NodeTypes and shows the result.

flow_show_indexable_configuration-946x731.png

Fulltext search via CLI

./flow search:fulltext <searchWord>

Performs a fulltext search and displays the results. This can be very handy for debugging the configuration of the indexing and fulltext analyzation independently of the rendering of your frontend search.

flow_search_fulltext.png

Examine the indexed data

./flow search:viewnode <nodeIdentifier> [<dimensionCombinationAsJson>] [<field>]

Now, that you have the node identifier of your search result, you can use ./flow search:viewnode to get all contents that are indexed fo a given node. The optional field parameter can be used to only display a certain field.

flow_search_viewnode.png

Specify length / size of bulk requests

A limit for the number of bulk request parts as well as the content length of a bulk request can now be specified to fit to your Elasticsearch setup. 

Flowpack:
  ElasticSearch:
    ContentRepositoryAdaptor:
      indexing:
        batchSize:
          elements: 50
          octets: 40000000

Better error reporting

Elasticsearch error messages often contain a lot of information and like stack traces, they are not suitable to be written to a log file. Thats why the detailed error messages are now written to a file in  FLOW_PATH_DATA/Logs/Elasticsearch and only a reference is written to the log file.

Additional Package releases

Besides the Elasticsearch ContentrepositoryAdapater package, there are a bunch of packages, that als received updated versions:

Flowpack.ElasticSearch

Flowpack.ElasticSearch is the base package which establishs the connection to the Elasticsearch backend and also brings support for indexing domain records.

Also, this package is compatible to latest the Elasticsearch versions. For that, the _type meta field was replaced with the neos_type field.

Additionally, smaller improvements in code style, and error output were included.

Neos.ContentRepository.Search

Neos.ContentRepository.Search defines the interfaces to adapt the ContentRepository to a search solution.

The package now contains an interface and the DTO definition for asset content extraction and the necessary code to make the feature to exclude properties from indexing, mentioned above, possible.
Also there have been made several code quality improvements.

Flowpack.ElasticSearch.ContentRepositoryQueueIndexer

Among cleanup and code improvements, the NodeType exclusion feature is also added to the QueueIndexer Package. No jobs are created for NodeTypes that are marked as excluded from indexing which brings an additional performance boost when indexing large ContentRepositories.

Flowpack.SearchPlugin

SearchAsYouType.gif

Autocompletion and suggestions in action.

The SearchPlugin offers a controller-action to return autocompletion and suggestion instantly as you type the search term. The feature was implemented to work with Elasticsearch 2, but did never work with Elasticsearch 5. Now the code refactorings done by Aske Ertmann for Elasticsearch 5 was finally merged and released as 4.1.0.

The released version 5 makes the  package together with completions and suggestions work with ESCRA 7.

Additionally I restructured the configuration and fusion code to fit better to our best practices.

That should have been the main changes to the packages. Besides that, a lot of code changes an refactorings where made to make the code less error prone and better maintainable. 

I really looking forward to your thoughts and feedback on these changes and I like to hear your use cases that may bring the packages even further with next versions. If you have questions just stop by in the forum or the #guild-search channel on slack.