Advanced Neo4j to Elasticsearch Replication

27 July 2016 · 5 min read

During GraphConnect San Francisco 2015, we introduced the concept of Graph-Aided Search and released the first module providing Neo4j data replication to Elasticsearch.

Some months later, the second part was released as an Elasticsearch plugin providing advanced personalized search using Neo4j as source of external knowledge, which, combined with the former module, constitutes a complete bidirectional integration with Neo4j, taking advantage of the strengths of both technologies.

The first version of the neo4j-to-elasticsearch plugin had a simple approach for defining which nodes should be indexed. After a while the need for more flexibility arose.

Based on our experience using the plugin and valuable feedback from the community, we are pleased to announce a completely revamped version, making more complex replication mapping definitions possible.

The concept

Reminder: It is important to have a common identifier for both Elasticsearch documents and Neo4j nodes and relationships. Since Neo4j reuses internal ids, these should be avoided. We recommend using our neo4j-uuid plugin to provide seamless integration with all of the GraphAware modules, but it is not mandatory.

This new version overcomes some of the limitations of the previous one, such as weak type support and missing relationships replication. In particular, one of the nicest features is the JSON based mapping definition that allows high customization capabilities in the mapping between nodes/relationships and documents. This makes the configuration more user-friendly and flexible, and also solves replication issues affecting synchronization when the same node belongs to more than one label.

An example replication mapping file would look like the following:

{
  "defaults": {
    "nodes_index": "default-nodes-index",
    "relationships_index": "default-relationships-index",
    "keyProperty": "uuid",
    "include_remaining_properties": true,
    "blacklisted_node_properties": ["password"]
  },
  "node_mappings": [
    {
      "condition": "hasLabel('User')",
      "type": "persons",
      "properties": {
        "name": "getProperty('firstName') + ' ' + getProperty('lastName')"
      }
    },
    {
      "condition": "hasLabel('Event') && hasProperty('timestamp')",
      "index": "'events-' + formatTime('timestamp', 'YYYY-mm-DD')",
      "type": "events"
    }
  ],
  "relationship_mappings": [
    {
      "condition": "hasType('FOLLOWS')",
      "type": "follow_events",
      "properties": {
        "since": "getProperty('since')"
      }
    }
  ]
}

The JSON contains 3 distinct parts :

defaults for some default settings like the node/relationship property key to use for the identifiers or properties that shouldn’t be indexed, like user passwords.
node_mappings and relationship_mappings representing your logic for the replication.

The module uses the powerful Spring Expression Language library offering the possibility to access node / relationship properties or define some conditionals in an expressive manner, making it possible to handle cases like :

time-based indices
filtering nodes based on property or labels
index nodes or relationships in more than one index
many more..

Workflow

replication worflow

The Neo4j TransactionData is translated into a set of NodeCreated, RelationshipUpdated, NodeDeleted, … objects. Every object passes through the replication mapping definitions chain and if the condition is true, the Elasticsearch action along with its JSON document content is returned back as a stack of actions to be executed in bulk against the Elasticsearch cluster.

Every mapping definition is responsible for evaluating the condition expression as well as the index and type of the document. Finally, a JSON representation of the document is built from the graph object properties, labels, relationship types etc.

Some tips

Time-based indices

By looking at the second node mapping definition in our example before :

{
      "condition": "hasLabel('Event') && hasProperty('timestamp')",
      "index": "'events-' + formatTime('timestamp', 'YYYY-mm-DD')",
      "type": "events"
}

we can build an expression representing an index name format of events-2016-07-26 for example. The formatTime property is a convenience method we added to the module for this specific use case.

The same expression flexibility can be used to define your document type.

Indexing all nodes

If you had to write a condition for every node label existing in the database, this wouldn’t be very friendly! Instead, you can make use of the allNodes() or allRelationships() condition expressions.

Java code evaluation

An interesting aspect of such expressions is the possibility to concatenate some properties as you can see in the example JSON for the Person::name or perhaps sanitize the data before indexing, for example :

{
      "condition": "hasLabel('Country')",
      "index": "countries",
      "type": "events",
	  "properties": {
	  	"name": "getProperty('countryName').toLower().replace(' ', '')"
	  }
}

Next steps

In the near future, we would like to spend some time on concurrency reliability and are thinking about a complete persistent queue mechanism implementation.

Conclusion

We are very excited about this release and are awaiting your feedback. We would like to thank David Rapin, CTO of Linkurious, for his very valuable feedback and contributions.

Our journey into the mixing of amazing technologies like Neo4j and Elasticsearch is far from being complete.

Want to discover more or speak with Neo4j experts about use cases for this integration? Meet our team at the next GraphConnect San Francisco on October 13 2016.