Faster Recommendations with Neo4j 2.3 Triadic Selection

· 8 min read

Recently, Neo Technology announced the 2.3.0-RC1 release of their Neo4j graph database. One of the key new features is Triadic Selection built into Cypher’s Cost Based Planner. In this blog post, we will explore the Triadic Selection in detail and demonstrate how significantly it can speed up recommendations computed in Neo4j.

What is Triadic Selection?

A Bit of Theory: Triadic Closure

Networks or graphs can rarely be considered static structures. On the contrary, often they seem to be ever-evolving objects. Any social network, for example, is often the most dynamic of graphs: at any moment, new relationships are created between existing nodes, other relationships vanish, new nodes appear and others are deleted. It is therefore crucial to observe how the network evolves over time.

The forces that operate on each network vary over various network types, but one of the most basic principles is the following:

“If two people in a social network have a friend in common, then there is an increased likelihood that they will become friends themselves at some point in the future. [Anatole Rapoport. Spread of information through a population with socio-structural bias I: Assumption of transitivity. Bulletin of Mathematical Biophysics, December 1953 ]”

We refer to this principle as triadic closure. It is intuitive and really simple to find examples of in everyone’s private network. Moreover, experience suggests three basic principles of how triadic closures operate:

  • Opportunity: One reason why B and C are more likely to become friends, when they have a common friend A, is simply based on opportunity for B and C to meet.
  • Trust: In the process of forming a friendship, the fact that both B and C are friends with A gives them a basis for trusting each other. Such trust may be lacking in an arbitrary pair of unconnected people.
  • Incentive: If A is friend with B and C, then it becomes a source of later stress in these relationships if B and C are not of friends with each other.

Furthermore, the basic role of triadic closure in a social network has motivated the formulation of simple social network measures to capture its prevalence. One of these measures is the clustering coefficient.

The clustering coefficient of a node A is defined as the probability that two randomly selected connections of A are connected each other [Easley D., Kleinberg J. - Networks, Crowds and Markets. Reasoning about a Highly Connected World. Cambridge Press, 2010].

Usage in Recommendation Engines

In recommendations engines, triadic closure represents an interesting opportunity for evolution of the network, because in most applications, closing a triangle has a positive impact on the state of the world.

In a social network like Facebook or LinkedIn, a triadic closure represents a new friendship or a professional relationship, where these connections are the raison d’être of these applications. In an interest or purchase graph, a triadic closure may represent a new view or a new buy. Therefore, in recommendation engines, the triadic closure principle can be very helpful in recommending relationships that close a triangle.

TriadicSelection in Neo4j 2.3

Triadic Sample

Recommending user C in Cypher is easy:

MATCH (a:User {name:"A"})-[:KNOWS]->(b)-[:KNOWS]->(c)
WHERE a <> c
AND NOT (a)-[:KNOWS]->(c)
RETURN c

+------------------------+
| c                      |
+------------------------+
| Node[656706]{name:"C"} |
+------------------------+
1 row
14 ms

This does not seem to be a complicated query nor does there seem to by anything special about it. So what is this all about then?

Well, the good news is that in Neo4j 2.3, Cypher is optimized to make use of TriadicSelection just by recognizing the closure pattern in your query. Let’s take a look at the PROFILE of this query :

Compiler CYPHER 2.3

Planner COST

Runtime INTERPRETED

+-------------------+----------------+------+---------+-----------------------------+--------------------------------------------------------+
| Operator          | Estimated Rows | Rows | DB Hits | Identifiers                 | Other                                                  |
+-------------------+----------------+------+---------+-----------------------------+--------------------------------------------------------+
| +ProduceResults   |              0 |    1 |       0 | c                           | c                                                      |
| |                 +----------------+------+---------+-----------------------------+--------------------------------------------------------+
| +TriadicSelection |              0 |    1 |       0 | anon[26], anon[40], a, b, c | a, b, c                                                |
| |\                +----------------+------+---------+-----------------------------+--------------------------------------------------------+
| | +Filter         |              0 |    1 |       0 | anon[26], anon[40], a, b, c | Ands(Fby(NOT(a == c),Last(NOT(anon[26] == anon[40])))) |
| | |               +----------------+------+---------+-----------------------------+--------------------------------------------------------+
| | +Expand(All)    |              0 |    1 |       2 | anon[26], anon[40], a, b, c | (b)-[:KNOWS]->(c)                                      |
| | |               +----------------+------+---------+-----------------------------+--------------------------------------------------------+
| | +Argument       |              1 |    1 |       0 | anon[26], a, b              |                                                        |
| |                 +----------------+------+---------+-----------------------------+--------------------------------------------------------+
| +Expand(All)      |              1 |    1 |       2 | anon[26], a, b              | (a)-[:KNOWS]->(b)                                      |
| |                 +----------------+------+---------+-----------------------------+--------------------------------------------------------+
| +Filter           |              1 |    1 |       3 | a                           | a.name == {  AUTOSTRING0}                              |
| |                 +----------------+------+---------+-----------------------------+--------------------------------------------------------+
| +NodeByLabelScan  |              3 |    3 |       4 | a                           | :User                                                  |
+-------------------+----------------+------+---------+-----------------------------+--------------------------------------------------------+

Positive and Negative Predicates

Our simple Cypher example was a negative predicate meaning that we are looking for triangles that are not yet closed.

Cypher also allows you to look for positive predicates, meaning triangles that are already closed. A slight modification to the WHERE clause will do just that:

MATCH (a:User)-[:FOLLOWS]->(b)-[:FOLLOWS]->(c)
WHERE (a)-[:FOLLOWS]->(c)
AND (a) <> (c)
RETURN c

Valid Triadic Selection patterns

The key point of TriadicSelection is that the predicate pattern WHERE NOT (a)-[:FOLLOWS]->(c) needs to have the same direction, type and labels as the first part of the MATCH pattern MATCH (a:User)-[:FOLLOWS]->(b).

So b and c need to have the same labels. Triadic re-expresses the problem as find all (c) that are not in the set of (b) which is only a reasonable re-expression if the set of found (b) is a superset of possible (c).

Performance matters

As Neo4j experts and enthusiasts, we could not resist the temptation to test the performance impact TriadicSelection has on more representative data sets. If the performance improvements are significant, this might be a feature to take into account when it comes to data modeling or building new recommendation engines in the future.

We have tested the performance impact by comparing Neo4j2.2.5 to Neo4j2.3.0-RC1 on 3 data sets :

  • A small data set created with Graphgen (the default example on the web application, only with more nodes).
  • The pokec social network relationships dataset available on Stanford Educational Website, representing 1632803 user nodes and 30622564 edges.
  • An internal data set representing the Github Activity of more than 1 million users during 1 month.

The tests were run on an AWS EC2 m3.xlarge instance.

Graphgen data set results

| Version | Total Time (ms) | Occurrences | Mean (ms) | Min (ms) | Max (ms) | Variance (ms) |
|---------|-----------------|-------------|-----------|----------|----------|---------------|
| 2.2.5   | 1386            | 200         | 6.93      | 2        | 44       | 53.05         |
| 2.3-RC1 | 1192            | 200         | 5.97      | 1        | 44       | 45            |

Here, the difference is not significant because there are only 200 Person nodes.

Pokec Social Network data set results

| Version            | Total Time (ms) | Occurrences | Mean (ms) | Min (ms) | Max (ms) | Variance (ms) | Results Min | Results Max | Results Mean |
|--------------------|-----------------|-------------|-----------|----------|----------|---------------|-------------|-------------|--------------|
| 2.2.5              | 3.45E+07        | 100000      | 345.4     | 0        | 331332   | 2062909.27    | 0           | 373301      | 3564.28      |
| 2.3-RC1            | 859166          | 100000      | 8.59      | 0        | 1013     | 191.85        | 0           | 373301      | 3564.28      |

We ran this test on 100k different users. As you can see in the table, the improvement in the query response time is really significant : from 331 seconds to 1010ms.

The query times are generally proportional to the number of suggestions the TriadicSelection will find.

Internal Github Activity Database results

Here we tested each valid TriadicSelection pattern.

| Pattern                                                                                     | Total time (ms) | Query Performed | Mean          | Min | Max (ms)    | Variance      | Results Min | Result Max | Result Mean   |
|---------------------------------------------------------------------------------------------|-----------------|-----------------|---------------|-----|-------------|---------------|-------------|------------|---------------|
| MATCH (a:User {id:{param}})-[r1]->(b)-[r2]->(c) WHERE NOT (a)-->(c)                         | 2566811         | 1395233         | 1.839         | 0   | 761162      | 474476.78     | 0           | 365352     | 30.739        |
| MATCH (a:User {id:{param}})-[:FOLLOWS]->(b)-[:FOLLOWS]->(c) WHERE NOT (a)-[:FOLLOWS]->(c)   | 173511          | 1395233         | 0.1243598739  | 0   | 2248        | 10.43         | 0           | 22789      | 1.89          |
| MATCH (a:X)-[:A]->(b)-[:B]->(c) WHERE NOT (a:X)-[:A]->(c)                                   | 99554           | 1395233         | 0.07135295682 | 0   | 49          | 0.07580282543 | 0           | 357        | 0.3749875469  |
| MATCH (a:User {id:{param}})-[:WATCH]->(b)-[:FORK_OF]->(c) WHERE NOT (a)-[:WATCH]->(c)       | 229511          | 1395233         | 0.1644965393  | 0   | 100         | 0.3880639455  | 0           | 8485       | 7.06          |
| MATCH (a:X)-->(b)-->(c) WHERE (a)-->(c)                                                     | 2663348         | 1395233         | 1.9           | 0   | 818127      | 546845.41     | 0           | 31204      | 0.5179823012  |
| MATCH (a:User {id:{param}})-[:FOLLOWS]->(b)-[:FOLLOWS]->(c) WHERE (a)-[:FOLLOWS]->(c)       | 170278          | 1395233         | 0.1220426982  | 0   | 2206        | 9.92          | 0           | 9644       | 0.1187027543  |
| MATCH (a:User {id:{param}})-[:WATCH]->(b)-[:FORK_OF]->(c) WHERE (a)-[:WATCH]->(c)           | 98553           | 1395233         | 0.07063551392 | 0   | 40          | 0.07915648308 | 0           | 11         | 8.14E-04      |

Again, the performance impact is very significant and will certainly benefit the Neo4j users.

Conclusion

TriadicSelection in Neo4j 2.3 is a real improvement, bringing a lot of benefits for people building recommendations engines with Cypher. We are looking forward to building new recommendation engines with the GraphAware Reco extension.

You can find the code used for the tests in our stress-test repository.

Other resources :

  • Neo4j Cypher test case for TriadicSelction : link

  • Cypher plan code for TriadicSelection : link

We’re looking forward to meeting you at GraphConnect in San Francisco. If you want to know more about the Github dataset, Christophe will do a lightning talk titled : Mining the Github Activity in Neo4j.

Also, do not miss Michal’s talk : Real Time Recommendations with Graphs and the Future of Search.

NB: Thanks to Craig Taverner from Neo Technology for useful information about the TriadicSelection internals in Neo4j.

Alessandro Negro

Christophe Willemsen

Technology & Infrastructure | Neo4j certification

Christophe Willemsen brings almost 15 years of experience as a Communication Systems Engineer for the Belgian Navy to the table. Highly skilled in software engineering, he is one of the primary authors of Hume and an active participant in the Neo4j community, dedicating his efforts towards driving the development of cutting-edge technologies in the field.