Reverse Engineering Book Stories with Neo4j and GraphAware NLP

24 July 2017 · 10 min read

A book tells us a story, but for a computer it is a wall of text. How can we use graphs and NLP to help our machines make more sense of a story?

Our example comes from the A Song of Ice and Fire books, aka Game of Thrones. We converted the e-books (epub) to text-files and used a small python program to split them into chapters, paragraphs, and sentences.

So a book turned into this model :

model

GraphAware NLP

GraphAware NLP Framework is a project that integrates NLP processing capabilities available in several software packages like Stanford NLP and OpenNLP, existing data sources, such as ConceptNet5 and WordNet, and GraphAware’s knowledge about search, graphs, and Recommendation Engines. GraphAware NLP is developed as plugin for Neo4j and an external frontend for interacting with a Spark Cluster. It provides a set of tools, by means of procedures, background process, and APIs, that all together provide a Domain Specific Language for Natural Language Processing on top of Cypher. The available interesting features are:

Information Extraction (IE) - processing textual information for extracting main components and relationships
Extracting sentiments
Enriching basic data with ontologies and concepts (ConceptNet 5)
Computing similarities between text elements in a corpus using base data and ontology information
Enriching knowledge using external sources (Alchemy)
Providing basic search capabilities
Providing complex search capabilities leveraging enriched knowledge like ontology, sentiments, and similarity
Providing recommendations based on a combination of content/ontology-based recommendations, social tags, and collaborative filtering
Unsupervised corpus clustering using LDA
Semi-supervised corpus clustering using Label Propagation
Word2Vec computation and importing

You can read more about it on the NLP section of our blog.

Text processing phase

Custom text processing pipeline with custom stopwords

CALL ga.nlp.addPipeline({textProcessor: 'com.graphaware.nlp.processor.stanford.StanfordTextProcessor', name: 'customStopwords', stopWords: '+,i,me,my,myself,we,our,ours,ourselves,you,your,yours,yourself,yourselves,he,him,his,himself,she,her,hers,herself,it,its,itself,they,them,their,theirs,themselves,what,which,who,whom,this,that,these,those,am,is,are,was,were,be,been,being,have,has,had,having,do,does,did,doing,a,an,the,and,but,if,or,because,as,until,while,of,at,by,for,with,about,against,between,into,through,during,before,after,above,below,to,from,up,down,in,out,on,off,over,under,again,further,then,once,here,there,when,where,why,how,all,any,both,each,few,more,most,other,some,such,no,nor,not,only,own,same,so,than,too,very,s,t,can,will,just,don,should,now', threadNumber: 20})

Text extraction, tokenization, lemmification and named entity recognition

MATCH (p:Paragraph) WHERE NOT p:Processed WITH p LIMIT 500 SET p:Processed
WITH p
CALL ga.nlp.annotate({text: p.text, id: p.ref, pipeline: "customStopwords"})
YIELD result
MERGE (p)-[:HAS_ANNOTATED_TEXT]->(result)

The two above steps lead us to the following graph :

model

The original graph and its NLP processed representation are related with the HAS_ANNOTATED_TEXT relationship. Every paragraph is split into sentences which themselves are extracted to Tag nodes. Optionally, tags have an additional Named Entity Recognition label like NER_Person, NER_Location, etc..

First insights from text

Which tags/concepts were found ?

MATCH (n:Tag) WHERE n:NER_Person OR n:NER_Location OR n:NER_Organization RETURN n.value, n.ne, size((n)<-[:HAS_TAG]-()) AS f ORDER BY f DESC

╒════════════╤════════════╤═════╕
│"n.value"   │"n.ne"      │"f"  │
╞════════════╪════════════╪═════╡
│"Jon"       │["PERSON"]  │"510"│
├────────────┼────────────┼─────┤
│"Jaime"     │["PERSON"]  │"384"│
├────────────┼────────────┼─────┤
│"Tyrion"    │["PERSON"]  │"372"│
├────────────┼────────────┼─────┤
│"Arya"      │["PERSON"]  │"358"│
├────────────┼────────────┼─────┤
│"Robb"      │["PERSON"]  │"299"│
├────────────┼────────────┼─────┤
│"Dany"      │["PERSON"]  │"280"│
├────────────┼────────────┼─────┤
│"Sam"       │["PERSON"]  │"240"│
├────────────┼────────────┼─────┤
│"Davos"     │["PERSON"]  │"229"│
├────────────┼────────────┼─────┤
│"Joffrey"   │["PERSON"]  │"210"│
├────────────┼────────────┼─────┤
│"Catelyn"   │["PERSON"]  │"206"│
├────────────┼────────────┼─────┤
│"Sansa"     │["LOCATION"]│"193"│
├────────────┼────────────┼─────┤
│"Cersei"    │["PERSON"]  │"162"│
├────────────┼────────────┼─────┤
│"Tywin"     │["PERSON"]  │"161"│
├────────────┼────────────┼─────┤
│"Brienne"   │["PERSON"]  │"138"│
├────────────┼────────────┼─────┤
│"Robert"    │["PERSON"]  │"126"│
├────────────┼────────────┼─────┤
│"Winterfell"│["LOCATION"]│"101"│
├────────────┼────────────┼─────┤
│"Ygritte"   │["PERSON"]  │"101"│
├────────────┼────────────┼─────┤
│"Meera"     │["PERSON"]  │"101"│
├────────────┼────────────┼─────┤
│"Beric"     │["PERSON"]  │"98" │
├────────────┼────────────┼─────┤
│"Jojen"     │["PERSON"]  │"94" │
└────────────┴────────────┴─────┘

It is very impressive that on this artificial text, with no real people, locations or organizations, the NLP approach still yields such correct results. (Michael Hunger - Neo4j, Inc).

Which chapter has the most different tags?

MATCH (c:Chapter)-[:HAS_PARAGRAPH]->(p)-[:HAS_ANNOTATED_TEXT]->(at)-[:CONTAINS_SENTENCE]->(s)-[:HAS_TAG]->(t)
WHERE size(t.ne) > 0
RETURN c.number, c.title, count(*) as f ORDER BY f DESC LIMIT 5

╒══════════╤══════════╤══════╕
│"c.number"│"c.title" │"f"   │
╞══════════╪══════════╪══════╡
│"61"      │"TYRION"  │"4453"│
├──────────┼──────────┼──────┤
│"5"       │"TYRION"  │"3693"│
├──────────┼──────────┼──────┤
│"24"      │"DAENERYS"│"3678"│
├──────────┼──────────┼──────┤
│"34"      │"SAMWELL" │"3659"│
├──────────┼──────────┼──────┤
│"57"      │"BRAN"    │"3564"│
└──────────┴──────────┴──────┘

Making sense of thrones

We can now use this base model to make more advanced (fun) queries on the data

First occurrences of persons in the book

MATCH (t:Tag) WHERE t:NER_Person
WITH t, size((t)<-[:HAS_TAG]-()) AS f
ORDER BY f DESC LIMIT 10
MATCH (t)<-[:HAS_TAG]-(s)<-[:CONTAINS_SENTENCE]-(at)<-[:HAS_ANNOTATED_TEXT]-(para)<-[:HAS_PARAGRAPH]-(c)
WITH t.value as person,s.sentenceNumber as num, para.position as paraNum, c.number as chapterNum, c.title as chapter, s.text as sentence, apoc.text.format('C%03d-P%03d-S%03d',[c.number,para.position,s.sentenceNumber]) as position
WHERE chapterNum > 1
WITH * ORDER BY position ASC 
RETURN person, collect({chapter: chapter, sentence: sentence, position: position})[0]

While the query might look complex, it is just traversing the graph. Sentence nodes contain their position in the paragraph which make it easy to retrieve apparitions in the right order.

╒═════════╤══════════════════════════════╕
│"person" │"collect({chapter: chapter, se│
│         │ntence: sentence, position: po│
│         │sition})[0]"                  │
╞═════════╪══════════════════════════════╡
│"Jaime"  │{"chapter":"JAIME","sentence":│
│         │"Not that Jaime had ever seen │
│         │her smiling.","position":"C002│
│         │-P003-S002"}                  │
├─────────┼──────────────────────────────┤
│"Robb"   │{"chapter":"JAIME","sentence":│
│         │"She stood at Robb’s left hand│
│         │ beside the high seat, and for│
│         │ a moment felt almost as if sh│
│         │e were looking down at her own│
│         │ dead, at Bran and Rickon.","p│
│         │osition":"C002-P003-S000"}    │
├─────────┼──────────────────────────────┤
│"Tyrion" │{"chapter":"CATELYN","sentence│
│         │":"Tyrion felt their eyes on h│
│         │im as he rode past; chilly eye│
│         │s, angry and unsympathetic.","│
│         │position":"C003-P002-S001"}   │
├─────────┼──────────────────────────────┤
│"Joffrey"│{"chapter":"JAIME","sentence":│
│         │"The moonstones Joffrey gave h│
│         │er.”","position":"C002-P005-S0│
│         │02"}                          │
├─────────┼──────────────────────────────┤
│"Jon"    │{"chapter":"JAIME","sentence":│
│         │"Jon had never met anyone so s│
│         │tubborn, except maybe for his │
│         │little sister Arya.","position│
│         │":"C002-P005-S001"}           │
├─────────┼──────────────────────────────┤
│"Catelyn"│{"chapter":"JAIME","sentence":│
│         │"A silence fell across the tor│
│         │chlit hall, and in the quiet C│
│         │atelyn could hear Grey Wind ho│
│         │wling half a castle away.","po│
│         │sition":"C002-P002-S001"}     │
├─────────┼──────────────────────────────┤
│"Arya"   │{"chapter":"JAIME","sentence":│
│         │"Jon had never met anyone so s│
│         │tubborn, except maybe for his │
│         │little sister Arya.","position│
│         │":"C002-P005-S001"}           │
├─────────┼──────────────────────────────┤
│"Davos"  │{"chapter":"JAIME","sentence":│
│         │"But Davos could not complain │
│         │of chill.","position":"C002-P0│
│         │04-S000"}                     │
├─────────┼──────────────────────────────┤
│"Dany"   │{"chapter":"JAIME","sentence":│
│         │"The harpy of Ghis, Dany thoug│
│         │ht.","position":"C002-P003-S00│
│         │0"}                           │
├─────────┼──────────────────────────────┤
│"Sam"    │{"chapter":"CATELYN","sentence│
│         │":"Sam was trying to feed him │
│         │onion broth, but he could not │
│         │swallow.","position":"C003-P00│
│         │3-S003"}                      │
└─────────┴──────────────────────────────┘

Interaction graph

The idea here is that we are looking for who is interacting with another person in the same sentence and create a new OCCURS_WITH relationship between the entities

MATCH (a:AnnotatedText)-[:CONTAINS_SENTENCE]->(s:Sentence)-[:SENTENCE_TAG_OCCURRENCE]->(to:TagOccurrence)-[:TAG_OCCURRENCE_TAG]->(tag)
WHERE tag:NER_Person 
WITH a, to, tag
ORDER BY s.id, to.startPosition
WITH a, collect(tag) as tags
UNWIND range(0, size(tags) - 2) as i
WITH a, tags[i] as tag1, tags[i+1] as tag2 WHERE tag1 <> tag2
MERGE (tag1)-[r:OCCURS_WITH]-(tag2)
ON CREATE SET r.freq = 1
ON MATCH SET r.freq = r.freq + 1

co-occurences

Tag Occurrence nodes are an intermediate representation between a sentence and a tag found in it. Because Tag nodes are unique, this offers the possibility to have extra information about the occurrence of a tag in a specific sentence.

Find the queen/lord/king/lady

MATCH (n:Tag) WHERE n.value IN ['lord','queen','king','lady']
MATCH (n)<-[:TAG_OCCURRENCE_TAG]-(to)<-[:SENTENCE_TAG_OCCURRENCE]-(s)-[:SENTENCE_TAG_OCCURRENCE]->(to2)-[:TAG_OCCURRENCE_TAG]->(t2) WHERE to2.startPosition = to.endPosition + 1
AND t2:NER_Person
WITH n.value AS v, toLower(t2.value) AS person, count(*) AS f ORDER BY f DESC
RETURN v, collect(person)[0..20]

╒═══════╤══════════════════════════════╕
│"v"    │"collect(person)[0..20]"      │
╞═══════╪══════════════════════════════╡
│"lady" │["lysa","smallwood","catelyn",│
│       │"brienne","frey","leonette","o│
│       │lenna","bulwer","melisandre","│
│       │ashara dayne","roslin","roslin│
│       │ frey","lynesse","jeyne wester│
│       │ling","janna fossoway","jeyne │
│       │swann","joanna","whent","walda│
│       │ frey","shae"]                │
├───────┼──────────────────────────────┤
│"king" │["robert","joffrey","robb","ba│
│       │lon","tommen","aerys","stannis│
│       │","daeron","maegor","joffrey b│
│       │aratheon","robert i baratheon"│
│       │,"andahar","renly","aegon v ta│
│       │rgaryen","tristifer","harren t│
│       │he black","harren","baelor","t│
│       │orrhen"]                      │
├───────┼──────────────────────────────┤
│"queen"│["cersei","jeyne","selyse","ma│
│       │rgaery","alysanne","myrcella",│
│       │"rhaella"]                    │
├───────┼──────────────────────────────┤
│"lord" │["tywin","beric","walder","bol│
│       │ton","tyrell","hoster","petyr"│
│       │,"alester","rickard","eddard",│
│       │"frey","vargo","karstark","ren│
│       │ly","redwyne","mormont","edmur│
│       │e","stannis","davos","tully"] │
└───────┴──────────────────────────────┘

Which organizations and locations were found

MATCH (n:Tag) WHERE n:NER_Organization OR n:NER_Location
WITH n.value AS entity, n.ne AS types, size((n)<-[:HAS_TAG]-()) AS f
ORDER BY f DESC
RETURN types, collect(entity)[0..25] AS entities

│"types"         │"entities"                    │
╞════════════════╪══════════════════════════════╡
│["PERSON"]      │["Tyrion","Arya","Robb","Davos│
│                │","Joffrey","Catelyn","Cersei"│
│                │,"Tywin","Brienne","Ygritte","│
│                │Jojen","Edmure","Grenn","Mormo│
│                │nt","Stannis","Gendry","Riverr│
│                │un","Oberyn","Shae","Thoros","│
│                │Bronn","Margaery","Clegane","P│
│                │etyr","Craster"]              │
├────────────────┼──────────────────────────────┤
│["MISC"]        │["Dornish","Valyrian","Eyrie",│
│                │"Thenn","Dywen","Westerosi","T│
│                │argaryen","Greyjoys","Cerwyn",│
│                │"Baratheon","Golden Tooth","Qa│
│                │rtheen","Queen of Thorns","Iro│
│                │n Islands"]                   │
├────────────────┼──────────────────────────────┤
│["ORGANIZATION"]│["King","Castle Black","Tyrell│
│                │","Aegon","House","Salladhor S│
│                │aan","Blackwater","Mace Tyrell│
│                │","Twins","Kraznys","Irri","Fr│
│                │eys","Jinglebell","Pentos","Wh│
│                │itetree","Vale","Hobb","Red Vi│
│                │per","Myr","Ser Addam","Mummer│
│                │s","Acorn Hall","Kingswood Bro│
│                │therhood","Bloody Mummers","Fo│
│                │rd"]                          │
├────────────────┼──────────────────────────────┤
│["LOCATION"]    │["Sansa","Winterfell","Dorne",│
│                │"Highgarden","Casterly Rock","│
│                │Greenbeard","Oldtown","Dragons│
│                │tone","Pylos","Lannister","Tar│
│                │th","Duskendale","Qarth","Gran│
│                │d Maester Pycelle","Willas","W│
│                │hite Harbor","Longclaw","Eastw│
│                │atch","Green Fork","Nightfort"│
│                │,"Braavos","Baelor","Khaleesi"│
│                │,"Tyroshi","Shadow Tower"]    │
└────────────────┴──────────────────────────────┘

Entity merging

Finding names and merging them together for creating person nodes.

This is done in 2 steps, the first step is to extract name parts into their own nodes and create a hierarchy of name parts :

match (t:Tag:NER_Person)
merge (n1:Name {name:toLower(t.value)}) set n1:FromTag
with n1
unwind split(n1.name,' ') as part
with n1, part where length(part) > 2
merge (n2:Name {name:part}) set n2:Single
with n1,n2 where n1 <> n2
merge (n2)-[:PART_OF]->(n1)

name extraction

The second step is to traverse from a single name entity up to its top PART_OF relationship and traverse its own cluster in order to determine the entities to merge together.

Let’s first check what the results will look like :

MATCH (n:Single) WITH n, size( (n)-[:PART_OF]->() ) as degree WHERE degree < 5 
MATCH path = (n)-[:PART_OF*..3]->(m)
WITH n, collect(m) as ms
WHERE size(ms) > 1
UNWIND ms as m
WITH n, m WHERE size([ (m)<-[:PART_OF*..3]-(o)-[:PART_OF*..3]->(e:FromTag) WHERE e IN ms | e]) = size(ms)
WITH n, collect(m.name) as names
return n.name, names, reduce(x = '', name in names | case when size(name) > size(x) then name else x end)

╒═════════════╤══════════════════════════════╤══════════════════════════════╕
│"n.name"     │"names"                       │"reduce(x = '', name in names │
│             │                              │| case when size(name) > size(│
│             │                              │x) then name else x end)"     │
╞═════════════╪══════════════════════════════╪══════════════════════════════╡
│"tallad"     │["rowan ser tallad","ser talla│"rowan ser tallad"            │
│             │d"]                           │                              │
├─────────────┼──────────────────────────────┼──────────────────────────────┤
│"grace"      │["his grace","his grace king j│"his grace king joffrey"      │
│             │offrey"]                      │                              │
├─────────────┼──────────────────────────────┼──────────────────────────────┤
│"walder"     │["walder frey","theon greyjoy │"theon greyjoy walder frey tyw│
│             │walder frey tywin lannister"] │in lannister"                 │
├─────────────┼──────────────────────────────┼──────────────────────────────┤
│"kettleblack"│["oswald kettleblack","osmund │"oswald kettleblack"          │
│             │kettleblack"]                 │                              │
├─────────────┼──────────────────────────────┼──────────────────────────────┤
│"dolorous"   │["dolorous edd","dolorous edd │"dolorous edd dywen"          │
│             │dywen"]                       │                              │
├─────────────┼──────────────────────────────┼──────────────────────────────┤
│"tallhart"   │["ser helman tallhart","helman│"ser helman tallhart"         │
│             │ tallhart"]                   │                              │
├─────────────┼──────────────────────────────┼──────────────────────────────┤
│"bronn"      │["ser bronn","philip foote ser│"philip foote ser bronn jalabh│
│             │ bronn jalabhar xho"]         │ar xho"                       │
├─────────────┼──────────────────────────────┼──────────────────────────────┤
│"hempen"     │["bodger hempen dan","hempen d│"bodger hempen dan"           │
│             │an"]                          │                              │
├─────────────┼──────────────────────────────┼──────────────────────────────┤
│"rhaegar"    │["prince rhaegar","rhaegar pri│"rhaegar prince of dragonstone│
│             │nce of dragonstone"]          │"                             │
├─────────────┼──────────────────────────────┼──────────────────────────────┤
│"ryman"      │["ser ryman","ser ryman frey"]│"ser ryman frey"              │
├─────────────┼──────────────────────────────┼──────────────────────────────┤
│"mandon"     │["ser mandon moore","ser mando│"ser mandon moore"            │
│             │n"]                           │                              │
├─────────────┼──────────────────────────────┼──────────────────────────────┤
│"lorch"      │["ser amory lorch","amory lorc│"ser amory lorch"             │
│             │h"]                           │                              │
├─────────────┼──────────────────────────────┼──────────────────────────────┤
│"red"        │["red alyn","dornish dilly red│"dornish dilly red alyn"      │
│             │ alyn"]                       │                              │
├─────────────┼──────────────────────────────┼──────────────────────────────┤
│"donal"      │["clydas donal noye","donal no│"clydas donal noye"           │
│             │ye"]                          │                              │
├─────────────┼──────────────────────────────┼──────────────────────────────┤
│"thorne"     │["alliser thorne","ser alliser│"ser alliser thorne"          │
│             │ thorne"]                     │                              │
├─────────────┼──────────────────────────────┼──────────────────────────────┤

We can now create the Person entities and relate them to their corresponding Tag nodes :

MATCH (n:Single) WITH n, size( (n)-[:PART_OF]->() ) as degree WHERE degree < 5 
MATCH path = (n)-[:PART_OF*..3]->(m)
WITH n, collect(m) as ms
WHERE size(ms) > 1
UNWIND ms as m
WITH n, m WHERE size([ (m)<-[:PART_OF*..3]-(o)-[:PART_OF*..3]->(e:FromTag) WHERE e IN ms | e]) = size(ms)
WITH n, collect(m.name) as names
with n.name as single, names, reduce(x = '', name in names | case when size(name) > size(x) then name else x end) as longestName
MERGE (p:Person {name: longestName})
WITH p, names
UNWIND names as name
MATCH (n:Name {name: name})
MERGE (p)-[:KNOWN_AS]->(n)

entities

And clean the temporary Name nodes

MATCH (n:Name) DETACH DELETE n

Conclusion

The last step is a very interesting step, because it offers us the possibility to compute again the interaction graph but with more precise Person entities than what we have done previously.

MATCH (a:AnnotatedText)-[:CONTAINS_SENTENCE]->(s:Sentence)-[:SENTENCE_TAG_OCCURRENCE]->(to:TagOccurrence)-[:TAG_OCCURRENCE_TAG]->(tag)<-[:KNOWN_AS]-(p:Person)
WITH a, to, p
ORDER BY s.id, to.startPosition
WITH a, collect(p) as persons
UNWIND range(0, size(persons) - 2) as i
WITH a, persons[i] as person1, persons[i+1] as person2 WHERE person1 <> person2
MERGE (person1)-[r:OCCURS_WITH_PERSON]-(person2)
ON CREATE SET r.freq = 1
ON MATCH SET r.freq = r.freq + 1

Imgur

We’ve made the database available (without the original text) in read-only mode here and a backup can be downloaded from Dropbox here. The Neo4j version used is 3.1.3.

Graph Structures are ideal for storing enriched representations of textual data. Combine this with NLP capabilities and a powerful query language for graphs and you are able to discover really interesting insights from books.

GraphAware is a Gold Sponsor of GraphConnect New York, feel free to pass by our booth to learn more about the GraphAware NLP Framework.