Counting Many Paths Between Nodes In NEO4J


A parent node in a graph database can have many child nodes.

In addition to the many children, there can be many distinct and different paths to each single child node. An example of multiple paths to the same destination is to look at a subsection of Manhattan.

Small Part of Manhattan

How many ways are there to get from the intersection of West 23rd Street and Tenth Avene, to East 34th Street and Park Avenue South? You could do only one turn: go north, and turn west. Or go west and turn north. Or you could do many turns, zig-zagging your way through the streets on the grid. And those would just be the shortest paths. You could also take the long way, zig zag all the way to the south of Manhattan, and take a site seeing tour back north.

By contrast, if there is only one country road, there is often only one way to get from point A to point B.

The more nodes and connections there are, the more the possible paths there are between nodes. As I pointed out in my last post, https://rodgersnotes.wordpress.com/2013/08/12/using-neo4j-to-find-all-parents-child-paths/ the number of paths in a graph, can be many multiples more than the number of nodes, or relationships.

So, given a parent node, finding the distinct set of children can give odd results. Here are some pitfalls to be aware of.

———–

Listing All Child Generations of SYS.STANDARD:

start p = node:node_auto_index ( object_id = ‘1219’  )
return p    

+————————————————————————————–+
| p                                                                                    |
+————————————————————————————–+
| Node[40583]{owner:”SYS”,object_name:”STANDARD”,object_type:”PACKAGE”,object_id:1219} |
+————————————————————————————–+

start p = node:node_auto_index ( object_id = ‘1219’  )
MATCH path =  p – [ : PARENT_OF* ] -> c
return  distinct  c.object_id
order by c.object_id


| 74700       |
| 74701       |
| 74702       |
| 74714       |
| 74719       |
| 74962       |
| 74963       |
| 74964       |
| 74965       |
| 74969       |
+————-+
9857 rows

In my database, there are 9857 distinct children of SYS.STANDARD, all generations.

———–

Over 1/2 Million Child Paths From One Object:

Count of all paths, to all generations of children:

start p = node:node_auto_index ( object_id = ‘1219’  )
MATCH path =  p – [ : PARENT_OF* ] -> c
return   count(*)

+———-+
| count(*) |
+———-+
| 543827   |
+———-+
1 row

From one single node, there are 9857 children, but over 543,827 possible paths to all these children. That is an average of over 55 paths per node. Yet, there are only 48,690 nodes in the entire NEO4J graph.

———–

Leaf Nodes That Have No Children:

What are the children of SYS.STANDARD, that have no children of their own?

start p = node:node_auto_index ( object_id = ‘1219’  )
MATCH path =  p – [  : PARENT_OF* ] -> c
where  not ( c- [ : PARENT_OF ]-> () )    // leaf only
return  distinct  c.object_id
order by c.object_id

….
| 74696       |
| 74697       |
| 74698       |
| 74699       |
| 74700       |
| 74702       |
| 74714       |
| 74719       |
| 74962       |
| 74964       |
| 74965       |
| 74969       |
+————-+
4623 rows

Of the 9857 children of SYS.STANDARD, 4623 are leaf nodes, that have no children of their own.

———–

First Generation Children of SYS.Standard:

start p = node:node_auto_index ( object_id = ‘1219’  )
MATCH path =  p – [ : PARENT_OF*1 ] -> c
return  distinct
length ( path )  AS PATH_LENGTH
, extract ( n in nodes ( path ) : n.object_id ) as the_path
order by length ( path )


| 1           | [1219,10003] |
| 1           | [1219,10002] |
| 1           | [1219,10001] |
| 1           | [1219,10000] |
+—————————-+
7128 rows

———-

Second Generation Children, and Count The Distinct Paths To Each Child Node:

start p = node:node_auto_index ( object_id = ‘1219’  )
MATCH path =  p – [ : PARENT_OF*2 ] -> c
return c.object_id
, count(*)
order by count(*), c.object_id

+————————+
| c.object_id | count(*) |
+————————+
| 1222        | 1        |
| 1245        | 1        |
| 2675        | 1        |

| 69920       | 65       |
| 57515       | 66       |
| 57593       | 73       |
| 57671       | 73       |
| 57749       | 73       |
| 11636       | 76       |
| 69200       | 77       |
| 69247       | 96       |
+————————+
7151 rows

———–

Third Generation Children, And Count Of The Distinct Paths To Each Child Node:

start p = node:node_auto_index ( object_id = ‘1219’  )
MATCH path =  p – [ : PARENT_OF*3 ] -> c
return length ( path)
, c.object_id
, count(*)
order by count(*), c.object_id

+—————————————–+
| length ( path) | c.object_id | count(*) |
+—————————————–+
| 3              | 2676        | 1        |
| 3              | 2678        | 1        |
| 3              | 3191        | 1        |
| 3              | 3199        | 1        |


| 3              | 9911        | 190      |
| 3              | 9916        | 190      |
| 3              | 69247       | 193      |
| 3              | 69200       | 196      |
+—————————————–+
5708 rows

——-

If you start to add up the counts, (7128 + 7151 + 5708), 19987 is already over twice as much as 9857, the distinct number of child nodes.

And, notice also the duplicate data with object_id 69247 and 69200? They are both are a second generation, and a third generation child.

Obviously, there are multiple paths to each object, each with a different number of nodes in the paths.

———–

Distribution Of All Possible Path Lengths Of All Child Nodes:

start p = node:node_auto_index ( object_id = ‘1219’  )
MATCH path =  p – [ : PARENT_OF* ] -> c
return  length ( path)
, count(*)
order by length ( path) , count(*)

+—————————+
| length ( path) | count(*) |
+—————————+
| 1              | 7128     |
| 2              | 25311    |
| 3              | 46005    |
| 4              | 59409    |
| 5              | 63570    |
| 6              | 64155    |
| 7              | 62259    |
| 8              | 55579    |
| 9              | 47804    |
| 10             | 37932    |
| 11             | 28461    |
| 12             | 19368    |
| 13             | 12141    |
| 14             | 7198     |
| 15             | 3943     |
| 16             | 2010     |
| 17             | 932      |
| 18             | 388      |
| 19             | 166      |
| 20             | 50       |
| 21             | 14       |
| 22             | 4        |
+—————————+
22 rows

Total of Counts: 543,827.

———–

Count All Possible Paths To Each Node:

start p = node:node_auto_index ( object_id = ‘1219’  )
MATCH path =  p – [ : PARENT_OF* ] -> c
return c.object_id
, count(*)
order by count(*), c.object_id

+————————+
| c.object_id | count(*) |
+————————+

| 9908        | 11794    |
| 9921        | 11828    |
| 10202       | 11953    |
| 9911        | 12504    |
| 9916        | 12504    |
| 9933        | 12755    |
| 10179       | 13238    |
| 10183       | 13272    |
| 9913        | 13638    |
| 10204       | 13799    |
| 10185       | 14201    |
| 10181       | 15084    |
+————————+
9857 rows

Between object_id 1219 and 10181, there are 15,084 possible paths! Wow!

———–

List Every Distinct Path Between Two Specific Nodes:

Working with object_id 1219 and 10181, list all the possible paths between them.

start p = node:node_auto_index ( object_id = ‘1219’  )
MATCH path =  p – [ : PARENT_OF* ] -> c
where  has ( c.object_id )
and   c.object_id = 10181
return  length ( path )  AS PATH_LENGTH
, extract ( n in nodes ( path ) : n.object_id ) as the_path
order by length ( path )

+—————————————————————————————————————–+
| PATH_LENGTH | the_path                                                                                          |
+—————————————————————————————————————–+
| 1           | [1219,10181]                                                                                      |
| 2           | [1219,9911,10181]                                                                                 |
| 2           | [1219,9910,10181]                                                                                 |
| 2           | [1219,9887,10181]                                                                                 |
| 2           | [1219,9885,10181]                                                                                 |
| 2           | [1219,9883,10181]                                                                                 |

| 17          | [1219,4900,4902,4908,9608,9704,9705,9724,9743,9757,9758,9861, 9867,9868,9869,9910,9911,10181]      |
| 17          | [1219,4900,4902,4908,9608,9704,9705,9724,9743,9757,9758,9861, 9867,9868,9869,9910,10170,10181]     |
| 17          | [1219,4900,4902,4908,9608,9704,9705,9724,9743,9757,9758,9861, 9867,9868,9869,9870,9911,10181]      |
| 18          | [1219,8197,8199,8201,8203,8205,8207,8243,9547,9698,9702,9725, 9745,9747,9763,9865,9870,9911,10181] |
| 18          | [1219,8197,8199,8201,8203,8205,8207,8243,9547,9698,9702,9725, 9745,9747,9763,9862,9870,9911,10181] |
+—————————————————————————————————————–+
15084 rows

Note that the path can be direct, with a path length of one.  Or, a path length of 18 nodes long!

Using the NYC analogy, this is the equivalent of taking the scenic tour all over Manhattan, before arriving at the destination.

———–

Querying DBA_DEPENDENCIES:

Returning to where I got the data from, the Oracle data dictionary, and querying DBA_DEPENDENCIES, we get:

Select
TYPE  || ‘ ‘ ||
OWNER || ‘.’ ||    NAME  || ‘ references ‘ ||
REFERENCED_TYPE  || ‘ ‘ ||
REFERENCED_OWNER || ‘.’ ||   REFERENCED_NAME
as DEPENDENCIES
From all_dependencies
Where referenced_name = UPPER ( LTRIM ( RTRIM( ‘STANDARD’  )))
AND   name = UPPER ( LTRIM ( RTRIM ( ‘KU$_M_VIEW_PFH_VIEW’ )))
order by 1

DEPENDENCIES
————————————————————————————————————————
VIEW SYS.KU$_M_VIEW_PFH_VIEW references PACKAGE SYS.STANDARD

This means that the source code for SYS.KU$_M_VIEW_PFH_VIEW, refers to PACKAGE SYS.STANDARD. Therefore, SYS.KU$_M_VIEW_PFH_VIEW is a child of SYS.STANDARD.

Which confirms the Cypher query. There is a direct link between the two objects.

———–

Think About What You’re Doing:

One takeaway: the more complex and interconnected the graph, the number of paths between nodes goes up exponentially.

Even though an object can be directly connected to another, there can be many other paths between the same two objects.  A similar idea is Linked In. You can be connecteddirectly to your friend. But you are also connected via a mutual friend.  And friends of friends.

Again, the duplicates in these Cypher queries remind me of Cartesian products in SQL queries.  As I wrote years ago, to avoid duplicates in SQL, you need to be very aware of what you can realistically expect from a query. The same idea also applies to NEO4J and Cypher.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: