Getting Started With NEO4J: For Database Professionals


Why Use the NEO4J Graph Database?

About 10 years ago, I went to a BIO-IT conference, and looked a spiralling 3D model of a very large molecule made up of hundreds of atoms. I thought, “that does not look like the rows and columns in a relational database”.

If I were to model that collection of atoms in SQL, how would I model the schema to store all the atoms? While it could be done in RDBMS and SQL, it would also require a fair amount of thought, code, and recursion to do it. But after a retrieval, what would the structure be after it was retrieved, rows and columns? With all the reads, how fast would the retrieval time be for each molecule?

Lately I’ve been looking at the Graph Database, NEO4J. It looks like it can easily do many things better than a traditional RDBMS could. The graph database looks better suited to some applications than RDBMS, or linked lists. For example: molecules, tree structures, social networks, or paths.

Compared to a RDBMS, one interesting aspect about graph databases is how relationships contrast between them. In a RDBMS, the relationships are predefined by CREATE TABLE, PK/FK, and inserting rows in the database. The structure of the relationships will never change until the architect or DBA changes the schema and moves the data. It’s for this reason, I think that schemas and table structures should be given a lot of thought, refined, and changed over time.

In a graph database, the relationships are constantly changing and dynamic. Modelling a social network in a graph database, as multiple people connect, the new connections between the nodes, become the new relationships. It’s a new paradigm.

Here, I’ll show some initial explorations with NEO4J.

Install NEO4J and the Cineasts Movie Database:

NEO4J will come with a small database based on the movie, The Matrix. However, The Matrix has only about 7 nodes. The downloadable movie database has over 60,000 nodes, which is a more realistic set of data. Go to:

http://info.neotechnology.com/0110-cypher.html

Download NEO4J and install it.

Download the movie database and overwrite the old database.

There are even some canned queries are there for you.

NEO4J Installs Easily:

The NEO4J software just unzips into a directory. No RPMs. No running an executable.

One thing I have to give NEO4J a lot of credit for, is how easy it is to install. If you have looked at some of my other posts, such as VM Ware Player vs Virtual Box, you know that I am demanding that software install intuitively and easily these days, and not require reading a 150 page installation document as a prereq.

Java Prerequisite:

One point to note is that Java 1.6 is required to use NEO4J. If your $PATH defaults to an older version of java, or, you don’t have Java 1.6 installed, NEO4J won’t work.

On Linux, start NEO4J like this:

cd /neo4j/neo4j-community-1.8.1/bin
./neo4j start &

Then go to the web based data browser at:

http://localhost:7474/webadmin/

Interestingly enough, when I substituted the server name, and the server’s local IP address in the URL, I got the error: “Could not connect to remote server.” But administration later.

Cypher Compared To SQL:

Cypher             SQL Clause

Return              Select clause
Start                  From
Where              Where
Match               Join

Navigating the NEO4J Schemas With Cypher Commands:

When I started working with Oracle, before I could create any queries, I first needed to know information about the tables and schema that I was working with. If I didn’t know the table name, I couldn’t query it.

Two key things helped me. One was the Oracle data dictionary. So, if I wanted to know what tables I had access to, I ran the command: Select table_name from All_tables. Then, in SQLPLUS, I used the Describe command, which would tell me a table’s structure. I could cut and paste all the fields into a query.

This is what I call, navigating. Here are some useful queries to navigate the NEO4J movie database.

On the local machine, go to:

http://localhost:7474/webadmin/

One fine point about doing queries in the NEO4J Data Browser. If you leave a blank line, or even a blank character before the Start command, the query won’t work, and error will come back:
Query not executed yet. Press the search button or hit CTRL+ENTER inside the query editor to execute it.

Remove any blanks, and place Start in the upper left corner.

Query To Get A Count Of All Nodes:

START n=node(*)
return count(*)

Returned 1 row.
Query took 2060ms

count(*)
63075

Another place to find the node count is in Overview:dashboard. There, you can also see the the number of relationships, properties, and relationship types.
http://localhost:7474/webadmin/

Properties Can Vary By Node, and Node Type:

Select * queries are different in NEO4J. Consider a SQL query:

Select count(*)
from Person
where name=”Lucy Liu”

In an RDBMS table, Person, every row will have every column in its structure, including name. If Person.name is allowed to be NULL, some rows may have NULL in that field.

There are some differences with NEO4J. If you run the following query, you will get an error:

START n = node(*)
WHERE (n.name=”Lucy Liu”)
return n

The property ‘name’ does not exist on Node[0]

The error occurs because there are different kinds of nodes, with different properties in the nodes.

Filtering Using WHERE and HAS:

One way to make the above query work is to filter only those nodes that have the property, name, with the function, HAS ().

START n = node(*)
WHERE has (n.name)
and (n.name=”Lucy Liu”)
return n.name

Returned 1 row.
Query took 695ms

n.name
“Lucy Liu”

Using GROUP BY To Determine All Relationship Types:

START n=node(*)
MATCH n-[r]-m
RETURN type(r), count(*)
ORDER BY count(*) desc

Returned 5 rows.
Query took 1525ms

type(r) count(*)

“ACTS_IN” 189412
“DIRECTED” 23836
“RATED” 68
“FRIEND” 30
“ROOT” 2

This is a very useful query. To write queries to filter for the different kinds of relationships, you first need to know what the relationships are, and their exact spelling.

One point to note is that in Cypher, the phrase, “GROUP BY” is actually not used in the query. GROUP BY is implied with the aggregate function, count(*).

ORDER BY is used in the same way as in SQL. Although, unlike Oracle, you cannot order by listing the column position. This query executes, but is not ordered by either type(r), or by count(*).

START n=node(*)
MATCH n-[r]-m
RETURN type(r), count(*)
ORDER BY 2, 1

Returned 5 rows.
Query took 1458ms

type(r) count(*)

“RATED” 68
“FRIEND” 30
“DIRECTED” 23836
“ROOT” 2
“ACTS_IN” 189412

Returning Either A Property, Or An Entire Object:

NEO4J can return just individual properties (columns in SQL) or entire objects.

The previous query for “Lucy Liu” returns just the text of name. The following query returns the entire object.

START n = node(*)
WHERE has (n.name)
and (n.name=”Lucy Liu”)
return n

Image

In this case, the node is identified internally as Node 1000.

Notes On Node Numbers:

One thing that seems odd, is that for Node 1000, the value for the property, ID, is actually “140”. N.id (“140”) is not the same as the internal identifier of Node (1000).

This is different from using sequences as unique identifiers with RDBMS. I sense that it is a java thing that runs under the covers.

I’ve been trying to get a query to return these Node IDs, but so far I haven’t figured this out. If anyone knows a query that will work, please let me know.

Useful Links:

http://structr.org/cypher-cheat-sheet.html
http://www.neo4j.org/learn/cypher
http://watch.neo4j.org/video/57174859
http://docs.neo4j.org

http://info.neotechnology.com/0110-cypher.html

Giving credit, some of these queries came straight from these links.

All for now. More later.

Advertisements

One Response to Getting Started With NEO4J: For Database Professionals

  1. Thanks a lot for your feedback, good article.

    I submitted a PR taking care of the leading whitespace in the databrowser edit window. (https://github.com/neo4j/neo4j/pull/490)

    The “id” properties are business level id’s in this case the id values from themoviedb.org where the data comes from. The internal neo4j-id’s are independent of that. You can access them with ID(n)

    Regarding ordering by column number, I don’t think that is very maintainable. But you can give your columns aliases with AS and order by them.

    What you would usually do for looking up a starting-node by a property is NOT this very expensive query:

    START n = node(*)
    WHERE (n.name=”Lucy Liu”)
    return n

    but use an index lookup

    START n = node:Person(name=”Lucy Liu”)
    return n

    Cheers

    Michael

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: