Recommendation Engines: RDBMS and SQL, Versus Graph Database


For years, I’ve worked with Oracle, doing complex SQL. Recently I’ve been looking at the graph database, NEO4J.

Last night I was watching a NEO4J webinar about Graphs for Gaming.
It makes some interesting Cypher (NEO4J query language) queries for recommendation engines for a gaming company. The point being that certain tasks are much easier in Graph DB/Cypher than in RDBMS/SQL.

The more complex query was to take an individual gamer, Rik, and find other users/gamers, who:
– had worked at one the same companies as Rik
– spoke one of the same languages as Rik
– had NOT gamed with Rik yet

The 12 line Cypher query was:  

Start
rik=node:node_auto_index(name=”Rik”)
Match
rik-[ :SPEAKS ]-> language,
rik-  [:PARTICIPATES  ] -> session,
rik- [:WORKED_FOR|WORKS_FOR ] -> company,
person2- [:WORKED_FOR|WORKS_FOR ] -> company,
person2- [:SPEAKS ]-> language ,
person2- [:PARTICIPATES ] -> session ]
Where
Not ( person2-  [:PARTICIPATES ] -> session)
Return distinct person2.name

Having wrote complex queries for years, this didn’t seem too difficult for me. So how would you do it in SQL? Not having access to the actual tables, I could only imagine what the tables were. But the solution I envisioned was to use inline views, one of the incredible aspects of SQL, and an outer join.

( /* outer inline view, first half */
select person_id, name, company, language, session_id
from
( /* inline view for work history */
Select person_id, name, wh.company
from person p
join work_history wh
on p.person_id = wh.person_id
where p.name = ‘Rik’
) rik_work
join
( /* inline view for languages */
Select name, language
from person p
join language_person lp
on p.person_id = lp.person_id
join lanuage l
on l.language_id = lp.language_id
where p.name = ‘Rik’
) rik_lang
on rik_work.person_id = rik_lang.person_id
Join
( /* inline view for sessions */
Select person_id, name, s.session_id
from person p
join session s
on p.person_id = s.person_id
where p.name = ‘Rik’
) rik_sess
on rik_work.person_id = rik_sess.person_id
) first_half

That completed the inline view for Rik.

To find other gamers: make another inline view, and do an outer join to the one for Rik.

Select distinct person.person_id, person.name
from
( Select …

) first_half
Left Outer join
( Select …
/* Same as above,
where person.name/id <> rik */
) second_half
on first_half.language = second_half.language
and second_half.language is not null
on first_half.company = second_half.company
and second_half.company is not null
on first_half.session_id = second_half.session_id
and second_half.session_id is NULL

My first thought was to use INTERSECT. But this would only work, if the sessions were the same. This query is looking for similar people, but different sessions.

It took me about 15 minutes to write the basic query. The hard part was making assumptions on what the tables and fields were.

Cartesian Products:

An interesting point to note with this kind of a SQL query, is that each time you add multiple properties, such as language spoken, it multiplies the number of rows returned.

So, in the first half, if Rik:
had worked for 5 companies,
spoke 3 languages, and
had done 100 game sessions

There would be 1500 rows returned just for Rik.
5 companies, 3 languages – 15 row sets. Multiplied for each of the 100 sessions.

rik, company1, language1, session1
rik, company2, language1, session1
rik, company3, language1, session1

rik, company1, language2, session1

rik, company5, language3, session100

The second inline view, for all the other users, each with multiple companies, languages, and sessions, would become many multiples bigger than the first.

Each inline view becomes a cartesian product. And, put together, both become yet another cartesian product.

Cartesian products are either misunderstood, or, completely unknown to many database practitioners. And especially non database programmers. See the presentation I made on cartesian products at the Oracle User’s Group.

—-

The SQL query is complex, but doable. All in a single SQL statement! Without using SQL, how would you do it in a 3GL language such as java?

For many very cool tricks with SQL, see a tuning presentation that I did at the Oracle user group.

—-

The Cypher query assumes that the data for language, session, and companies worked at, are all in individual nodes. Not properties in a node.

They are certainly not in a denormalized string of “English, French, Swedish, German”. Or should it be, “English, American, Australian, Canadian, Irish, Scottish”? 🙂

—-

However, given the available data structure, the query is much shorter in Cypher, as you can see. The graph database is very useful in some areas.

—-

There was another query in the webinar which was easier. The question was: For those gamers who had played “Puzzle Pirates”, find the other games they had played. Count the times the other games had been played. Rank the highest played games for use in a recommendation.

Watch the webinar and see if you can do it in SQL.

Permalink: https://rodgersnotes.wordpress.com/2013/05/30/recommendation-engines-rdbms-versus-graph-db/

4 Responses to Recommendation Engines: RDBMS and SQL, Versus Graph Database

  1. Wow. Very well done. Thanks for making the comparison – it just rocks.

    Rik

  2. Nice post, but do you have performance results? Which DB performed better with same data for recommendation engine?

    Thanks!

  3. rodgersnotes says:

    Hello Alejandro,

    It’s a good question, but for a comparison, you really need to have an actual use case and realistic data.

    I’d wrote:

    “Not having access to the actual tables, I could only imagine what the tables were.”
    and
    “It took me about 15 minutes to write the basic query. The hard part was making assumptions on what the tables and fields were.”

    And on design, I noted: “The Cypher query assumes that the data for language, session, and companies worked at, are all in individual nodes. Not properties in a node.” If the NEO4J attributes were fields/properties in a node, the Cypher query would be different. Longer perhaps.

    The reason that I noted that was that in an internal company system, you may not have a field for language at all. If it does exist, it could in fact just be a denormalized string of multiple languages: “English, French, Swedish, German”, as I pointed out. If that was the case, the query would have to stop and do a sideways linear search through the list of languages, for every row/node. That’s really slow in RDBMS, and I know it would slow down the graph DB too.

    For more info on RDBMS tuning, do check out the links in the post on database tuning, and cartesian products. As well as the series I wrote: Database Design Mistakes To Avoid.

    How you do your design in RDMBS will have a very direct effect on performance. Bad designs can have: poor performance, inaccurate results, cartesian products, dupe data, etc.

    Design concepts will also apply to the Graph DB. As well as design of: java objects, architecture, programs, algorithms, user interface, etc.

    HTH

Leave a comment