Analyzing Enron Email Metadata Using SQL


enron-logo

enron-logo

What can we discover analyzing Enron emails using SQL? Quite a bit actually.

The Enron scandal in 2001 was huge.  As part of the discovery process, prosecutors started looking at emails to find evidence to convict the guilty. These email sets have since been made public.

Recently, I downloaded a set of Enron emails from USC:
http://www.isi.edu/~adibi/Enron/Enron.htm
and installed them into a MYSQL database.

There were over 252K email messages, sent to over 2 million recipients.

+-----------------+
| Tables_in_enron |      Count
+-----------------+
| employeelist    |        151
| message         |    252,759
| recipientinfo   |  2,064,442
| referenceinfo   |     54,778
+-----------------+

Analyzing the emails produced some very interesting findings about what went on inside Enron!

MetaData:

The table, Employeelist, contains the first name, last name, and email of 151 Enron employees. The table, Message, contains the actual message, and the sender’s email. The table, Recipientinfo, is the 1:M relationship with Message. It contains all the TO/CC/BCC recipients of the message. There are often a lot of recipients for every email. On average, there are about 8.17 recipients for every email.

Message Table:

Of the 1/4 million messages, there were about 17.5K distinct senders.

+------------------------+----------+
| count(distinct sender) | count(*) |
+------------------------+----------+
|                  17568 |  252,759 |
+------------------------+----------+

Recipientinfo Table:

The over 2 million messages were received by over 68K people.

+----------------+----------+
| cnt_recipients | count(*) |
+----------------+----------+
|          68214 |2,064,442 |
+----------------+----------+

How Many Recipients Were On Each Email?

The greatest number of email messages, 8245, were sent to just one person.

+------------------+----------+
| No_Of_Recipients | count(*) |
+------------------+----------+
|              212 |        1 |
|              234 |        1 |
|              255 |        1 |
|              266 |        1 |
...

|                9 |      379 |
|                8 |      626 |
|                7 |      765 |
|                6 |      837 |
|                5 |     1017 |
|                4 |     1616 |
|                3 |     1670 |
|                2 |     4264 |
|                1 |     8254 |
+------------------+----------+
1097 rows in set (2.56 sec)

However, the maximum number of recipients on a single email was: 19198.

No PK/FK In This Enron Dataset:

If you have a well thought out data model, the model gives you structure and meaning. Someone with database expertise can then determine pretty quickly what the relationships between the data are, what they mean, and how to work with them. While there are shared fields between the Enron email tables, there are no Primary Key/Foreign Key relationships. So, it took a little longer to figure out meaningful results.

Which Organizations Did Enron Communicate With?

From All Domains, To All Domains:

+------------------------------+-------------------------+----------+
| Sender_Domain                | Recipient_Domain        | count(*) |
+------------------------------+-------------------------+----------+

...
| @hoegh.no                    | @enron.com              |     1460 |
| @hotmail.com                 | @enron.com              |     1475 |
| @enron.com                   | @bracepatt.com          |     1525 |
| @epsa.org                    | @enron.com              |     1577 |
| @aol.com                     | @enron.com              |     1624 |
| @duke-energy.com             | @duke-energy.com        |     1673 |
| @bracepatt.com               | @enron.com              |     1731 |
| @gmssr.com                   | @enron.com              |     1921 |
| @columbiaenergygroup.com     | @enron.com              |     2070 |
| @txu.com                     | @txu.com                |     2130 |
| @yahoo.com                   | @enron.com              |     2139 |
| @enron.com                   | @hotmail.com            |     2189 |
| @amerexenergy.com            | @enron.com              |     2543 |
| @txu.com                     | @enron.com              |     2882 |
| @govadv.com                  | @enron.com              |     4955 |
| @enron.com                   | @aol.com                |     5261 |
| @enron.com                   | @ENRON                  |     5500 |
| @haas.berkeley.edu           | @haas.berkeley.edu      |     5608 |
| @enron.com                   | @enron.com              |  1701614 |
+------------------------------+-------------------------+----------+
34786 rows in set (5.58 sec)

It looks like there was a lot of communication with personal accounts. And that there were a number of employees taking online courses from the HAAS Business School at UC Berkeley.

| bortman@haas.berkeley.edu | sidbury@haas.berkeley.edu    | This is a reminder to all 3rd year students that you should be bidding for classes today. The first round of bidding ends tomorrow Thursday, November 8, at 12 noon. If you have any questions please give us a call at 510-642-1406Good Luck Josh Bortman EWMBA Program

Data Quality:

Turns out there were a lot of variations in Jeff Skilling’s and Ken Lay’s email address. And, these addresses didn’t correspond to the official email address in the table, employeelist.

+------------------------------+----------+
| recipient                    | COUNT(*) |
+------------------------------+----------+
| effrey_skilling@enron.com    |        1 |
| jskilling@enron.com          |        1 |
| jeffrey.k.skilling@enron.com |        1 |
| skillingj@enron.com          |        1 |
| k.l.lay@enron.com            |        1 |
| ssskenneth.lay@enron.com     |        1 |
| kennethlay@enron.com         |        1 |
| effrey.skilling@enron.com    |        1 |
| jeffrey_skilling@enron.com   |        1 |
| klay.enron@enron.com         |        1 |
| eff_skilling@enron.com       |        1 |
| k.lay@enron.com              |        1 |
| layk@enron.com               |        1 |
| kllay@enron.com              |        2 |
| skilling@enron.com           |        2 |
| k_lay@enron.com              |        2 |
| kenlay@enron.com             |        3 |
| ken_lay@enron.com            |        3 |
| kenneth.l.lay@enron.com      |        3 |
| lay@enron.com                |        4 |
| jeffrey.skilling@enron.com   |        7 |
| jeff_skilling@enron.com      |        9 |
| lay.kenneth@enron.com        |       10 |
| kenneth_lay@enron.com        |       50 |
| klay@enron.com               |     1553 |
| jeff.skilling@enron.com      |     2071 |
| kenneth.lay@enron.com        |     2761 |
+------------------------------+----------+

So, I looked at who sent these emails to the most common misspelling, klay@enron.com.

+------------------------------- -----------+
| sender                           count(*) |
+------------------------------- -----------+
| 110165.74@compuserve.com                1 |
...

| tiptonhr@yahoo.com                      1 |
| tlittleman@aol.com                      1 |
| tmartini@glencove.com                   1 |
| tobrien8@depaul.edu                     1 |
| tom.siekman@compaq.com                  4 |
| tomicamachouston@aol.com                2 |
| tomrip1@netscape.net                    1 |
| tony_eng@netcel360.com                  1 |
| trash@16visions.com                     1 |
| tricebn@pacbell.net                     1 |
...
| wewerlacy@aol.com                       2 |
| weyoung@earthlink.net                   1 |
| whurt@cvalley.net                       1 |
| wickedbill@yahoo.com                    1 |
| wieman@cooper.edu                       1 |
| wild@sfo.com                            1 |
| will.reed@techforall.org                4 |
...
+-------------------------------------------+
1346 rows in set (0.60 sec)

Most of these emails came from outside Enron, from personal email accounts. I assume that they are spammers. Or strangers who were trying to contact the executive.

From Within Enron, Who Sent The Most Messages?

+----------------------------------------------+----------+
| sender                                       | COUNT(*) |
+----------------------------------------------+----------+

...
| pete.davis@enron.com                         |     2500 |
| debra.perlingiere@enron.com                  |     2549 |
| matthew.lenhart@enron.com                    |     2784 |
| enron.announcements@enron.com                |     2838 |
| no.address@enron.com                         |     3511 |
| chris.germany@enron.com                      |     3686 |
| vince.kaminski@enron.com                     |     4366 |
| tana.jones@enron.com                         |     4437 |
| sara.shackleton@enron.com                    |     4797 |
| kay.mann@enron.com                           |     5100 |
| jeff.dasovich@enron.com                      |     6272 |  <-
+----------------------------------------------+----------+
6102 rows in set (0.55 sec)

Jeff Dasovich, Government Relations Executive, sent the most emails, and appeared continuously in the queries.

Who Received The Most Emails?

+---------------------------------------------------------+----------+
| rvalue                                                  | COUNT(*) |
+---------------------------------------------------------+----------+

...
| mark.guzman@enron.com                                   |     5907 |
| susan.mara@enron.com                                    |     5910 |
| james.steffes@enron.com                                 |     6212 |
| steven.kean@enron.com                                   |     6397 |
| pete.davis@enron.com                                    |     7509 |
| louise.kitchen@enron.com                                |     7593 |
| mark.taylor@enron.com                                   |     8530 |
| sara.shackleton@enron.com                               |     9632 |
| tana.jones@enron.com                                    |     9777 |
| jeff.dasovich@enron.com                                 |    10644 | <- 
| richard.shapiro@enron.com                               |    11015 | <-
| no.address@enron.com                                    |    19198 |
+---------------------------------------------------------+----------+
29042 rows in set (2.54 sec)

Richard Shapiro, Vice President of Regulatory Affairs, actually received more emails than Jeff Dasovich.

Email Recipient Type:

+-------+----------+
| rtype | count(*) |
+-------+----------+
| TO    |  1556994 |
| CC    |   253735 |
| BCC   |   253713 |
+-------+----------+

In the recipientinfo table, there are approximately as many BCC as there are CC. I’m sure that many of the BCC messages were done via email aliases. If an email was sent the “Trading Group”, it was sent to all those on that distribution list.

That gives us a good idea of what our data set is, and looks like.  If you read my blog, you’ll notice that this is the basic data analysis that I always do.  Stay tuned for more.

2 Responses to Analyzing Enron Email Metadata Using SQL

  1. Rose Dl says:

    Recently, I downloaded a set of Enron emails from USC:
    http://www.isi.edu/~adibi/Enron/Enron.htm
    and installed them into a MYSQL database…
    the url that shown above doesnot exit..can u give the new one…i wanna get the set of Enron eamil and install them into MYSQL database…ty

  2. rodgersnotes says:

    Try these links:

    http://www.cs.cmu.edu/~enron/

    which has a link to download the Enron email dataset:
    http://www.cs.cmu.edu/~enron/enron_mail_20110402.tgz

    Although, It’s not the link that I used.

Leave a comment