MUCK – Massively Unified Code-Key, Generic Three Table Data Model


Another Database Design Mistake To Avoid.   Actually, one of the worst DB design mistakes you can make.

One of the worst things you can do to a database system is to make a generic data model of three tables for the whole system. A variation is to use three tables to combine multiple lookup tables into one.

Instead of standard tables and columns, three tables are used for everything.

One table each for:
entity/table
attribute/column/field
values of the data

MUCK – Massively Unified Code-Key, Generic Three Table Data Model

Instead of a number of columns in a row, you now have a single column, and multiple rows.  One writer, Don Peterson, referred to this method, as MUCK, for Massively Unified Code-Key.

This has proved to be such an incredibly bad idea that it has already been discussed online in a few places.  Unanimously the articles state why you should never do this.  A few well written articles can be found here.

http://www.simple-talk.com/sql/database-administration/five-simple–database-design-errors-you-should-avoid/

http://www.simple-talk.com/sql/database-administration/ten-common-database-design-mistakes/

http://www.projectdmx.com/dbdesign/lookup.aspx

Tom Kyte also wrote about it on this in “Effective Oracle By Design”, on page 35.  “But how does this model perform?  Miserably, terribly, and horribly.”

All these articles reach the same conclusion, and give even more (different) reasons why you shouldn’t use this idea.  Do check them out.

The results of MUCK:

Awful Performance:

This kind of system, is actually DESIGNED to go slow.  For even the simplest of queries, you now have to join at least three tables.

Pivots:

Another reason for the bad performance is that each value for each attribute, is stored in a separate row.  So, if you have five attributes, the data will be stored in five different rows.  To get them all back on a single row, you have to pivot the data.  This is of course more work than if all the data was on a single row.

Multiple Inline Views:

Since each piece of data is on a different row, if you have more than one field to retrieve, you have to create an inline view, for each field!  So, to retrieve five fields with their appropriate values on the same row, you actually have to create five inline views!  The queries get to be really big and complex, fast.

Data Type Problems:

To make this system work, the field that holds the value is made of type VARCHAR.  This will hold a representations of other data types, such as a number.  But it will be stored in the character datatype “8”, not as a number, 8.  At some point, you will need to convert it from a string, back to the number.  But eventually, a non numeric value will sneak in, such as “A”, that can’t be converted back into a number.  The query will fail, and you will be debugging, and fixing it manually.  More people time.

Really Complex Inserts:

For an insert on a normal system, you would insert one row with the values for (say) five different fields.  To get the equivalent in the MUCK system, you would need to do five separate insert statements!   More people time.

People Time:

Lots more people time is necessary for even the simplest of queries.  For a simple query, you will sometimes have to query the lookup tables first, before you even compose your SQL statement, so that you can find the codes to filter on.  “What is the code for ???”

Too Confusing For Even The Brightest:

Once my department had a task to double check a lot of data from these tables, and compare them to another system.  With regular tables and columns, we would just select the fields, and got it done in a few minutes.  But we spent a number of hours trying to ensure these complex queries were correct.

And yet, the three of us came up with three different answers!  Which one was correct?  This was in spite of the fact that all three of us knew SQL really well.  How’s that for accuracy?

Cost:

What is the dollar value of the people time to use and maintain this system?  Compared to whatever advantages there were for doing it in the first place?

It’s Cobol:

If you have to look at other fields, or run processing to know what a particular row or field represents, you are basically running a Cobol system.  More on Cobol systems in another post.

Correction:

JUST SAY NO!     Standard database design practice please.

11 Responses to MUCK – Massively Unified Code-Key, Generic Three Table Data Model

  1. rodgersnotes says:

    I have since found out that MUCK is also known as the Entity Attribute Value (EAV) Model, or Open Schema.

    This wikipedia entry describes it.
    http://en.wikipedia.org/wiki/Entity–attribute–value_model

    But remarkably, the article comes off as if EAV is actually a model to consider! It does mention the problems of pivoting, and comes up with three very costly solutions to it. A separate warehouse. Materialized views. And in memory data structures. More complexity upon complexity.

    But still no real added benefit to just doing it right in the first place.

    Amazing how such an incredibly bad idea is made palatable and sold.

  2. michiel says:

    “doing it right in the first place”… Have you read the article? Medical symptoms come in the range of 300,000 items, each required to be a column in a table describing all symptoms. So to track possible symptoms of a patient, we’d have a table structure like
    PATIENT(PatientID Primary Key, S1, S2, S3…….)
    Most modern databases would collapse on simple queries or inserts.

  3. rodgersnotes says:

    Uh, yes I read the article.

    Actually, I wrote both the article, and the response.

  4. michiel was referring to whether you have read wikipedia’s article, not yours. EAV and other metadata-based techniques can indeed be overkill and become an anti-pattern, but in some cases they provide solid trade-offs: flexibility, backwards compatibility, etc.

    • rodgersnotes says:

      Re: clinical example on wiki.
      An interesting problem. A few thoughts.

      One idea is that perhaps RDBMS is not the best tool
      for the clinical job. Objects perhaps? Or, some of the NEW/NOSQL databases? I’ve never tackled 100K + columns.

      But if EAV was so difficult for a few dozen columns,
      how much more difficult will it be for 100K?

      —-

      An RDBMS idea would be to not have columns at all.
      Certainly not 100K or more.

      Instead, have an intersecting table for the M:N relationship

      Patient
      Patient_Symptom
      Symptom

      In this way, you use the least amount of storage.

      The rows might need to be pivoted from rows into columns.
      What could be problematic: after pivoting, what is the column name to search on?
      Temp tables perhaps?

      —-

      The 100K columns are 1:1 relationships.

      This could be a good case to split up the many columns into multiple 1:1 tables.

      So, symptoms specific to cancer, or columns that ruled out others, would go into their own tables, still preserving the 1:1 relationship. They would be only queried, inserted, updated, or joined when occasionally needed.

      I do recommend similar for BLOBs:
      https://rodgersnotes.wordpress.com/2010/09/19/combining-identifying-data-transaction-data-and-lobs-in-a-single-table/

      I would like to see comparisons of EAV versus different methods.

      As I’ve said in other posts, it’s very unusual for companies to test multiple designs and prototypes first. Usually, once they have “the design”, they start throwing more and more code on top to make the awful beginner prototype work.

      See my series on Database Design Mistakes To Avoid

      Also, see the technical limits as to the number of columns in a table.

      Discussion at DBMS2.COM, on various databases
      http://www.dbms2.com/2011/03/13/so-how-many-columns-can-a-single-table-have-anyway/

      Oracle 10g has a 1000 column maximum
      http://docs.oracle.com/cd/B19306_01/server.102/b14237/limits003.htm

  5. rodgersnotes says:

    That’s a really strange post.
    http://sqlblog.com/blogs/aaron_bertrand/archive/2009/11/19/what-is-so-bad-about-eav-anyway.aspx

    The guy actually recommends going to an EAV model because he does not want to add a column to a table. Huh?

    He should read my series here on DB Design Mistakes to Avoid.
    Starting with the first one:
    https://rodgersnotes.wordpress.com/2010/09/14/database-design-mistakes-to-avoid/

  6. R S says:

    Though a bit late to the game, but still have a few points to comment.

    [Awful Performance] and [Pivots]
    Yes, EAV is slower than regular RDBMS design. But this is planned and expected price for the flexibility. Difference in performance might be reduced by using some techniques.

    One of the approaches is to use updatable views, so that for the app developer database access will look like working with regular tables. All EAV logic would be wrapped in view definition and triggers. Why this might be needed? Well, mostly for the cases when DB structure is not known at development time, when end user wants to create new columns (attributes) or even new tables (entities).

    About performance improvements: values can be stored in a table with primary clustered index by entity id, attribute id, so that they will be physically stored together to each other, on the same file page.

    [Multiple Inline Views]

    There are standard SQL techniques to avoid multiple joins. For example, use one subquery that retrieves all values and then extract specific attributes by aggregation. Not the prettiest sql syntax, but still OK to get the work done.

    [Data Type Problems]
    For db systems which support type affinity, it is not an issue. E.g. SQLite allows to store any type of data in any column.

    [Really Complex Inserts]
    It is not only inserts, but also updates and deletes. :) Again, with updatable views problem goes away.

  7. R S says:

    And to mention some pros that EAV gives:
    1) easy to implement change tracking and other similar actions based on the fact that you are dealing with uniformed set of data (row- and cell-based access, e.g.)
    2) easy to change your schema (adding or dropping columns, applying rules) – just need to re-create view, instead of altering table

    Yes, I know that you can do all this using standard RDBMS techniques. But there are cases when this type of data modelling does make sense. As mentioned before, think about medical database with 300K different attributes for the patient’s diagnosis. Another example – online store, with products grouped into multiple categories, with feature compare matrix.

  8. rodgersnotes says:

    Sigh.

    Ok, let me summarize your recommendations. See the posts I’ve linked for expanded explanations.

    Use triggers , and put logic in triggers, over standard PK/FK (very bad)
    https://rodgersnotes.wordpress.com/2010/09/18/referential-integrity-and-performance/

    Use complex MUCK/EAV, and then put more complexity on top of it (views) to try to make it work the way it should have been designed in the first place.
    https://rodgersnotes.wordpress.com/2010/09/20/using-the-wrong-object/
    https://rodgersnotes.wordpress.com/2010/09/21/binary-trees-as-the-database-design/

    “DB structure is not known at development time”
    Huh? So you would charge ahead and code, then design your table structures after the fact? Very, very bad!
    https://rodgersnotes.wordpress.com/2010/09/14/database-design-mistakes-to-avoid/

    Allow the “end user … to create new columns or tables”.
    Really? Not pass the table design off to the DB techs who should know better how to design it?

    Speculate that using “primary clustered indexes” will make everything ok.
    Sorry. No index will save you if you have designed something to go slow, even if all the data is on the same data blocks.
    https://rodgersnotes.wordpress.com/2010/09/14/oracle-performance-tuning/

    In fact, with a bad design, indexes can actually make performance slower.
    https://rodgersnotes.wordpress.com/2010/09/15/stamping-out-cartesian-products/

    Use EAV (with up to 300K fields), even though you know performance will always be slower than standard RDBMS design.

    “There are standard SQL techniques to avoid multiple joins.”
    I’ve pioneered some serious tuning using views and inline views. Sounds like you want to use lots of 3GL code. That is, for any query, use multiple hits to the database instead of one simple query. See my performance tuning presentation for some advanced techniques using views:

    You talk of using SQLite, so that a String can slip into a field that MUST always be an Integer, or Date.
    Contradicting one very reason I say to NOT use EAV/MUCK. To repeat, this will ensure that the data quality will eventually be sure to suffer.

    Could SQLite also give you a table with 300K columns? No. Nor could any other relational database product on the market! Why do people still speculate about it?

    You think that by using updatable views, “problem goes away”. But they won’t help the performance issues at all. See my tuning presentation.

    “easy to implement change tracking”
    Sorry, I fail to see how using MUCK/EAV makes it any easier to record “before” and “after” states of the DB design, data, etc. Change tracking is a discipline in itself.

    “easy to change your schema (adding or dropping columns, applying rules) – just need to re-create view, instead of altering table”

    I’ve used lots of views with standard RDBMS design. Views or procedures on top of bad designs, while useful, can become very complex. The more inappropriate the design, the more complex the view or procedure. They are not always “easy” to change. See my tuning presentation.

    Use Case – add a column:
    In standard DB Design, using the Alter table command is pretty easy to do, and very useful. If a view, you just add the column to the view, easy.

    In EAV/MUCK, you have to add create another inline view with a SELECT and WHERE clause, and join it to the original view, expanding it. But you missed the fact that with EAV, you also have to insert the row for the column.

    Now, take the same case, and assume you have to subsequently update every row with that new column with a standard default value, to say, “Active”.
    In standard RDBMS, you just use a simple update command to update every row, say 100K rows, and commit. Two simple statements.
    In EAV/MUCK, you would have to insert 100K rows. Because there is an overhead for every row, EAV/MUCK will take up more disk space.

    If there are 5 different values, based on different decision rules, it gets really difficult to update with MUCK/EAV.

    One more thought, MUCK/EAV is similar to another bad design cult, Cobol Systems Inside the Database
    https://rodgersnotes.wordpress.com/2010/09/29/creating-cobol-systems-inside-the-database/

    —-

    Dude, if you said statements like these in an interview with experienced DB developers, and DBAs, you would very soon fail the interview.

    You should read the links I have listed, and the rest of my series on DB Design Mistakes To Avoid.

    Also read Fabian Pascal
    http://www.dbdebunk.com/

    and books by Tom Kyte
    https://asktom.oracle.com

    Hope this helps.

  9. billkarwin says:

    Great blog post! You left a comment on my related blog warning of the ills of EAV (http://karwin.blogspot.com/2009/05/eav-fail.html) so I’ll offer one here in turn.

    Here’s an analogy to object-oriented languages: sure, we *could* use Reflection APIs for every class instantiation or every method call, but 99% of the time that’s needlessly complex, error-prone, and defeats compile-time checking. So it would be totally bone-headed to try that.

    All the suggestions of managing MUCK or EAV using views, triggers, or application code are just dancing around the problem that when you store your own metadata, you end up having to write more code from scratch to manage all that metadata. Eventually, you’ve implemented a “schemaless” NoSQL solution on top of an RDBMS.

    Wikipedia defines the “Inner Platform Effect” (https://en.wikipedia.org/wiki/Inner-platform_effect) as:
    “a system so customizable as to become a replica, and often a poor replica, of the software development platform they are using.” It uses EAV as an example of this.

    Nevertheless, we sometimes are given a requirement to be arbitrarily flexible.

    The least bad solution for this requirement is to extend the RDBMS with a semi-structured datatype, like a JSON document. PostgreSQL has it, and MySQL 5.7 will soon have it. Ideally, one can use function indexes help to make it efficient to search for specific fields within the semi-structured data.

    If your RDBMS supports a data type for semi-structured data, then you can happily put all your other data in traditional schemas. Best of both worlds!

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: