Author Topic: Data Normalization and Stackoverflow  (Read 298 times)

Walter Mitty

  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 1025
    • View Profile
Data Normalization and Stackoverflow
« on: November 12, 2011, 06:33:14 pm »
I could say a lot about good database design.  But I've been posting too much in here lately, so I'm going to limit myself, a little.

A lot of people with high reps on SO charge in to database design questions, and lead off with "first thing, you have to normalize your data."  Then they proceed to give specific advice that tells me that the "expert" has a hazy idea of what data normalization really is, at best.

The normal forms are very specific, and come with very specific anomalies that can be obviated by adhering to the form.  A full understanding of at least the first few normal forms can be indispensable to carrying out a database design.

But good designers very often intentionally disregard the rules of normalization in pursuit of other benefits that they can obtain with a different design principle.  Good design usually involves conscious management of trade-offs rather than blind adherence to "best practices".  This subtlety seems to elude a lot of people.

Richardk

  • Global Moderator
  • Wise Sage
  • *****
  • Posts: 3820
    • View Profile
Re: Data Normalization and Stackoverflow
« Reply #1 on: November 12, 2011, 08:02:28 pm »
But good designers very often intentionally disregard the rules of normalization in pursuit of other benefits that they can obtain with a different design principle.  Good design usually involves conscious management of trade-offs rather than blind adherence to "best practices".  This subtlety seems to elude a lot of people.

You can't manage what you don't understand but you sure can try to fake it.

Origisaurus

  • Wise Sage
  • Wise Sage
  • *****
  • Posts: 1678
    • View Profile
Re: Data Normalization and Stackoverflow
« Reply #2 on: November 12, 2011, 08:27:00 pm »
Yes, to optimize a specific application it may seem OK to deviate from the canonical normal forms.  As long as it is understood that this may suboptimize or even prohibit subsequent extensions to the application.  And try to get some PHB to sign his/her name a priori to that!

Oh, BTW, one does not normalize the data, rather one normalizes the data model.  Getting the data from where we find it to a normalized model is one of the great challenges of database administration.
Avatar is from the cover of the November 2007 National Geographic.  Fair use is assumed.

Walter Mitty

  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 1025
    • View Profile
Re: Data Normalization and Stackoverflow
« Reply #3 on: November 13, 2011, 06:36:36 am »
It depends on what you mean by "the data model".  I think of the data model at three different levels: conceptual, logical, and physical.  One does not normalize the conceptual model.  Normalizing the conceptual model is one of the misconceptions held by the majority of the industry.

"Normalization" as carried out in the classical era of database design was really a bottom up approach.  One took the files design of a non database implementation, and corrected the deviations from normal forms.  There is a top down approach that leads to  data that is nearly always as normalized as it needs to be, without having to think about it.

"Normalization" in the narrow sense, only applies to data in the relational model.  If the organizing principle is the hierarchical, network, object, or multidimensional model,  normalization does not really apply.  Having said that, one can apply normalization concepts to data in other models.  For example, if you take a multidimensional model and transpose it to the world of SQL tables, you end up with a star schema or similar.  and it's sometimes useful to ask how normalized or denormalized a star schema is.
« Last Edit: November 13, 2011, 08:45:01 am by Walter Mitty »

Walter Mitty

  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 1025
    • View Profile
Re: Data Normalization and Stackoverflow
« Reply #4 on: November 13, 2011, 06:42:48 am »
I agree with the substance of your reply.  In particular, people who deviate from normalization without understanding the consequences of their actions are often painting themselves into a corner.  Or more likely, painting their successor into a corner.

But good designers sometimes deivate from normalization fully cognizant of the downsides, and willing to live with them.

I also agree with you that "future proofing"  a database design is one of the most challenging aspect of this business.  However, that has to be understood in the light of Gorn's comment in another thread, where the ease of using DB technology in single applications has given rise to the embedded database.  If a database is embedded inside an application that is not to be future proofed,  then future proofing the database is a pointless exercise.


Walter Mitty

  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 1025
    • View Profile
Re: Data Normalization and Stackoverflow
« Reply #5 on: November 13, 2011, 06:51:05 am »

Oh, BTW, one does not normalize the data, rather one normalizes the data model. 
That's a quibble.
Quote
Getting the data from where we find it to a normalized model is one of the great challenges of database administration.

I'm going to quibble here.  It's one of the great challenges of database design.   Contrary to popular belief, database design and database administration are not the same job function. 

Some of the best DBAs I've ever seen are disparaged by the people around them as mere "database babysitters".  They aren't given to "thinking outside of the box".  They are obsessive about weeding out imperfection to the point where their colleagues all think they have Asperger's.  And nobody likes them.  Except the people who have used a database managed by a more "likeable" DBA.

And they don't construct new data models.

Walter Mitty

  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 1025
    • View Profile
Re: Data Normalization and Stackoverflow
« Reply #6 on: November 13, 2011, 07:02:18 am »
You can't manage what you don't understand but you sure can try to fake it.

An awful lot of the fakers get by with "doing what the smart people do".  But the really smart people don't "do what the other smart people do".  They do what their intelligence tells them to do.  Sure. smart people can learn from other smart people.  But they don't learn a set of rules for "best practices".

And here's the strange part:  smart people probably learn more from stupid people than stupid people learn from smart people.
 
« Last Edit: November 13, 2011, 07:55:59 am by Walter Mitty »

The Gorn

  • Your agonizer, please. And be sure to keep the batteries charged!
  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 14182
  • Gornix user
    • View Profile
Re: Data Normalization and Stackoverflow
« Reply #7 on: November 13, 2011, 11:27:15 am »
I agree with the substance of your reply.  In particular, people who deviate from normalization without understanding the consequences of their actions are often painting themselves into a corner.  Or more likely, painting their successor into a corner.

But good designers sometimes deivate from normalization fully cognizant of the downsides, and willing to live with them.

It once was pointed out to me a specific case in which this happened, and the guy who did it explained exactly what issue he was addressing in doing so.

The programmer-cum-DBA in question here was a self educated redneck with absolute disdain for all acedemic terminology and tools that were brought out after about 1982. He didn't even do SQL - he used an ISAM record management library for all of his stuff.

He built line of business applications for a few local manufacturers and distribution companies.

We were talking about data normalization. He walked through his table design for customers, invoicing and billing. He showed me one table where he captured every detail of the "header" for an invoice, which included the name, address, and contact information fields for the destination customer, as literal data, not as foreign keys.

Note that he already had this information in the customer table(s). I challenged him on this sloppy approach of smearing the same data across two or more tables.

He "corrected" me as follows. The table of generated invoices reflected every customer's then-current location. If he embedded a foreign key to take the place of the expanded customer information in every invoice, the customer would have the problem in the future of not being able to recall or reconstruct specific invoices, in the event that the customer moved or changed their information. Also it was a legal thing - this table was proof that an invoice was generated, to that customer at that address and with that contact info. As well as being a tax law thing. Again, the table of invoices could be treated like an archival view of the data.

I think he also expanded the list of invoice items and snapshotted them at the time of invoice generation instead of relying on foreign keys for inventory items, too.

I suppose that if you wanted to elaborate such a design to take this into account, you could have versioning of any fields that were subject to change. But that is a much higher order of complexity than just snapshotting the data in a non normal way. I suggested this to him and he scorned the added complexity. His design had been proven by deployments in half a dozen companies and every customer had been absolutely satisfied with his approach.

I was pretty impressed at the time with how such an elegant academic and applied concept - normalization - falls apart in certain real world cases. At least unless you wish to create an enormous additional layer of complexity to cope with the "non normal in practice" data to make it normal, too.

I also agree with you that "future proofing"  a database design is one of the most challenging aspect of this business.  However, that has to be understood in the light of Gorn's comment in another thread, where the ease of using DB technology in single applications has given rise to the embedded database.  If a database is embedded inside an application that is not to be future proofed,  then future proofing the database is a pointless exercise.

This is a good statement of facts. Every application is not the rock of ages and its data doesn't have to integrate.
Gornix is protected by the GPL. *

* Gorn Public License. Duplication by inferior sentient species prohibited.


John Masterson

  • Administrator
  • Wise Sage
  • *****
  • Posts: 7981
    • View Profile
Re: Data Normalization and Stackoverflow
« Reply #8 on: November 13, 2011, 12:18:11 pm »
I have heard it said: normalize 'til it hurts, then denormalize 'til it works.

Richardk

  • Global Moderator
  • Wise Sage
  • *****
  • Posts: 3820
    • View Profile
Re: Data Normalization and Stackoverflow
« Reply #9 on: November 13, 2011, 01:51:24 pm »
We were talking about data normalization. He walked through his table design for customers, invoicing and billing. He showed me one table where he captured every detail of the "header" for an invoice, which included the name, address, and contact information fields for the destination customer, as literal data, not as foreign keys.

Note that he already had this information in the customer table(s). I challenged him on this sloppy approach of smearing the same data across two or more tables.
His approach makes a lot of sense once you understand the problem he's trying to solve. This is similar to data warehousing, where you often denormalize the data as well.


I suppose that if you wanted to elaborate such a design to take this into account, you could have versioning of any fields that were subject to change. But that is a much higher order of complexity than just snapshotting the data in a non normal way. I suggested this to him and he scorned the added complexity. His design had been proven by deployments in half a dozen companies and every customer had been absolutely satisfied with his approach.
I've done this as well and versioning is quite different. It's more useful in a 'living document' that keeps changing. The invoices mentioned above are more 'set in stone'.

I recall one client that wanted versioning and it was a royal pain since there was no database support for this and everything had to be coded in the application. Not such a great idea if you have to make it happen.

The Gorn

  • Your agonizer, please. And be sure to keep the batteries charged!
  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 14182
  • Gornix user
    • View Profile
Re: Data Normalization and Stackoverflow
« Reply #10 on: November 13, 2011, 01:54:25 pm »
Essentially, this guy was writing an almost - ready - to - print version of every invoice to that set of tables. I suspect that actually printing an old invoice would consist of locating the invoice record, finding the associated line items on the invoice, and printing everything without reference to external tables like product inventory.

I got the impression that in spirit it was more like a PDF of a invoice than it was "data".
Gornix is protected by the GPL. *

* Gorn Public License. Duplication by inferior sentient species prohibited.


Walter Mitty

  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 1025
    • View Profile
Re: Data Normalization and Stackoverflow
« Reply #11 on: November 13, 2011, 02:49:33 pm »
I was pretty impressed at the time with how such an elegant academic and applied concept - normalization - falls apart in certain real world cases. At least unless you wish to create an enormous additional layer of complexity to cope with the "non normal in practice" data to make it normal, too.


"Temporal" databases and normalization are not nearly as much at odds as you make it sound.   http://en.wikipedia.org/wiki/Temporal_database

Although systematic solutions such as the ones described in the article didn't come along until the 1990s,  the problem as such was recognized with the very first deployment of relational DBMSes.  And solutions such as the one you outlined were well understood.

The sample database distributed with DEC Rdb/VMS and used with all the courses had two tables, called Job_History and Salary_History.  Each table had a Start_Date and an End_Date.  The row with a missing End_Date would be the current row for the referenced employee.  When an employee changed salary,  the action would be to include the End_Date in the old row,  and then insert a new row.  Everybody who learned databases learned this technique, at least in rudimentary form.

« Last Edit: November 13, 2011, 03:04:23 pm by Walter Mitty »

Walter Mitty

  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 1025
    • View Profile
Re: Data Normalization and Stackoverflow
« Reply #12 on: November 13, 2011, 02:56:05 pm »
I have heard it said: normalize 'til it hurts, then denormalize 'til it works.

I'm sorry, but I think this is one of those facile slogans that oversimplifies matters.

In particular the term "denormalize" is a very handwaving description of the design pattern.  It doesn't tell you what to do.  It's like telling a person to get to Los Angeles by driving away from Boston. 

I much prefer adopting an alternative design principle such as star schema for example,  when the situation calls for it,  instead of normalization followed by denormalization.  Then just design according to that principle.  It can still be useful to  know the deviations from normalization, because these will help you pinpoint the issues that are going to arise when you start to do ETL. 

In practice,  I have never had to push a data model beyond Boyce-Codd Normal Form.

Walter Mitty

  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 1025
    • View Profile
Re: Data Normalization and Stackoverflow
« Reply #13 on: November 13, 2011, 03:43:06 pm »
My own history with databases is not nearly as "academic" as some of you may have concluded.  I had learned databases from a friend and mentor in the instructor group where I had once worked.  And I took a one week intensive course, and read a couple of magazine articles.  And, especially, I got to look at a major database that had been built by somebody who knew what he was doing, and could explain his design in terms of the subject matter.

But I owned no books on Database theory until I had been teaching DB courses for about two years, and a student showed me his book.  It was good enough for me to buy a copy.  And I never got around to reading Codd's original 1970 paper until I got my ears burned off by a flame warrior in some usenet newsgroup. 

A few years back, when Gorn asked me for a good reference book, I took the unusual step of recommending a book I had never read.  How did that work out for you, Gorn?

I've never read a book by CJ Date.  They are too long.  But I absolutely love his slogan:  "Theory is practical".

Walter Mitty

  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 1025
    • View Profile
Re: Data Normalization and Stackoverflow
« Reply #14 on: November 13, 2011, 03:54:21 pm »
He didn't even do SQL - he used an ISAM record management library for all of his stuff.

By an odd coincidence, I started "doing things relationally"  using a VMS feature called RMS indexed files myself.  In addition to the files,  there was tool called "Datatrieve" that was capable of doing relational joins between files,  although Datatrieve called it a "CROSS" instead of a join.  Datatrieve was my real introduction to "relational think".  I did just enough in Pascal to put the data into a form that Datatrieve could work with,  and then the rest of the work in Datatrieve. 

When I first started working with Datatrieve, I was bad at it.  As bad as I am right now at Java.  It took me a long time to stop trying to make it be like the languages I already knew,  and start exploiting the tool in the way it was meant to be used.  It had a great report writer built in.  I don't think I've ever seen a better one.


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf