viernes, 31 de agosto de 2012

What is MySQL denormalization and should you be doing it?

Fuente: linux.org

Introduction

Originally, I was going to write an tutorial over MySQL Performance Tuning Tips & Tricks. I quickly realized that this tutorial would be a bullet list of Tips & Tricks, and thus would be too short and require users to go searching elsewhere for terms they did not understand, or would lose users because it would become far too long. Instead I decided to hash out a series of tutorials to publish individually. While none of these tutorials will build off of each other, some terms you learn in one may be referenced in others.

In this tutorial we will take a look at what Database Denormalization is and how it is different from Normalization. Many DBAs probably use the Normalization rules in designing databases, either because they have never heard of Denormalization or they do not fully understand it.

What is Normalization?

We will start off by looking at Database Normalization. The primary reason for this is because many people were taugh that they should always "Normalize" their databases. Normalized data means that fields and tables in a relational database have been organized to reduce redundancy and dependency by dividing up larger tables and linking them through the uses of Indexes. While this makes for a cleaner overall layout, and seems to make sense, it is not always the best course of action.

Why do people Normalize?

In my opinion, the answer to this question is because that is how they were taught. For some reason the most widely accepted method of teaching is the Normalized method described above. It seems to make sense to populate, modify and delete data in one table and have it propagate through the entire database through the use of defined relationships.

One example of where this would make sense would be a table called students. The denormalized layout for this table would have the following columns: student_id, student_name, student_email, student_address, course_number. In this example, a student could easily be enrolled in 10-15 different courses. If that students address changed for any reason, that would mean that a query would have to perform 10-15 updates on the address column for that student.

The normalized approach for the same example would have a students table (student_id, student_name, student_email, student_address) and a enrolled_courses table (enrolled_course_id, student_id, course_name, course_number). We might even take this one step further and move course_name and course_number to a table called courses (course_id, course_name, course_number). That would mean our new enrolled_courses table would look like: enrolled_course_id, course_id, student_id.

In the denormalized approach lets assume that it is a new semester and student A did not re-enroll for that semester because he needed planned to take a semester break. Since the semester has ended we would empty all the rows from the database since students will no longer be enrolled in those courses. Since student A is not going to re-enroll his data vanishes along with the course data. When he does decide to enroll in new course all his information must be re-entered into the system.

What is Denormalization?

Now lets take a look at Database Denormalization. This is the process by with DBAs attempt to optimize the performance of the database by adding in redundant, or grouped, data. The advantages to this is that the data is stored in one table, as opposed to multiple smaller tables, which cuts back on the need for JOINS in your queries, resulting in faster SELECTS. It is also good to note that a single table allows for a more efficient index usage. With properly indexed columns, results can be filtered and sorted by utilizing the same index.

What are the downsides?

The downsides really aren't big, and for the most part can be easily solved. One thing you must consider is that in a denormalized database, data can be duplicated across many tables. While this will result in better indexing and faster SELECTS as explained above, it does make INSERTS and UPDATES a bit more complicated. This is where it comes down to weighing the pros and cons. It all comes down to a ratio of reads VS. writes. If you have high reads and low writes, then denormalized is the sure way to go. If you are about even, or if you have higher writes than reads, you can still go with the denormalized approach but it may require a bit more thinking and planning.

Which should I use?

By now you're probably wondering which one to use. We've looked at just a few pros and cons to both approaches, and really it comes down to an almost 50/50 scale. The solution to this is simple: The correct way is to utilize both the normalized and denormalized approaches depending on the situations. Most of you are probably disappointed by the lack of a climax ending, and you were probably thinking that I was about to advocate a denormalized database over a normalized one, but that simply isn't the point of this tutorial. The point was to introduce you to denormalization, and to get you thinking and approaching database design in a new way. Many people spend all their time tweaking configuration settings and trying to optimize queries (mainly JOIN statements, which at their heart is really a complex nested-loop), when in fact they could get a huge bump in performance by taking a moment to ask themselves, "Should this be normalized or denormalized?"

As a side note, I will say that knowing your database engines can be a big help in boosting performance too. Knowing when to use MyISAM or InnoDB, and knowing when you should use something like ARCHIVE which only allows INSERTS and READS. At the end of the day the best performance boost you can give yourself is properly planning out your database layout upfront. See my tutorial An Introduction To MySQL Storage Engines.

Ronald Steelman