Jump to content
Larry Ullman's Book Forums

Database Optimization And Normalization


Recommended Posts

You provide several tips for database optimization (pag 193):

1. chose the best storage engine. If I understand correctly, we basically have to chose between MyISAM and InnoDB. If it is necessary or better to use transactions we have no choice than InnoDB. If it is necessary to use FULLTEXT indexing than we have no choice than MyISAM. Other factors, as you presented, are speed - but in some other thread you said that speed depends of other factors as well, for example the OS. So I guess FULLTEXT indexing and transactions are the dealbrakers, is this correct?

 

2. Use the smallest datatype possible for each column. This one I do not find it explained in your book. What do you mean? Something along the idea that bit is faster than integer and integer faster than varchar?

 

Let's say we have a table where one of the columns is "language". Let's say we have two values: "English" and "Spanish". Is it correct to assume (from 2 above) that it is better to have this column defined in the database as integer, where 0 will be default and will stay for English and 1 will stay for Spanish and new numbers will stay for new languages. If we have to display the name of the language we will use a conditional or a switch (later on when we have more languages). Is this way of thinking correct? When do you consider this approach innapropriate, for example when the number of languages will be too big so the PHP code will be way longer than MySQL or there are other factors involved?

 

Let's say we would like to create a small application that allows users to submit their preferences about a subject, like what programming language do they love and we want to allow them to select more than one languages. Let's assume we have to index these columns. Considering (1) above I may think as follow: because fulltext index not work with InnoDB i have to go with MyISAM, but MyISAM do not let me use transactions. Indexing is the most important thing about this table, considering that I cannot use transactions I may be willing to make a trade-off and instead to allow users to select any number of programming languages, I will chose a number that may sound reasonable to me, let's say 3, and I will create 3 columns in my table "language1", "language2" and "language3" (instead of a new table to hold languages linked to the first one).

 

I encountered this exemple in all "jobs" sites I visit. I assume because they need FULLTEXT indexing for that particular table they made the trade-off I was talking about. Is this a good example of what you call "overrulling normalization" (pag 167)?.

Link to comment
Share on other sites

1. chose the best storage engine. If I understand correctly, we basically have to chose between MyISAM and InnoDB. If it is necessary or better to use transactions we have no choice than InnoDB. If it is necessary to use FULLTEXT indexing than we have no choice than MyISAM. Other factors, as you presented, are speed - but in some other thread you said that speed depends of other factors as well, for example the OS. So I guess FULLTEXT indexing and transactions are the dealbrakers, is this correct?

 

Generally speaking, yes.

 

2. Use the smallest datatype possible for each column. This one I do not find it explained in your book. What do you mean? Something along the idea that bit is faster than integer and integer faster than varchar?

 

There's really not much to explain here. I'm not suggesting you change types, like from an integer to a varchar or vice versa. You always want to choose the proper general type for the data. But choose a varchar instead of text, as appropriate, or SMALLINT instead of MEDIUMINT.

 

Let's say we have a table where one of the columns is "language". Let's say we have two values: "English" and "Spanish". Is it correct to assume (from 2 above) that it is better to have this column defined in the database as integer, where 0 will be default and will stay for English and 1 will stay for Spanish and new numbers will stay for new languages. If we have to display the name of the language we will use a conditional or a switch (later on when we have more languages). Is this way of thinking correct? When do you consider this approach innapropriate, for example when the number of languages will be too big so the PHP code will be way longer than MySQL or there are other factors involved?

 

I personally only use numbers to represent non-numerical values in very obvious situations: TRUE/FALSE, on/off, yes/no. If you use numbers to represent words or options, then you've separated the meaning into two places (MySQL and PHP), which is bad on a number of levels. One alternative would be to use ENUM, which MySQL internally treats as numbers but allows you to use words. Or you could use a separate table and relate it in. For something like gender, with a small number of unchanging options, ENUM would make sense. For something like language, with a lot of options that do change, you'd use a separate table.

 

Let's say we would like to create a small application that allows users to submit their preferences about a subject, like what programming language do they love and we want to allow them to select more than one languages. Let's assume we have to index these columns. Considering (1) above I may think as follow: because fulltext index not work with InnoDB i have to go with MyISAM, but MyISAM do not let me use transactions. Indexing is the most important thing about this table, considering that I cannot use transactions I may be willing to make a trade-off and instead to allow users to select any number of programming languages, I will chose a number that may sound reasonable to me, let's say 3, and I will create 3 columns in my table "language1", "language2" and "language3" (instead of a new table to hold languages linked to the first one).

 

I encountered this exemple in all "jobs" sites I visit. I assume because they need FULLTEXT indexing for that particular table they made the trade-off I was talking about. Is this a good example of what you call "overrulling normalization" (pag 167)?.

 

No, actually that's a terrible break in normalization rules. I would create a separate table of languages and another of users_languages. That way users can select any number of languages, including none, without having all sorts of NULL values and other potential problems.

  • Upvote 1
Link to comment
Share on other sites

Having so many related tables to a main table will not decrease performance? If I have things columns like "countries", "states", "gender", "language" and several others in separate tables related to a main table, I will have to use joins and will be quite a few to display the information with meaning (1 is English, 2 is Spanish...etc., I want displayed the label from the related table, not the numeric value from the main table). Having 5-6 joins of this kind isn't too much? Is there a point where it will be better if instead of having this arrangement with related tables and joins to actually write the information in the main table as varchar (i.e. "English" or "Spanish" instead of '1' and '2')?

 

If i have 3 options it sounds good ENUM, but how about if I have 5 or 10 options (for example a column keeping the level of education: no education, primary school, high school, professional school, college not finished, college finished, masterate, doctorate). These will not change, but there are quite a few. Will still make sense to use ENUM? Is the number that counts or there are other factors?

 

What would be the rule of thumb here? Thank you.

Link to comment
Share on other sites

Yes, having many related tables WILL decrease performance when you're doing SELECTs with JOINs across them. But performance is only one criteria, and, really, performance is one of the easiest things to fix. it's not very easy to fix a poor database design and you can't fix corrupted data. And, in this case, you have to consider the logic that you're putting into the PHP scripts. That will decrease performance, too. And factor in how much time you, the developer, will have to spend should a change come up.

 

I don't think it's the number of options that dictates whether you use ENUM or not, but the likelihood of the options changing. And, of course, how often that list would be reused. For example, you might think an employee would only have one title, but commonly in companies employees get multiple titles. Even your level of education example isn't good for ENUM. What if you later decided to distinguish between doctorate and ABD (all but dissertation)? How much work would be required to change the entire system to reflect the one new option?

 

My rule of thumb is that if it's a binary choice (true/false, yes/no), use a TINYINT. I'll also use a CHAR(1) for something like M/F. If it's a small number of choices that won't ever change or repeat, use ENUM. Generally, though, use a related table. I expect I probably use a related table maybe 65% of the time, a small column type 30% of the time, and ENUM 5%.

Link to comment
Share on other sites

You suggest CHAR(1) for gender as M/F.

 

(1)

If I want to display the information as "Male" / "Female" do I still have to make a linked table to maintain "the logic in MySQL" as you suggested? Or it is ok to make the adjustments in PHP and print "Male" instead of "M" (I guess this means separating the meaning between PHP and MySQL). I take your advise in his full meaning as not making a related table and not separating the meaning between PHP MySQL, so I guess my options will be: (a) I have to print "M" where the application should print "gender" or (b ) create a related table if I want to print "Male". Any of those is better than (c ) printing "Male" instead of "M" with the help of PHP. Is this the correct reading?

 

(2)

In a related table should (or it is recommended) the foreign key to be the primary key? I can anticipate the disadvantages as if you delete a record and add a new one the primary key will change/increase so the queries will have to be modified to reflect the change. I also guess that using the primary key of the related table as foreign key of the main table has its advantages as probably the database treat preferential these keys and it is probably faster than having another column as primary key (that probably should be indexed). What do you advise here?

 

products

-----------

id (PK)

product

manufacturer

 

manufacturer

----------------

id (PK) (FK)

manufacturer

 

instead of:

 

manufacturer

----------------

id (PK)

manufacturer

productid (FK)

 

(3)

Does it make sense here is to use in the lookout table the "productid" as fk and give up to the "id" as primary key? Does all tables need to have a primary key?

 

(4)

Related to (1) above, let say we have a table "employees" where users can chose either "m" or "f" (per "male" and "female"). Now let say we have a "recrutingAgent" table where the recruting agent have in fact three options: "male", "female" and "both". If the recruting agent search on the "emplyees" table than the solution is simple. But how about the recruting agen has to post a job, so now you have to put in the "recrutingAgent" table the selection and you have three. What will be the solution here? Char(1) where we put "m/f" and we add some other letter like "b" for "both"?

 

(5)

Regarding the (1) above there are other columns that may be defined as char and they are quite frequent, like "countries" and "states". Both can be abbreviated with 2 characters, countries have a standard with 3 characters as well. What is the best approach here? A lookout table where the foreign key is the abbreviation from the main table so we can print the full country name or state name? Or we chose not to print the full name of the country and give up to the lookout table?

 

If a main table contain several such lookout tables we will have to do several queries with the sole purpose, like in the example with states and countries, to print a dropdown with options where the label will be the country names and state names and the values will be country and state values respectively? Isn't this a waste? I have a main table with 6 such lookout tables, it is still true that the meaning will have to remain in MySQL with the result of making 6 queries for the sake of displaying the drop-downs?

 

What is the best practice?

 

main table (anything like a registration form....)

------------

...

CountryAbbr

StateAbbr

...

 

lookout table1

---------------

CountryAbbr

CountryName

 

lookout table2

---------------

StateAbbr

StateName

 

Thank you for your time. I'm delighted I can learn from your books and get answer to my questions in this forum. I regret that I'm not in college anymore to take your classes.

Link to comment
Share on other sites

  • 2 weeks later...
You suggest CHAR(1) for gender as M/F.

 

(1)

If I want to display the information as "Male" / "Female" do I still have to make a linked table to maintain "the logic in MySQL" as you suggested? Or it is ok to make the adjustments in PHP and print "Male" instead of "M" (I guess this means separating the meaning between PHP and MySQL). I take your advise in his full meaning as not making a related table and not separating the meaning between PHP MySQL, so I guess my options will be: (a) I have to print "M" where the application should print "gender" or (b ) create a related table if I want to print "Male". Any of those is better than (c ) printing "Male" instead of "M" with the help of PHP. Is this the correct reading?

 

No, you wouldn't make a linked table. If you did a linked table, you'd just link the primary key in as a foreign key, not use a CHAR. True, full normalization would suggest a second, linked table is the right solution, but that's an overnormalization in my opinion. Using a CHAR(1) column and having PHP expand the logic is a reasonable solution. If you'd rather not put that much logic into the PHP code (switching 'M' to 'Male'), you could use an ENUM column type instead, storing the full Male and Female.

 

(2)In a related table should (or it is recommended) the foreign key to be the primary key? I can anticipate the disadvantages as if you delete a record and add a new one the primary key will change/increase so the queries will have to be modified to reflect the change. I also guess that using the primary key of the related table as foreign key of the main table has its advantages as probably the database treat preferential these keys and it is probably faster than having another column as primary key (that probably should be indexed). What do you advise here?

 

Not sure if I understand your question, but the FK in table A should always be to the PK in table B. Always.

 

products

-----------

id (PK)

product

manufacturer

 

manufacturer

----------------

id (PK) (FK)

manufacturer

 

instead of:

 

manufacturer

----------------

id (PK)

manufacturer

productid (FK)

 

Neither of those designs makes sense to me. I'd do...

 

products: product_id (PK), manufacturer_id (FK), product

manufacturers: manufacturer_id (PK), manufacturer

 

(3)Does it make sense here is to use in the lookout table the "productid" as fk and give up to the "id" as primary key? Does all tables need to have a primary key?

 

I don't understand the first question, but, yes, all tables should have a primary key.

 

(4)Related to (1) above, let say we have a table "employees" where users can chose either "m" or "f" (per "male" and "female"). Now let say we have a "recrutingAgent" table where the recruting agent have in fact three options: "male", "female" and "both". If the recruting agent search on the "emplyees" table than the solution is simple. But how about the recruting agen has to post a job, so now you have to put in the "recrutingAgent" table the selection and you have three. What will be the solution here? Char(1) where we put "m/f" and we add some other letter like "b" for "both"?

 

Putting aside that it's illegal to hire based upon gender... For the recruiting agent or jobs or whatever, I would use CHAR(1): M, F, B. In a way, the gender of an employee and the needed gender for a job are two different criteria. An employee has to be either only M or F. The criteria for jobs does not have that restriction.

 

(5)Regarding the (1) above there are other columns that may be defined as char and they are quite frequent, like "countries" and "states". Both can be abbreviated with 2 characters, countries have a standard with 3 characters as well. What is the best approach here? A lookout table where the foreign key is the abbreviation from the main table so we can print the full country name or state name? Or we chose not to print the full name of the country and give up to the lookout table?

 

This is a judgement call. Proper, full normalization would insist on separate tables, but I don't normally do that. One thing a separate table gives you is a standardized data set. The danger of not using a separate table of options is it allows for inconsistent data. For example, say one person types the state as "Iowa" and another "iowa" and another "iwoa". That's three different representations of the same state, which will likely be a problem. A solution is to create a separate table where the values come from (instead of letting the user type them). Another solution is to use PHP or HTML to restrict the options that a user can select.

 

If a main table contain several such lookout tables we will have to do several queries with the sole purpose, like in the example with states and countries, to print a dropdown with options where the label will be the country names and state names and the values will be country and state values respectively? Isn't this a waste? I have a main table with 6 such lookout tables, it is still true that the meaning will have to remain in MySQL with the result of making 6 queries for the sake of displaying the drop-downs?

 

Yes, you will have to do several queries and I do feel that's a waste. But normalization and separation of presentation from content would suggest you go that route. I'm a pragmatist, though, and don't believe you always have to do things to the nth degree.

 

What is the best practice?

 

The main thing you should understand is that everything's a trade off. With these kinds of questions, you're trading data integrity for performance. But there are ways of improving both, too. So you have to find the right middle ground for your situation.

 

Thank you for your time. I'm delighted I can learn from your books and get answer to my questions in this forum. I regret that I'm not in college anymore to take your classes.

 

You're quite welcome. Glad you're finding my work and these forums to be useful.

  • Upvote 1
Link to comment
Share on other sites

 Share

×
×
  • Create New...