Jump to content
Larry Ullman's Book Forums

Suggest For Book: Large Data Imports


Recommended Posts

Hi Larry,


In your database chapter you have some tips on specific things which might be encountered. Here's a topic tip which for me was non-obvious.


I need to import large chunks of data, meaning 10,000 - 100,000 rows of CSV data. For large data imports to MySQL InnoDB, I learned:


1. Active Record is prohibitively slow. So is DAO, but not for the reason you'd think.


2. My test platform performance (a mac laptop) bears absolutely no relation to production database server performance.


Secret number 1 is that, for the MySQL InnoDB storage engine, you get massively better performance doing the entire import (hundreds of thousands of database queries) in a single transaction. InnoDB does some overhead things at the end of each transaction, and if you have a LOT of transactions, the number of transactions per second is extremely low. This behavior is distinctly different from MyISAM.


This means that the whole point of transactions goes out the window. This scheme rolls back the entire import rather than rolling back a single row. But doing a transaction per row becomes prohibitively slow if you're doing a LOT of inserts at once.


Secret number 2 is that "innodb_buffer_pool_size" needs to be increased from the default value, by a LOT. My own /etc/my.conf (dev platform) reads:


max_allowed_packet = 12000000
innodb_buffer_pool_size = 8G

One symptom of the issue is when you truncate a table with several million rows and foreign key constraints. The truncate will take hours with the default pool setting. It takes a minute or two with a huge pool setting. Database performance on the inserts likewise goes way up.


Because I am using a normalized database structure, each import row generates several database queries/inserts hitting the various related tables. That's why you would want to enclose all queries for a single row in a single transaction, and roll them back as a unit.


What I did, instead, was bury all that inside a stored procedure. The stored procedure (defined as a Function) returns the main row insert id or null.


Then, in php, I keep a count of "good" rows and "bad" rows. I use the Yii database log route for reporting the "bad" rows. If I logged them in the database during that transaction, and rolled back the transaction, I'd lose that logged information. Since Yii does not write the log messages until the end, the transaction is closed and I don't lose the Yii log. (Nice!)


So, my method for doing a massive import of many rows of data:


1. Adjust the MySQL Server setting for innod_buffer_pool_size.


2. Enclose the entire import in a single transaction, or in as few transactions as possible.


3. Run the entire import inside a try/catch block, exactly as The Yii Book demonstrates in the database chapter.


4. Build the DAO command once, and bind the parameters for each row, as that row gets imported. In my case the row data are trusted, and the parameter binding is sufficient data validation.


5. In my case, it's okay to "ride through" a failed import of a single row. Thus I hide the insert-error processing inside the stored procedure, and do NOT throw an exception.


6. Any thrown exception triggers a transaction rollback, and if the import completes without throwing an exception, the transaction commits.


7. I use the Yii log-to-database route for reporting import problems. They do not get written to the database until the batch process has ended, and therefore they do not get rolled back by a failed import.


For the record, I drive all the above with PHPUnit. The Yii/PHPUnit integration is nice indeed! The batch import is implemented as a Yii console command, so PHPUnit invokes that with exec(), but each of the batch steps is a component which PHPUnit can exercise directly.


In the same way, PHPUnit can directly exercise and validate the stored procedures by using Yii DAO inside the PHPUnit methods.

Link to comment
Share on other sites

Larry, agreed! I considered MySQL "load data" and indeed it does a screamingly fast import. Unfortunately I have not found any command-line way to distribute a row of data into normalized tables (address to the address table, product lookup-or-insert from the product table and insert its id, etc.). If that CAN be done, I'd love to be clued in! Perhaps a stored trigger on insert, but that begins to add another layer of complexity to what should be simple!


My other issue is "scrubbing" the data. I detect various inconsistencies during import. That in itself is not justification for doing PHP-based import. It would make equal sense to to a fast command-line import and then do a second pass with PHP looking for those discrepancies. But, since I'm already in the PHP, I just do the discrepancy checking at the same time.

Link to comment
Share on other sites


  • Create New...