Increase MySQL output to 80K rows/second in Pentaho Data Integration

One of our clients has a MySQL table with around 40M records. To load the table it took around 2,5 hours. When i was watching the statistics of the transformation I noticed that the bottleneck was the write to the database. I was stuck at around 2000 rows/second. You can imagine that it will take a long time to write 40M records at that speed.
I was looking in what way I could improve the speed. There were a couple of options:
  1. Tune MySQL for better performance on Inserts
  2. Use the MySQL Bulk loader step in PDI
  3. Write SQL statements to file with PDI and  read them with mysql-binary
When i discussed this with one of my contacts of Basis06 they faced a similar issue a while ago. He mentioned that speed can be boosted by using some simple JDBC-connection setting. useServerPrepStmts=false

These options should be entered in PDI at the connection. Double click the connection go to Options and set these values.

Used together, useServerPrepStmts=false and rewriteBatchedStatements=true will “fake” batch inserts on the client. Specifically, the insert statements:

INSERT INTO t (c1,c2) VALUES ('One',1);
INSERT INTO t (c1,c2) VALUES ('Two',2);
INSERT INTO t (c1,c2) VALUES ('Three',3);

will be rewritten into:

INSERT INTO t (c1,c2) VALUES ('One',1),('Two',2),('Three',3);

The third option useCompression=true compresses the traffic between the client and the MySQL server.

Finally I increased the number of copies of the output step to 2 so that there are two treads inserting into the database.

This all together increased the speed to around 84.000 rows a second! WOW!


9 thoughts on “Increase MySQL output to 80K rows/second in Pentaho Data Integration

  1. This is fantastic!
    My table output step’s speed was 200 rows/sec, so I had to use MySQL Bulk Loader.
    After this configuration, the speed went up to 12,000 rows/sec, which is almost 60 times faster!

    (and I haven’t increased the number of copies of output step, and I’ll update the result soon)
    Thanks for sharing great information, David.

  2. Hi,
    I’ve used to use this technique for a while, in my new company we are using amazon RDS and I do not see this setting is working; I still see single row inserts instead of bulk loading. Can someone confirm that RDS supports it ? I do not see any reason why not because its pure JDBC responsibility to construct SQL, however proven solution is not working in my new environment and the only difference I see mysql 5.6 vs 5.5 and RDS service vs linux virtual box with mysql 5.5 on ec2

      • I’m using MariaDB 10.0.17 on a EC2 instance as I wanted to get the RTT out of the way while on RDS MySQL 5.6.22.
        I’m using PID 4.4 with
        And a batch size of 1000. My job takes 5m30s to insert 182k rows in a table with 95% of the time being spent in the DB commits.
        I tweaked every little knob to ensure maximum performance but there seems to be a bottleneck somewhere.
        Here are just some of the things that I did: updated JDBC to latest version, increased # of table output copies (PID is running on a c4.4xlarge EC2 instance), increased the kettle java heap memory, provisioned the EBS volume to 6000 IOPS, using /dev/shm for temporary tables, and various other tweaks.
        The sheer DB performance is very good, but not in combination with PID.

        I am new to PID so I can’t really advise what’s happening in there.
        If anyone can point me in some directions, it’d be great. My last resort is to do a tcpdump against the DB and see what’s happening in there.


  3. Pingback: MySQL + JDBC with rewriteBatchedStatements=true

  4. Amazing! With these settings I have been able to divide the time needed for some of my transformations by 20 times!
    Going from one hour to 3 minutes… I couldn’t believe it, I had to triple check that every row had indeed been copied to the destination 🙂

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s