Most database systems provide a command-line based bulk loading utility that operates on flat files. These db-specific utilities are usually the fastest way to get data into a db system. This post shows a generic technique for using bulk loading utilities as part of a PDI ETL process. The download package has a working sample showing the technique in action.
Suppose PDI has no ready-made bulk loading step for the database you are using, or an existing bulk loading step falls short in some regard. You may be tempted to generate a temporary flat file first, and then launch the database’s bulk loading utility to import it quickly. While that approach works, it adds the additional I/O overhead of writing the temporary file. For bigger datasets the I/O overhead may very well outweigh the speed improvement gained by using the bulk loading utility. Besides, you may not be blessed with vast amounts of fast disk space.
Would it not be great to avoid the overhead of writing an actual file, and maybe creating some kind of channel to pass PDIs output directly to the bulk loading utility? At this point it is worth recalling the concept of a FIFO pipe. What if the temporary flat file was not an actual file on disk, but a FIFO pipe? The data producer, a PDI transformation, would continuously write to it, whereas the consumer, the database’s bulk-loading utility, would continuously read. If you are on a UNIX-based operating system, FIFOs are abstracted as files on the file system. It is easy to create a FIFO and then read and write to it, just like with regular files.
The following sample process shows how the above setup can be achieved with PDI. For demonstration purposes, I choose to ignore the existing MySQL bulk loading step in PDI, and manually bulk-load a MySQL table using its mysqlimport command-line utility. The mysqlimport utility works very well with csv-fomatted files, so the key components of the solution are to write csv data to a FIFO, and run mysqlimport to read from the same FIFO simultanously.
The first entries of the job just make sure the target table is in place, and emptied if necessary.
Moving on the the delete fifo and create fifo entries, they replace any possibly stale FIFO, with a fresh and empty FIFO.
The next section kicks off the parallel writing and reading to/from the FIFO. The transformation generates 10M rows, and writes them in a CSV format into the FIFO. In parallel to that, a shell script is launched that executes the command-line bulk loading utility to import the data from the FIFO. Please note that it is essential that both entries are executing in parallel. The parallel lines on the hops indicate that the shell script and the transformation execute at the same time.
The speed benefits are usually substantial as compared to regular JDBC-based inserts. For my test setup I am generating and inserting 10M rows to a MySQL DB on my local machine. If I use a simple table output step with batched inserts (commit size of 10.000), it takes ca. 11 minutes to complete. If I run with the command line bulk loader, the whole process takes about 16 seconds.
The download package contains the above sample setup. I’ve tested it on Mac OSX but it should work on Linux without substantial changes as well. Just make sure to adjust the connection details and command lines to your environment.
PS: when testing this in Spoon on Mac OS X, make sure you start Spoon using spoon.sh instead of the .app. There seems to be some inconsistent handling of the stderr stream in the “Shell” job entry implementation that causes the job to fail when starting from .app, (observed in PDI 4.3.0).