Implementing Universal Bulk-Loading in PDI

Most database systems provide a command-line based bulk loading utility that operates on flat files. These db-specific utilities are usually the fastest way to get data into a db system. This post shows a generic technique for using bulk loading utilities as part of a PDI ETL process. The download package has a working sample showing the technique in action.

Suppose PDI has no ready-made bulk loading step for the database you are using, or an existing bulk loading step falls short in some regard. You may be tempted to generate a temporary flat file first, and then launch the database’s bulk loading utility to import it quickly. While that approach works, it adds the additional I/O overhead of writing the temporary file. For bigger datasets the I/O overhead may very well outweigh the speed improvement gained by using the bulk loading utility. Besides, you may not be blessed with vast amounts of fast disk space.

Would it not be great to avoid the overhead of writing an actual file, and maybe creating some kind of channel to pass PDIs output directly to the bulk loading utility? At this point it is worth recalling the concept of a FIFO pipe. What if the temporary flat file was not an actual file on disk, but a FIFO pipe? The data producer, a PDI transformation, would continuously write to it, whereas the consumer, the database’s bulk-loading utility, would continuously read. If you are on a UNIX-based operating system, FIFOs are abstracted as files on the file system. It is easy to create a FIFO and then read and write to it, just like with regular files.

The following sample process shows how the above setup can be achieved with PDI. For demonstration purposes, I choose to ignore the existing MySQL bulk loading step in PDI, and manually bulk-load a MySQL table using its mysqlimport command-line utility. The mysqlimport utility works very well with csv-fomatted files, so the key components of the solution are to write csv data to a FIFO, and run mysqlimport to read from the same FIFO simultanously.

The first entries of the job just make sure the target table is in place, and emptied if necessary.

Moving on the the delete fifo and create fifo entries, they replace any possibly stale FIFO, with a fresh and empty FIFO.

The next section kicks off the parallel writing and reading to/from the FIFO. The transformation generates 10M rows, and writes them in a CSV format into the FIFO. In parallel to that, a shell script is launched that executes the command-line bulk loading utility to import the data from the FIFO. Please note that it is essential that both entries are executing in parallel. The parallel lines on the hops indicate that the shell script and the transformation execute at the same time.

The speed benefits are usually substantial as compared to regular JDBC-based inserts. For my test setup I am generating and inserting 10M rows to a MySQL DB on my local machine. If I use a simple table output step with batched inserts (commit size of 10.000), it takes ca. 11 minutes to complete. If I run with the command line bulk loader, the whole process takes about 16 seconds.

The download package contains the above sample setup. I’ve tested it on Mac OSX but it should work on Linux without substantial changes as well. Just make sure to adjust the connection details and command lines to your environment.

Cheers
Slawo

PS: when testing this in Spoon on Mac OS X, make sure you start Spoon using spoon.sh instead of the .app. There seems to be some inconsistent handling of the stderr stream in the “Shell” job entry implementation that causes the job to fail when starting from .app, (observed in PDI 4.3.0).

 

6 comments to Implementing Universal Bulk-Loading in PDI

  • Hunt

    Perfect !

    Do you think it’s better to never use .app on mac os x ?

    Thank you for sharing this.

  • Slawomir Chodnicki

    Hey there,

    I usually prefer the .sh as I prefer keeping my JVM command line options in a .sh vs. the .plist file, besides that, I like seeing stdout/err on my terminal and not have it mixed in in the system log. But it’s a matter of taste, really.

    Cheers
    Slawo

  • Sean

    Hi Slawo,
    Will it work with Windows 7 or 2008 server? I’m currently loading to Infinidb using some scripts I saw on Diethar’s blog. I do write large files just to read them back again. So is this something that will work with Windows (and specifically Windows and Infinidb)?

    Your blog is one of my go-to blog and I think I must have referred to your reg-ex article so many times. Please keep writing.

    Regards,
    Sean

  • Slawomir Chodnicki

    Hey Sean,

    Windows does not seem to have good command line support for its named pipes: http://en.wikipedia.org/wiki/Named_pipe
    Maybe you can make it happen using Cygwin or some other windows specific tool that exposes a windows named pipe as a file.

    Cheers
    Slawo

  • Sathish Kumar S

    Hi,

    How does this work with Postgresql Copy Command ? when I executed in parallel, I am experiencing performance issues.

  • Vasili

    Hi Slawo,

    Good solution, thanks for sharing!

    Regards,
    Vasili

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>