Run Kettle Job for each Row

Sometimes it would be nice to execute a Kettle job or transformation multiple times, each time with different parameters passed in. Suppose for example you are doing migration work and need to import legacy data one day at a time. You already have a Job in place that imports data for a particular date. The date must be specified as a parameter. Suppose you need to import all data since a fixed start date up to yesterday. Running the import for every day manually is a pain and it is error prone. Would it not be nice to let Kettle figure out the proper dates and run the import job for each date?

Things like that can be done by having a master job that first runs a transformation which returns the dates one row at a time. The actual import job is placed after the transformation and it is executed for each row in the transformation result. The job can be set up to get its parameters from a result field.

Let’s look at this one step at a time (no pun intended).

Date Generation

The transformation responsible for generating the dates produces rows with a single date field. The “set start date” step puts thousands of rows to the stream with a constant start date. The “calc days” step inserts a sequence into the stream which is used to calculate consecutive dates in the “calc date” calculator step. The subsequent step throws away the helper fields.  Yesterday’s date is added to the stream, so a comparison is possible checking whether the current row contains a date that is earlier than today.  If it is, it is copied to the result set. If it is not, the row is discarded.

The Master Job

The master job executes the date generation transformation and follows up with an inner job that is supposed to execute for each of the incoming dates. The job must be specifically set up to execute for each incoming row. The check box at “Execute for every input row?” enables that. It is also required to pass in the calculated dates into the job. The job expects them in the “date” parameter, so “Copy previous results to parameters?” is also checked. Finally, the parameters must be mapped from the incoming rows. The expected parameter and the incoming field are both called “date”.

The Inner Job

The inner job runs for each row and accepts the incoming date in the parameter “date”.  In the example it simply logs which date it sees and completes.

Downloads

Download the example setup here. The master job runs the inner job for each date beginning with 2010-01-01 up to and including yesterday.

I would like to point out that this technique might  not fit all possible looping needs. Especially for “repeat till condition holds” scenarios. You might want to check out this article for an alternative approach to looping in kettle jobs.

Cheers

Slawo

Related Literature

14 comments to Run Kettle Job for each Row

  • [...] Run Kettle Job for each Row    Developing a custom Kettle Plugin: Looking up values in Voldemort [...]

  • [...] feature together with the “execute for every input row” feature enables an interesting looping technique. How to access parameter [...]

  • Daniel Melo

    The date parameter is empty at inner job. I’m using kettle 3.2.

  • Slawomir Chodnicki

    Hi Daniel,

    I do not really know what you have tried, so it’s hard for me to diagnose the problem. When I run the master job, the inner job correctly logs the dates it runs for. I can only guess what the problem might be. Maybe you’ve run the inner job stand alone? Or maybe you’ve imported the whole setup into a repository and are facing this issue: http://jira.pentaho.com/browse/PDI-2411. If you provide more details I might be able to help.

    Regards

    Slawo

  • Nikolai Sandved

    Hi Slawo

    Thanks for a splendid post. I’ve just used your example to implement a main job calling two transformations (Kettle 4.0.1). The first collects a list with file names, the second transformation uses this to generate an excel file for each with a name dependent on the current row processed. So this setup works calling transformations as well :-) Thanks again! _nikosan_

  • Slawomir Chodnicki

    Hi _nikosan_,

    great, good to hear the post helped with the solution :)

    Cheers

    Slawo

  • Ilia

    Hi, thanks for your post.

    I’m wandering is it possible to launch the same job with different parameters in parallel ?
    let say I have the presentation of days in the first step like 201201,201202 … etc and I’m building summarization table by date, so I want to ran sum calculation in parallel because all of them will be by different day and no potential concurrency problem

    Thanks

    Ilia

  • Pankaj Kataria

    Hi Slawomir, thanks for this.. it’s an awesome post.

    Though I did figure out to execute my job on each row but now am facing memory issues and it gives me a java.lang.OutOfMemoryError: Java heap space error. I did change the start-pentaho.bat to give JVM enough memory.

    set CATALINA_OPTS=-Xms4g -Xmx8g -XX:MaxPermSize=4g -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000

    What else do you think I can do? Thanks in advance.

    Regards,
    Pankaj

  • Inder

    Awesome Slawo! I am new to pentaho/pdi. Thanks for showing how to do this – helped me a lot.
    -Inder

  • Teddy

    Hi Slawomir

    Thanks for you work.It’s useful.

    Regards
    Teddy

  • Slawomir Chodnicki

    Thanks Teddy ;)

  • Christian

    Great article, thanks. Also a great blog in general, I can’t count how many times I’ve googled kettle + something advanced and your site comes up with an article specifically about something advanced.

  • Yonni

    Hi Slawomir

    I am new to pentaho data integration. I find your example here very helpful. However there is a problem. I generated two columns instead (start,end) but they dont seem to be recognized as parameters $(start) and $(end) do not print the corresponding values in the columns. They appear as “$(start) and $(end)”. They are not recognized as parameters but it seems that ‘date’ is. I dont understand why this is the case. When I changed ‘start’ to ‘date’ i got the correct values. What is it about the parameter name ‘date’. AM using community edition 4.4.0 and am confused here. Can you help ?

  • Thanks Slawomir, your post help me to solve a similiar problem!!!

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>