Sometimes it would be nice to execute a Kettle job or transformation multiple times, each time with different parameters passed in. Suppose for example you are doing migration work and need to import legacy data one day at a time. You already have a Job in place that imports data for a particular date. The date must be specified as a parameter. Suppose you need to import all data since a fixed start date up to yesterday. Running the import for every day manually is a pain and it is error prone. Would it not be nice to let Kettle figure out the proper dates and run the import job for each date?
Things like that can be done by having a master job that first runs a transformation which returns the dates one row at a time. The actual import job is placed after the transformation and it is executed for each row in the transformation result. The job can be set up to get its parameters from a result field.
Let’s look at this one step at a time (no pun intended).
Date Generation
The transformation responsible for generating the dates produces rows with a single date field. The “set start date” step puts thousands of rows to the stream with a constant start date. The “calc days” step inserts a sequence into the stream which is used to calculate consecutive dates in the “calc date” calculator step. The subsequent step throws away the helper fields. Yesterday’s date is added to the stream, so a comparison is possible checking whether the current row contains a date that is earlier than today. If it is, it is copied to the result set. If it is not, the row is discarded.
The Master Job
The master job executes the date generation transformation and follows up with an inner job that is supposed to execute for each of the incoming dates. The job must be specifically set up to execute for each incoming row. The check box at “Execute for every input row?” enables that. It is also required to pass in the calculated dates into the job. The job expects them in the “date” parameter, so “Copy previous results to parameters?” is also checked. Finally, the parameters must be mapped from the incoming rows. The expected parameter and the incoming field are both called “date”.
The Inner Job
The inner job runs for each row and accepts the incoming date in the parameter “date”. In the example it simply logs which date it sees and completes.
Downloads
Download the example setup here. The master job runs the inner job for each date beginning with 2010-01-01 up to and including yesterday.
I would like to point out that this technique might not fit all possible looping needs. Especially for “repeat till condition holds” scenarios. You might want to check out this article for an alternative approach to looping in kettle jobs.
Cheers
Slawo







[...] Run Kettle Job for each Row Developing a custom Kettle Plugin: Looking up values in Voldemort [...]
[...] feature together with the “execute for every input row” feature enables an interesting looping technique. How to access parameter [...]
The date parameter is empty at inner job. I’m using kettle 3.2.
Hi Daniel,
I do not really know what you have tried, so it’s hard for me to diagnose the problem. When I run the master job, the inner job correctly logs the dates it runs for. I can only guess what the problem might be. Maybe you’ve run the inner job stand alone? Or maybe you’ve imported the whole setup into a repository and are facing this issue: http://jira.pentaho.com/browse/PDI-2411. If you provide more details I might be able to help.
Regards
Slawo
Hi Slawo
Thanks for a splendid post. I’ve just used your example to implement a main job calling two transformations (Kettle 4.0.1). The first collects a list with file names, the second transformation uses this to generate an excel file for each with a name dependent on the current row processed. So this setup works calling transformations as well
Thanks again! _nikosan_
Hi _nikosan_,
great, good to hear the post helped with the solution
Cheers
Slawo
Hi, thanks for your post.
I’m wandering is it possible to launch the same job with different parameters in parallel ?
let say I have the presentation of days in the first step like 201201,201202 … etc and I’m building summarization table by date, so I want to ran sum calculation in parallel because all of them will be by different day and no potential concurrency problem
Thanks
Ilia
Hi Slawomir, thanks for this.. it’s an awesome post.
Though I did figure out to execute my job on each row but now am facing memory issues and it gives me a java.lang.OutOfMemoryError: Java heap space error. I did change the start-pentaho.bat to give JVM enough memory.
set CATALINA_OPTS=-Xms4g -Xmx8g -XX:MaxPermSize=4g -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000
What else do you think I can do? Thanks in advance.
Regards,
Pankaj
Awesome Slawo! I am new to pentaho/pdi. Thanks for showing how to do this – helped me a lot.
-Inder
Hi Slawomir
Thanks for you work.It’s useful.
Regards
Teddy
Thanks Teddy
Great article, thanks. Also a great blog in general, I can’t count how many times I’ve googled kettle + something advanced and your site comes up with an article specifically about something advanced.