Sometimes it would be nice to execute a Kettle job or transformation multiple times, each time with different parameters passed in. Suppose for example you are doing migration work and need to import legacy data one day at a time. You already have a Job in place that imports data for a particular date. The date must be specified as a parameter. Suppose you need to import all data since a fixed start date up to yesterday. Running the import for every day manually is a pain and it is error prone. Would it not be nice to let Kettle figure out the proper dates and run the import job for each date?
Things like that can be done by having a master job that first runs a transformation which returns the dates one row at a time. The actual import job is placed after the transformation and it is executed for each row in the transformation result. The job can be set up to get its parameters from a result field.
Let’s look at this one step at a time (no pun intended).
The transformation responsible for generating the dates produces rows with a single date field. The “set start date” step puts thousands of rows to the stream with a constant start date. The “calc days” step inserts a sequence into the stream which is used to calculate consecutive dates in the “calc date” calculator step. The subsequent step throws away the helper fields. Yesterday’s date is added to the stream, so a comparison is possible checking whether the current row contains a date that is earlier than today. If it is, it is copied to the result set. If it is not, the row is discarded.
The Master Job
The master job executes the date generation transformation and follows up with an inner job that is supposed to execute for each of the incoming dates. The job must be specifically set up to execute for each incoming row. The check box at “Execute for every input row?” enables that. It is also required to pass in the calculated dates into the job. The job expects them in the “date” parameter, so “Copy previous results to parameters?” is also checked. Finally, the parameters must be mapped from the incoming rows. The expected parameter and the incoming field are both called “date”.
The Inner Job
The inner job runs for each row and accepts the incoming date in the parameter “date”. In the example it simply logs which date it sees and completes.
Download the example setup here. The master job runs the inner job for each date beginning with 2010-01-01 up to and including yesterday.
I would like to point out that this technique might not fit all possible looping needs. Especially for “repeat till condition holds” scenarios. You might want to check out this article for an alternative approach to looping in kettle jobs.