Squeezing the most out of the JavaScript Step in Pentaho Kettle

If you read the articles about using Java in Kettle and manually generating rows, you have seen constant reminders about the fact that scripted code runs slower than a compiled plugin. But if for some reason you are stuck with a scripted solution, it is good to know a few facts about the JavaScript step that help optimizing execution speed.

Execute initialization code only once

Suppose it is necessary to do some kind of sophisticated custom lookup. For argument’s sake let’s look at an transformation step, that is supposed to match a string from a fixed set of ca. 1000 strings based on a modulo calculation of an input ID. This is what the first attempt might look like.

var values = [
	"AARON",
	"ABHI",
	"AG",
--- snip ---
	"ZIGZAGGLE",
	"ZOGLA",
	"ZONGKER",
	"ZSOLT",
	"ZUKAS",
	"ZVEROVICH"
];

var output_value = values[id % values.length];

It looks entirely innocent, but the array of possible values is recreated for every incoming row. This is not necessary. Variables inside the JavaScript keep their values between incoming rows, so it should be possible to initialize the values array only once.  Probably the most elegant way to do it, is to create a separate script tab for initialization. If you right-click the tab area of the editor, you can create a new tab.

All of the script tabs also appear in the left menu where you can also right click them. The context menu allows to give the newly created tab a more meaningful name.

After moving the initialization code into the new initialization tab, Kettle must be told to execute the code in this tab exactly once just before the first row comes in. This is done by right clicking the initialization tab and marking it as a “Start Script”.

The other possible markings are “End Script” and “Transform Script”. An “End Script” is executed after the last row has been processed. It is perfect for clean up work if you want to free resources like connections or file handles, should you have any. The “Transform Script” is executed for each incoming row. Exactly one tab must be marked as “Transform Script”. The first tab of a JavaScript step is is marked a “Transform Script” by default. There is also the possibility to remove the script type from a tab. In this case the script is just sitting there and is not executed at all.

Unfortunately in Kettle it is not possible to see at a glance whether a tab has been marked as special. When in doubt, you have to click through all your tabs and assign the appropriate script types.
Edit: Samatar fixed this little flaw in Kettle 4.1, see comments section :)

Is it really faster now?

The new step uses the initialization tab and avoids recreating the values array on each row. The unoptimized transformation processes 1,000,000 rows in ca. 19 secs on my laptop.  The optimized version pumps the same amount of rows in under 3 seconds. Download the example transformations, if you like.

How is a script tab with no type useful?

A script tab that is not marked as special will not be executed by Kettle.

This can be useful if you want to keep a code scrapbook around. But you can also execute the typeless tab’s script dynamically from your initialization step. This can be useful if you have several initialization subtasks that you want to keep in separate tabs. The bundled function LoadScriptFromTab is used to execute a script tab dynamically. It is in the “Special functions” section. Which brings me to the next point.

Have a look at the bundled functions

The bundled functions are implemented in Java and are thus executing faster than custom JavaScript functions you might come up with. Each function comes with a short sample which shows how to use it. Some of the functions are quite interesting, and can save you a lot of custom coding. In particular the functions in the “Special functions” section provide some interesting scripting possibilities. I’m thinking along the lines of fireToDB, appendToFile, resolveIP, etc. The “Date functions” section also provides nice functions that are frequently useful in a DWH context. Feel free to explore.

Related Literature

11 comments to Squeezing the most out of the JavaScript Step in Pentaho Kettle

  • I found out about this post in the Pentaho forum ( http://forums.pentaho.com/showthread.php?76911-Javascript-Previous-Row ), and I have to say that this tip is awesome. I didn’t know about those start and end scripts.

    I usually initialized my long-lived JS variables like this:

    var myJavaService;

    // Initialization code – Only executed for the first row
    if (myJavaService == null) {
    myJavaService= new Packages.com.censored.bi.kettle.MyJavaService();
    }

    // Use the service for each row

    I’ve started looking at your other articles as well, and there’s some very good and practical stuff in there. It’s too bad I only found your blog recently: it would have saved me a lot of time (spent with trial and errors), in particular with Mondrian / MDX.

    Keep it up!

  • Slawomir Chodnicki

    Hi Etienne,

    thanks for your comments. I’m glad to hear that you find the blog useful. I’ll try to keep it interesting in the future :)
    Do you have a suggestion for a topic maybe?

    PS: my initialization code looked exactly the same before I found out about the start script feature :)

  • Samatar

    Hi Slawomir,

    see the Jira :

    http://jira.pentaho.com/browse/PDI-4521

    Take care

    Samatar

  • Slawomir Chodnicki

    Hi Samatar,

    thanks for posting the JIRA case. :)
    Wow, it’s already fixed. I didn’t notice at first.

    A big “Thank you” Samatar :)
    Cheers

    Slawo

  • @Samatar: wow, that was fast :) Thanks for fixing this!

    @Slawomir: well, I’m currently interested in “Validation & Error Handling Best Practices” inside Kettle, and this might be an interesting topic (at least, for me).
    More precisely, I have many transformations validating source data with the “Validator Step”, and I think it’s a good idea to keep the invalid rows somewhere for further analysis.
    For example, it would be great, after each ETL run, if there were invalid rows, to send an e-mail like:
    “There were validation errors.
    - 12 invalid rows in table “src_store”
    - 218 invalid rows in table “src_sale”
    …”
    But I still have many problems to solve, such as:
    - should I simply log this data (using “Write to log”), and then attach the whole log file in the mail? If so, what happens when I have a million invalid rows and the log becomes too big?
    - should I save this to some “validation_errors” db table? If so, and if this table has 4 columns (“transformation_name”, “validation_step_name”, “error_description” and “record_values”), how do I create the appropriate “Sub-transformation” that
    a) auto-discovers the name of its parent transformation (“transformation_name”) and the name of the validation step that found the row to be invalid (“step_name”)
    b) is able to take an unknown number of unspecified fields, and concatenate them to populate the “record_values” column… (because the “error_description” returned by the Validator Step does not always contain this information). It might be possible with the JS step, but I got to investigate

    I think I might post about it in the forum to get some opinions, but there are many problems to solve in this area ;)

    - Etienne

  • Slawomir Chodnicki

    Hi Etienne,

    I compiled my thoughts about this topic into another post. I hope it is useful to you :)
    http://type-exit.org/adventures-with-open-source-bi/2010/08/data-validation-and-monitoring-with-pentaho-kettle/

    Cheers

    Slawo

  • Anonymous

    Hey it is a nice one to remove duplicate rows !!!!
    But is their any solution for removal of duplicates entries from csv file.

    -Ranu.. :)

  • Slawomir Chodnicki

    Hey there,

    You can probably read the csv, sort, then use unique rows, then write an output file to do that.

    Cheers
    Slawo

  • DJ

    It would be great if some one could clarify whether we can use variables in start and end scripts of Modified Java Script.
    In simple terms, I have an input field for MJS and I just want to use the input field in start and end scripts. I am getting the below error message though.
    2013/08/29 08:41:03 – Modified Java Script Value 3 2.0 – ERROR (version 4.2.1-stable, build 15952 from 2011-10-25 15.27.10 by buildguy) : Error processing start script
    2013/08/29 08:41:03 – Modified Java Script Value 3 2.0 – ERROR (version 4.2.1-stable, build 15952 from 2011-10-25 15.27.10 by buildguy) : ReferenceError: “result” is not defined. (trans_Start#2)

  • Anonymous

    hi,i have a problem,how to add some blank

  • LeoG

    What does setEnvironmentVar do? I’m trying to write a load to set windows environment variables from a csv file.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>