Using Java in Pentaho Kettle

Sometimes it would be nice to access a Java library directly from Kettle. You might find it useful for  validation, lookup or custom cryptography support, just to give a few examples. Sometimes even basic access to data is not as straightforward as getting a file dump or a using a database connection. Some data sources might be encapsulated in an application, and the only way to get your hands on it, is using a custom Java client. This article explains how you can directly utilize your Java classes in Kettle a.k.a. PDI.

NOTE: If you’re using Kettle 4.0 or later, you also have the option to use the new User Defined Java Class step.

The Modified Java Script Value Step

The key to using Java in Kettle transformations is the Modified Java Script Value step. While the name of the step implies that it executes JavaScript only, it is just es well possible to execute Java code. To understand why this is the case we need to notice that the step is implemented using the Rhino JavaScript Engine. Rhino makes it possible to execute JavaScript in a JVM, but also makes it possible for the scripts to access Java classes. To illustrate how this works, let’s look at a simple transformation that uses Java to calculate an MD5 hash for a field. You can download the md5 example transformation, if you like. Previewing the output gives the following picture.

The MD5 calculation is happening in the javascript step. It contains the following code.

// simulate java imports
var MessageDigest = java.security.MessageDigest;
var String = java.lang.String;
var StringBuffer = java.lang.StringBuffer;
var Integer = java.lang.Integer;

// get the md5 digest algorithm
var algorithm = MessageDigest.getInstance("MD5");

// get the input as bytes
var bytes = new String(test_value).getBytes("UTF-8");

// calculate the digest
algorithm.reset();
algorithm.update(bytes);
var md5Digest = algorithm.digest();

// turn the digest into a hex-string representation
var hexString = new StringBuffer();

for (var i = 0; i < md5Digest.length; i++){
	var hex = Integer.toHexString(0xFF & md5Digest[i]);
	if (hex.length() == 1){
		hexString.append('0');
	}
	hexString.append(hex);
}

// write output value
var md5_hash = hexString.toString().toUpperCase();

Let’s go through this JavaScript section by section.

The first section looks like a classic Java import section. In fact it serves the same purpose as it does in Java. Declaring variables which point to Java classes saves us the trouble of having to reference each Java class by its full name later on. When working with your own scripts you should be careful to not accidentally overwrite JavaScript classes using this technique. JavaScript has its own Date class for instance, and you could easily overwrite it without really intending to.

The following computation of the MD5 digest value consists of rather straightforward Java API calls. One thing to point out is that our input fields are given to us as JavaScript objects. The test_value field is a JavaScript string in our example. You can convert it to a Java string using many different ways. I chose a simple String constructor for clarity. The resulting Java string has the getBytes() method we need for further processing.

After the digest has been computed, the code engages in a small loop to generate a string representation of the digest, which is ultimately written to the row as a new field.

Using External Libraries

Executing too much scripted code might impact performance, so it is always desirable to let complex things be done by compiled Java code. Usually there is some nice Java library out there that does just what we want. Maybe we even created it ourselves. Would it not be great if we could ask Kettle to make it available to us? Maybe a few calls to an external library save us the effort to write an entire plugin. Kettle will load any jar files present in its libext folder and its subfolders during startup, so if you want to access classes from a custom jar, you could place the jar there. Considering our MD5 example, I notice that Kettle 3.2.0 already ships with a jar of the apache commons codec project. The codec project has a nice method to calculate an MD5 hex string from an input string. Exactly what we need. Since the jar is already in place, we can simply use the DigestUtil class from the commons project to replace our elaborate JavaScript code with a single call to DigestUtils.

// get a nice md5 hash
var md5_hash = org.apache.commons.codec.digest.DigestUtils.md5Hex(test_value);

On my laptop the new transformation step processes around 10,500 rows per second. The first version with the more elaborate code merely manages around 2,900 rows per second.

The difference in processing speed is quite significant, so using encapsulated functionality really seems to pay off. Then again, maybe the apache project uses an entirely different MD5 implementation. In any case, the new code is certainly running faster and is easier to understand. You can download the commons md5 example transformation, if you like.

Conclusion

The Modified Java Script Value step allows access to arbitrary Java classes present in the JVM. If your ETL process has special needs, that are already adressed by a Java implementation, you might consider giving it a try. Always be on the lookout for performance issues when working with scripted code, however. Also be sure to have a proper dependency management and deployment process in place if you are using a lot of external libraries. Issues might arise if you don’t have an automated process to keep consistent versions of the external libraries on the developer workstations, test systems and production systems.

Related Literature

10 comments to Using Java in Pentaho Kettle

  • Harris

    Hi,
    I need to include import sun.misc.BASE64Encoder in Modified java script. While doing this, I am getting error “Reference error:sun is not defined”.

    kindly suggest me a soln. Thanks in advance.

  • Slawomir Chodnicki

    Hi Harris,

    assuming you have a string field “foo” in your row stream, you can use the following code to encode it

    var enc = new Packages.sun.misc.BASE64Encoder();
    var encodedFoo = enc.encode(new java.lang.String(foo).getBytes());

    Cheers

    Slawo

  • Jose

    Hi,
    I haved followed your tuto and work fine for me, thanks a lot!
    But I have a question, in my java custom class I connect to a database and do some tasks with data. On PDI all are right but now I have created a xaction on Pentaho Design Studio and I have launched on the BI Server.

    My custom class has been packed into a jar, uses hibernate to access to data. I have placed all the needed jars in the BI Server classpath. But when I run the xaction the Tomcat log thows an Exception with the Configuration resource: /hibernate.cfg.xml, saying that can not find it.

    I have try to use a jndi connection in order to not use a hibernate.cfg.xml but now say that it can not find the jndi connection.

    ¿Could someone help?

    Thanks in advance

  • Slawomir Chodnicki

    Jose,

    it looks like there are some classloader issues involved. Have you tried packaging the hibernate.cfg.xml into the jar file?
    If all else fails I’d recommend trying the following:

    Make your custom class accept a JDBC database connection as a parameter
    Make sure the database you need to connect to is configured in PDI
    Use the PDI API to get a connection and pass it to your class to work with it

    I’d recommend checking Kettle’s JobEntrySQL.java lines 250ff to see how Kettle manages that internally. Once you got a Database object, you can call .connect() and .getConnection() to get the JDBC connection. You should be able to call this.getTransMeta().findDatabase(“MyDBName”) to get the DatabaseMeta you need for the constructor of Database() by name.

    Hope it makes some sense. Dig into the Kettle classes and you’ll get the hang of it :)

    Cheers
    Slawo

  • Thilan

    Hi,

    Could any one please tell us about the optimization level used in Modified javascript of pentaho v4.2.1 and also the advantages of using v4.2.1 over v3.2.0

  • Ramya Toshniwal

    Hi Slawomir,

    I have a code running perfectly fine in VBA. I need to automate this code through kettle. Is there any way by which I can simply embed the written code in any of the kettle step.

    Thanks,
    Ramya

  • Slawomir Chodnicki

    Hey Ramya,

    The only thing that comes to mind is that maybe you can find a way to call a VBA script on the command line. I don’t know of any way to run VBA within PDI.

    May be good to check with the tech experts at stackoverflow: http://stackoverflow.com/questions/6833588/calling-vba-code-from-java

    Best
    Slawo

  • Carey

    Is there a performance difference when calling a compiled class from a JAR in JAVASCRIPT or the user defined Java Class? Specifically in your example calling that Apache Commons-MD5 has routine – if that returns a result and that result is passed into the data stream what is the likely (% hit) using javascript to instantiate that Class instance and store in a variable vs User Defined Java Class?

  • Slawomir Chodnicki

    Hey Carey,

    the java class will always be faster because it is actually compiled to bytecode and run, whereas the javascript step is “compiled” to an execution tree with different degrees of optimization. The call to a jar class would not be costly in and of its own in either case, what does have an effect is the aspect of the field values that are brought into the local scope of the javascript, regardless of whether they are actually used. In the java class, there is no overhead in making field values available.

    I hope it makes sense.

    Cheers
    Slawo

  • cpr

    This blog post ranks high for the Google search “Kettle md5sum”. So I would like to mention here that another solution was found: The “Calculator” step has CRC-32, MD5, SHA-1 and other checksums.

    (This is in no way meant to diminish the usefulness of this post explaining how to “jump out” into the VM!)

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>