Sometimes it would be nice to access a Java library directly from Kettle. You might find it useful for validation, lookup or custom cryptography support, just to give a few examples. Sometimes even basic access to data is not as straightforward as getting a file dump or a using a database connection. Some data sources might be encapsulated in an application, and the only way to get your hands on it, is using a custom Java client. This article explains how you can directly utilize your Java classes in Kettle a.k.a. PDI.
NOTE: If you’re using Kettle 4.0 or later, you also have the option to use the new User Defined Java Class step.
The Modified Java Script Value Step
The key to using Java in Kettle transformations is the Modified Java Script Value step. While the name of the step implies that it executes JavaScript only, it is just es well possible to execute Java code. To understand why this is the case we need to notice that the step is implemented using the Rhino JavaScript Engine. Rhino makes it possible to execute JavaScript in a JVM, but also makes it possible for the scripts to access Java classes. To illustrate how this works, let’s look at a simple transformation that uses Java to calculate an MD5 hash for a field. You can download the md5 example transformation, if you like. Previewing the output gives the following picture.
The MD5 calculation is happening in the javascript step. It contains the following code.
// simulate java imports
var MessageDigest = java.security.MessageDigest;
var String = java.lang.String;
var StringBuffer = java.lang.StringBuffer;
var Integer = java.lang.Integer;
// get the md5 digest algorithm
var algorithm = MessageDigest.getInstance("MD5");
// get the input as bytes
var bytes = new String(test_value).getBytes("UTF-8");
// calculate the digest
algorithm.reset();
algorithm.update(bytes);
var md5Digest = algorithm.digest();
// turn the digest into a hex-string representation
var hexString = new StringBuffer();
for (var i = 0; i < md5Digest.length; i++){
var hex = Integer.toHexString(0xFF & md5Digest[i]);
if (hex.length() == 1){
hexString.append('0');
}
hexString.append(hex);
}
// write output value
var md5_hash = hexString.toString().toUpperCase();
Let’s go through this JavaScript section by section.
The first section looks like a classic Java import section. In fact it serves the same purpose as it does in Java. Declaring variables which point to Java classes saves us the trouble of having to reference each Java class by its full name later on. When working with your own scripts you should be careful to not accidentally overwrite JavaScript classes using this technique. JavaScript has its own Date class for instance, and you could easily overwrite it without really intending to.
The following computation of the MD5 digest value consists of rather straightforward Java API calls. One thing to point out is that our input fields are given to us as JavaScript objects. The test_value field is a JavaScript string in our example. You can convert it to a Java string using many different ways. I chose a simple String constructor for clarity. The resulting Java string has the getBytes() method we need for further processing.
After the digest has been computed, the code engages in a small loop to generate a string representation of the digest, which is ultimately written to the row as a new field.
Using External Libraries
Executing too much scripted code might impact performance, so it is always desirable to let complex things be done by compiled Java code. Usually there is some nice Java library out there that does just what we want. Maybe we even created it ourselves. Would it not be great if we could ask Kettle to make it available to us? Maybe a few calls to an external library save us the effort to write an entire plugin. Kettle will load any jar files present in its libext folder and its subfolders during startup, so if you want to access classes from a custom jar, you could place the jar there. Considering our MD5 example, I notice that Kettle 3.2.0 already ships with a jar of the apache commons codec project. The codec project has a nice method to calculate an MD5 hex string from an input string. Exactly what we need. Since the jar is already in place, we can simply use the DigestUtil class from the commons project to replace our elaborate JavaScript code with a single call to DigestUtils.
// get a nice md5 hash var md5_hash = org.apache.commons.codec.digest.DigestUtils.md5Hex(test_value);
On my laptop the new transformation step processes around 10,500 rows per second. The first version with the more elaborate code merely manages around 2,900 rows per second.
The difference in processing speed is quite significant, so using encapsulated functionality really seems to pay off. Then again, maybe the apache project uses an entirely different MD5 implementation. In any case, the new code is certainly running faster and is easier to understand. You can download the commons md5 example transformation, if you like.
Conclusion
The Modified Java Script Value step allows access to arbitrary Java classes present in the JVM. If your ETL process has special needs, that are already adressed by a Java implementation, you might consider giving it a try. Always be on the lookout for performance issues when working with scripted code, however. Also be sure to have a proper dependency management and deployment process in place if you are using a lot of external libraries. Issues might arise if you don’t have an automated process to keep consistent versions of the external libraries on the developer workstations, test systems and production systems.







Hi,
I need to include import sun.misc.BASE64Encoder in Modified java script. While doing this, I am getting error “Reference error:sun is not defined”.
kindly suggest me a soln. Thanks in advance.
Hi Harris,
assuming you have a string field “foo” in your row stream, you can use the following code to encode it
var enc = new Packages.sun.misc.BASE64Encoder();
var encodedFoo = enc.encode(new java.lang.String(foo).getBytes());
Cheers
Slawo
Hi,
I haved followed your tuto and work fine for me, thanks a lot!
But I have a question, in my java custom class I connect to a database and do some tasks with data. On PDI all are right but now I have created a xaction on Pentaho Design Studio and I have launched on the BI Server.
My custom class has been packed into a jar, uses hibernate to access to data. I have placed all the needed jars in the BI Server classpath. But when I run the xaction the Tomcat log thows an Exception with the Configuration resource: /hibernate.cfg.xml, saying that can not find it.
I have try to use a jndi connection in order to not use a hibernate.cfg.xml but now say that it can not find the jndi connection.
¿Could someone help?
Thanks in advance
Jose,
it looks like there are some classloader issues involved. Have you tried packaging the hibernate.cfg.xml into the jar file?
If all else fails I’d recommend trying the following:
Make your custom class accept a JDBC database connection as a parameter
Make sure the database you need to connect to is configured in PDI
Use the PDI API to get a connection and pass it to your class to work with it
I’d recommend checking Kettle’s JobEntrySQL.java lines 250ff to see how Kettle manages that internally. Once you got a Database object, you can call .connect() and .getConnection() to get the JDBC connection. You should be able to call this.getTransMeta().findDatabase(“MyDBName”) to get the DatabaseMeta you need for the constructor of Database() by name.
Hope it makes some sense. Dig into the Kettle classes and you’ll get the hang of it
Cheers
Slawo
Hi,
Could any one please tell us about the optimization level used in Modified javascript of pentaho v4.2.1 and also the advantages of using v4.2.1 over v3.2.0