Developing a custom Kettle Plugin: A Simple Transformation Step

Sometimes when doing ETL work, custom processing tasks unique to the project’s context arise. Custom processing has often to do with master data management, validation or getting data from data sources using a non mainstream data storage like key-value storages or other noSQL solutions. Looking at the plug-in enabled architecture of Pentaho Kettle you might be wondering how to create one of your own Kettle plugins. Coding up a custom plugin might be a good idea if it helps to solve a problem in an elegant, reliable and generic way. If it is generic enough in its design, it might also be a valuable addition to the open source community plugins.

This article shows how to develop a simple plugin which provides a custom transformation step for Kettle 4.0. The transformation step should accept any row stream and append a string field at the end, filling it with a fixed value. The user should be able to define the name of the added field. For starters, that should be enough. Keeping the step functionality at a minimum allows me to explain how the plugin interfaces with Kettle with as little distraction as possible.

Prerequisites

In order to develop a working plugin you technically only need Kettle 4.0 CE. Get your copy here, in case you don’t have one yet. I will be using the RC1 version for this article. The stable release of Kettle 4.0 CE is not available yet as I write. But it seems only a few days away.

It is not mandatory, but of course highly recommended to download the source package, as well. If you find yourself wondering how a certain feature is implemented in an existing transformation step, all you need to do is examine the step’s Java source. There’s no better place to learn plugin development for Kettle than the original sources.

I will be using bare-bones “Eclipse 3. 5 for Java Developers” to guide through the code. It is the standard installation, no bells, no whistles.

Get the example step sources discussed in this article. I would like to think of the provided example as a template step, since it is a good starting point for developing something more useful.

The Plugin as an Eclipse Project

The sources of the example plugin come in the form of an eclipse project. After importing it into your workspace, you need to satisfy a few external dependencies, namely the jars from Kettle’s lib folder as well as the SWT library from Kettle’s libswt folder. Choose the swt.jar for the system you are developing on. After satisfying the compile time dependencies, the project should be error free. Before digging into the plugin classes you should convince yourself, that the plugin really works. It can be installed by simply generating a jar file from the java classes and copying it into Kettle’s plugins/steps folder. Finally you need to copy the plugin.xml and icon.png files from the plugin folder of the project. (In real life scenarios this should be done by a build script, of course.) The screencast shows the entire process (you might want to watch in full screen mode), also showing the installed step in action. It really should add a new field and fill it with a dummy value.

How a Kettle step works

There are 4 classes, that make up the Kettle step. Each has a specific purpose and an important role to play.

  • TemplateStep: the step class implements the StepInterface. Instances of this class do the actual row processing when the transformation runs. Each thread of execution is represented by an instance of this class. It is given instances of the data and meta classes, when executed.
  • TemplateStepData: the data class is used for storing data unique to a thread of execution, when the plugin runs. This is where database connections, file handles, caches and other things needed during execution are stored.
  • TemplateStepMeta: the meta class implements the StepMetaInterface. It is responsible for holding and serializing the settings chosen for a particular instance of the step. For our template step it holds the step name and the name of the output field.
  • TemplateStepDialog: the dialog class implements the user interface of the step. It shows a dialog allowing the user to configure the step’s behavior to their liking. This dialog class is closely related to the meta class which keeps track of the chosen settings.

In addition to the code, there is the plugin.xml file which wires the step into Kettle by specifying its meta class. It also defines the visual appearance of the step and the category it is listed in.

Trying to understand the details of how everything fits together, maybe it is beneficial to start following the user’s perspective: As a user we first see the appearance of a step, we then drag the step on the canvas and specify its settings, finally we let the transformation run. Respectively I’d like to start with the plugin.xml file, continuing to the meta and the dialog classes, and finally examining the step and data classes.

Writing the plugin.xml file

The plugin.xml file is rather simple for most plugins. Its main function is to tell Kettle about the implementation (meta) class, the plugin name and description (localized) as well as additional jar files to load. Be sure to check out the reference article about plug-in loading for details. For our little template step the settings are straightforward.

<?xml version="1.0" encoding="UTF-8"?>
<plugin
   id="TemplatePlugin"
   iconfile="icon.png"
   description="Template Plugin"
   tooltip="Only there for demonstration purposes"
   category="Demonstration"
   classname="plugin.template.TemplateStepMeta">
   <libraries>
      <library name="templatestep.jar"/>
   </libraries>
</plugin>

The ID for the plugin should be globally unique, and should not be changed, as it is used in serialization. The icon file is a nice png. “Description” is the name of the step as it appears in the tree menu. The “tooltip” appears on mouse rollover in the tree menu. “Category” is the name of the tree folder the plug in appears in. Our little example goes into a folder labeled “Demonstration”. For the “classname” Kettle expects the fully qualified name of the meta class. The library section specifies the jars to load for this plugin. In our example this is only a single jar file.

This short xml file wires the plugin into Kettle and makes it accessible through the user interface. Time to move on to the meta class.

The Meta Class

The TemplateStepMeta class has the following main responsibilities, and methods to deal with them.

// keep track of the step settings
public String getOutputField()
public void setOutputField(...)
public void setDefault()

// serialize the step settings to and from xml
public String getXML()
public void loadXML(...)

// serialize the step settings to and from a kettle repository
public void readRep(...)
public void saveRep(...)

// provide information about how the step affects the field structure of processed rows
public void getFields(...)

// perform extended validation checks for the step
public void check(...)

// provide instances of the step, data and dialog classes to Kettle
public StepInterface getStep(...)
public StepDataInterface getStepData()
public StepDialogInterface getDialog(...)

Please note that the meta class stores the name of the output field of the step in a private member named outputField.

There are more aspects to the meta class. Most of them are covered with a default implementation in BaseStepMeta, which is the parent class of TemplateStepMeta. The default implementations are working just fine for our template step. Check out the javadoc for StepMetaInterface, and BaseStepMeta for more details and possibilities.

The Dialog Class

The TemplateStepDialog class implements the settings dialog for the template step. Kettle uses the Eclipse SWT framework for its user interface widgets. So most of the code is SWT code which you need to become familiar with before producing complex dialogs yourself. The SWT documentation is available from inside Eclipse (use the help menu) and online. The source of the other plugins should be your guide and inspiration when designing new settings dialogs. During construction, a dialog object is given a step meta object, where it should read the settings from when  opened, and where it should save the settings to when confirmed. For the template step the only setting is the name of the output field. A custom step dialog derived from BaseStepDialog must provide an open(…) method. This method must return the (possibly changed) name of the step or null if the dialog has been cancelled.

The Step Class

The step class is doing all the actual processing and transformation work. Since most of the boilerplate code is provided by the BaseStep parent class, most plugins focus on only a few implementation specific methods.

// initialization and teardown
public boolean init(...)
public void dispose(..)

// processing rows
public void run()
public boolean processRow(..)

The method init() is called by Kettle before any step of the transformation starts. The transformation run will only start after all the steps returned successfully from their init() call. The template step does nothing there, but since it is a common place for doing initialization work, I kept it in the code. Similarly dispose() is called after the step is done. The step should close resources, like file handles, caches and the like when dispose() is called.

Kettle calls the run() method when it is time to actually process rows. A common implementation of run() would call processRow() in a tight loop, until there is nothing more to process, or the transformation has been stopped.

The method processRow() is called to process a single row of data. This method usually has a getRow() call to get a row to process. This call will block if necessary, for example if the step is getting rows at a slow rate. Subsequently processRow() would do its transformation work and call putRow(), which puts the row downstream.  Your steps will need to examine and change the row structure. To do that safely and conveniently be sure to get familiar with the package org.pentaho.di.core.row, especially with RowMetaInterface and RowDataUtil.

The BaseStep implementation conveniently provides a first flag to indicate that the row processed is the first one. This can be useful to execute certain code only once, like for example an expensive look up, that is going to be cached.

The Data Class

Most steps require some kind of storage for caching indices or temporary data. The proper place to put it is in the data class. Each thread of execution will get its own instance of the data class, so it can run in a self-contained way. The boilerplate code is provided by the BaseStepData class. Think of the data class as of the per-thread storage of your step plugin. As a rule of thumb avoid adding non-constant fields to your step class. Whatever you need to store, it is probably better placed in the data class.

The template step uses a data object to store the output structure of the rows it processes. It does not require any other storage.

Conclusion

A Kettle step plugin consists of four classes, each with its own roles and responsibilities. The meta, dialog, step and data classes nicely fit together, especially since a lot of the boilerplate code and common tasks is already implemented by the respective base classes. The open source nature of Kettle makes it possible to look at the techniques used by other developers, so if you find yourself stuck at some point, go ahead an look at the sources of some other steps. The answers are usually there.

If you want more detailed information take a look at another article which explains the code of a more advanced plugin: the Voldemort lookup plugin.

Other sources of help are the PDI wiki page and the forums at forums.pentaho.org. Especially check out the PDI users forum and PDI developers forum if you need help developing your custom plugin.

Comments and corrections are welcome.

Slawo

Related Literature

11 comments to Developing a custom Kettle Plugin: A Simple Transformation Step

  • Youngwoo Kim

    Hi Slawo,
    Really nice post! if you post follow ups on advanced topic for developing plugins, it would be very useful!

  • Slawomir Chodnicki

    Hi there,

    I am planning to follow up with an article on a step that extracts values from the Voldemort key-value strore. Another thing I am thinking about is a step that would do GeoIP lookup based on maxmind’s open source country database. Maybe I’ll have time to work on these tonight so stay tuned :)

  • [...] article expands the example from the introductory article on custom plugins. With noSQL storage solutions on the rise, I thought it might be useful to show [...]

  • Fady

    This is an excellent post. Greatly helps jump start custom plugins development.

    As an aside, it would be very helpful if you could talk about testing plugins. Looking at the PDI code, there are classes such as TransformationTestCase and TransTestFactorywhich which look like the premise of a testing framework but it’s not obvious how they are meant to be used.

    Again, thank you for your work.

    Fady

  • Slawomir Chodnicki

    Hi Fady,

    thanks for commenting, and special thanks for suggesting an article topic. Stay tuned :)
    In the meanwhile you might want to check out the automated tests that are part of the Kettle source (they are in the test folder). There’s unit tests checking up on individual classes and blackbox tests, that compare transformation results with gold data.

    Cheers

    Slawo

  • This is the best d*** tutorial on writing a Kettle Step plugin there is !

    Miles better than the original wiki at http://wiki.pentaho.com/display/EAI/Writing+your+own+Pentaho+Data+Integration+Plug-In

    I would have saved quite a bunch of hours had I found this article the first time ;-)

    Thanks Slawomir !!

    Hey, you’re the one who created the Voldemort plugin youtube videos! :-)

  • Slawomir Chodnicki

    Hi Hendy,

    Thanks for the comment! Yes, I did the Voldemort plugin demo and uploaded some videos :)
    I also did not find any coherent introduction to plugin development, so after finding things out the hard way, I decided to give it a shot myself. Glad to see it is used :)

    Cheers

    Slawo

  • KettleUser

    Hi

    I am trying to write a plug-in (step), in which I need to make a call to other Kettle Core steps. Is is possible to do it. Do you have any example for this ?

    Thanks

  • Slawomir Chodnicki

    Hi there,

    while it is technically possible to call another step directly at runtime, I’d advise against that. Transformation steps operate in parallel in different threads and they are not supposed to directly communicate. Core steps don’t do that neither. Their medium of communication is the row stream.

    However, If you really want to call other steps for some reason you can ask the transformation object to give you instances of all steps. Call getDispatcher() on your step to get the Trans object and call getSteps() on that to get a list of all steps that are part of the transformation.

    Cheers

    Slawo

  • Chris

    Great article, thanks. The code still works perfectly with 4.2.1 btw.

  • Slawomir Chodnicki

    Good to hear ;)

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>