Contents
This article shows how to develop a simple plugin which provides a custom transformation step for Kettle 4.0. The transformation step should accept any row stream and append a string field at the end, filling it with a fixed value. The user should be able to define the name of the added field. For starters, that should be enough. Keeping the step functionality at a minimum allows me to explain how the plugin interfaces with Kettle with as little distraction as possible.
Prerequisites
In order to develop a working plugin you technically only need Kettle 4.0 CE. Get your copy here, in case you don’t have one yet. I will be using the RC1 version for this article. The stable release of Kettle 4.0 CE is not available yet as I write. But it seems only a few days away.
It is not mandatory, but of course highly recommended to download the source package, as well. If you find yourself wondering how a certain feature is implemented in an existing transformation step, all you need to do is examine the step’s Java source. There’s no better place to learn plugin development for Kettle than the original sources.
I will be using bare-bones “Eclipse 3. 5 for Java Developers” to guide through the code. It is the standard installation, no bells, no whistles.
Get the example step sources discussed in this article. I would like to think of the provided example as a template step, since it is a good starting point for developing something more useful.
The Plugin as an Eclipse Project
The sources of the example plugin come in the form of an eclipse project. After importing it into your workspace, you need to satisfy a few external dependencies, namely the jars from Kettle’s lib folder as well as the SWT library from Kettle’s libswt folder. Choose the swt.jar for the system you are developing on. After satisfying the compile time dependencies, the project should be error free. Before digging into the plugin classes you should convince yourself, that the plugin really works. It can be installed by simply generating a jar file from the java classes and copying it into Kettle’s plugins/steps folder. Finally you need to copy the plugin.xml and icon.png files from the plugin folder of the project. (In real life scenarios this should be done by a build script, of course.) The screencast shows the entire process (you might want to watch in full screen mode), also showing the installed step in action. It really should add a new field and fill it with a dummy value.
How a Kettle step works
There are 4 classes, that make up the Kettle step. Each has a specific purpose and an important role to play.
- TemplateStep: the step class implements the StepInterface. Instances of this class do the actual row processing when the transformation runs. Each thread of execution is represented by an instance of this class. It is given instances of the data and meta classes, when executed.
- TemplateStepData: the data class is used for storing data unique to a thread of execution, when the plugin runs. This is where database connections, file handles, caches and other things needed during execution are stored.
- TemplateStepMeta: the meta class implements the StepMetaInterface. It is responsible for holding and serializing the settings chosen for a particular instance of the step. For our template step it holds the step name and the name of the output field.
- TemplateStepDialog: the dialog class implements the user interface of the step. It shows a dialog allowing the user to configure the step’s behavior to their liking. This dialog class is closely related to the meta class which keeps track of the chosen settings.
In addition to the code, there is the plugin.xml file which wires the step into Kettle by specifying its meta class. It also defines the visual appearance of the step and the category it is listed in.
Trying to understand the details of how everything fits together, maybe it is beneficial to start following the user’s perspective: As a user we first see the appearance of a step, we then drag the step on the canvas and specify its settings, finally we let the transformation run. Respectively I’d like to start with the plugin.xml file, continuing to the meta and the dialog classes, and finally examining the step and data classes.
Writing the plugin.xml file
The plugin.xml file is rather simple for most plugins. Its main function is to tell Kettle about the implementation (meta) class, the plugin name and description (localized) as well as additional jar files to load. Be sure to check out the reference article about plug-in loading for details. For our little template step the settings are straightforward.
<?xml version="1.0" encoding="UTF-8"?> <plugin id="TemplatePlugin" iconfile="icon.png" description="Template Plugin" tooltip="Only there for demonstration purposes" category="Demonstration" classname="plugin.template.TemplateStepMeta"><libraries> <library name="templatestep.jar"/> </libraries> </plugin>
The ID for the plugin should be globally unique, and should not be changed, as it is used in serialization. The icon file is a nice png. “Description” is the name of the step as it appears in the tree menu. The “tooltip” appears on mouse rollover in the tree menu. “Category” is the name of the tree folder the plug in appears in. Our little example goes into a folder labeled “Demonstration”. For the “classname” Kettle expects the fully qualified name of the meta class. The library section specifies the jars to load for this plugin. In our example this is only a single jar file.
This short xml file wires the plugin into Kettle and makes it accessible through the user interface. Time to move on to the meta class.
The Meta Class
The TemplateStepMeta class has the following main responsibilities, and methods to deal with them.
// keep track of the step settings public String getOutputField() public void setOutputField(...) public void setDefault() // serialize the step settings to and from xml public String getXML() public void loadXML(...) // serialize the step settings to and from a kettle repository public void readRep(...) public void saveRep(...) // provide information about how the step affects the field structure of processed rows public void getFields(...) // perform extended validation checks for the step public void check(...) // provide instances of the step, data and dialog classes to Kettle public StepInterface getStep(...) public StepDataInterface getStepData() public StepDialogInterface getDialog(...)
Please note that the meta class stores the name of the output field of the step in a private member named outputField.
There are more aspects to the meta class. Most of them are covered with a default implementation in BaseStepMeta, which is the parent class of TemplateStepMeta. The default implementations are working just fine for our template step. Check out the javadoc for StepMetaInterface, and BaseStepMeta for more details and possibilities.
The Dialog Class
The TemplateStepDialog class implements the settings dialog for the template step. Kettle uses the Eclipse SWT framework for its user interface widgets. So most of the code is SWT code which you need to become familiar with before producing complex dialogs yourself. The SWT documentation is available from inside Eclipse (use the help menu) and online. The source of the other plugins should be your guide and inspiration when designing new settings dialogs. During construction, a dialog object is given a step meta object, where it should read the settings from when opened, and where it should save the settings to when confirmed. For the template step the only setting is the name of the output field. A custom step dialog derived from BaseStepDialog must provide an open(…) method. This method must return the (possibly changed) name of the step or null if the dialog has been cancelled.
The Step Class
The step class is doing all the actual processing and transformation work. Since most of the boilerplate code is provided by the BaseStep parent class, most plugins focus on only a few implementation specific methods.
// initialization and teardown public boolean init(...) public void dispose(..) // processing rows public void run() public boolean processRow(..)
The method init() is called by Kettle before any step of the transformation starts. The transformation run will only start after all the steps returned successfully from their init() call. The template step does nothing there, but since it is a common place for doing initialization work, I kept it in the code. Similarly dispose() is called after the step is done. The step should close resources, like file handles, caches and the like when dispose() is called.
Kettle calls the run() method when it is time to actually process rows. A common implementation of run() would call processRow() in a tight loop, until there is nothing more to process, or the transformation has been stopped.
The method processRow() is called to process a single row of data. This method usually has a getRow() call to get a row to process. This call will block if necessary, for example if the step is getting rows at a slow rate. Subsequently processRow() would do its transformation work and call putRow(), which puts the row downstream. Your steps will need to examine and change the row structure. To do that safely and conveniently be sure to get familiar with the package org.pentaho.di.core.row, especially with RowMetaInterface and RowDataUtil.
The BaseStep implementation conveniently provides a first flag to indicate that the row processed is the first one. This can be useful to execute certain code only once, like for example an expensive look up, that is going to be cached.
The Data Class
Most steps require some kind of storage for caching indices or temporary data. The proper place to put it is in the data class. Each thread of execution will get its own instance of the data class, so it can run in a self-contained way. The boilerplate code is provided by the BaseStepData class. Think of the data class as of the per-thread storage of your step plugin. As a rule of thumb avoid adding non-constant fields to your step class. Whatever you need to store, it is probably better placed in the data class.
The template step uses a data object to store the output structure of the rows it processes. It does not require any other storage.
Conclusion
A Kettle step plugin consists of four classes, each with its own roles and responsibilities. The meta, dialog, step and data classes nicely fit together, especially since a lot of the boilerplate code and common tasks is already implemented by the respective base classes. The open source nature of Kettle makes it possible to look at the techniques used by other developers, so if you find yourself stuck at some point, go ahead an look at the sources of some other steps. The answers are usually there.
If you want more detailed information take a look at another article which explains the code of a more advanced plugin: the Voldemort lookup plugin.
Other sources of help are the PDI wiki page and the forums at forums.pentaho.org. Especially check out the PDI users forum and PDI developers forum if you need help developing your custom plugin.
Comments and corrections are welcome.
Slawo




Hi Slawo,
Really nice post! if you post follow ups on advanced topic for developing plugins, it would be very useful!
Hi there,
I am planning to follow up with an article on a step that extracts values from the Voldemort key-value strore. Another thing I am thinking about is a step that would do GeoIP lookup based on maxmind’s open source country database. Maybe I’ll have time to work on these tonight so stay tuned
[...] article expands the example from the introductory article on custom plugins. With noSQL storage solutions on the rise, I thought it might be useful to show [...]
This is an excellent post. Greatly helps jump start custom plugins development.
As an aside, it would be very helpful if you could talk about testing plugins. Looking at the PDI code, there are classes such as TransformationTestCase and TransTestFactorywhich which look like the premise of a testing framework but it’s not obvious how they are meant to be used.
Again, thank you for your work.
Fady
Hi Fady,
thanks for commenting, and special thanks for suggesting an article topic. Stay tuned
In the meanwhile you might want to check out the automated tests that are part of the Kettle source (they are in the test folder). There’s unit tests checking up on individual classes and blackbox tests, that compare transformation results with gold data.
Cheers
Slawo
This is the best d*** tutorial on writing a Kettle Step plugin there is !
Miles better than the original wiki at http://wiki.pentaho.com/display/EAI/Writing+your+own+Pentaho+Data+Integration+Plug-In
I would have saved quite a bunch of hours had I found this article the first time
Thanks Slawomir !!
Hey, you’re the one who created the Voldemort plugin youtube videos!
Hi Hendy,
Thanks for the comment! Yes, I did the Voldemort plugin demo and uploaded some videos
I also did not find any coherent introduction to plugin development, so after finding things out the hard way, I decided to give it a shot myself. Glad to see it is used
Cheers
Slawo
Hi
I am trying to write a plug-in (step), in which I need to make a call to other Kettle Core steps. Is is possible to do it. Do you have any example for this ?
Thanks
Hi there,
while it is technically possible to call another step directly at runtime, I’d advise against that. Transformation steps operate in parallel in different threads and they are not supposed to directly communicate. Core steps don’t do that neither. Their medium of communication is the row stream.
However, If you really want to call other steps for some reason you can ask the transformation object to give you instances of all steps. Call getDispatcher() on your step to get the Trans object and call getSteps() on that to get a list of all steps that are part of the transformation.
Cheers
Slawo
Great article, thanks. The code still works perfectly with 4.2.1 btw.
Good to hear