Writing custom output formats in Pentaho Kettle

java_output_serialization

Sometimes an ETL process needs to generate files in very specific non row based formats. This can be standard files like EDIFACT record files or maybe files using some ancient format you need to feed to a legacy system. In this post I would like to show some techniques to create those files using Pentaho Kettle . . . → Read More: Writing custom output formats in Pentaho Kettle

Building a detailed Date Dimension with Pentaho Kettle

teaser

Having rich date dimensions in a data warehouse often enables sophisticated business relevant analytical queries. This post shows a way to generate a detailed date dimension table that includes fixed date and variable date holidays, working days, special events and week of year information using the Kettle ETL tool, also known as Pentaho PDI. . . . → Read More: Building a detailed Date Dimension with Pentaho Kettle

Accessing Previous Row Values in Kettle

stock_transformation

Many people resort to JavaScript hacking when faced with the requirement to access a previous row’s value in Kettle. In most cases I find that to be unnecessary. And while there isn’t anything bad about the JavaScript step as such, it has the problem of a somewhat lower performance and it also adds complexity to the . . . → Read More: Accessing Previous Row Values in Kettle

GeoIP lookup using MaxMind’s Country Database and Kettle

geoip_screenshot

Reliable location information is a valuable asset when looking at internet traffic. Among other uses it can be utilized for fraud prevention or help in estimating foreign market potential. This article explains how you can lookup location information for an IP address using Kettle and MaxMind’s free GeoIP database.

Edit: As Daniel Einspanjer points out, there’s a . . . → Read More: GeoIP lookup using MaxMind’s Country Database and Kettle

Using Named Parameters in Kettle

settings dialog

Kettle’s named parameters often enable very elegant solutions for ETL requirements. This post gives an introduction to the named parameters feature of Kettle.

It is safe to think of named parameters as variables with their initial values assigned before the ETL process starts. In fact, they are available as variables when the ETL process executes. Their values can be specified in the launch dialog inside spoon or on the command line when running kitchen or pan. Continue reading Using Named Parameters in Kettle

Error Handling in the JavaScript Step

error_handling

If you are familiar with the error handling feature of Kettle transformation steps, you might be wondering how it can be utilized in a JavaScript step. The user interface mechanics are the same as with any other step that supports the error handling feature. It can be enabled and configured by right clicking the JavaScript step . . . → Read More: Error Handling in the JavaScript Step

Custom Database Lookups using the JavaScript Step and fireToDb

fire2db

Suppose you need to do a non-trivial lookup operation.  For arguments sake, let’s assume you need to check the database for a set of regular expressions, and you need to determine for a row field how many of those expressions match. In fact, a similar scenario has been brought up in the Pentaho forums in this post. One way to solve this is to use a JavaScript step. When initialized, it queries the database for the regular expressions. For each passing row it then loops over the expressions and counts how many match, writing the result to a row field. Sounds easy enough, but how do you query the database using JavaScript? . . . → Read More: Custom Database Lookups using the JavaScript Step and fireToDb

Accessing Kettle Internals with JavaScript

dbinfo

On rare occasions an ETL requirement comes up that is not easily implemented using the Kettle standard steps. Suppose for example that each record must be tagged with information about the database it is coming from. The source database is defined in the DB connection settings, but how to extract its current settings (i.e. type, host, port, dbname)? . . . → Read More: Accessing Kettle Internals with JavaScript

Writing BLOBs to Files

blob transformation

Every now and then a BLOB field pops up during ETL work. I suppose in migration work, mostly. Dealing with BLOBs is usually DB vendor specific and your mileage may vary as to how well Kettle is able to support the BLOB types of your RDBMS. But if you are lucky and Kettle recognizes your binary type . . . → Read More: Writing BLOBs to Files

Squeezing the most out of the JavaScript Step in Pentaho Kettle

set script type

If you read the articles about using Java in Kettle and manually generating rows, you have seen constant reminders about the fact that scripted code runs slower than a compiled plugin. But if for some reason you are stuck with a scripted solution, it is good to know a few facts about the JavaScript step that . . . → Read More: Squeezing the most out of the JavaScript Step in Pentaho Kettle