Documentation for Kettle ETL

docs

Whenever you’re working on an ETL project and the responsibilities of each job, transformation, and execution step are becoming clearly defined, it might be a good idea to write down some documentation for the ETL solution. If that sounds like something that may be useful (or even required by your client or boss) but you just . . . → Read More: Documentation for Kettle ETL

Building a detailed Date Dimension with Pentaho Kettle

teaser

Having rich date dimensions in a data warehouse often enables sophisticated business relevant analytical queries. This post shows a way to generate a detailed date dimension table that includes fixed date and variable date holidays, working days, special events and week of year information using the Kettle ETL tool, also known as Pentaho PDI. . . . → Read More: Building a detailed Date Dimension with Pentaho Kettle

Accessing Previous Row Values in Kettle

stock_transformation

Many people resort to JavaScript hacking when faced with the requirement to access a previous row’s value in Kettle. In most cases I find that to be unnecessary. And while there isn’t anything bad about the JavaScript step as such, it has the problem of a somewhat lower performance and it also adds complexity to the . . . → Read More: Accessing Previous Row Values in Kettle

Data Validation and Monitoring with Pentaho Kettle

validation

When doing ETL work, sometimes you get to work with data inputs with little or no consistency guarantees. If you choose to do the data validation in Kettle, there are a few options. Among other things you may choose to verify data using the validator step, flag the rows and fields based on some calculation and . . . → Read More: Data Validation and Monitoring with Pentaho Kettle

A Simple Date Dimension for Mondrian Cubes

cube

Date dimensions are among the most important dimensions of many Mondrian cubes. The usefulness of a cube often depends on the way the date dimension has been modeled. This post shows how to create a basic date dimension and how it can be augmented with properties to suit specific analysis needs. If at some point you . . . → Read More: A Simple Date Dimension for Mondrian Cubes

GeoIP lookup using MaxMind’s Country Database and Kettle

geoip_screenshot

Reliable location information is a valuable asset when looking at internet traffic. Among other uses it can be utilized for fraud prevention or help in estimating foreign market potential. This article explains how you can lookup location information for an IP address using Kettle and MaxMind’s free GeoIP database.

Edit: As Daniel Einspanjer points out, there’s a . . . → Read More: GeoIP lookup using MaxMind’s Country Database and Kettle

Bulk Uploads with Kettle

upload

Kettle processes sometimes need to upload files to remote machines. Uploading is usually not much of an issue, since Kettle provides several steps to upload files using different transmission protocols. The upload steps have their limitations however when trying to upload an entire folder structure. None of the built in steps accepts a directory that it . . . → Read More: Bulk Uploads with Kettle

Sub-Transformations a.k.a Mappings

mapping_input

Recently a question about sub-transformations appeared on the Kettle forum, so I thought I’ll honor the occasion and write a small tutorial on how to use those. They are a nice feature for reusing whole transformations, so if you find yourself copying and pasting the same steps into multiple transformations, mappings a.k.a. sub-transformations might be a . . . → Read More: Sub-Transformations a.k.a Mappings

Developing a Custom Kettle Plugin: Triggering a Report on JasperServer

screenshot-job

The previous posts on Kettle plugin development focus on transformation steps. It is also possible to extend Kettle with custom job entries. This post introduces a plugin that provides a job entry which can trigger a report on JasperServer 3.7 Community Edition. Scheduling reports can be a tricky thing. If you keep your reports on JasperServer, . . . → Read More: Developing a Custom Kettle Plugin: Triggering a Report on JasperServer

Prevent running multiple instances of a Kettle Job

simple status file

Sometimes a Kettle job needs to be executed on a tight schedule, every few minutes for example. Occasionally it is undesirable to have multiple instances of the job run in parallel. This might happen in case a run takes longer than usual, and the subsequent run starts before the current one finishes. This post shows a . . . → Read More: Prevent running multiple instances of a Kettle Job