Implementing Universal Bulk-Loading in PDI

concept_fifo

Most database systems provide a command-line based bulk loading utility that operates on flat files. These db-specific utilities are usually the fastest way to get data into a db system. This post shows a generic technique for using bulk loading utilities as part of a PDI ETL process. The download package has a working sample showing the . . . → Read More: Implementing Universal Bulk-Loading in PDI

Book Review: PDI 4 Cookbook by María Carina and Adrián Sergio

cookbook_2

“PDI 4 Cookbook”, published June 2011, is a wonderful collection of tips, tricks, techniques and best practices regarding Kettle 4.x. It contains over 70 individual recipes that show how to solve common (and sometimes extraordinary) data processing tasks. This book is not about architecture diagrams, technology buzzwords and the philosophy behind Enterprise APIs. It is a . . . → Read More: Book Review: PDI 4 Cookbook by María Carina and Adrián Sergio

Clustering in Kettle

clustered_group_and_sort

This article introduces clustering concepts supported by Kettle a.k.a. PDI. If you need to replicate data to several physical databases, or would like to learn about scale-out options for record processing, this article may be for you. As usual, the downloads section has the demo transformations for this article. . . . → Read More: Clustering in Kettle

Partitioning in Kettle

article_pic

This article introduces partitioning concepts supported by Kettle a.k.a. PDI. If you need to partition records over several tables, or would like to learn about increasing the parallelism of your transformations, this article may be for you. . . . → Read More: Partitioning in Kettle

An Introduction to Regular Expressions

finite_automaton

Regular expressions are a very useful tool for a variety of string related tasks. In Kettle they are frequently used for extraction and manipulation tasks, as well as for specifying groups of file names. This post gives an introduction to regular expressions in general as well as some applications within Kettle a.k.a. PDI. Since the built-in . . . → Read More: An Introduction to Regular Expressions

Write ETL that writes ETL – Creating Crosstabs with Kettle

write_etl

In this post I’d like to demonstrate a technique for creating dynamic Kettle transformations. What do I mean by dynamic in this case? Imagine a transformation dynamically creating or changing another transformation before it is executed. Why would anyone want to do this? In this post I’ll take up the example of crating crosstabs in Kettle, . . . → Read More: Write ETL that writes ETL – Creating Crosstabs with Kettle

Dynamic SQL Queries in PDI a.k.a. Kettle

featured

When doing ETL work every now and then the exact SQL query you want to execute depends on some input parameters determined at runtime. This requirement comes up most frequently when SELECTing data. This article shows the techniques you can employ with the “Table Input” step in PDI to make it execute dynamic or parametrized queries. . . . → Read More: Dynamic SQL Queries in PDI a.k.a. Kettle

Using an on-demand in-memory SQL database in PDI

h2_logo

Anybody who finds themselves working on a client’s environment will usually face the fact that access to databases is restricted to what’s absolutely required to get the job done. The source files and target systems will be available, but creating helper tables or databases may be completely out of the question, or it may involve overcoming . . . → Read More: Using an on-demand in-memory SQL database in PDI

Processing the void – detecting and handling empty row streams in PDI

detect_empty_stream

ETL processes sometimes need to generate data, even if there’s no input. This may be a bit puzzling at times, since an usual ETL row stream produces nothing if there’s no input. In some scenarios an ETL process is supposed to generate some sort of aggregation, which implies it should report a value of 0 even . . . → Read More: Processing the void – detecting and handling empty row streams in PDI

Tracking Transformation Progress becomes easier in PDI

transformation_progress

Tracking the progress of a transformation in PDI Spoon usually involves closely observing the numbers displayed in the step metrics tab. The step metrics tab dispays information about each step’s processed rows and input/output buffers, which makes it an important tool for understanding step performance at a glance as well as understanding the progress of a . . . → Read More: Tracking Transformation Progress becomes easier in PDI