Book Review: PDI 4 Cookbook by María Carina and Adrián Sergio

“PDI 4 Cookbook”, published June 2011, is a wonderful collection of tips, tricks, techniques and best practices regarding Kettle 4.x. It contains over 70 individual recipes that show how to solve common (and sometimes extraordinary) data processing tasks. This book is not about architecture diagrams, technology buzzwords and the philosophy behind Enterprise APIs. It is a practical collection of sample solutions to real world problems. If you need to create or maintain a PDI-based solution, it’s good to have this book on your reading shelf. It never fails to inspire in moments of requirements-induced despair ;)

Chapter 1

The book’s first chapter is devoted to working with databases. There are some recipes that show the correct use of major DB-related steps, but there are also advanced topics. Dynamic connection definitions, building and executing SQL queries during runtime, and loading a hierarchy into a parent-child (Adjacency List) table.

Chapter 2

The second chapter focuses on reading and writing files. The basics of processing input files are again explained very nicely, but the book does not stop there. Every now and then an ETL project faces some unstructured file or a file that has some custom structure that needs to be parsed. The cookbook shows some tricks regarding that. There are also a couple of tips regarding file output. Writing custom file structures is explained using a very nice example.

Chapter 3

In many projects some form of XML is used for data interchange. The third chapter of the cookbook explains how to validate, read and write XML files using Kettle. If you ever wondered how to easily extract information from XML using Kettle, or have found that your ETL process is expected to generate complex XML output, this chapter is going to be invaluable.

Chapter 4

File management is something that many ETL solutions do at one level or another. The fourh chapter shows how to effectively work with files and directories. Both locally and remotely.

Chapter 5

This chapter is about data lookups. It features interesting examples that range from common uses of DB lookups, to calling web services, to using “fuzzy matching” looking up values that most closely match the input. This is a very creative chapter. I would recommend skimming through it when trying to come up with an inventive solution to “impossible” requirements.

Chapter 6

The concept of row streams is absolutely central to Kettle. Chapter 6 explains the mechanics and the rules behind row streams. It features interesting use cases that manipulate row streams. The samples include splitting, joining, appending, merging finding differences (very useful for some CDC solutions), and detecting empty streams.

Chapter 7

Having a clever solution to particular data processing problem is one thing. Building a successful solution to a real world problem turns out to be quite another. Chapter 7 focuses on techniqes for creating flexible and reusable jobs and transformations that help structuring and maintaining an ETL project. The topics include executing jobs and transformations in loops, passing data between transformations, determining the job or transformation to execute at runtime, and using sub-transformations a.k.a. mappings.

Chapter 8

Many Kettle projects are transforming data that is then made accessible using other components of the Pentaho BI suite, like reporting or OLAP analysis. Chapter 8 is about integrating Kettle and the Pentaho Suite. Topics covered include executing jobs and transformations on the BI-Server, as well as using transformations as data sources to dashboards and reports. This section of the book shows the versatility of PDI and some creative solutions to otherwise hard problems. If you ever wanted a dashboard or report to display information that is being generated by a PDI transformation, this chapter is for you.

Chapter 9

The last chapter has some nice bits and pieces that do not quite fit into other chapters. It includes topics such as sending mails, working with JSON, customizing PDI logging, advice regarding ETL testing, using the “User Defined Java Class” step, and generating documentation for the ETL project.

All the recipes come with sample files and code that is freely available from the publisher’s website.

Grab the book and enjoy :)


2 comments to Book Review: PDI 4 Cookbook by María Carina and Adrián Sergio

  • The best thing I found about this book – is that a whole load of the things described validated work I was already doing – some of which i considered as quite hacky solutions! An experienced user wont learn much from this book – but still worth a read for that one or two extremely useful nuggets.

  • Mick

    I have bought this book and I think it could be very helpful.
    I already knew some of those proposed solutions, but even for an experienced Kettle user it’s possible to find some useful tricks.

    The next step would be to create an online version, where Kettle users can add their own problems+solutions!


Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>