Write ETL that writes ETL – Creating Crosstabs with Kettle

In this post I’d like to demonstrate a technique for creating dynamic Kettle transformations. What do I mean by dynamic in this case? Imagine a transformation dynamically creating or changing another transformation before it is executed. Why would anyone want to do this? In this post I’ll take up the example of crating crosstabs in Kettle, which often requires a dynamic, data driven row structure. There are other uses for generating ETL automatically, the “copy table” wizard being an example. Let’s jump in.

The Requirements

There is an input file that holds transactional data. It holds individual sales data structured like this:

Item Category Price City Region Country
Celery Produce 0.99 Finlayson Minnesota United States
Sweetener packets Baking 1.49 Jose Leon Suarez Buenos Aires Argentina

The requirement is to generate a csv file with a crosstab showing turnover summed by country (on rows) and product category (on columns). So depending on the countries and product categories present the result would look similar to the following:

Country Baking Produce
United States 1.65 1.89
Argentina 7.71 13.23

The First Shot

The first approach to the problem is creating a transformation that groups the data by country and category, thus creating the necessary sums. It is then denormalized to bring distinct product categories into distinct columns. The result is written to file.

The Problem

While the above solution works just fine, it implies that the product categories remain static. Should the data change to include another product category, you’d have to edit the denormalizer step and the text output step to take care of the new column. Also, if the data does not have a certain product category, it will still appear as a column, unless you manually remove it from the denormalizer and text output steps.

Wouldn’t it be cool to automatically adjust the field definitions on the denormalizer and text output steps after having a peek at the actual product categories present in the input file? This post shows how to do it using scripting.

The Solution

The idea is simple: have a transformation that finds the product categories present in the input and changes the crosstab generating transformation accordingly. The following transformation does just that. It first distills all product categories from the input file, then executes a Ruby script that opens the crosstab generating .ktr file, sets the field definitions on the denormalizer and text output step, then saves the file again. It uses Kettle’s Java API to do that.

The main script does the job :

# this script opens up the gnerate_crosstab.ktr modifies it, setting the categories on the
# field mapper step and output step, then saves the file back to disk

$tabs['Helper Library'].require

# read all incoming categories into an array
all_categories = $input.read_all.map {|row| row["Category"]}

# load transformation ($trans_file is set as a "Global Variable" in the options pane below)
trans_meta = load_trans $trans_file

# set fields on denormalizer step
set_fields_on_denormalizer_step trans_meta, all_categories

# set fields on text output step
set_fields_on_text_output_step trans_meta, all_categories

# save back to disk
save_trans trans_meta, $trans_file

# don't generate any rows in this script
nil

A short library script provides the actual functionality:
Show Helper Library Script»

There are some degrees of freedom here: the idea can be implemented using another language, the UDJC or JavaScript steps, for example. Directly manipulating the XML of the .ktr file is also a viable option, since Kettle produces reasonable XML. If you’re not much into Kettle programming, a look at the XML may very well open the door to dynamically creating/changing transformations too. The Kettle Cookbook works with the XML, for example. Enjoy, and have fun!

Edit: Roland Bouman has contributed a similar sample that manipulates the XML of a transformation directly. Thanks :)

When both transformations are executed in sequence, the result is a dynamic solution that adapts to the data at hand. Feel free to get the example package and add a few extra lines to the sales.txt file to introduce new categories. See how the transformation adapts :)

Conclusion

If you’re into generic solutions to ETL problems, the above technique may be for you. It has the nice property of generating tangible transformation files, which means that you can always test your transformation changing/generating code in isolation. That said, for the particular problem of dynamic columns, you may also want to check out Kettle’s “Metadata Injection” facilities, which Matt recently has written a nice tutorial on.

Downloads

Get the sample package to check out the details. You’ll need Kettle 4.x and the Ruby plugin.

Comments and corrections are welcome :)

Cheers

Slawo

PS: As a variation on the theme, and in response to this forum thread, I’d like to offer a sample solution that sets the Excel sheets of an Excel Input step dynamically. Enjoy :)

# java imports for short-name access to java classes
java_import org.pentaho.di.trans.TransMeta
java_import org.pentaho.di.trans.TransMeta
java_import org.pentaho.di.trans.steps.denormaliser.DenormaliserMeta
java_import org.pentaho.di.trans.steps.denormaliser.DenormaliserTargetField
java_import org.pentaho.di.core.row.ValueMetaInterface
java_import org.pentaho.di.trans.steps.textfileoutput.TextFileField

# loads a .ktr file from disk
def load_trans file_name
	TransMeta.new(file_name)
end

# saves a .ktr file to disk
def save_trans trans_meta, file_name
	# split off leading "file://" from .ktr url, if present
	file_name = file_name.sub(/^file:\/\//, "")
	# save transformation to file
	File.open(file_name, 'w') {|f| f.write(trans_meta.get_xml)} 
end

# sets the fields on the denormalizer step according to given categories
def set_fields_on_denormalizer_step trans_meta, categories

	# get the denormalizer step and generate a field definition for each category 
	denormaliser = trans_meta.find_step("Map Categories to Fields").step_meta_interface

	fields = categories.map do |cat|  

		field = DenormaliserTargetField.new
		field.field_name 	   = "turnover"
		field.key_value 	   = cat
		field.target_name 	   = cat
		field.target_precision = -1
		field.target_length    = -1
		field.target_format    = "0.00"
		field.target_type	   = ValueMetaInterface::TYPE_NUMBER
		field.target_aggregation_type = "SUM"
	
		field	
	end

	denormaliser.set_denormaliser_target_field(fields.to_java(DenormaliserTargetField))

end

# sets the fields on the text output step according to given categories
def set_fields_on_text_output_step trans_meta, categories

	# get the text output step and set the field defnintions 	
	text_output = trans_meta.find_step("Output Crosstab").step_meta_interface
	
	country_field =	TextFileField.new(
			"Country",							#field name
			ValueMetaInterface::TYPE_STRING,	#field type
			"",									#format
			-1,									#length
			-1,									#precision
			nil,								#currency symbol
			nil,								#decimal symbol
			nil,								#group symbol
			''									#null string
		)

	fields = [country_field] + categories.map do |cat|
		TextFileField.new(
			cat,								#field name
			ValueMetaInterface::TYPE_NUMBER,	#field type
			"0.00",								#format
			-1,									#length
			-1,									#precision
			nil,								#currency symbol
			nil,								#decimal symbol
			nil,								#group symbol
			''									#null string
		)
	end

	text_output.set_output_fields(fields.to_java(TextFileField))
end
Powered by Hackadelic Sliding Notes 1.6.5

8 comments to Write ETL that writes ETL – Creating Crosstabs with Kettle

  • Hi!

    great post, very cool!

    I just wanted to add: there’s a more gentle example on manipulating the kettle transformation XML than the kettle cookbook. I uploaded an example to the “kettle exchange” page in the pentaho wiki that dumps the result of an arbitrary SQL statement to excel. The logic of that example is identical to the example you have here: a single job runs a transformation that generates the actual transformation, which is then executed as the second step in the job. The generating transformation is very simple and simply merges a template transformation file (initially created with spoon, and then manually edited to remove the dynamic part) with an XML fragment using the XML Join step. The example is here:

    Dump any SQL resultset to MS Excel (generic job)

  • Slawomir Chodnicki

    Hi Roland,

    thanks for providing the link! I was just thinking to myself “It’s gonna be a day or two and somebody will request a sample for an XML based implementation…” Knowing you I should have thought “It’s gonna be a day or two and Roland will provide a sample for an XML based implementation…” ;)

    May every craftsman pick the appropriate hammer to nail this down :)

    Cheers

    Slawo

  • Hi Slawomir,

    This is exactly what I’m doing with a case in a telecommunication operator – dynamic natures in both columns and rows sizing.

    And this has been slightly discussed in our popular user group https://groups.google.com/forum/#!searchin/pentaho-id/etl$20generate$20etl/pentaho-id/gr7cxVTI0Ik/PYQ5WOyQqncJ.

    Thank you for a great article !

    Regards,

    Feris

  • yhuang

    hello:

    Can you provide Chinese Service. My English is poor.

    Thanks

  • Jeffrey

    Hello,

    Is it possible to give an example of writing out to the ExcelWriter step? Having some issues injecting if I prefer using an Excel Out, rather than Text Out.

    Thank you!

  • Jeffrey

    Never mind. I see now how to manipulate the metadata.

  • Jeffrey

    Is it possible to use a repository with this step for the global variables? Thanks!

  • Sergio

    Hello guys I was looking for a way to achieve this without using the array_to_string function of postgreSQL. http://www.mkyong.com/database/convert-subquery-result-to-array/ Thank you

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>