Building a detailed Date Dimension with Pentaho Kettle

Having rich date dimensions in a data warehouse often enables sophisticated business relevant analytical queries. This post shows a way to generate a detailed date dimension table that includes fixed date and variable date holidays, working days, special events and week of year information using the Kettle ETL tool, also known as Pentaho PDI.

The Example Transformation

I’d like to present a transformation that creates and stores the date dimension in a master table. Other processes in the DWH would access this table to associate relevant records with the date dimension. The example transformation works with Kettle 3.2 and 4.x alike. It generates dates from 2000 till 2020.

The main stream starts by generating a row for each day. What follows is a set of calculations that generate additional fields derived from calendar information on the processed day. There are two additional streams that handle fixed date holidays and variable date holidays. The main stream looks up these days from the stream and defines working days based on week day and holiday information. There’s also a JavaScript step that generates miscellaneous special information. Finally, the transformation is setting some useful flags on the dates, derived from the already determined information and writes the rows to the database.

I am reusing the ISO string of the date as the primary key. You may consider adding rows with id “N/A”, “malformed”, “unknown” and other special meaning rows to the date dimension, depending on your particular needs.

Click on the screenshot to view in full size. If you’d like to follow along the steps of the transformation, it’s probably best to download the example transformation, adjust the database connection, and create the output table (click the SQL button on the “insert/update” output step to do that quickly). You can now execute the transformation and follow along each step by right clicking it and selecting “preview” to see what it returns.

The Transformation Explained

There are two major types of information that go into a date master table. There are fields that are directly derived from the calendar like day of week, number of month, week of month etc. The second set of fields is derived from rules based on your location, cultural setting and business context. Among these are national and religious holidays, the definition of working days, and optionally also special business periods like fair days, selling season indicators, etc.

Getting Calendar Information

Generating fields directly derived from the calendar is pretty straightforward in Kettle. The calculator step provides some of the calendar fields when given a date object to work with. For more advanced fields it is possible to use the rhino-backed JavaScript step to use the Calendar object of the Java language.

Fixed Date Holidays

Fixed date holidays are kept in a simple CSV file and joined to the appropriate days. I’m using a “Steam lookup” step to join the holidays to the appropriate dates.

Variable Date Holidays

Variable date holidays are usually based on some non-trivial algorithm. All variable date holidays in Germany are based on the easter sunday, so I am using that as a basis for the variable holidays. The variable date holidays must be generated for each relevant year, since by their nature every year’s date will be different. Therefore I start with the generation of relevant years. For each year a JavaScript step calculates the relevant holidays and puts them to the row stream. I presume the same idea would work for most cultural and religious settings. There must be some algorithm that yields the date of any variable date holiday for any given year.

Special Dates and Events

An additional step calculates miscellaneous date information specific to the organization. For the example I chose to flag all days of the FIFA soccer world championships from 2002 till 2010. I am using a JavaScript step to do that, but you may choose any other means to join this information to the row stream. In real life scenarios I’d probably consider reading in an Excel sheet that holds the relevant dates and descriptions and join the information to the row stream, similar to the approach taken for the fixed date holidays.

Calculating useful Flags

The “consolidate holidays/working days” step is assigning the “is_holiday”, “holiday_name”, and “is_working_day” fields. They are filled based on the content of the fixed and variable date holidays and weekday information.

Internationalization Considerations

Depending on what your default Java locale is, the date calculations will return slightly different results. Especially when it comes to handling weeks. There are different rules in use as to what the first week of the year is, for instance. Similarly, in some countries weeks begin with sundays. In other countries weeks begin with mondays. The date calculations in the example transformation use the default Java locale, so they should yield numbers appropriate for your location. I chose to also include the relevant fields according to the ISO8601 standard that is often used in international organizations.

In the “Calc more calendar info” step I also show how to explicitly use specific Java locales for calendars and output formats. I use specific locales to calculate the week of month as per ISO8601 and to generate english and german labels for months and weekdays. The example code should help you along if you need to support different locales and decide to store localized information in the master data table.

Updating the Dimension Table

The transformation is designed to be rerun in case the need arises. This might be the case if fixed date holidays change, or you’d like to retrospectively mark a time period as special. Since the transformation is using a smart key, days will be updated as required, and the references to the date dimension will remain intact.

Downloads

Download the example transformation.
I hope it serves as a good starting point for creating a useful date dimension table for you :)

Related Literature:

23 comments to Building a detailed Date Dimension with Pentaho Kettle

  • [...] holidays, working day info and other special flags, you may want to check out this article about building a detailed master date table. Drawing from this article you should be able to expand the existing member properties to include [...]

  • Hi Slawomir,

    Good job and good explanation of your transformation.

    Some days ago I published a similar entry in my blog with my particular version of this kind of transformation.

    We can do the comparation exercise maybe both can learn something.

    Regards
    Enric

  • Slawomir Chodnicki

    Hi Enric,

    I’ve just had a look at your site. I think your blog is in Catalan, which I am afraid I don’t have a chance to read, but the transformation is mostly kept in english :)

    I think the transformation is quite nice. I especially like the way you parametrized the locale info. Building on that it should be possible to create some nice multi-language OLAP cubes :)

    Best Regards

    Slawo

  • Thanks,
    A particular interest in this transformation for me was an example of using stream lookups.

  • Slawomir Chodnicki

    Thanks for reading Vasili :)

    Cheers
    Slawo

  • [...] a detailed Date Dimension with Pentaho Kettle” – Sept 2010 (link) In here, Slawomir Chodnicki explains briefly the desing considerations in his design. One [...]

  • hamza

    hey Slawomir,

    I am doing a project on creating the data warehouse for a retail store. The problem i am getting is in your transformation the column week_of_month has a 0 in it instead of a number….how can this be corrected and if its not possible then how to interpret it….

    Thnx in advance….

    Hamza Mughal

  • Slawomir Chodnicki

    Hi Hamza,

    please refer to http://download.oracle.com/javase/6/docs/api/java/util/Calendar.html and check the section on “First Week” to interpret the week numbers.

    Cheers
    Slawo

  • It was exactly I was looking for. A customizable time-dimension generator.
    Great job and thanks for sharing!

  • simmon

    Hi Slawomir, Best able to provide the target table structure. Thanks.

  • Slawo, Your information help me a lot. It works fine in Japanese. Thanks.

  • [...] a detailed Date Dimension with Pentaho Kettle – Sept 2010 (link) In here, Slawomir Chodnicki explains briefly the desing considerations in his design. One [...]

  • wihel

    Hi, Slawomir..
    i’m doing a project for my tesis…
    and i’ll make dimension snowflake with pentaho kettle…
    can you help me to give some examples for dimension snowflake.ktr

    what tools i must do….
    just simple dimension snowflake. i hope you can help me.
    thank u

  • Sara

    Hi Slawomir,

    I’m a beginner in Pentaho Data Integration [Kettle]. And I am doing my final year project.
    Well, I want to do an ETL process from conventional relational database model to dimensional database model. (Loading Dimension/Fact Table).
    I am just wondering how to go about doing this ! I have read around and looked at examples, but I am just having difficulties wrapping my head around how this works !!!!

    * For exemple, we have this “relational database” (Mysql) :

    Customer(CustmerID, CustomerName, CustomerAddress, CustomerCity, CustomerState, CustomerZip, CustomerContactLName, CustomerContactFName, CustomerContactPhone);

    OrderHeader(OrderID, CustomerID, OrderData, BillToAddress, BillToCity, BillToState, BillToZip, ShipToAddress, ShipToCity, ShipToState, ShipToZip);

    Category(CategoryID, CategoryName);

    Product(ProductID, CategoryID, ProductName, ProductDescription, ProductPrice);

    OrderDetail(OrderID, ProductID, LineItem, QuantityOrdred);

    * In the other side, concerning the “Star Model”, we have:

    ProductDimension(ProductID, ProductCode, ProductName, Category, SubCategory, Brand, Height, Width)

    StoreDimension(StoreID, StoreName, ParentChain, Region, Territory, Zone, Address, City, State, Zip)

    TimeDimesion(TimeID, DayofWeek, DayofMonth, DayofYear, Month, Quarter, Year, Holiday, Weekend)

    SalesFact(ProductID, TimeID, StoreID, SaledDollars)

    Well, What isn’t clear for me is the “TimeDimension” :

    1. I have read that there is an ETL process to “create/fill” a TimeDimension ; How comes ?!…. Should I create a process (Transformation) in ETL then load it to a Table in the database, and after that loading that table from the database into a Transformation
    (Kettle) or what ???

    2. Now after loading the Dimesions Tables and the fact Table : How will I affect the Date to a row in the DataBase. Because in the database, there is no table created called “TimeTable” !!
    In other word, When we will execute a query , we ‘re going to interrogate the database but indeed, there is no relationship between the Table Product and the TimeTable(which even doesn’t exist), for example… !!?

    3. In the exemple above, concerning the Fact Table : How can I Add a row into it, in order to calculate the “SalesDollars”?

    Could you please clarify this for me ?!

    Thanks and Regards.

  • This is great! And still applicable as of v5.0.4

  • croo

    Thanks Slawomir, I found that very useful.
    Written & shared in 2010 but it’s still giving today!
    thanks

    Colin

  • Marlon

    Thanks !!

  • vineetdubey

    hi where can i get the structure of the dim_date table ?

  • [...] Building a detailed Date Dimension with … – Having rich date dimensions in a data warehouse often enables sophisticated business relevant analytical queries. This post shows a way to generate a … [...]

  • [...] Building a detailed Date Dimension … – Having rich date dimensions in a data warehouse often enables sophisticated business relevant analytical queries. This post shows a way to generate a … [...]

  • ¡Many thanks! I have been working with Pentaho for the last year, and it is awesome what it can do. Everyday I discover a new functionality ;)

  • [...] Building a detailed Date Dimension with Pentaho Kettle [...]

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>