This post is a hands on tutorial on how to create an analysis cube for the Mondrian OLAP engine. It is an introductory post for an audience with no OLAP experience at all. I will assume some experience with relational databases and the Kettle ETL tool. If you’d like to follow the examples you will need access to a database, a copy of Pentaho Kettle and a Mondrian installation. I will be using MySQL as RDBMS and JasperServer 3.7. CE for the Mondrian installation. Other possibilities include Pentaho BI-Server and a bare bones Mondrian/JPivot and PAT installation.
The example data
For the example data I would like to use a small and simple dataset, so it is easy to share and easy to understand. Therefore I decided to use an Excel extract from the public issue tracker for the Kettle tool. It has one thousand lines of issues, bugs and feature requests that I would like to put into an OLAP cube for analysis. Each row contains the issue type, a summary, the assignee, a priority, a status and a resolution. I took the liberty to replace the real names of the assignees with figures from Sesame Street in the input data. And no, I will not reveal who is who
Download the Excel file that contains the example data.
Transforming the Excel file to a Mondrian cube
A Mondrian cube is stored in a relational database. The central table for an OLAP cube is called the fact table. It contains entities (think: rows) whose properties are analyzed in the OLAP cube. For the example cube an issue will be a fact. It will have the properties type, assignee, priority, status and resolution. I’d like to drop the summary field, because it contains free text information that is of little use in aggregations or breakdowns.
The simplest path from the Excel sheet to the database is to create the table in the database and to import the Excel sheet. I will be using Kettle for that. I’ll be using a counter to create a key for each issue. Other than that the import transformation is just your garden variety database import.
As a result the facts table looks like this:
CREATE TABLE fact_issue ( id INT PRIMARY KEY, issue_type TINYTEXT, assignee TINYTEXT, priority TINYTEXT, status TINYTEXT, resolution TINYTEXT );
Get the import transformation if you want to follow along.
Now that the data has made it to the database Mondrian needs to be configured to use it.
Telling Mondrian about the cube
Mondrian cubes are configured using schema files. The schema is going to hold a single cube. We must give it a name and define dimensions for it. A dimension is an OLAP concept that I will not bother to define with academic precision at this point. For the time being I’d like to think of a Mondrian dimension as a property of a fact (i.e. issue) that we can break down along. As such the issue type, assignee, priority, status and resolution are each dimensions of an issue. OLAP dimensions are always hierarchical in nature. The issue dimensions only have one hierarchy level each. In real world applications dimensions may have multiple hierarchies with multiple levels each. If you’re wondering how this can possibly be useful, imagine a date dimension that breaks down by year/quarter/month/day or by year/week/weekday. That’s two hierarchies with four levels and three levels respectively. An OLAP cube also needs at least one measure. Again, academic definitions are probably not useful at this point. Therefore I’d like to think of a measure as something describing a quantitative value. In the cube at hand an obvious measure seems to be the amount of issues in any given breakdown. It is interesting how many open issues are not assigned yet, for instance. In real world cubes measures usually address questions beginning with “how many?” and “how much?”
The discussion above leads to the following schema file:
<?xml version="1.0"?> <Schema name="IssueSchema"> <Cube name="Issue"> <Table name="fact_issue"/> <Dimension name="Type"> <Hierarchy hasAll="true" allMemberName="All Types"> <Level name="Type" column="issue_type" uniqueMembers="true"/> </Hierarchy> </Dimension> <Dimension name="Assignee"> <Hierarchy hasAll="true" allMemberName="All Assignees"> <Level name="Assignee" column="assignee" uniqueMembers="true"/> </Hierarchy> </Dimension> <Dimension name="Priority"> <Hierarchy hasAll="true" allMemberName="All Priorities"> <Level name="Priority" column="priority" uniqueMembers="true"/> </Hierarchy> </Dimension> <Dimension name="Status"> <Hierarchy hasAll="true" allMemberName="All Status"> <Level name="Status" column="status" uniqueMembers="true"/> </Hierarchy> </Dimension> <Dimension name="Resolution"> <Hierarchy hasAll="true" allMemberName="All Resolution"> <Level name="Resolution" column="resolution" uniqueMembers="true"/> </Hierarchy> </Dimension> <Measure name="Issue Count" column="id" aggregator="count" formatString="Standard"/> </Cube> </Schema>
The schema file must be presented to the Mondrian engine. I can not give a complete account on how to do this, since the necessary steps depend on the server software (and version) you are using. The screencast shows the steps necessary on JasperServer 3.7 CE. You’ll have to consult the manual of your server solution to find out how to introduce a new OLAP schema.
Regardless of which server you use, at some point you’ll likely need to provide an MDX query for a view of the new cube. The initial query usually is not very important in interactive environments, since you’ll be able to use the UI to navigate the cube. And since this post is not dealing with MDX as such, I’d like to present a short MDX query that will work as a kickstart for the cube without comment.
SELECT Type.Children ON COLUMNS, Assignee.Children ON ROWS FROM Issue
This post gives a practical introduction to Mondrian cubes. It shows how to create a minimal, but fully operational Mondrian cube. It also explains that an OLAP cube always revolves around a fact table and tries to give an intuitive understanding of dimensions and measures. Dimensions are essentially hierarchical properties attributed to facts that can be used to examine facts with certain characteristics at various levels of detail. Measures are describing facts in terms of numbers. They answer quantitative questions.
I hope that this little introduction enables Mondrian beginners to get up to speed faster. There is much more to know, of course. The follow up article deals with dimension tables. Check it out if you want the cube to be stored more efficiently.
If you would like to suggest a Mondrian topic, feel free to leave a comment.