Pentaho Mondrian: Custom Formatting with Cell Formatter


Repost from http://diethardsteiner.github.io/mondrian/2015/07/29/Mondrian-Cell-Formatter.html

Formatting measures in cubes is quite essential for the readability of any analysis results. Pentaho Mondrian features a formatString (attribute of measure declaration in a Mondrian Cube Definition) and FORMAT_STRING (MDX query) options which allow you to define the format in a Visual Basic syntax. Say you want to display 12344 as 12,333 you can easily create following formatting mask: #,###. This kind of approach works for most use cases.

However, sometimes you might required a custom formatting option which is not covered by the Visual Basic formatting options. Imagine e.g. that one of the measures of your cube is a duration in seconds. The integer value that you store in your database is not really a nice way of presenting this measure, unless of course you are only dealing with very small figures. But what if you have e.g. a figure like 102234 seconds. How many days, hours etc is this?

One approach of dealing with this is to create hidden calculated members which break this figure down into days, hours, minutes etc:

measure mdx calculation
days CAST(INT(sec/(60*60*24)) AS INTEGER)
hours CAST(INT((sec/(60*60))-(days*24)) AS INTEGER)
minutes CAST(INT((sec/60)-(days*(24*60))-(hours*60)) AS INTEGER)
seconds CAST(ROUND(sec-(days*(24*60*60))-(hours*60*60)-(minutes*60),0) AS INTEGER)

You could then create a visible measure which concatenates all these invisible measures and displays the figure as 1d 4h 24min 54sec. Example final calculated visible measure:

CAST([Measures].[Days] AS STRING) || "d " || IIF([Measures].[Hours] < 10, "0", "") || CAST([Measures].[Hours] AS STRING) || ... and so forth

However, while this approach works, you will realize that you cannot sort by this measure properly! That’s rather inconvenient.

Fortunately enough, Mondrian also provides a Cell Formatter, which allows you to access the value of e.g. a measure and manipulate it any way for display purposes, but – and this is the very important bit – this does not influence the underlying data type. So in our example, the integer value for the duration will still be an integer value and hence the sorting will work! The other really good point is that you can use various languages to manipulate the value, e.g. Java or JavaScript.

To add a special Cell Formatter, simply nest the CellFormatter XML element within the Measure or CalculatedMeasure XML element. Then nest another Script XML element within this one to specify the Script Language and finally nest your code within this element. Example (this time including weeks as well):

<Measure name="Duration" column="duration" visible="true" aggregator="sum">
    <CellFormatter>
         language="JavaScript">
            var result_string = '';
            // access Mondrian value
            var sec =  value;
            var weeks = Math.floor(sec/(60*60*24*7));
            var days = Math.floor((sec/(60*60*24)) - (weeks*7));
            var hours = Math.floor(sec/(60*60) - (weeks*7*24) - (days*24));
            var minutes = Math.floor((sec/60) - (weeks*7*24*60) - (days*24*60) - (hours*60));
            var seconds = Math.floor(sec - (weeks*7*24*60*60) - (days*24*60*60) - (hours*60*60) - (minutes*60));
            result_string = weeks.toString() + 'w ' + days.toString() + 'd ' + hours.toString() + 'h ' + minutes.toString() + 'min ' + seconds.toString() + 'sec';
            return result_string;
        
    </CellFormatter>
</Measure>

You could of course improve the JavaScript further by only showing the relevant duration portions:

var result_string = '';
// access Mondrian value
var sec =  value;
var weeks = Math.floor(sec/(60*60*24*7));
var days = Math.floor((sec/(60*60*24)) - (weeks*7));
var hours = Math.floor(sec/(60*60) - (weeks*7*24) - (days*24));
var minutes = Math.floor((sec/60) - (weeks*7*24*60) - (days*24*60) - (hours*60));
var seconds = Math.floor(sec - (weeks*7*24*60*60) - (days*24*60*60) - (hours*60*60) - (minutes*60));
if(weeks !== 0){
    result_string = weeks.toString() + 'w ' + days.toString() + 'd ' + hours.toString() + 'h ' + minutes.toString() + 'min ' + seconds.toString() + 'sec';
} else if(days !== 0){
    result_string = days.toString() + 'd ' + hours.toString() + 'h ' + minutes.toString() + 'min ' + seconds.toString() + 'sec';
} else if(hours !== 0){
    result_string = hours.toString() + 'h ' + minutes.toString() + 'min ' + seconds.toString() + 'sec';
} else if(minutes !== 0){
    result_string = minutes.toString() + 'min ' + seconds.toString() + 'sec';
} else if(seconds !== 0){
    result_string = seconds.toString() + 'sec';
} else {
    // always provide a display value - do not return null
    result_string = '0sec';
}
return result_string;

Note: The script has to return a value in any situation – it must not return null, otherwise there can be issues with the client tool. E.g. Analyzer doesn’t display all the results properly if null is returned.

Important: When adding the CellFormatter make sure that you removed the formatString attribute from the Measure or CalculatedMeasure XML Element, otherwise this attribute will take precedence over the CellFormatter.

Amazing isn’t it? If I had only known about this feature earlier on! A big thanks to Matt Campbell for pointing out that Mondrian has a Cell Formatter.

Pentaho Mondrian: The MDX Generate Function


Repost from http://diethardsteiner.github.io/cde/2015/04/23/MDX-Generate-Function.html

This is a brief overview of how you can use the MDX GENERATE() function. All the examples below work with the SteelWheelsSales Cube, which ships with the Pentaho BA Server. Try the examples listed below either with Pentaho Analyzer, the Saiku CE plugin, Pivot4J or Roland Bouman’s Pentaho Analysis Shell (Pash) (The last 3 ones are plugins, which you can download via the Pentaho Marketplace).

The GENERATE() function has two different applications:

  • If the second parameter is a set, it will return the members that exist in both sets ( a UNION() of the sets). You can think of it as a kind of for-each function which checks if members in the first set exist in the second one. ALL can be used as a third parameter to keep any duplicates.
  • If the second parameter is a string expression then the members will be concatenated.

Examples:

Find the members that are in both sets:

SELECT
    [Measures].[Sales] ON 0
    , NON EMPTY GENERATE(
        {[Markets].[Country].[Australia], [Markets].[Country].[USA]}
        , [Markets].[Country].[Australia]
    ) ON 1
FROM [SteelWheelsSales]
Country Sales
Australia 630,623

In this case Australia survives, because it is a member of the first as well as the second set.

For each member of the first set retrieve the first child:

SELECT
    [Measures].[Sales] ON 0
    , NON EMPTY GENERATE(
        {[Markets].[Country].[Australia], [Markets].[Country].[USA]}
        , [Markets].currentMember.firstChild
    ) ON 1
FROM [SteelWheelsSales]
State Province Sales
NSW 305,567
CA 1,505,542

Next we’d like to know the top performing country within each territory:

SELECT
    [Measures].[Sales] ON 0
    , NON EMPTY GENERATE(
        [Markets].[Territory].Members
        , TOPCOUNT([Markets].CurrentMember.Children, 1, [Measures].[Sales])
    ) ON 1
FROM [SteelWheelsSales]
Country Sales
Australia 630623
Spain 1215687
Japan 188168
USA 3627983

Taking the previous example a bit further, we can get the top performing country and customer combination based on sales for each territory:

SELECT
    [Measures].[Sales] ON 0
    , NON EMPTY GENERATE(
        [Markets].[Territory].Members
        , TOPCOUNT([Markets].CurrentMember.Children * [Customers].[Customer].Members, 1, [Measures].[Sales])
    ) ON 1
FROM [SteelWheelsSales]
Country Customer Sales
Australia Australian Collectors, Co. 200,995
Spain Euro+ Shopping Channel 912,294
Singapore Dragon Souveniers, Ltd. 172,990
USA Mini Gifts Distributors Ltd. 654,858

There is also an option to retain duplicates in the results using the ALL flag as the third argument:

SELECT
    [Measures].[Sales] ON 0
    , NON EMPTY GENERATE(
        {[Markets].[Country].[Australia], [Markets].[Country].[Australia]}
        , [Markets].currentMember.firstChild
        , ALL
    ) ON 1
FROM [SteelWheelsSales]
State Province Sales
NSW 305,567
NSW 305,567

Now let’s have a look at a completely different way we can use the GENERATE() function:

Let’s show the territories with all their countries next to each other in one line. We can define a delimiter as the third parameter of the GENERATE() function:

WITH 
MEMBER [Measures].[Countries] AS
    GENERATE(
            [Markets].CurrentMember.Children
            , [Markets].CurrentMember.Name
            , "-"
        )
SELECT
    {[Measures].[Countries]} ON 0
    , NON EMPTY [Markets].[Territory].Members ON 1
FROM [SteelWheelsSales]
Territory Countries
#null Germany-Ireland-Israel-Netherlands-Poland-Portugal-Russia-Singapore-South Africa-Spain-Switzerland
APAC Australia-New Zealand-Singapore
EMEA Austria-Belgium-Denmark-Finland-France-Germany-Ireland-Italy-Norway-Spain-Sweden-Switzerland-UK
Japan Hong Kong-Japan-Philippines-Singapore
NA Canada-USA

As mentioned earlier, this is just a brief intro the extremely useful GENERATE() function. I hope that the example gave you some ideas on how to use this function.

MDX Solve Order, SCOPE_ISOLATION and the Aggregate() function


Reblog from http://cwebbbi.wordpress.com/2014/10/14/mdx-solve-order-scope_isolation-and-the-aggregate-function/

Solve order in MDX is a mess. Back in the good old days of Analysis Services 2000 it was a difficult concept but at least comprehensible; unfortunately when Analysis Services 2005 was released a well-intentioned attempt at making it easier to work with in fact ended up making things much, much worse. In this post I’m going to summarise everything I know about solve order in MDX to try to make this complicated topic a little bit easier to understand.

If you’re an experienced MDXer, at this point you’ll probably lose interest because you think you know everything there is to know about solve order already. Up until two weeks ago that’s what I though too, so even if you know everything I say in the first half of this post keep reading – there’s some new stuff at the end I’ve only just found out about.

Let’s start with a super-simple cube built from a single table, with two measures (Sales Amount and Cost Amount) and a Product dimension containing a single attribute hierarchy with two members (Apples and Oranges). Everything is built from the following table:

image

Solve Order and calculated members in the WITH clause

To understand what solve order is and how it can be manipulated, let’s start off looking at an example that uses only calculated members in the WITH clause of a query. Consider the following:

WITH
 MEMBER [Measures].[Cost %] AS
DIVIDE([Measures].[Cost Amount],[Measures].[Sales Amount]),
FORMAT_STRING='0.0%'
MEMBER [Product].[Product].[Total Fruit] AS
SUM({[Product].[Product].&[Apples],
[Product].[Product].&[Oranges]})
SELECT
{[Measures].[Sales Amount],[Measures].[Cost Amount],
MEASURES.[Cost %]}
ON COLUMNS,
{[Product].[Product].&[Apples],[Product].[Product].&[Oranges],
[Product].[Product].[Total Fruit]}
ON ROWS
FROM SALES

There are two calculated members here:

  • Cost % divides Cost Amount by Sales Amount to find the percentage that costs make up of the sales amount
  • Total Fruit sums up the values for Apples and Oranges

The output of the query is as follows:

image

Solve order controls the order that MDX calculations are evaluated when two or more of them overlap in the same cell. In this case Cost % and Total Fruit are both evaluated in the bottom right-hand cell; Total Fruit is calculated first, giving the values of 30 for Sales Amount and 21 for Cost Amount, and Cost % is calculated after that. The bottom right-hand cell is the only cell where these two calculations overlap and the only cell where solve order is relevant in this query.

In this case, 70% is the value you would expect to get. You, however, can control solve order for calculations in the WITH clause by setting the SOLVE_ORDER property for each calculated member, like so:

WITH
 MEMBER [Measures].[Cost %] AS
DIVIDE([Measures].[Cost Amount],[Measures].[Sales Amount]),
FORMAT_STRING='0.0%',
SOLVE_ORDER=1
 MEMBER [Product].[Product].[Total Fruit] AS
SUM({[Product].[Product].&[Apples],
[Product].[Product].&[Oranges]}),
SOLVE_ORDER=2
 SELECT
{[Measures].[Sales Amount],[Measures].[Cost Amount],
MEASURES.[Cost %]}
ON COLUMNS,
{[Product].[Product].&[Apples],[Product].[Product].&[Oranges],
[Product].[Product].[Total Fruit]}
ON ROWS
FROM SALES

image

Now the value in the bottom right-hand corner is 135% instead of 70%: Cost % is calculated first, then Total Fruit second so 60%+75%=135%. The SOLVE_ORDER property of a calculated member is an integer value, and the lower the SOLVE_ORDER value the earlier the calculation will be evaluated, so with Cost % having a solve order of 1 and Total Fruit having a solve order of 2, this forces Cost % to be calculated first now even though in this case it gives what is clearly an ‘incorrect’ result.

Solve Order and calculated members defined on the cube

Things now get a bit more complicated. There’s a different way of controlling solve order if your calculations are defined on the cube itself: in this case, solve order is determined by the order that the calculations appear on the Calculations tab. So if the calculations tab of the Cube Editor contains the calculations in this order:

CREATE MEMBER CURRENTCUBE.[Measures].[Cost %] AS
DIVIDE([Measures].[Cost Amount],[Measures].[Sales Amount]),
FORMAT_STRING='0.0%';
CREATE MEMBER CURRENTCUBE.[Product].[Product].[Total Fruit] AS
SUM({[Product].[Product].&[Apples],[Product].[Product].&[Oranges]});

image

…and you run the following query:

SELECT
{[Measures].[Sales Amount],[Measures].[Cost Amount],
MEASURES.[Cost %]}
ON COLUMNS,
{[Product].[Product].&[Apples],[Product].[Product].&[Oranges],[Product].[Product].[Total Fruit]}
ON ROWS
FROM SALES

You get the incorrect result again:

image

…but if you change the order of the calculations so that Total Fruit comes first:…and rerun the same query, you get the correct results:

image

image

The SOLVE_ORDER property can also be used with calculations defined on the cube to override the effect of the order of calculations. So defining the following calculations on the cube:

CREATE MEMBER CURRENTCUBE.MEASURES.[Cost %] AS
DIVIDE([Measures].[Cost Amount], [Measures].[Sales Amount]),
FORMAT_STRING='PERCENT', SOLVE_ORDER=2;
 
CREATE MEMBER CURRENTCUBE.[Product].[Product].[Total Fruit] AS
SUM({[Product].[Product].&[Apples],
[Product].[Product].&[Oranges]}), SOLVE_ORDER=1;

image

…means that, even though Total Fruit comes after Cost % on the Calculations tab, because it has a lower solve order set using the SOLVE_ORDER property it is evaluated before Cost % and the query still returns the correct value:

image

Solve order and calculations defined in the WITH clause and on the cube

What happens if some calculations are defined on the cube, and some are defined in the WITH clause of a query? By default, calculations defined on the cube always have a lower solve order than calculations defined in the WITH clause of a query; the SOLVE_ORDER property has no effect here. So if Total Fruit is defined in the WITH clause and Cost % on the cube, you get the incorrect result:

image

WITH MEMBER [Product].[Product].[Total Fruit] AS
SUM({[Product].[Product].&[Apples],
[Product].[Product].&[Oranges]})
 SELECT
{[Measures].[Sales Amount],[Measures].[Cost Amount],
MEASURES.[Cost %]}
ON COLUMNS,
{[Product].[Product].&[Apples],[Product].[Product].&[Oranges],[Product].[Product].[Total Fruit]}
ON ROWS
FROM SALES

image

Of course, if Total Fruit is defined on the cube and Cost % is defined in the WITH clause you will get the correct answer. However, usually measures like Cost % are defined on the cube and it’s calculations like Total Fruit, which define custom groupings, that are defined on an ad hoc basis in the WITH clause. This is a problem.

The SCOPE_ISOLATION property

This default behaviour of calculations defined on the cube always having a lower solve order than calculations in the WITH clause can be overridden using the SCOPE_ISOLATION property. Setting SCOPE_ISOLATION=CUBE for a calculated member defined in the WITH clause will give that calculated member a lower solve order than any calculations defined on the cube. So, with Cost % still defined on the cube the following query now gives the correct results:

WITH MEMBER [Product].[Product].[Total Fruit] AS
SUM({[Product].[Product].&[Apples],
[Product].[Product].&[Oranges]}),
SCOPE_ISOLATION=CUBE
 SELECT
{[Measures].[Sales Amount],[Measures].[Cost Amount],MEASURES.[Cost %]}
ON COLUMNS,
{[Product].[Product].&[Apples],[Product].[Product].&[Oranges],[Product].[Product].[Total Fruit]}
ON ROWS
FROM SALES

image

The Aggregate() function

Using the MDX Aggregate() function (and in fact also the VisualTotals() function – but you probably won’t ever want to use it) inside a calculation has a similar effect to the SCOPE_ISOLATION property in that it forces a calculation to be evaluated at a lower solve order than anything else. Therefore, in the previous example, instead of using the SCOPE_ISOLATION property you can change the calculation to use the Aggregate() function instead of Sum() and get the correct results:

WITH
 MEMBER [Product].[Product].[Total Fruit] AS
AGGREGATE({[Product].[Product].&[Apples],[Product].[Product].&[Oranges]})
 SELECT
{[Measures].[Sales Amount],[Measures].[Cost Amount],MEASURES.[Cost %]}
ON COLUMNS,
{[Product].[Product].&[Apples],[Product].[Product].&[Oranges],[Product].[Product].[Total Fruit]}
ON ROWS
FROM SALES

image

The general rule is, therefore, whenever you are creating custom-grouping type calculated members like Total Fruit in the WITH clause of a query, to use the Aggregate() function rather than Sum(). The fact that Aggregate() takes into account the AggregateFunction property of each measure on the cube (so that distinct count, min and max measures are dealt with correctly) is another good reason to use it.

Using the Aggregate() function in calculations defined on the cube has the same effect. Even when the Total Fruit calculated member is defined after Cost % on the Calculations tab, as here:

image

…so long as Total Fruit uses the Aggregate() function, running the test query gives the correct result:

SELECT
{[Measures].[Sales Amount],[Measures].[Cost Amount],MEASURES.[Cost %]}
ON COLUMNS,
{[Product].[Product].&[Apples],[Product].[Product].&[Oranges],[Product].[Product].[Total Fruit]}
ON ROWS
FROM SALES

image

There are some very interesting details about the way Aggregate() changes solve order though.

First of all, using the Aggregate() function in a calculated member doesn’t change the solve order of the whole calculation, just the part of the calculation that uses the Aggregate() function. With the following calculations defined on the cube:

CREATE MEMBER CURRENTCUBE.[Measures].[Cost %] AS
DIVIDE([Measures].[Cost Amount],[Measures].[Sales Amount]),
FORMAT_STRING='0.0%';

CREATE MEMBER CURRENTCUBE.[Product].[Product].[One Aggregate] 
AS AGGREGATE({[Product].[Product].&[Apples],
[Product].[Product].&[Oranges]});
 CREATE MEMBER CURRENTCUBE.[Product].[Product].[One Sum] AS
SUM({[Product].[Product].&[Apples],
[Product].[Product].&[Oranges]});

CREATE MEMBER CURRENTCUBE.[Product].[Product].[Two Aggregates] AS
AGGREGATE({[Product].[Product].&[Apples],[Product].[Product].&[Oranges]})
+
AGGREGATE({[Product].[Product].&[Apples],[Product].[Product].&[Oranges]});

CREATE MEMBER CURRENTCUBE.[Product].[Product].[Two Sums] AS
SUM({[Product].[Product].&[Apples],[Product].[Product].&[Oranges]})
+
SUM({[Product].[Product].&[Apples],[Product].[Product].&[Oranges]});

CREATE MEMBER CURRENTCUBE.[Product].[Product].[One Aggregate One Sum] AS
AGGREGATE({[Product].[Product].&[Apples],[Product].[Product].&[Oranges]})
+
SUM({[Product].[Product].&[Apples],[Product].[Product].&[Oranges]});

…running the following query:

SELECT
{[Measures].[Sales Amount],[Measures].[Cost Amount],MEASURES.[Cost %]}
ON COLUMNS,
{[Product].[Product].&[Apples],[Product].[Product].&[Oranges],[Product].[Product].[One Aggregate],[Product].[Product].[One Sum],[Product].[Product].[Two Aggregates],[Product].[Product].[Two Sums],[Product].[Product].[One Aggregate One Sum]}
ON ROWS
FROM SALES

…gives these results:

image

The value returned for the calculation [One Aggregate One Sum], which contains an Aggregate() and a Sum(), shows that the value returned by the Aggregate() is evaluated at a different solve order than the value returned by Sum(), even if they are inside the same calculated member.

Furthermore, in some very obscure cases the contents of the set passed to the Aggregate() function determine whether its special solve order behaviour happens or not. I don’t know for sure what all those cases are but I have seen this happen with time utility (aka date tool aka shell) dimensions. Here’s an example.

The demo cube I’ve been using in this post has been changed to add a new dimension, called Data Type, which has just one hierarchy with one member on it called Actuals; Data Type is a fairly standard time utility dimension. The Cost % calculation has also been changed so that it’s now a calculated member on the Data Type dimension, although it is still defined on the cube. Here’s its new definition:

CREATE MEMBER CURRENTCUBE.[Data Type].[Data Type].[Cost %] AS
DIVIDE(
([Measures].[Cost Amount],[Data Type].[Data Type].&[Actuals]),
([Measures].[Sales Amount],[Data Type].[Data Type].&[Actuals])),
FORMAT_STRING='0.0%';

Now if I run the following query:

WITH MEMBER [Product].[Product].[Simple Set] AS
AGGREGATE({[Product].[Product].&[Apples],[Product].[Product].&[Oranges]})
 MEMBER [Product].[Product].[Nextmember Function Used] AS
AGGREGATE({[Product].[Product].&[Apples],[Product].[Product].&[Apples].NEXTMEMBER})
 MEMBER [Product].[Product].[Descendants Function Used] AS
AGGREGATE(DESCENDANTS({[Product].[Product].&[Apples],
[Product].[Product].&[Oranges]}))
 MEMBER [Product].[Product].[Descendants Function Used Twice] AS
AGGREGATE({
DESCENDANTS([Product].[Product].&[Apples]),
DESCENDANTS([Product].[Product].&[Oranges])
})
 MEMBER [Product].[Product].[Descendants Function Used Twice With Union] AS
AGGREGATE(
UNION(
DESCENDANTS([Product].[Product].&[Apples]),
DESCENDANTS([Product].[Product].&[Oranges])
))
 SELECT
{[Measures].[Sales Amount]} * [Data Type].[Data Type].ALLMEMBERS
ON COLUMNS,
{[Product].[Product].&[Apples],[Product].[Product].&[Oranges],
[Product].[Product].[Simple Set],[Product].[Product].[Nextmember Function Used],
[Product].[Product].[Descendants Function Used], [Product].[Product].[Descendants Function Used Twice],
[Product].[Product].[Descendants Function Used Twice With Union]}
ON ROWS
FROM [Sales With Data Type]

I get these results:

image

Note that for some of the calculations, the Aggregate() function results in a lower solve order in the way we’ve already seen, but not for all of them. Using the NextMember() function, or having two Descendants() functions without wrapping them in a Union() function, seems to stop SSAS assigning the calculation a lower solve order. Ugh. Luckily, though, I have only been able to replicate this with calculated members from two non-measures dimensions; if Cost % is a calculated measure Aggregate() always gives the lower solve order. Apparently this is something that SSAS does on purpose to try to recognise ‘visual total’-like calculated members and make them work the way you want automatically. This is definitely something to beware of if you are using time utility dimensions and calculations on other dimensions though, as it may result in incorrect values being displayed or performance problems if you’re not careful.

[Thanks to Gabi Münster for showing me how Aggregate() works with different sets and Marius Dumitru for confirming that this is intended behaviour]

Open Business Analytics Training in London #BI #BigData #ETL #OLAP


Training Main page

Training

Dates:  From 28th April to 1st May 2014

Duration: 24 hours. 4 days

Location: Executive offices group meeting rooms. London.

Address: Central Court, 25 Southampton Bldgs – WC2A 1AL .

Training contents:

DAY 1
Business Intelligence Open Source Introduction and BI Server User Console
a. Pentaho 5 Architecture and new features, Mondrian, Kettle, etc…
b. Users and roles in Pentaho 5.
c. Browsing the repository in the user console.
d. Design tools.
Pentaho Data Integration (Kettle) ETL tool
a. Best Practices of ETL Processes.
b. Functional Overview (Jobs, transformations, stream control)
c. Parameters and Variables in kettle
• Environment variables and shared connections.
• ETL scheduling
d. Jobs
• Overview
• Step types (Mail, File Management, Scripting, etc…)
• Steps description
e. Transformations
• Overview
• Step types (Input, Output, Transform, Big Data, etc…)
• Steps description
f. Practical exercises
g. Data profiling with DataCleaner (pattern analysis, value distribution, date gap analysis …)
h. Talend Open Studio vs Kettle comparative
DAY 2
Data warehousing, OLAP and Mondrian
a. Datawarehouse – Datamart.
b. Star database schemas.
c.Multidimensional/OLAP
d. Mondrian ROLAP engine.
e. JPivot and MDX.
f. Designing OLAP structures Schema Workbench.
g. Tips to maximize Mondrian performance.
h. Alternatives to JPivot: STPivot, Saiku Analytics, OpenI
i. Practical Exercises
Social Intelligence
a. Introduction
b. Social KPIs (Facebook, Twitter …)
c. Samples
DAY 3
Reporting
a. AdHoc Reporting
• WAQR
• Pentaho Metadata Editor
• Creating a business model
b. Pentaho Reporting. Report Designer.
c. Practical Exercises

Big Data

a. Big Data Introduction
b. Pentaho Big Data components
c. Relational vs Columnar and Document Databases
DAY 4
Dashboards and Ctools
a. Introduction.
• Previous concepts.
• Best practices in dashboard design.
• Practical design.
b. Types of dashboards.
c. CDF
• Introduction.
• Samples.
d. CDE (Dashboard Editor)
• Introduction.
• Samples.
• Practical Exercise.
e. Ad hoc Dashboards.
• Introduction.
• Samples.
f. STDashboard
Plug-ins
a. SPARKL (Application designer)
b. Startup Rules (Substitute of xactions)

Book Review: Mondrian in action by William D. Back, Nicholas Goodman & Julian Hyde


Hi all,

Today I will post my review about a really must-read book: Mondrian in action http://www.manning.com/back/  

Mondrian in Action

One of the facts that most attracted me was the fact that this reference book is excellent for a great variety of IT roles:

  • Business Analysts
  • Data Architects
  • Business Intelligence/Analytics Consultants
  • ETL Developers
  • Application Developers
  • Enterprise Architects

Enjoy….

Chapter 1: Beyond reporting: Business analytics

The book’s first chapter is devoted to introduce you to some of the usual problems encountered with a report-based approach to analysis. It is explained why creating database reports is not a good idea and how Mondrian can be used to overcome those challenges and some of the characteristics that make Mondrian OLAP analytic engine is the best choice

top_5_territories_screen

Chapter 2: Mondrian: A first look

imagmondres

Second chapter starts with a brief overview of the architecture, then you will discover some sort of things you can do with Mondrian. Finally, it is explained how to get data from your operational systems into Mondrian to be used for analysis.

Chapter 3: Creating the data mart

This chapter is focused to Data warehouse architects since unveils the general architecture of an analytic solution and then moves to explore the best database modeling technique for business analytics systems, as you sure know this technique consists on build a Star Schema. Besides Star schema is compared with Third Normal form modeling technique.

The following terms are mentioned: Dimension tables, Fact Tables, Slow Changing Dimension Tables, Star schema vs. Snowflakes, Junk/Degenerate Dimensions and Time Dimensions.

Chapter 4: Multidimensional Modeling: Making Analytics Data Accessible

The chapter describes the new XML syntax of schemas in Mondrian version 4. Logical elements (Schemas, cubes, attributes and measures) and Physical elements (Tables and columns) are described in detail and how Mondrian acquires the data from the data mart. Besides, Mondrian 3.X obsolete models are mentioned on an upgrade section. Finally, an optimized Time Dimension is created.

Chapter 5: How schemas grow

This chapter describes advanced modeling features.

We will see how to design and use:

  • Shared Dimensions
  • Measure Groups (Cubes using more than only one fact table)
  • Parent-Child hierarchies
  • Hanger Dimensions for comparing Target vs. Actual values
  • Calculated Members

Chapter 6: Securing data

how_protect500

This chapter shows how to restrict access to specific data members, dimensions, and even the full schema using Mondrian role based access control security policy. Some of the terms used are the following: SchemaGrant, CubeGrant, DimensionGrant, HierarchyGrant, MemberGrant, Measure Grants…

Chapter 7: Maximizing Mondrian Performance

This is a very important issue, since it is focused in describing the different techniques available to improve Mondrian performance. Configuring Mondrian caches (Community Distributed Cache, Infinispan and Memcached), tuning database and creating aggregate tables are some of the techniques mentioned.

aggregate_tables_1

Chapter 8: Dynamic Security

This chapter is a continuation of chapter 6 and explains how to manage advanced security requirements in Mondrian by means of using a Dynamic Schema Processor. A DSP allows a dynamic creation of a Mondrian schema made to measure to the connected user. Previous knowledge of Java language is required.

Chapter 9: Working with Mondrian and Pentaho

new-pentaho-logo-CMYK

This chapter takes a look at a good deal tools that are frequently used with Mondrian and show how they are used.

Pentaho Analyzer: Plug-in that provides drag & drop analysis and advanced charting features (Pentaho Enterprise Edition).

Saiku Analytics: Open source OLAP thin client interface that provides drag & drop analysis and basic charting.

Community Dashboard Framework: Open source tool that allows users to create dashboards using Mondrian data, included in Ctools suite.

Pentaho Report Designer: Open source desktop application that allows users to create pixel perfect reports using Mondrian as an origin of data.

Pentaho Data Integration: Open source ETL tool (aka kettle) which is commonly used to populate the data used by Mondrian as mentioned in previous chapters, but that can also use Mondrian as a source of data.

Chapter 10: Developing with Mondrian

This chapter is focused to software developers and unveils several possibilities to embed Mondrian engine into your custom applications.

There are 2 ways of using Mondrian with third party apps:

  • XML for Analysis using a thin client.
  • Xmla4js a Javascript library that provides basic XML for Analysis (XML/A) capabilities, allowing javascript developers to access data and metadata from OLAP.

Chapter 11: Advanced Analytics

In this final chapter it is covered how to do advanced analytics using the enormous power of MDX language both inside Mondrian and with external tools. This complex analytics, through MDX, meets many use cases like Growth, Profitability and Ratios.

whatiffs

Apart from that it is explained some limited What-If Analysis (aka. scenarios) support to allow Mondrian to help you model and think about several “What would occur if X occurred “. Then it is covered how to do inside Mondrian Data mining and Machine Learning using R language or Weka framework such as Clustering, Forecasting or Fraud Detection analysis. Finally it is briefly covered where Mondrian fits within the Big Data ecosystem

(Hadoop, Hive, CouchDB) and why Mondrian is much faster with more data on a columnar analytical database (Vertica, Vectorwise, Greenplum, Inobright, InfiniDB, MonetDB, LucidDB).

Big-Data-Players-2012

Appendices

A Installing and Running Mondrian

Explains how to use the virtual machine with Pentaho CE configured with Mondrian, Saiku and Ctools included in the book.

B Online Resources

Lists all available community resources like blogs and wikis

Summarizing, although the softbound print will not be available until August 2013. I will strongly recommend you don’t lose the opportunity to purchase this wonderful book now on an early access program.

Saiku running Mondrian 4


Recently interest in the upcoming Mondrian 4 has increased and as it reaches a more useable state, and Pentaho have started publishing builds to their Artifactory repo, I decided to put together a Saiku & Mondrian 4 build.

This can be found here.

Feel free to download it, especially check out the new style Foodmart schema as schema design is the big new change in Mondrian 4.

Play with it and find out what works and doesn’t. Mondrian 4 requires people to test it, so what better way than whilst taking Saiku for a spin.

Remember it is not for production use, it is not meant for that, its purely a way to get people using a Mondrian 4 pre-release.

 

DOWNLOAD LINK: http://ci.analytical-labs.com/job/saiku-server-mondrian4/

 

Pentaho Aggregation Designer on Ubuntu configuration


The Pentaho Aggregation Designer (PAD) is a tool that comes from the Mondrian project. Its purpose is to generate SQL queries to create summarization tables, the instructions for data aggregation and the modification of the mondrian schema so the engine can use them.

The engine can check that precalculated data is available based on the names of fields used in keys and and report them speeding the display of results. The cost is that the cube construction will take considerably longer.

Install

  1. Download it from the mondrian sourceforge page (1.2.2-stable).
  2. Unzip it’s content to /pentaho/aggregation-designer
  3. Open a terminal, navigate to that folder and make the .sh files excecutables:
    $ chmod +x *.sh

To start using it

In that folder and type:

$ ./startaggregationdesigner.sh
  1. Click Configure and fill the parameters as we have done before: SampleData name, localhost, sampledata, pentaho_user, password. Click on test and ok.
  2. Click on Mondrian schema and locate steelwheels.mondrian.xml in /Pentaho/biserver-ce/pentaho-solutions/steel-wheels/analysis/, click Apply.
  3. Select Steelwheels model, and click connect.
    PAD will check for primary keys, nulls on foreing keys –which should not exist in a correctly done ETL process-. Click Ok.
  4. Now click advisor and then recommend in the dialog window.
  5. If you click on on the right bottom panel, the columns used will be displayed on the left bottom panel.PAD Screen

    PAD Screen

    You can select wich one to generate  as it shows a graph with cost to build (number of records) against benefit (time to get the result).

  6. Click export to open a dialog window:
    Click on Create Aggregates: export to Save it to a file like create.sql
    Click on Populate Aggregates: export to a file like populate.sql
    Click on Publish Mondrian Schema: export to a file like schema.xml
  7. Save your work to a file like PAD_workspace and exit
  8. Now you can check and modify the scripts. Maybe adding the TRUNCATE SQL instructions to the tables in the populate file.
    You can automate running them them with an .xaction sequence and a PDI job.

Reading Material

  • Julyan Hyde blog post about the 2.0 BI release –and PAD here
  • Mondrian documentation here
  • User Guide 1.0 PDF y 1.2 scribd
  • If you use Lucid DB: heres an article to use with a plugin LucidDbAggregateDesigner
  • In 2008, Chris Webb made an an accurate prediction about mondrian column databases (LucidDB) and PAD her