Executing JNDI based Pentaho reports using the Java API

Executing JNDI based Pentaho reports using the Java API.


Working on the Alfresco Business Reporting project. In there, the tool can execute PentahoJaspersoft reports (against Alfresco business objects, using the Pentaho/Jasersoft Java API) on a scheduled basis. I ran into the issue of portability of the (Pentaho) reports. The database, url, credentials were not identical on my Development, Test and Production environments… JNDI is the answer.

In JNDI ( (Java Naming and Directory Interface) world, a datasource is identified by a name, and each naming provider can carry different credentials/url’s for the same name in different environments. Datasources can be LDAP systems, JDBC connections etc. Just what I needed.
I ran into 3 issues, initially entangled;
  1. How does the Pentaho reporting API deal with JNDI
  2. Running a report using my custom Java logic from a command-line
  3. Running a report from within Alfresco – How to manage the duplicate credentials. The reporting database connection properties were already defined in alfresco-global.properties, and I prefer to re-use these in the named JNDI connection.
 Although I like journeys, I do prefer reaching destination too, especially in this particular case. There were not many useful guides along the route, so having it finally figured out, also resulting in this blog.
How does the Pentaho reporting API deal with JNDI

The Pentaho report definition (.prpt files) are zip files containing mostly XML files. One of the files contains all database related info. This means the JNDI name and the JDBC userame/password/driver/url information is stored inside the report. It appears hard to override when executing the report using the Java API.
The Pentaho API has a class MasterReport, that has the methods setDataFactory and setDataSource. The DataFactory is typically used for JDBC like connections (user, pass, url, driver).  I have had no success in overriding these values when executing a report in a different environment.
The JNDI name is stored in the report too, but since it is resolved against the context at time of execution, it is perfectly fine to use the name as stored. Exactly how JNDI was meant. Find here how to configure the Pentaho Report Designer for JNDI.
Running a report using my custom Java logic from a command-line
I created a command-line version of the logic to execute reports. Building stuff into Alfresco, restarting and the full procedure takes too much time. But how to use JNDI like this?
Most guides expect some kind of JNDI provider (usually application servers like JBoss, Tomcat), and tell you how to get definitions from there. I do not want to depend om yet another definition of my reporting database details, it is in Alfresco’s global properties already. I got stuck in Simple-JNDI and related tools, that apparently get datasources and details from, yet again, an existing JNDI provider. The Report executer does not accept anything that Simple-JNDI provides (DataSource or Connection). (Wonder how Pentaho Report Designer does the trick itself… I have had my share of reverse engineering for now…) The Pentaho report just requests the JNDI name, and gets provided by the environment. How to get such a responsive environment?
I finally ran into this guide from Oracle. It inspired me to finalize their idea (stopRegistry was missing):
01 private static Registry startRegistry() throws RemoteException {
02     System. out .println("enter startRegistry" );
03     Registry registry = null ;
04     try {
05         registry = LocateRegistry.createRegistry(Registry. REGISTRY_PORT );
06         System. out .println("exit startRegistry: RMI registry created.");
07     } catch (ExportException e){
08         System. out .println("exit startRegistry: RMI registry already existed.");
09     }
10     return registry;
11 }
13 private static boolean stopRegistry(Registry registry) {
14     System. out .println("enter stopRegistry" );
15     boolean result = false;
16     try {
17         result = UnicastRemoteObject.unexportObject(registry, true );
18     } catch (NoSuchObjectException e) {
19         System. out .println("stopRegistry: RMI registry already stopped.");
20     }
21     System. out .println("exit stopRegistry: RMI registry stopped: " + result);
22     return result;
23 }
25 private static InitialContext createContext() throws NamingException {
26     Properties env = new Properties();
27     env.put(Context. INITIAL_CONTEXT_FACTORY ,"com.sun.jndi.rmi.registry.RegistryContextFactory" );
28     env.put(Context. PROVIDER_URL , "rmi://localhost:" +Registry. REGISTRY_PORT);
29     InitialContext context = new InitialContext(env);
30     return context;
31 }
33 private static void setJndiDataSource(String source) throws NamingException, RemoteException{
34     System. out .println("enter setJndiDataSource source="+source);
36     ConnectionPoolDataSource dataSource = newMysqlConnectionPoolDataSource();
37     ((MysqlDataSource) dataSource).setUser( "username" );
38     ((MysqlDataSource) dataSource).setPassword("password" );
39     ((MysqlDataSource) dataSource).setServerName("localhost" );
40     ((MysqlDataSource) dataSource).setPort(3306);
41     ((MysqlDataSource) dataSource).setDatabaseName("alfrescoreporting" );
42     DataSource ds = (javax.sql.DataSource)dataSource;
43     InitialContext context = createContext();
44     context.rebind( "java:/comp/env/jdbc/" +source, ds);
45     System. out .println("exit setJndiDataSource" );
46 }
Remind that the unexportObject() method only works with the exact object as returned by createRegistry() call … I invoked startRegistry() and setJndiDataSource() before and stopRegistry() after Pentaho’s  reportProcessor.processReport() and got exactly what I needed!  (I suggest you call the stopRegistry() in a finally part of the try/catch block around the processReport()).
Now I was able to build my own JNDI provider that actually allowed me to execute a Pentaho JNDI-based report from the command-line (e.g. without any application server).
Running a report from within Alfresco
It would be too easy if my command-line code would work inside an application server too. Th application Server (Tomcat in my case) owns the Registry. You can access the registry read-only, but not rebind and therefore ‘dynamically’ adding the appropriate details for any given JNDI name. This actually makes sense of course.
Basically, define a JNDI source within Tomcat (or application server of choice). I followed Tomcat’s example, and added
1 <Resource defaultTransactionIsolation="-1" defaultAutoCommit="false"
2     maxActive="15" maxIdle="3" initialSize="1"
3     username="username" password="password"
5     driverClassName="org.gjt.mm.mysql.Driver"
6     type="javax.sql.DataSource" auth="Container"
7     name="jdbc/alfrescoReporting"/>
to my META-INF/context.xml. My web.xml was enahnced with:
1 <resource-ref>
2     <description>Alfresco Reporting JNDI definitition</description>
3     <res-ref-name>jdbc/alfrescoReporting</res-ref-name>
4     <res-type>javax.sql.DataSource</res-type>
5     <res-auth>Container</res-auth>
6 </resource-ref>
You also need to make sure your database driver (mysql-connector-java-5.x.yy.jar, or any Postgres, Oracle, MSSql equivalent) is located in tomcat/lib. (Usually it is already there.)
According the guys from Tomcat, you’re done. However, it doesn’t work. Apparently you also need to tweak <applicationname>.xml (in my case: alfresco.xml) in tomcat/conf/Catalina/localhost. You need to insert exactly the same line(s) as added to context.xml above.
And now we’re done. Pentaho reports are executed using the Java API in a web application like Alfresco using JNDI!

Single Sign-On (SSO) on Pentaho Community Edition using CAS


You can have lots of different systems which are used by corporate users, customers and providers, so managing users for all those systems started to be quite complicated. So we decided to implement CAS as our single sign-on solution.

We started integrating CAS with some of our internal websites (mostly written on .Net), then we integrated it with several corporate WordPress blogs, and after a customer request, we integrated it with out business intelligence solution, Pentaho.

We are using the Community Edition of Pentaho, and the documentation for integrating CAS and Pentaho is only available for the Enterprise Edition. So after some googling and some spring magic, we have our SSO working with Pentaho. Please note that this tutorial doesn’t cover the installation and configuration of CAS, and that we have tried this on Pentaho Community Edition 3.10.0-STABLE, but probably will work on other versions.

This is how we did it:

1. Download required CAS and Spring Security libraries and add them the Pentaho:

– Download spring-security-cas-client-2.0.5.RELEASE.jar
– Download cas-client-core-3.2.1.jar
– Copy both jars to biserver-ce/tomcat/webapps/pentaho/WEB-INF/

2. Create the CAS configuration file applicationContext-spring-security-cas.xml

Create the file applicationContext-spring-security-cas.xml in biserver-ce/pentaho-solutions/system/ with this content:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springsource.org/dtd/spring-beans.dtd">
<beans default-autowire="no" default-dependency-check="none"
<bean id="filterChainProxy"
autowire="default" dependency-check="default" lazy-init="default">
<property name="filterInvocationDefinitionSource">
<bean id="serviceProperties"
autowire="default" dependency-check="default" lazy-init="default">
<property name="sendRenew" value="false" />
<bean id="casProcessingFilter"
autowire="default" dependency-check="default" lazy-init="default">
<property name="authenticationManager">
<ref bean="authenticationManager" />
<property name="authenticationFailureUrl" value="/Login?login_error=2" />
<property name="defaultTargetUrl" value="/" />
<property name="filterProcessesUrl" value="/j_spring_cas_security_check" />
<bean id="casSingleSignOutFilter" />
<bean id="casSingleSignOutHttpSessionListener" />
<bean id="exceptionTranslationFilter"
autowire="default" dependency-check="default" lazy-init="default">
<property name="authenticationEntryPoint">
<ref local="casProcessingFilterEntryPoint" />
<property name="accessDeniedHandler">
<bean />
<bean id="casProcessingFilterEntryPoint"
autowire="default" dependency-check="default" lazy-init="default">
<property name="loginUrl" value="http://localhost:8080/cas/login" />
<property name="serviceProperties">
<ref local="serviceProperties" />
<bean id="authenticationManager"
autowire="default" dependency-check="default" lazy-init="default">
<property name="providers">
<ref bean="anonymousAuthenticationProvider" />
<ref bean="casAuthenticationProvider" />
<bean id="casAuthenticationProvider">
<property name="userDetailsService">
<ref bean="userDetailsService" />
<property name="serviceProperties">
<ref local="serviceProperties" />
<property name="ticketValidator">
<ref local="ticketValidator" />
<property name="key" value="my_password_for_this_auth_provider_only" />
<bean id="ticketValidator"
autowire="default" dependency-check="default" lazy-init="default">
<constructor-arg index="0" value="http://localhost:8080/cas" />
<bean id="logoutFilter"
autowire="default" dependency-check="default" lazy-init="default">
<bean />
<bean />
<property name="filterProcessesUrl" value="/Logout" />

You should change the URLs in this file to match your Pentaho and CAS installation.

3. Edit pentaho-spring-beans.xml:

Edit the file biserver-ce/pentaho-solutions/system/pentaho-spring-beans.xml to make it look like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springsource.org/dtd/spring-beans.dtd">
<import resource="pentahoSystemConfig.xml" />
<import resource="adminPlugins.xml" />
<import resource="systemListeners.xml" />
<import resource="sessionStartupActions.xml" />
<import resource="applicationContext-spring-security.xml" />
<import resource="applicationContext-spring-security-cas.xml" />
<import resource="applicationContext-common-authorization.xml" />
<import resource="applicationContext-spring-security-hibernate.xml" />
<import resource="applicationContext-pentaho-security-hibernate.xml" />
<import resource="pentahoObjects.spring.xml" />

Please note that we’ve added applicationContext-spring-security-cas.xml after applicationContext-spring-security.xml

4. Add CAS certificate to Java keyring in Pentaho server

If your CAS server is not running in the same machine as your Pentaho installation, you will need to import its certificate into the Java VM installation.

– Copy your .cert or .pem to Pentaho machine.
Execute keytool -import -trustcacerts -file your_cert.crt -alias CAS -keystore /usr/lib/jvm/java-6-sun/jre/lib/security/cacerts

Please change your CAS cert filename and the path to your JRE installation.

5. Add users to Pentaho database

Unfortunately, you will have to add users to Pentaho database (or using the administration console). The only thing you have to keep in mind is that the username in pentaho has to be the equal as the username in CAS.

6. Restart Pentaho and enjoy

After restarting Pentaho, when yo try to access it you will be redirected to CAS login form.

Beginners Guide to CDF (Community Dashboard Framework)

The Dashboard will allow the user to search for a companies. Companies are arranged in a 3-level hierarchy, and the Dashboard will allow users to navigate down the hierarchy, so if the user clicks a company in one level, the levels below it will show it’s subsidiaries and the levels above it will show its parent companies. When ever a company is selected (in any of the levels) the graphs on the bottom display additional data for the company.

To accomplish the task, we are going to use Pentaho BI Server and CDF, which is an open source dashboard framework developed by Pedro Alves from WebDetails.

So first thing first, we have to set up our infrastructure, the good news is that CDF is now an integral part of Pentaho CE, so we just need to download and install Pentaho 3.6:

1 $ wget http: //sourceforge.net/projects/pentaho/files/Business%20Intelligence%20Server/3.6.0-stable/biserver-ce-3.6.0-stable.tar.gz/download
2 $ tar xzvf biserver-ce-3.6.0-stable.tar.gz
3 $ cd biserver-ce
4 $ ./start-pentaho.sh

Great, you can see that it is working by navigating to port 8080 on the localhost. You can log-in using the default user and password (joe/password), there are many samples, including CDF samples and also CDF documentation is available through the system.

I am going to publish the system on the web so I have to edit tomcat/conf/server.xml and to change the Connector port from 8080 to just 80, and also to edit tomcat/webapps/pentaho/WEB-INF/web.xml and modify the base-url parameter to the url I am going to use.

Next step is to create a datasource to the database, this can be done using the administrator colsole, but for me it is easier to just editing tomcat/webapps/pentaho/META-INF/context.xml and add this:

1 <Resource name="jdbc/datasource" auth="Container" type="javax.sql.DataSource"
2 factory="org.apache.commons.dbcp.BasicDataSourceFactory" maxActive="20" maxIdle="5"
3 maxWait="10000" username="biserver" password="nottelling"
4 driverClassName="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/dw"/>

At this point don’t forget to restart the server.

OK, next step is to create a solution and a dashboard, the easiest way is copy-paste from the examples.

1 $ cd pentaho-solutions
2 $ cp -r cdf-samples/20-samples/charts mySolution
3 $ cd mySolution

The solution is defined by the index.xml file and we can edit the index.properties file to give it a name and a description. The Dashboard is defined by the .xcdf file which we should also rename and edit to our own names.

Now you can go back to the web-browser, in order to see the new solution and new Dashboard you have to “refresh the repository cache” (there’s a menu option for that and also a refresh icon) and Voilà – a new Dashboard was born.

The content of the dashboard is stored in a template.html file, there are several parts to it which I would explain herein.

First there is the html template itself which defines the structure of the Dashboard. In my case I needed a search box, 3 multi-select-boxes, two bar-charts and one gauge which I am going to layout as follows:

01 <table align="center" style="border: 0px solid>
02 <tr>
03 <td valign="top">Search:<div id="search_object"></div></td>
04 </tr>
05 <tr>
06 <td valign="top">A Level:</td>
07 <td valign="top">B Level:</td>
08 <td valign="top">C Level:</div></td>
09 </tr>
10 <tr>
11 <td valign="top"><div id="a_object"></div></td>
12 <td valign="top"><div id="b_object"></div></td>
13 <td valign="top"><div id="c_object"></div></td>
14 </tr>
15 <tr>
16 <td valign="top"><div id="channel_object"></div></td>
17 <td valign="top"><div id="dialChart_object"></div></td>
18 <td valign="top"><div id="product_object"></div></td>
19 </tr>
20 </table>

The next step is to create the different components using CDF, the search box is declared using javascript:

01 var query;
02 var searchComponent =
03 {
04 name: "searchComponent",
05 type: "textInputComponent",
06 htmlObject: "search_object",
07 executeAtStart: true,
08 parameter: "query",
09 tooltip: "enter partial company name and press Enter",
10 postChange: function(){ searchFunction(query); }
11 }

The htmlObject property is very important, it tells CDF where in the html template the component is going to reside.

The postChange trigger function is going to be called when the users enters a new search query, and would be described in more details later.

Each of the 3 multi-select-boxes is declared in a similar way:

01 var a;
02 var aComponent =
03 {
04 name: "aComponent",
05 type: "selectMultiComponent",
06 parameters:[],
07 parameter:"a",
08 size: "10",
09 valueAsId: false,
10 htmlObject: "a_object",
11 executeAtStart: true,
12 queryDefinition: aQueryDefinition,
13 listeners: ["event"],
14 postChange: function(){aFunction(a);}
15 };

The queryDefinition object, as you must have already gueesed, is responsible for querying the data that is going to populate the component, it will be described in more details below.

Bar-charts are created using this code:

01 var channelComponent =
02 {
03 name: "channelComponent",
04 type: "jFreeChartComponent",
05 listeners:["selectedRegionMeasure"],
06 chartDefinition: channelDefinition,
07 htmlObject: "channel_object",
08 executeAtStart: true,
09 listeners: ["event"]
10 }

The listeners property lists event names to which this component listens, when ever such event is fired the component will refresh it’s data.

Finally the code for a gauge component:

01 var overallComponent =
02 {
03 name: "overallComponent",
04 type: "dialComponent",
05 listeners:["departmentMeasure"],
06 chartDefinition: dialChartDefinition,
07 htmlObject: "dialChart_object",
08 executeAtStart: true,
09 listeners: ["event"]
10 }

Now that all the components are ready we can go ahead and initiate the Dashboard with this call:

1 Dashboards.init([searchComponent,aComponent,bComponent,cComponent,overallComponent,channelComponent,productComponent]);

CDF includes many components such as Buttons, Combo-boxes, Menus, Tables, Maps, FlashCharts, Pivot views, Traffic lights and more… check out the online documentation for the full list, descriptions and samples.

The next thing we want to define is the behavior of the Dashboard, in my case the state of the Dashboard is going to be managed using 4 variables and is going to use just one event to refresh all the views when the state changes:

1 var queryStatus;
2 var aStatus;
3 var bStatus;
4 var cStatus;

Now we can go back and look at the postChange function of the search component, which stores the query and resets any previous state. The call to fireChanges is will dispatch the event through all our views causing it to change.

1 function searchFunction(val) {
2 queryStatus = val;
3 aStatus = "";
4 bStatus = "";
5 cStatus = "";
6 Dashboards.fireChange("event",queryStatus);
7 }

And for each of the multi-select components, the code is just as simple:

1 function aFunction(val) {
2 queryStatus = "";
3 aStatus = val;
4 bStatus = "";
5 cStatus = "";
6 Dashboards.fireChange("event",queryStatus);
7 }

Now we are approaching the last part of the development, which is to define the data queries in either SQL or MDX. The query is constructed based on the state of the dashboard.

The queryDefinition object for the multi-selects is:

01 var aQueryDefinition = {
02 queryType: 'sql',
03 jndi: "datasource",
04 query: function(){
05 var q= "SELECT distinct A_ID, A_NAME " +
06 "FROM company " +
07 "WHERE 1=1 " +
08 (queryStatus==""?"":("AND A_NAME like '%"+queryStatus+"%' ")) +
09 (aStatus==""?"":("AND A_ID="+aStatus+" ")) +
10 (bStatus==""?"":("AND B_ID="+bStatus+" ")) +
11 (cStatus==""?"":("AND C_ID="+cStatus+" ")) +
12 "LIMIT 100";
13 return q;
14 }
15 }

As you can see, we are using an SQL query to the datasource to pull a list of all the A level companies that match to the criteria that is defined by the status of the Dashboard, so if the status is a query we will pick only companies whose names contains the search query and if the status is a company we will pick only companies on that branch of the hierarchy.

Similarly, the queryDefinition of the bar-charts is:

01 var channelDefinition = {
02 width: 300,
03 height: 250,
04 chartType: "BarChart",
05 datasetType: "CategoryDataset",
06 is3d: "false",
07 isStacked: "true",
08 includeLegend: "false",
09 foregroundAlpha: 0.7,
10 queryType: 'sql',
11 jndi: "datasource",
12 title: "Channel Revenues:",
13 query: function(){
14 var query = "SELECT name, sum(Revenue) " +
15 "FROM company LEFT JOIN revenue ON company.id=COMPANY_SEQ LEFT JOIN channel ON CHANNEL_SEQ=channel.id " +
16 "WHERE " +
17 (aStatus==""&&bStatus==""&&cStatus==""?"1=0 ":"1=1 ") +
18 (aStatus==""?"":("AND A_ID="+aStatus+" ")) +
19 (bStatus==""?"":("AND B_ID="+bStatus+" ")) +
20 (cStatus==""?"":("AND C_ID="+cStatus+" ")) +
21 "GROUP BY id ORDER BY sum(Revenue) DESC LIMIT 10";
22 //query = "SELECT 'London',100";
23 return query;
24 }
25 }

This query will return the total revenues by channel of all the companies that are on the branch of the selected company, if no company is selected the query will return no results.

The gauge query will select the average value over all the companies in the branch.

01 var dialChartDefinition = {
02 width: 300,
03 height: 200,
04 chartType: "DialChart",
05 queryType: 'sql',
06 is3d: 'true',
07 jndi: "datasource",
08 title: "Overall",
09 intervals: [40,60,100],
10 includeLegend: true,
11 query: function(){
12 var query = " SELECT avg(ovrall) FROM overall LEFT JOIN company ON COMPANY_SEQ=id "+
13 " WHERE " +
14 (aStatus==""&&bStatus==""&&cStatus==""?"1=0":"1=1") +
15 (aStatus==""?"":("AND A_ID="+aStatus+" ")) +
16 (bStatus==""?"":("AND B_ID="+bStatus+" ")) +
17 (cStatus==""?"":("AND C_ID="+cStatus+" "));
18 //query= "SELECT 75";
19 return query;
20 }
21 }

That would be all. We have defined the Dashboard’s view by suppling the html template and the components to populate it, the Dashboard’s controller was defined using the components call back functions and the update event and the Dashboard’s data model by defining the dynamic queries to the datasource. The Dashboard is ready to be used. Enjoy!



Pentaho Aggregation Designer on Ubuntu configuration

The Pentaho Aggregation Designer (PAD) is a tool that comes from the Mondrian project. Its purpose is to generate SQL queries to create summarization tables, the instructions for data aggregation and the modification of the mondrian schema so the engine can use them.

The engine can check that precalculated data is available based on the names of fields used in keys and and report them speeding the display of results. The cost is that the cube construction will take considerably longer.


  1. Download it from the mondrian sourceforge page (1.2.2-stable).
  2. Unzip it’s content to /pentaho/aggregation-designer
  3. Open a terminal, navigate to that folder and make the .sh files excecutables:
    $ chmod +x *.sh

To start using it

In that folder and type:

$ ./startaggregationdesigner.sh
  1. Click Configure and fill the parameters as we have done before: SampleData name, localhost, sampledata, pentaho_user, password. Click on test and ok.
  2. Click on Mondrian schema and locate steelwheels.mondrian.xml in /Pentaho/biserver-ce/pentaho-solutions/steel-wheels/analysis/, click Apply.
  3. Select Steelwheels model, and click connect.
    PAD will check for primary keys, nulls on foreing keys –which should not exist in a correctly done ETL process-. Click Ok.
  4. Now click advisor and then recommend in the dialog window.
  5. If you click on on the right bottom panel, the columns used will be displayed on the left bottom panel.PAD Screen

    PAD Screen

    You can select wich one to generate  as it shows a graph with cost to build (number of records) against benefit (time to get the result).

  6. Click export to open a dialog window:
    Click on Create Aggregates: export to Save it to a file like create.sql
    Click on Populate Aggregates: export to a file like populate.sql
    Click on Publish Mondrian Schema: export to a file like schema.xml
  7. Save your work to a file like PAD_workspace and exit
  8. Now you can check and modify the scripts. Maybe adding the TRUNCATE SQL instructions to the tables in the populate file.
    You can automate running them them with an .xaction sequence and a PDI job.

Reading Material

  • Julyan Hyde blog post about the 2.0 BI release –and PAD here
  • Mondrian documentation here
  • User Guide 1.0 PDF y 1.2 scribd
  • If you use Lucid DB: heres an article to use with a plugin LucidDbAggregateDesigner
  • In 2008, Chris Webb made an an accurate prediction about mondrian column databases (LucidDB) and PAD her

Pentaho and MongoDB

Pentaho and MongoDB. LINK

At  MongoNYC conference in New York today, where Pentaho is a sponsor. 10gen have done a great job with this event, and they have 1,000 attendees at the event.

We just announced a strategic partnership between 10gen and Pentaho. From a technical perspective the integration between MongoDB and Pentaho means:

  • No Big Silos. Data silos are bad. Big ones are no better. Our MongoDB ETL connectors for reading and writing data mean you can integrate your MongoDB data store with the rest of your data architecture (relational databases, hosted applications, custom applications, etc).
  • Live reporting. We can provide desktop and web-based reports directly on MongoDB data
  • Staging. We can provide trending and historical analysis by staging snapshots of MongoDB aggregations in a column store.

I’m looking forward to working with 10gen to integrate some of their new aggregation capabilities into Pentaho.

Community Data Validation ( Thanks to WebDetails & Pedro Alves for this Job)

Why ?

We need a way to do data validation

Use cases

Below are some of the use cases we want to tackle. Emphasized are the ones we think the current spec satisfies

  • Global
    • Is the server running?
    • Is the server running properly?
  • Connectivity
    • Do we have all the access we should? (network / database)
  • Query specific
    • Do we have up to date data?
    • Can we trust the data?
    • How long did the queries take to run?
    • Do we have wrong data? (duplicated users in community)
    • Do we have a big number of ‘unknowns’? (tk=1 in DW)
    • Do we have peaks or valleys in the data? (due to double process or no process)
    • Is the data stalled? (eg: number of twitter followers not updating)
    • Did the data format change
    • We need a way to handle known effects (eg: Christmas dip)
    • We need to correlate independent datasources (eg: comparing AUS with Blocklist evolution)
    • We need to connect the long running queries from CDA to CDV
    • Be able to validate big chunks of reprocessing
    • Do we have clearly wrong lines in resultset? (eg: a line there)
  • Dashboards
    • Are the dashboards rendering properly?
      • Do we have all the components?
      • Any layout change?
      • Any js errors?
    • Are the dashboards performing properly?
    • Can CDF talk with CDV to report client-side errors?
    • Alternatively, can CDA talk with CDV to report query errors?
      • Who caused the error?
  • ETL
    • Did the etl run?
    • Are we processing the expected amount of data?
    • Is the etl taking the expected time to run?
    • Did the etl finish before X am?
    • Test etl against tracer bullets?

Work flow

We expect from this system:

  • A central dashboard that allows us to quickly glimpse the overall status of our system.
    • Did all tests pass?
    • Which one failed?
    • Why?
    • When was the last time the tests ran
  • We need simple ways to define the tests (based on existing CDAs)
  • We need to which queries failed and which queries took long time to run
  • We need push notification system by email
  • We need to make sure it can talk to nagios
  • We need an outside test to check if server is up

Logging types

Every test will result in the following levels:

  • Critical
  • Error
  • Warn
  • Ok

Each specific test will be responsible for converting the output of that test (validation function for cda, tbd for kettle) into that status. The object format is:

 level: "Critical",
 type: "Missing data",
 description: "Whatever string the user defined" 

On each test definition, we need to be able to optionally set a timing threshold for the queries, and that will automatically generate a log with Type ‘Duration’

Test types

There are 4 possible types of tests:

  • CDA based query validation
  • ETL monitoring
  • Datawarehouse validation (a specific set of the cda based query validation)
  • Dashboard validation (we may opt to leave this one out for now as we’ll try to infer the errors from CDA’s 405)

CDA based query


We want to select one or more cda / dataAccessId from our system, define the input parameters and select the type of validation we need.
The shape of the function will be: f( [ query, [params] ], validationFunction )
The generic test will be the implementation of the validation function:

validationFunction = function ( [ {metadata: [] , resultset: [[]]} ] ) :  value

That will be freely mapped to the log outputs

ETL monitoring query

The workflow defined here has to match with the previous section. We’ll build specific CDA queries that will read the kettle log files. From that point on, specific validations will have to be built for this logs
We’ll need, in pentaho, to define which connection refers to the kettle logging tables. Either by defining a special jndi or specifying in the settings.
We’ll need to test for:

  • Time
  • Start /end time
  • Amount of data processed

Datawarehouse schema validation

There are some specific tests we can do on the sanity of a datawarehouse.

  • Coherent amount of data on a daily / hourly basis
  • Test the same as before with specific breakdowns
  • Test for the amount of ‘unknowns’ on dimensions

Invocation and Scheduling

There are 2 ways to call the validations:

  • By url request
  • Scheduled calls

Url will be based on the id / query name (tbd). The schedule calls will be cron based, with the following presets:

  • Every hour
  • Every day
  • Every week
  • Every month
  • Custom cron

User interface

This are the features in the main user interface (this is the ultimate goal, the implementation may be broken into stages):

  • See existing validations
    • Allow to fire a specific validation
    • Get the url of a specific validation / all validations
  • Create / Edit validation
    • Define query name
    • Define queries and parameters
    • Define validation function
    • Chose log alerts (when to throw error / severe / warn / ok)
    • Choose duration thresholds
    • Define error message
    • Define cron
  • Validation status dashboard
  • CDA Query error dashboard (Should this belong to CDA instead?)
    • Query and parameters
    • Error
    • Incidents
  • Duration dashboard to identify slow points in the system
    • Query and parameters
    • Duration
    • Incidents

Technical approach

All the specific information will be stored in solution/cdv/queries/). The files will have the format _queryName.cdv and will internally be a JSON file with the following structure:

  type: "query",
  name: "validationName",
  group: "MyGrouping" 
  validation: [ 
    { cdaFile: "/solution/cda/test.cda", dataAccessId: "1" , parameters: [...] },
    { cdaFile: "/solution/cda/test2.cda", dataAccessId: "2" , parameters: [...] }
  validationType: "custom",
  validationFunction: "function(resultArray,arguments){ return 123 }",
  alerts: {
     /* This functions will be executed from bottom up. As the functions return true, the next one
        will be executed and the last matching level will be thrown.
        The exception to this rule is the optional okAlert(v) function. If this one returns true, no other calls will be made 
     criticalAlert: "function(v){ return v > 10 }",
     errorAlert: undefined,
     warnAlert:  "function(v){ return v > 5 }",
     okAlert: "function(v){return v<3;}",
     alertType: "MissingData",
     alertMessage: "function(level,v){return 'My error message: ' + v)"  /* this can either be a function or a string */
  executionTimeValidation: {
      expected: 5000,
      warnPercentage: 0.30,
      errorPercentage: 0.70,
      errorOnLow: true

  cron: "0 2 * * ? *" 

Preset validations

We won’t need to manually define all kinds of validations. CDV will support a preset that can also be extended by adding the definitions to solution/cdv/validationFunctions/ . The template for one such Javascript file looks like this:

  name: "Existence",
  validationArguments: [
    {name: "testAll", type:"boolean", default: true}

  validationFunction: function(rs, conf) {
    var exists = !!conf.testAll;

    return rs.map(function(r){return r.length > 0}).reduce(function(prev, curr){
      return conf.testAll ? (curr && prev) : (curr || prev);

  alertArguments: {
    {name: "failOnExistence" type: "boolean", default: true},
    {name: "failureLevel", type: "alarmLevel", default: "ERROR"},
    {name: "failureMessage", type: "String", default: "Failed Existence Test: ${result}"}

  alertMapper: function(result, conf) {
    var success = conf.failOnExistence && result,
        level = success ? "OK", conf.failureLevel,
        message = success ? conf.successMessage : conf.failureMessage; 
    return Alarm(level, message, result);

The wd.cdb.validation API is defined in the Validation Module.
There are 5 objects there that we need to analyze:

  • validationFunction(rs, conf) – This is the validation function that will be executed after the query runs
  • validationArguments – Definition of the arguments that will be used within the validation function
  • alertArguments – Definition of the arguments that will be sent to the alertMapper
  • alertMapper(result, conf) – Mapping between the validation result and the alerts

Preset validations or custom validations

When we define a query, we can chose which validation function to use and pass the parameters that specific validation requires.
Alternatively, we can use a custom validation function. That validation function has the following format, where all we need is to return the alarm level (this is a spec, may change after implementation)

function(rs, conf) {

    var exists = rs.map(function(r){return r.length > 0}).reduce(function(prev, curr){
      return conf.testAll ? (curr && prev) : (curr || prev);

    return exists ? Alarm.ERROR : Alarm.OK;

CDA integration

We need a tight integration between CDA and CDV to report:

  • Errors in CDA queries
  • Long running CDA queries
  • Queries with obvious errors in the structure (eg: missing lines)

It will obviously need to take into account the fact that CDV may not be installed and can’t have performance impacts in CDA

External interfaces

We can have several external interfaces supported:

  • Email
  • Http
  • Nagios integration
  • Server up check

The last one is a very specific check. All the other integrations will fail if suddenly the server hangs, and we must be notified of that. On http and nagios integration, we’ll be able to get reports on the individual tests and also on the test groups. This will not rerun the tests but get the report on the last status of a test.
On the http case, we can pass a flat to force a test to be rerun.
For nagios, we can have an export of test rules


We’ll be able to define the group rules, mainly for connectivity reasons. So the settings (that later can be converted to an UI), will look like this:

  • Groups
    • Email
      • Threshold
      • Destinations
    • SMS
      • Threshold
      • Destinations

Pentaho, 10gen Collaborate to Integrate MongoDB

Business analytics vendor Pentaho and 10gen, the company behind MongoDB, today announced a partnership to provide direct integration between Pentaho Business Analytics and MongoDB.

As enterprise data architectures continue to evolve, customers are looking to address rapidly changing multi-structured data and take advantage of cloud-like architectures. This alliance brings the data integration, data discovery and visualization capabilities of Pentaho to MongoDB.

The companies say that the native integration between Pentaho and MongoDB helps enterprises take advantage of the flexible, scalable data storage capabilities of MongoDB while ensuring compatibility and interoperability with existing data infrastructure.

Pentaho and 10gen have developed connectors to tightly integrate MongoDB and Pentaho Business Analytics. By adding MongoDB integration to its existing library of connectors for relational databases, analytic databases, data warehouses, enterprise applications, and standards-based information exchange formats, Pentaho says it can provide a more robust enterprise architects, developers, data scientists and analysts for both MongoDB and existing databases.

As a release this week stated, “Enterprise architects benefit from a scalable data integration framework functioning across MongoDB and other data stores, and developers gain access to familiar graphical interfaces for data integration and job management with full support for MongoDB. Data scientists and analysts can now visualize and explore data across multiple data sources, including MongoDB.”

Soon I will made a video explaining MongoDB integration with Pentaho Data Integration and Pentaho Reporting