Removing Special Characters from a string field in Oracle

Today while I was doing consultancy work I faced against the issue of loading a table into from Oracle to PostgreSQL, when I checked the logs I saw the some oracle varchar fields had strange characters at the end of them and this caused INSERT statements fail.  Initially I tried using Pentaho Data Integration  replace values in string and replace CR, LF and CRLF since they looked like carriage returns when copied the log files in Notepad++. But all attempts were unsuccessful, so I decided to look for Oracle functions and soon I got a proper solution.

REGEXP_REPLACE helped my as you could see in the query below

REGEXP_REPLACE( customer_description ,'[^[:alnum:]'' '']', NULL)
 FROM dim_customer


Brief Explanation

The [[:alnum:]] character class represents alphabetic and numeric characters, and it is same as using [a-zA-Z0-9] in regular expression.


Hope you have enjoyed 🙂


Increase MySQL output to 80K rows/second in Pentaho Data Integration

One of our clients has a MySQL table with around 40M records. To load the table it took around 2,5 hours. When i was watching the statistics of the transformation I noticed that the bottleneck was the write to the database. I was stuck at around 2000 rows/second. You can imagine that it will take a long time to write 40M records at that speed.
I was looking in what way I could improve the speed. There were a couple of options:
  1. Tune MySQL for better performance on Inserts
  2. Use the MySQL Bulk loader step in PDI
  3. Write SQL statements to file with PDI and  read them with mysql-binary
When i discussed this with one of my contacts of Basis06 they faced a similar issue a while ago. He mentioned that speed can be boosted by using some simple JDBC-connection setting. useServerPrepStmts=false

These options should be entered in PDI at the connection. Double click the connection go to Options and set these values.

Used together, useServerPrepStmts=false and rewriteBatchedStatements=true will “fake” batch inserts on the client. Specifically, the insert statements:

INSERT INTO t (c1,c2) VALUES ('One',1);
INSERT INTO t (c1,c2) VALUES ('Two',2);
INSERT INTO t (c1,c2) VALUES ('Three',3);

will be rewritten into:

INSERT INTO t (c1,c2) VALUES ('One',1),('Two',2),('Three',3);

The third option useCompression=true compresses the traffic between the client and the MySQL server.

Finally I increased the number of copies of the output step to 2 so that there are two treads inserting into the database.

This all together increased the speed to around 84.000 rows a second! WOW!

Book Review: Pentaho for Big Data Analytics (November 2013)


Book review by: David Fombella Pombal (twitter: @pentaho_fan)

Book Title: Pentaho for BIg Data Analytics

Authors: Manoj R Patil, Feris Thia

Paperback: 118 pages

I would like to suggest this book if you want to get started with Pentaho Open Source BI tool together with Hadoop and Big Data.

Target Audience
If you are  a Data Scientist, a Hadoop programmer, a Big Data enthusiast, or a developer working in the Business Intelligence domain who is aware of Hadoop or the Pentaho tools and want to try out creating a solution in the Big Data space, this is your manual.

Rating: 7 out of 10

Chapter 1, The Rise of  Pentaho Analytics along with Big Data

This chapter serves as a brief summary of the Pentaho tools and its history around Business Intelligence field, weaving in stories on the rise of Big Data.

Pentaho Tools:

Server Applications

  • Business Analytics (BA) Server: Java-based BI system with a report management system and lightweight process-flow engine, HTML5-based web interface. In Community Edition , there is another substitute application called Business Intelligence (BI) Server


  • Data Integration (DI) Server: Enterprise version only server for the ETL processes and Data Integration

Thin Client Tools

  • Pentaho Interactive Reporting: WYSIWYG type of design interface used to construct simple and adhoc reports on the fly without the need of having IT or programming skills. There are several CE alternatives as WAQR (Web Ad-Hoc Query Reporting) and Saiku Reporting.

PIRPentaho Interactive Reporting (EE)

saikurepSaiku Reporting (CE)

WAQRjpgWeb Ad Hoc Query Reporting

  • Pentaho Analyzer: An advanced OLAP viewer with support for drag-and-drop. It is an EE intuitive analytical visualization tool with the capability  to filter and drill down into data, stored in a Mondrian (Pentaho ROLAP engine) data source.

analyzer_territoryPentaho Analyzer

  • Pentaho Dashboard Designer (EE): Commercial plugin that allows users to create dashboards with an easy graphical interface

Design Tools

  • Schema Workbench: Graphical tool for creating ROLAP schemas for Pentaho Analysis (Mondrian).
  • Aggregation Designer: Generate pre-calculated tales  to improve the performance of Mondrian OLAP schemas with this tool.
  • Design Studio: An eclipse-based application and plugin, that eases the creation of business process flows with a special XML script to define action sequences xactions.
  • Report Designer: A banded report designing tool with a great GUI, useful to create sub-reports, charts and graphs.
  • Data Integration:  This wonderful ETL tool is also known as Kettle, and is composed by an ETL engine and GUI  that allows the user to design ETL jobs and transformations.
  • Metadata Editor: This tool is used to create business models and acts as an abstraction layer from the underlying physical database.


chp1Pentaho BI Suite components

Chapter 2, Setting Up the ground

In this topic we will install Pentaho BI Suite CE and Saiku OLAP plugin from Marketplace. Besides, in the chapter we learn how to administer data sources using Pentaho User Console and Pentaho Administration Console.

chp2 marketplaceMarketplace plugin

Chapter 3, Churning Big Data with Pentaho

This chapter provides a basic understanding of the Big Data ecosystem and an example to analyze data sitting on the Hadoop framework using Pentaho. At the end of this chapter, you will learn how to translate diverse data sets into meaningful data sets using Hadoop/Hive.
This chapter covers the following subjects:
• Overview of Big Data and Hadoop
• Hadoop architecture
• Big Data capabilities of Pentaho Data Integration (PDI)  Kettle
• Working with PDI and Hortonworks Data Platform, a Hadoop distribution
• Loading data from Hadoop Distributed File System (HDFS) to Hive using PDI

Hadoop ecosystemThe Hadoop ecosystem

HDFS to hive transformationHDFS to Hive transformation

Chapter 4, Pentaho Business Analytics Tools

This topics gives a quick summary of the business analytics life cycle. We will look at several applications such as Pentaho Action Sequence and Pentaho Report Designer, as well as the Community Dashboard Editor (CDE), Community Data Access (CDA) and Community Dashboard Framework (CDF) plugins and their configuration, in order to get in touch with them.


Hive Java queryHive Java query using User Defined Java Class Step

Chapter 5, Visualization of Big Data

This chapter provides a basic understanding of visualizations and examples to analyze the patterns using various charts based on Hive data. This chapter shows us  how to create an interactive analytical dashboard that gets data from Hive. Summarizing this chapter covers the following themes:
• Evolution of data visualization and its classification
• Data source preparation
• Consumption of HDFS-based data through HiveQL
• Creation of several types of charts
• Making charts more attractive using styling

hive query chp5Hive query

DashboardStock Price Analysis Dashboard

Appendix A, Big Data Sets

Talks about data preparation with one sample from stock exchange data.

Appendix B, Hadoop Setup

Takes you through the installation and configuration of the third-party Hadoop distribution, Hortonworks Sandbox, which is used throughout the book .



Tips for Editing Pentaho Auto-Generated OLAP Models

Tips for Editing Pentaho Auto-Generated OLAP Models.


If you’ve followed some of my tutorials earlier here or here where I’ve described the process of auto-generating OLAP models through the Pentaho auto-modeler, you will end up with a basic multidimensional star schema that allow you a basic level of customization such as here:


In most cases, that environment will provide enough control for you to create a model that will cover most of your analytical reporting needs. But if you want to build out a more complex model, you can manipulate the underlying Mondrian schema XML directly in a file or use the Pentaho Schema Workbench tool to build out snowflake schemas, custom calculations, Analyzer annotations, etc.


For direct XML editing of the multidimensional model, you can follow the Mondrian schema guide here.

To pull out the Mondrian model for editing from these Data Source Wizard sources, you can accomplish this by clicking the Export button on the Data Sources dialog box below:


If you use this method from the UI, you will download a ZIP file. Unzip that file and save the “schema.xml” inside the ZIP to your local file system. You can then edit that file in Schema Workbench (PSW) or in an XML editor and import your changes back into the platform from that same Manage Data Sources dialog in the Web UI, or just publish it directly to your server from PSW:


Here’s another tip that I like to do when I pull out a Mondrian schema from an auto-generated Data Source Wizard model that I think is easier than export a ZIP is to use the REST API call for extracting the XML schema directly. I downloaded curl on my Windows laptop to use as a command-line tool for calling Web Services APIs. Now I can make this REST call

curl –user Admin:password localhost:8080/pentaho/plugin/data-access/api/datasource/analysis/foodmart/download > foodmart.xml

To make the above call work in your environment, change the “–user” credentials to your username:password, replace the hostname with your server and then substitute “foodmart” for the name of your model that you  wish to modify. You can then edit that resulting file (foodmart.xml) in PSW or with an XML editor.

Don’t forget to import the updated file back into the platform or Publish it from Schema Workbench so that users will then be able to build their reports from the new schema.

One last trick that I do when I re-import or re-publish the edited model when I started from the generated Data Source Wizard model, is to rename the model in PSW or the XML file so that it will appear as a new model in the Pentaho tools. This way, you can avoid losing your new updates if you were to update the model in the thin modeler from Data Source Wizard again.


Parallelization jobs in Kettle – Pentaho Data Integration

Reblogged from

We always end up with ROFL in our team, when trying to find a name for strange looking ETL processes diagrams. This monster has no name yet:

Parallel kettle job

This is a parallelization framework for Pentaho Kettle 4.x. As you probably know in the upcoming version of Kettle (5.0) there’s native ability to launch job entries in parallel, but we haven’t got there yet.

In order to run a job in parallel, you have to call this abstract job, and provide it with 3 parameters:

  • Path to your job (which is supposed to run in parallel).
  • Number of threads (concurrency level).
  • Optional flag that says whether to wait for completion of all jobs or not.
Regarding the number of threads, as you can see the framework supports up to 8 threads, but it can be easily extended.
How this stuff works. “Thread #N” transformations are executed in parallel on all rows copies. Rows are split then, and filtered in these transformations by the given number of threads, so only a relevant portion of rows is passed to the needed job (Job – Thread #N). For example, if the original row set was:
           [“Apple”, “Banana”, “Orange”, “Lemon”, “Cucumber”]
and the concurrency level was 2, then the first job (Job – Thread #1) will get the [“Apple”, “Banana”, “Orange”] and the second job will get the rest: [“Lemon”, “Cucumber”]. All the other jobs will get an empty row set.
Finally, there’s a flag which tells whether we should wait until all jobs are completed.
I hope one will find attached transformations useful. And if not, at least help me find a name for the ETL diagram. Fish, maybe? 🙂

How to quit “Web Ad Hoc Query and Reporting has been replace by the new Interactive Reporting client …” message in Pentaho BI Server CE 5.0.1

In this quick post I will show the way to quit  “Web Ad Hoc Query and Reporting has been replace by the new Interactive Reporting client…” message in Pentaho BI Server CE 5.0.1 stable.

The annoying message is the following

Web Ad Hoc Query and Reporting has been replace by the new Interactive Reporting client. It is provided as a convenience but will no longer be enhanced or offically supported by Pentaho.

It appears every time you open WAQR Ad hoc reporting component.

1) Open waqr.html file at biserver-ce-5.0.1-stable/biserver-ce/pentaho-solutions/system/waqr/resources folder

2) Look for the following code and comment it with <!–  –>

<img src="resources/images/warning.png"/>
Web Ad Hoc Query and Reporting has been replace by the new Interactive Reporting client.<br/>
It is provided as a convenience but will no longer be enhanced or offically supported by Pentaho.

3) Go to the upper menu and execute Tools –> Refresh –> System Settings

After executing the refresh the warning message will be hidden.

Hope you enjoy!

Help me keep the guides up to date and the posts flowing by donating, every small amount of money helps!




How to quit “JPivot is a community plug-in that has been provided for your convenience….” message in Pentaho BI Server CE 5.0.1

In this quick post I will show the way to quit  “JPivot has been replaced by Pentaho Analyzer…” message in Pentaho BI Server CE 5.0.1 stable.

The annoying message is the following

JPivot is a community plug-in that has been provided for your convenience. If you are a Pentaho customer we encourage you to transition current Analysis Views to Pentaho Analyzer.

It appears every time you open Jpivot client.

1) Open mdxtable.css file at biserver-ce-5.0.1-stable/biserver-ce/pentaho-solutions/system/pentaho-jpivot-plugin/jpivot/table folder

And add the following CSS code at the start of the file

#deprecatedWarning {
display: none;

Restart BI Server and the deprecated warning will be hidden now.
Hope you enjoy!

Help me keep the guides up to date and the posts flowing by donating, every small amount of money helps!