Creating a connection to SAP HANA using Pentaho PDI


 

Reblog from http://scn.sap.com/community/developer-center/hana/blog/2014/09/04/creating-a-connection-to-sap-hana-using-pentaho-pdi

In this blog post we are going to learn how to create a HANA Database Connection within Pentaho PDI.

1)  Go to SAP HANA CLIENT installation path and copy the “ngdbc.jar”

*You can get SAP HANA CLIENT & SAP HANA STUDIO from :https://hanadeveditionsapicl.hana.ondemand.com/hanadevedition/

 

1.png

2) Copy and paste the jar file to : <YourPentahoRootFolder>/data-integration/lib

2.png

3) Start Pentaho PDI and create a new Connection

* Make sure your JAVA_HOME environment variable is setting correctly.

3.png

3_1.png

3_2.png

4) Create a transformation,  rick click on Database connection to create a new database connection

4.png

 

5) Select “Generic Database” connection type and Access as “Native(JDBC)”

 

5.png

6)  Fill the following parameter on Settings

Connection Name: NAMEYOURCONNECTION

Custom Connection URL: jdbc:sap://YOUR_IP_ADDREES:30015

Custom Driver Class Name: com.sap.db.jdbc.Driver

User Name: YOURHANAUSER

Password: YOURHANAPASSWORD

6.png

 

7) Test your connection.

7.png

Removing Special Characters from a string field in Oracle


Today while I was doing consultancy work I faced against the issue of loading a table into from Oracle to PostgreSQL, when I checked the logs I saw the some oracle varchar fields had strange characters at the end of them and this caused INSERT statements fail.  Initially I tried using Pentaho Data Integration  replace values in string and replace CR, LF and CRLF since they looked like carriage returns when copied the log files in Notepad++. But all attempts were unsuccessful, so I decided to look for Oracle functions and soon I got a proper solution.

REGEXP_REPLACE helped my as you could see in the query below

SELECT
REGEXP_REPLACE( customer_description ,'[^[:alnum:]'' '']', NULL)
 FROM dim_customer

 

Brief Explanation

The [[:alnum:]] character class represents alphabetic and numeric characters, and it is same as using [a-zA-Z0-9] in regular expression.

 

Hope you have enjoyed 🙂

Increase MySQL output to 80K rows/second in Pentaho Data Integration


One of our clients has a MySQL table with around 40M records. To load the table it took around 2,5 hours. When i was watching the statistics of the transformation I noticed that the bottleneck was the write to the database. I was stuck at around 2000 rows/second. You can imagine that it will take a long time to write 40M records at that speed.
I was looking in what way I could improve the speed. There were a couple of options:
  1. Tune MySQL for better performance on Inserts
  2. Use the MySQL Bulk loader step in PDI
  3. Write SQL statements to file with PDI and  read them with mysql-binary
When i discussed this with one of my contacts of Basis06 they faced a similar issue a while ago. He mentioned that speed can be boosted by using some simple JDBC-connection setting. useServerPrepStmts=false
rewriteBatchedStatements=true
useCompression=true

These options should be entered in PDI at the connection. Double click the connection go to Options and set these values.

Used together, useServerPrepStmts=false and rewriteBatchedStatements=true will “fake” batch inserts on the client. Specifically, the insert statements:

INSERT INTO t (c1,c2) VALUES ('One',1);
INSERT INTO t (c1,c2) VALUES ('Two',2);
INSERT INTO t (c1,c2) VALUES ('Three',3);

will be rewritten into:

INSERT INTO t (c1,c2) VALUES ('One',1),('Two',2),('Three',3);

The third option useCompression=true compresses the traffic between the client and the MySQL server.

Finally I increased the number of copies of the output step to 2 so that there are two treads inserting into the database.

This all together increased the speed to around 84.000 rows a second! WOW!

Book Review: Pentaho for Big Data Analytics (November 2013)


Bookcoverhttps://www.packtpub.com/pentaho-for-big-data-analytics/book

Book review by: David Fombella Pombal (twitter: @pentaho_fan)

Book Title: Pentaho for BIg Data Analytics

Authors: Manoj R Patil, Feris Thia

Paperback: 118 pages

I would like to suggest this book if you want to get started with Pentaho Open Source BI tool together with Hadoop and Big Data.

Target Audience
If you are  a Data Scientist, a Hadoop programmer, a Big Data enthusiast, or a developer working in the Business Intelligence domain who is aware of Hadoop or the Pentaho tools and want to try out creating a solution in the Big Data space, this is your manual.

Rating: 7 out of 10

Chapter 1, The Rise of  Pentaho Analytics along with Big Data

This chapter serves as a brief summary of the Pentaho tools and its history around Business Intelligence field, weaving in stories on the rise of Big Data.

Pentaho Tools:

Server Applications

  • Business Analytics (BA) Server: Java-based BI system with a report management system and lightweight process-flow engine, HTML5-based web interface. In Community Edition , there is another substitute application called Business Intelligence (BI) Server

BA

  • Data Integration (DI) Server: Enterprise version only server for the ETL processes and Data Integration

Thin Client Tools

  • Pentaho Interactive Reporting: WYSIWYG type of design interface used to construct simple and adhoc reports on the fly without the need of having IT or programming skills. There are several CE alternatives as WAQR (Web Ad-Hoc Query Reporting) and Saiku Reporting.

PIRPentaho Interactive Reporting (EE)

saikurepSaiku Reporting (CE)

WAQRjpgWeb Ad Hoc Query Reporting

  • Pentaho Analyzer: An advanced OLAP viewer with support for drag-and-drop. It is an EE intuitive analytical visualization tool with the capability  to filter and drill down into data, stored in a Mondrian (Pentaho ROLAP engine) data source.

analyzer_territoryPentaho Analyzer

  • Pentaho Dashboard Designer (EE): Commercial plugin that allows users to create dashboards with an easy graphical interface

Design Tools

  • Schema Workbench: Graphical tool for creating ROLAP schemas for Pentaho Analysis (Mondrian).
  • Aggregation Designer: Generate pre-calculated tales  to improve the performance of Mondrian OLAP schemas with this tool.
  • Design Studio: An eclipse-based application and plugin, that eases the creation of business process flows with a special XML script to define action sequences xactions.
  • Report Designer: A banded report designing tool with a great GUI, useful to create sub-reports, charts and graphs.
  • Data Integration:  This wonderful ETL tool is also known as Kettle, and is composed by an ETL engine and GUI  that allows the user to design ETL jobs and transformations.
  • Metadata Editor: This tool is used to create business models and acts as an abstraction layer from the underlying physical database.

 

chp1Pentaho BI Suite components

Chapter 2, Setting Up the ground

In this topic we will install Pentaho BI Suite CE and Saiku OLAP plugin from Marketplace. Besides, in the chapter we learn how to administer data sources using Pentaho User Console and Pentaho Administration Console.

chp2 marketplaceMarketplace plugin

Chapter 3, Churning Big Data with Pentaho

This chapter provides a basic understanding of the Big Data ecosystem and an example to analyze data sitting on the Hadoop framework using Pentaho. At the end of this chapter, you will learn how to translate diverse data sets into meaningful data sets using Hadoop/Hive.
This chapter covers the following subjects:
• Overview of Big Data and Hadoop
• Hadoop architecture
• Big Data capabilities of Pentaho Data Integration (PDI)  Kettle
• Working with PDI and Hortonworks Data Platform, a Hadoop distribution
• Loading data from Hadoop Distributed File System (HDFS) to Hive using PDI

Hadoop ecosystemThe Hadoop ecosystem

HDFS to hive transformationHDFS to Hive transformation

Chapter 4, Pentaho Business Analytics Tools

This topics gives a quick summary of the business analytics life cycle. We will look at several applications such as Pentaho Action Sequence and Pentaho Report Designer, as well as the Community Dashboard Editor (CDE), Community Data Access (CDA) and Community Dashboard Framework (CDF) plugins and their configuration, in order to get in touch with them.

CtoolsCtools

Hive Java queryHive Java query using User Defined Java Class Step

Chapter 5, Visualization of Big Data

This chapter provides a basic understanding of visualizations and examples to analyze the patterns using various charts based on Hive data. This chapter shows us  how to create an interactive analytical dashboard that gets data from Hive. Summarizing this chapter covers the following themes:
• Evolution of data visualization and its classification
• Data source preparation
• Consumption of HDFS-based data through HiveQL
• Creation of several types of charts
• Making charts more attractive using styling

hive query chp5Hive query

DashboardStock Price Analysis Dashboard

Appendix A, Big Data Sets

Talks about data preparation with one sample from stock exchange data.

Appendix B, Hadoop Setup

Takes you through the installation and configuration of the third-party Hadoop distribution, Hortonworks Sandbox, which is used throughout the book .

http://hortonworks.com/products/hortonworks-sandbox/

Hortonworks

 

Parallelization jobs in Kettle – Pentaho Data Integration


Reblogged from http://spektom.blogspot.com.es/2014/02/parallelization-monster-framework-for.html

We always end up with ROFL in our team, when trying to find a name for strange looking ETL processes diagrams. This monster has no name yet:

Parallel kettle job

This is a parallelization framework for Pentaho Kettle 4.x. As you probably know in the upcoming version of Kettle (5.0) there’s native ability to launch job entries in parallel, but we haven’t got there yet.

In order to run a job in parallel, you have to call this abstract job, and provide it with 3 parameters:

  • Path to your job (which is supposed to run in parallel).
  • Number of threads (concurrency level).
  • Optional flag that says whether to wait for completion of all jobs or not.
Regarding the number of threads, as you can see the framework supports up to 8 threads, but it can be easily extended.
How this stuff works. “Thread #N” transformations are executed in parallel on all rows copies. Rows are split then, and filtered in these transformations by the given number of threads, so only a relevant portion of rows is passed to the needed job (Job – Thread #N). For example, if the original row set was:
           [“Apple”, “Banana”, “Orange”, “Lemon”, “Cucumber”]
and the concurrency level was 2, then the first job (Job – Thread #1) will get the [“Apple”, “Banana”, “Orange”] and the second job will get the rest: [“Lemon”, “Cucumber”]. All the other jobs will get an empty row set.
Finally, there’s a flag which tells whether we should wait until all jobs are completed.
I hope one will find attached transformations useful. And if not, at least help me find a name for the ETL diagram. Fish, maybe? 🙂

Book Review: Pentaho Data Integration Cookbook – Second Edition


Pentaho Data Integration Cookbook, Second Edition picks up where the first edition left off, by updating the recipes to the latest edition of PDI and diving into new topics such as working with Big Data and cloud sources, and more.

0674OS_ Pentaho Data Integration Cookbook (2nd edition).jpg
https://www.packtpub.com/pentaho-data-integration-cookbook-second-edition/book

Book review by: David Fombella Pombal (twitter: @pentaho_fan)

Book Title: Pentaho Data Integration Cookbook – Second Edition

Authors: Alex Meadows, Adrián Sergio Pulvirenti, María Carina Roldán

Paperback: 462 pages

I would like to suggest this useful book since it shows us how to take advantage of all the aspects of Kettle through a set of practical recipes organized to find quick solutions to our everyday needs. Although this books covers advanced topics, all recipes are explained step by step in order to help all type of readers.

Target Audience
If you are a software developer, data scientist, or anyone else looking for a tool that will help extract, transform, and load data as well as provide the tools to perform analytics and data cleansing, then this book is for you.

Rating: 9 out of 10

Chapter 1, Working with Databases – 15 recipes

This chapter shows us how to work with relational databases with Kettle.The recipes show us how to create and share database connections, perform typical database functions (select, insert, update, and delete), as well as more advanced tricks such as building and executing queries at ETL runtime. Remember that in Kettle you can connect to MySQL,Oracle, SQL Server, PostgreSQL, db2 …. and nearly all the database engines available.

Chapter 1Inserting new records when PK has to be generated based on previous values transformation

Chapter 2, Reading and Writing Files – 15 recipes

This topic not only shows us how to read and write files (csv, txt, excel …), but also how to work with semi-structured files, and read data from Amazon Web Services S3 instances.

Chapter 2Loading data into an AWS S3 Instance transformation

Chapter 3, Working with Big Data and Cloud Sources – 8 recipes

This third chapter covers how to load and read data from some of the many different NoSQL data sources (MongoDB, HBase, Hadoop …) as well as from Salesforce.com. I would like to remark the importance of this issue of the book due to the importance of Big Data techniques nowadays.

Chapter 3 Loading data into HBaseLoading data into HBase transformation

Chapter 4, Manipulating XML Structures – 10 recipes

This topic shows us how to read, write, and validate XML  files. Simple and complex XML structures are shown as well as more specialized formats such as RSS feeds. Even an HTML page is generated using XML and XSL transformations. You should read carefully this chapter if you are used to work loading,reading, updating or validating XML files.

Chapter 4Generating an HTML page using XML and XSL sources transformation

Chapter 5, File Management – 9 recipes

This chapter demonstrates how to copy, move, transfer, and encrypt files and directories. Here you will learn how to get data from remote FTP servers, zip files and encrypt files using OpenPGP standard.

Chapter 5Encrypting and decrypting files transformation

Chapter 6, Looking for Data – 8 recipes

This issue shows you how to search for information through various methods via databases, web services, files, and more. This chapter also shows you how to validate data with Kettle’s built-in validation steps. Besides, in last recipe you will learn how to validate data at runtime.

Chapter 6Validating data at runtime transformation

Chapter 7, Understanding and Optimizing Data Flows – 12 recipes

This chapter details how Kettle moves data through jobs and transformations and how to optimize data flows (Processing jobs in parallel, splitting a stream into 2 or more, comparing streams ….).

Chapter 7Run transformations in parallel job

Chapter 8, Executing and Re-using Jobs and Transformations – 9 recipes

This chapter shows us how to launch jobs and transformations in various ways through static or dynamic arguments and parameterization. Object-oriented transformations through subtransformations are also explained.Chapter 8

Moving the reusable part of a transformation to a sub-transformation (Mapping)

Chapter 9, Integrating Kettle and the Pentaho Suite – 6 recipes

This chapter works with some of the other tools in the Pentaho suite (BI Server, Report Designer) to show how combining tools provides even more capabilities and functionality for reporting, dashboards, and more. In this part of the book you will create Pentaho reports from PDI,  execute PDI transformations from BI Server and populating a dashboard with PDI.

Chapter 9Creating a Pentaho report directly from PDI transformation

Chapter 10, Getting the Most Out of Kettle – 9 recipes

This part works with some of the commonly needed features (e-mail and logging) as well as building sample data sets, and using Kettle to read meta information on jobs and transformations via files or Kettle’s database repository.

Chapter 10Programming custom functionality using Java code transformation

Chapter 11, Utilizing Visualization Tools in Kettle – 4 recipes

This chapter explains how to work with plugins and focuses on DataCleaner, AgileBI, and Instaview, an Enterprise feature that allows for fast analysis of data sources.

Chapter 11PDI Marketplace (Here you can install all plugins available)

Chapter 12, Data Analytics – 3 recipes

This part shows us how to work with the various analytical tools built into Kettle, focusing on statistics gathering steps and building datasets for Weka (Pentaho Data Mining tool), you will also read data from a SAS datafile.

Chapter 13Reading data from a sas file transformation

Appendix A, Data Structures, shows the different data structures used throughout the book.

App ASteelwheels database model structure

Appendix B, References, provides a list of books and other resources that will help you
connect with the rest of the Pentaho community and learn more about Kettle and the other
tools that are part of the Pentaho suite.

Book link:

https://www.packtpub.com/pentaho-data-integration-cookbook-second-edition/book