Book Review: Pentaho Data Integration Cookbook – Second Edition

Pentaho Data Integration Cookbook, Second Edition picks up where the first edition left off, by updating the recipes to the latest edition of PDI and diving into new topics such as working with Big Data and cloud sources, and more.

0674OS_ Pentaho Data Integration Cookbook (2nd edition).jpg

Book review by: David Fombella Pombal (twitter: @pentaho_fan)

Book Title: Pentaho Data Integration Cookbook – Second Edition

Authors: Alex Meadows, Adrián Sergio Pulvirenti, María Carina Roldán

Paperback: 462 pages

I would like to suggest this useful book since it shows us how to take advantage of all the aspects of Kettle through a set of practical recipes organized to find quick solutions to our everyday needs. Although this books covers advanced topics, all recipes are explained step by step in order to help all type of readers.

Target Audience
If you are a software developer, data scientist, or anyone else looking for a tool that will help extract, transform, and load data as well as provide the tools to perform analytics and data cleansing, then this book is for you.

Rating: 9 out of 10

Chapter 1, Working with Databases – 15 recipes

This chapter shows us how to work with relational databases with Kettle.The recipes show us how to create and share database connections, perform typical database functions (select, insert, update, and delete), as well as more advanced tricks such as building and executing queries at ETL runtime. Remember that in Kettle you can connect to MySQL,Oracle, SQL Server, PostgreSQL, db2 …. and nearly all the database engines available.

Chapter 1Inserting new records when PK has to be generated based on previous values transformation

Chapter 2, Reading and Writing Files – 15 recipes

This topic not only shows us how to read and write files (csv, txt, excel …), but also how to work with semi-structured files, and read data from Amazon Web Services S3 instances.

Chapter 2Loading data into an AWS S3 Instance transformation

Chapter 3, Working with Big Data and Cloud Sources – 8 recipes

This third chapter covers how to load and read data from some of the many different NoSQL data sources (MongoDB, HBase, Hadoop …) as well as from I would like to remark the importance of this issue of the book due to the importance of Big Data techniques nowadays.

Chapter 3 Loading data into HBaseLoading data into HBase transformation

Chapter 4, Manipulating XML Structures – 10 recipes

This topic shows us how to read, write, and validate XML  files. Simple and complex XML structures are shown as well as more specialized formats such as RSS feeds. Even an HTML page is generated using XML and XSL transformations. You should read carefully this chapter if you are used to work loading,reading, updating or validating XML files.

Chapter 4Generating an HTML page using XML and XSL sources transformation

Chapter 5, File Management – 9 recipes

This chapter demonstrates how to copy, move, transfer, and encrypt files and directories. Here you will learn how to get data from remote FTP servers, zip files and encrypt files using OpenPGP standard.

Chapter 5Encrypting and decrypting files transformation

Chapter 6, Looking for Data – 8 recipes

This issue shows you how to search for information through various methods via databases, web services, files, and more. This chapter also shows you how to validate data with Kettle’s built-in validation steps. Besides, in last recipe you will learn how to validate data at runtime.

Chapter 6Validating data at runtime transformation

Chapter 7, Understanding and Optimizing Data Flows – 12 recipes

This chapter details how Kettle moves data through jobs and transformations and how to optimize data flows (Processing jobs in parallel, splitting a stream into 2 or more, comparing streams ….).

Chapter 7Run transformations in parallel job

Chapter 8, Executing and Re-using Jobs and Transformations – 9 recipes

This chapter shows us how to launch jobs and transformations in various ways through static or dynamic arguments and parameterization. Object-oriented transformations through subtransformations are also explained.Chapter 8

Moving the reusable part of a transformation to a sub-transformation (Mapping)

Chapter 9, Integrating Kettle and the Pentaho Suite – 6 recipes

This chapter works with some of the other tools in the Pentaho suite (BI Server, Report Designer) to show how combining tools provides even more capabilities and functionality for reporting, dashboards, and more. In this part of the book you will create Pentaho reports from PDI,  execute PDI transformations from BI Server and populating a dashboard with PDI.

Chapter 9Creating a Pentaho report directly from PDI transformation

Chapter 10, Getting the Most Out of Kettle – 9 recipes

This part works with some of the commonly needed features (e-mail and logging) as well as building sample data sets, and using Kettle to read meta information on jobs and transformations via files or Kettle’s database repository.

Chapter 10Programming custom functionality using Java code transformation

Chapter 11, Utilizing Visualization Tools in Kettle – 4 recipes

This chapter explains how to work with plugins and focuses on DataCleaner, AgileBI, and Instaview, an Enterprise feature that allows for fast analysis of data sources.

Chapter 11PDI Marketplace (Here you can install all plugins available)

Chapter 12, Data Analytics – 3 recipes

This part shows us how to work with the various analytical tools built into Kettle, focusing on statistics gathering steps and building datasets for Weka (Pentaho Data Mining tool), you will also read data from a SAS datafile.

Chapter 13Reading data from a sas file transformation

Appendix A, Data Structures, shows the different data structures used throughout the book.

App ASteelwheels database model structure

Appendix B, References, provides a list of books and other resources that will help you
connect with the rest of the Pentaho community and learn more about Kettle and the other
tools that are part of the Pentaho suite.

Book link:

Book Review: Pentaho Data Integration Beginner’s Guide – Second Edition

Hello friends today I am going to review Pentaho Data Integration Beginner’s Guide – Second Edition:

5040OS.jpgFirst of all, I would like to congratulate Maria Carina a great contributor to the community pentaho I met in person in last  Pentaho Community Meeting #PCM13 in  Sintra.

Below you can check the link to purchase the book:

Book review by: David Fombella Pombal (twitter: @pentaho_fan)

Book Title: Pentaho Data Integration Beginner’s Guide – Second Edition

Authors: María Carina Roldán

Paperback: 502 pages

I would like to recommend this book because if you are a noob in Pentaho Data Integration you will gain a lot of knowledge of this cool tool, besides if you are advanced with PDI you can use it as reference guide book.

Target Audience
This book is an excellent starting point for database administrators, data warehouse developers, or anyone who is responsible for ETL and data warehouse projects and needs to load data into them.

Rating: 9 out of 10

Although this book is oriented to PDI 4.4.0 CE version, some new features of PDI 5.0.1 CE are listed in an Appendix of the book

Kettle version

Chapter List

Chapter 1 – Getting Started with Pentaho Data Integration
In this chapter  you learn what Pentaho Data Integration is and installing the software required to start using PDI graphical designer. As an additional task MySQL DBMS server is installed.

Chapter 1Hello world transformation

Chapter 2 – Getting started with Transformations
This chapters introduces us in the basic terminology of PDI and an introduction in handling runtime errors is performed. We will also learn the simplest ways of transforming data.Chapter 2Calculating project duration transformation

Chapter 3 – Manipulating Real-World Data
Here we will learn how to get data from different sorts of files (csv, txt, xml …)  using PDI. Besides we will send data from Kettle to plain files

Chapter 3Creation of a CSV file with random values transformation

Chapter 4 – Filtering, Searching, and Performing Other Useful Operations with Data
Explains how to sort and filter data, grouping data by different criteria and looking up for data outside the main stream of data. Some data cleasing tasks are also performed in this chapter.

Chapter 4Filtering data transformation

Chapter 5 – Controlling the Flow of Data
In this very important for ETL developers chapter we will learn how to control the flow of data. In particular we will cover the following topics: Copying and distributing rows, Splitting streams based on conditions and merging streams of data.

Chapter 5Copying rows transformation

Chapter 6 – Transforming Your Data by Coding
This chapter explains how to insert code in your transformations. Specially you will learn: Inserting and testing Javascript and Java code in your transformations and Distinguishing situations where coding is the best option, from those where there are better alternatives. PDI uses the Rhino javascript engine from Mozilla . For allowing Java programming inside PDI, the tool uses the Janino project libraries. Janino es a supper-small and fast embedded compiler that compiles Java code at runtime . In summary,always remember that code in the Javascript step is interpreted, whereas the code in User Java Class is compiled. This means that a transformation that uses the UDJC step will have much better performance.

Chapter 6Transformation with java code

Chapter 7 – Transforming the Rowset
This chapter will be dedicated to learn how to convert rows to columns (denormalizing) and converting columns to rows (normalizing) . Furthermore, you will be introduced to a very important topic in data warehousing called time dimensions.

Chapter 7Denormalizing rows transformation

Chapter 8 – Working with databases
This is the firs of two chapters fully dedicated to working with databases. We will learn how to connect to a database, preview and get data from a database and insert/update/delete data from a database.

Chapter 8List of some of the many types of databases available to connect to in PDI

Chapter 9 – Performing Advanced Operations with Databases
This chapter explains different advanced operations with databases: Doing simple and complex lookups in a database. Besides an introduction in dimensional modeling and loading dimensions is included.

Chapter 9Database lookup in a transformation

Chapter 10 – Creating Basic Task Flows
So far, we have been working with data (running transformations). A PDI transformation does not run in isolation and usually is embedded in a bigger process. These processes like generating a daily report and transfer the report to a shared repository or updating a data ware house and  sending a notification by email  can be implemented by PDI jobs. In this chapter we will be introduced to jobs, executing tasks upon conditions and working with arguments and named paramenters.

Chapter 10Creating a folder transformation

Chapter 11 – Creating Advanced Transformations and Jobs
This chapter is about learning techniques for creating complex transformations and jobs (create subtransformations, implement process flows, nest jobs, iterate the execution of jobs and transformations …)

Chapter 11Execute transformation included in a job for every input row

Chapter 12 – Developing and Implementing a Simple Datamart
This chapter will cover the following: Introduction to a sales datamart based on a provided database, loading the dimensions and fact table of the sales datamart and automating what has been done.

Appendix A- Working With Repositories
PDI allows us storing our transformations and jobs under 2 different configurations: file-based and database repository. Along this book we have used file-based option, however the database repository is convenient in some situations.

Appendix B- Pan and Kitchen – LaunchingTransformations and Jobs from the Command Line

Despite having used Spoon as the tool for running jobs and transformation you may also run them from a terminal window. Pan is a cmd-line program which lets you launche the transformations designed in Spoon, both the .ktr files and from a repository. The counterpart to Pan is Kitchen, which allows you to run jobs from .kjb files and from a repository.

Appendix C-  Quick Reference – Steps and Job Entries

This appendix summarizes the purpose of  the steps and jobs entries  used in the labs throughout the book.

Appendix D-  Spoon Shortcuts

This very useful appendix includes tables summarizing  the main Spoon shortcuts.

Appendix E-  Introducing PDI 5 features

New PDI 5 features (PDI 5 is currently available now)

Book link: