Pentaho Data Integration Cookbook, Second Edition picks up where the first edition left off, by updating the recipes to the latest edition of PDI and diving into new topics such as working with Big Data and cloud sources, and more.
Book review by: David Fombella Pombal (twitter: @pentaho_fan)
Book Title: Pentaho Data Integration Cookbook – Second Edition
Authors: Alex Meadows, Adrián Sergio Pulvirenti, María Carina Roldán
Paperback: 462 pages
I would like to suggest this useful book since it shows us how to take advantage of all the aspects of Kettle through a set of practical recipes organized to find quick solutions to our everyday needs. Although this books covers advanced topics, all recipes are explained step by step in order to help all type of readers.
If you are a software developer, data scientist, or anyone else looking for a tool that will help extract, transform, and load data as well as provide the tools to perform analytics and data cleansing, then this book is for you.
Rating: 9 out of 10
Chapter 1, Working with Databases – 15 recipes
This chapter shows us how to work with relational databases with Kettle.The recipes show us how to create and share database connections, perform typical database functions (select, insert, update, and delete), as well as more advanced tricks such as building and executing queries at ETL runtime. Remember that in Kettle you can connect to MySQL,Oracle, SQL Server, PostgreSQL, db2 …. and nearly all the database engines available.
Inserting new records when PK has to be generated based on previous values transformation
Chapter 2, Reading and Writing Files – 15 recipes
This topic not only shows us how to read and write files (csv, txt, excel …), but also how to work with semi-structured files, and read data from Amazon Web Services S3 instances.
Loading data into an AWS S3 Instance transformation
Chapter 3, Working with Big Data and Cloud Sources – 8 recipes
This third chapter covers how to load and read data from some of the many different NoSQL data sources (MongoDB, HBase, Hadoop …) as well as from Salesforce.com. I would like to remark the importance of this issue of the book due to the importance of Big Data techniques nowadays.
Loading data into HBase transformation
Chapter 4, Manipulating XML Structures – 10 recipes
This topic shows us how to read, write, and validate XML files. Simple and complex XML structures are shown as well as more specialized formats such as RSS feeds. Even an HTML page is generated using XML and XSL transformations. You should read carefully this chapter if you are used to work loading,reading, updating or validating XML files.
Generating an HTML page using XML and XSL sources transformation
Chapter 5, File Management – 9 recipes
This chapter demonstrates how to copy, move, transfer, and encrypt files and directories. Here you will learn how to get data from remote FTP servers, zip files and encrypt files using OpenPGP standard.
Encrypting and decrypting files transformation
Chapter 6, Looking for Data – 8 recipes
This issue shows you how to search for information through various methods via databases, web services, files, and more. This chapter also shows you how to validate data with Kettle’s built-in validation steps. Besides, in last recipe you will learn how to validate data at runtime.
Validating data at runtime transformation
Chapter 7, Understanding and Optimizing Data Flows – 12 recipes
This chapter details how Kettle moves data through jobs and transformations and how to optimize data flows (Processing jobs in parallel, splitting a stream into 2 or more, comparing streams ….).
Run transformations in parallel job
Chapter 8, Executing and Re-using Jobs and Transformations – 9 recipes
This chapter shows us how to launch jobs and transformations in various ways through static or dynamic arguments and parameterization. Object-oriented transformations through subtransformations are also explained.
Moving the reusable part of a transformation to a sub-transformation (Mapping)
Chapter 9, Integrating Kettle and the Pentaho Suite – 6 recipes
This chapter works with some of the other tools in the Pentaho suite (BI Server, Report Designer) to show how combining tools provides even more capabilities and functionality for reporting, dashboards, and more. In this part of the book you will create Pentaho reports from PDI, execute PDI transformations from BI Server and populating a dashboard with PDI.
Creating a Pentaho report directly from PDI transformation
Chapter 10, Getting the Most Out of Kettle – 9 recipes
This part works with some of the commonly needed features (e-mail and logging) as well as building sample data sets, and using Kettle to read meta information on jobs and transformations via files or Kettle’s database repository.
Programming custom functionality using Java code transformation
Chapter 11, Utilizing Visualization Tools in Kettle – 4 recipes
This chapter explains how to work with plugins and focuses on DataCleaner, AgileBI, and Instaview, an Enterprise feature that allows for fast analysis of data sources.
PDI Marketplace (Here you can install all plugins available)
Chapter 12, Data Analytics – 3 recipes
This part shows us how to work with the various analytical tools built into Kettle, focusing on statistics gathering steps and building datasets for Weka (Pentaho Data Mining tool), you will also read data from a SAS datafile.
Reading data from a sas file transformation
Appendix A, Data Structures, shows the different data structures used throughout the book.
Steelwheels database model structure
Appendix B, References, provides a list of books and other resources that will help you
connect with the rest of the Pentaho community and learn more about Kettle and the other
tools that are part of the Pentaho suite.