Introduction xxxi Part I Getting Started 1 Chapter 1 ETL Primer 3 OLTP versus Data Warehousing 3 What Is ETL? 5 The Evolution of ETL Solutions 5 ETL Building Blocks 7 ETL, ELT, and EII 8 ELT 9 EII: Virtual Data Integration 10 Data Integration Challenges 11 Methodology: Agile BI 12 ETL Design 14 Data Acquisition 14 Beware of Spreadsheets 15 Design for Failure 15 Change Data Capture 16 Data Quality 16 Data Profiling 16 Data Validation 17 ETL Tool Requirements 17 Connectivity 17 Platform Independence 18 Scalability 18 Design Flexibility 19 Reuse 19 Extensibility 19 Data Transformations 20 Testing and Debugging 21 Lineage and Impact Analysis 21 Logging and Auditing 22 Summary 22 Chapter 2 Kettle Concepts 23 Design Principles 23 The Building Blocks of Kettle Design 25 Transformations 25 Steps 26 Transformation Hops 26 Parallelism 27 Rows of Data 27 Data Conversion 29 Jobs 30 Job Entries 31 Job Hops 31 Multiple Paths and Backtracking 32 Parallel Execution 33 Job Entry Results 34 Transformation or Job Metadata 36 Database Connections 37 Special Options 38 The Power of the Relational Database 39 Connections and Transactions 39 Database Clustering 40 Tools and Utilities 41 Repositories 41 Virtual File Systems 42 Parameters and Variables 43 Defining Variables 43 Named Parameters 44 Using Variables 44 Visual Programming 45 Getting Started 46 Creating New Steps 47 Putting It All Together 49 Summary 51 Chapter 3 Installation and Configuration 53 Kettle Software Overview 53 Integrated Development Environment: Spoon 55 Command-Line Launchers: Kitchen and Pan 57 Job Server: Carte 57 Encr.bat and encr.sh 58 Installation 58 Java Environment 58 Installing Java Manually 58 Using Your Linux Package Management System 59 Installing Kettle 59 Versions and Releases 59 Archive Names and Formats 60 Downloading and Uncompressing 60 Running Kettle Programs 61 Creating a Shortcut Icon or Launcher for Spoon 62 Configuration 63 Configuration Files and the .kettle Directory 63 The Kettle Shell Scripts 69 General Structure of the Startup Scripts 70 Adding an Entry to the Classpath 70 Changing the Maximum Heap Size 71 Managing JDBC Drivers 72 Summary 72 Chapter 4 An Example ETL Solution--Sakila 73 Sakila 73 The Sakila Sample Database 74 DVD Rental Business Process 74 Sakila Database Schema Diagram 75 Sakila Database Subject Areas 75 General Design Considerations 77 Installing the Sakila Sample Database 77 The Rental Star Schema 78 Rental Star Schema Diagram 78 Rental Fact Table 79 Dimension Tables 79 Keys and Change Data Capture 80 Installing the Rental Star Schema 81 Prerequisites and Some Basic Spoon Skills 81 Setting Up the ETL Solution 82 Creating Database Accounts 82 Working with Spoon 82 Opening Transformation and Job Files 82 Opening the Step''s Configuration Dialog 83 Examining Streams 83 Running Jobs and Transformations 83 The Sample ETL Solution 84 Static, Generated Dimensions 84 Loading the dim_date Dimension Table 84 Loading the dim_time Dimension Table 86 Recurring Load 87 The load_rentals Job 88 The load_dim_staff Transformation 91 Database Connections 91 The load_dim_customer Transformation 95 The load_dim_store Transformation 98 The fetch_address Subtransformation 99 The load_dim_actor Transformation 101 The load_dim_film Transformation 102 The load_fact_rental Transformation 107 Summary 109 Part II ETL 111 Chapter 5 ETL Subsystems 113 Introduction to the 34 Subsystems 114 Extraction 114 Subsystems 1-3: Data Profiling, Change Data Capture, and Extraction 115 Cleaning and Conforming Data 116 Subsystem 4: Data Cleaning and Quality Screen Handler System 116 Subsystem 5: Error Event Handler 117 Subsystem 6: Audit Dimension Assembler 117 Subsystem 7: Deduplication System 117 Subsystem 8: Data Conformer 118 Data Delivery 118 Subsystem 9: Slowly Changing Dimension Processor 118 Subsystem 10: Surrogate Key Creation System 119 Subsystem 11: Hierarchy Dimension Builder 119 Subsystem 12: Special Dimension Builder 120 Subsystem 13: Fact Table Loader 121 Subsystem 14: Surrogate Key Pipeline 121 Subsystem 15: Multi-Valued Dimension Bridge Table Builder 121 Subsystem 16: Late-Arriving Data Handler 122 Subsystem 17: Dimension Manager System 122 Subsystem 18: Fact Table Provider System 122 Subsystem 19: Aggregate Builder 123 Subsystem 20: Multidimensional (OLAP) Cube Builder 123 Subsystem 21: Data Integration Manager 123 Managing the ETL Environment 123 Summary 126 Chapter 6 Data Extraction 127 Kettle Data Extraction Overview 128 File-Based Extraction 128 Working with Text Files 128 Working with XML files 133 Special File Types 134 Database-Based Extraction 134 Web-Based Extraction 137 Text-Based Web Extraction 137 HTTP Client 137 Using SOAP 138 Stream-Based and Real-Time Extraction 138 Working with ERP and CRM Systems 138 ERP Challenges 139 Kettle ERP Plugins 140 Working with SAP Data 140 ERP and CDC Issues 146 Data Profiling 146 Using eobjects.org DataCleaner 147 Adding Profile Tasks 149 Adding Database Connections 149 Doing an Initial Profile 151 Working with Regular Expressions 151 Profiling and Exploring Results 152 Validating and Comparing Data 153 Using a Dictionary for Column Dependency Checks 153 Alternative Solutions 154 Text Profiling with Kettle 154 CDC: Change Data Capture 154 Source Data-Based CDC 155 Trigger-Based CDC 157 Snapshot-Based CDC 158 Log-Based CDC 162 Which CDC Alternative Should You Choose? 163 Delivering Data 164 Summary 164 Chapter 7 Cleansing and Conforming 167 Data Cleansing 168 Data-Cleansing Steps 169 Using Reference Tables 172 Conforming Data Using Lookup Tables 172 Conforming Data Using Reference Tables 175 Data Validation 179 Applying Validation Rules 180 Validating Dependency Constraints 183 Error Handling 183 Handling Process Errors 184 Transformation Errors 186 Handling Data (Validation) Errors 187 Auditing Data and Process Quality 191 Deduplicating Data 192 Handling Exact Duplicates 193 The Problem of Non-Exact Duplicates 194 Building Deduplication Transforms 195 Step 1: Fuzzy Match 197 Step 2: Select Suspects 198 Step 3: Lookup Validation Value 198 Step 4: Filter Duplicates 199 Scripting 200 Formula 201 JavaScript 202 User-Defined Java Expressions 202 Regular Expressions 203 Summary 205 Chapter 8 Handling Dimension Tables 207 Managing Keys 208 Managing Business Keys 209 Keys in the Source System 209 Keys in the Data Warehouse 209 Business Keys 209 Storing Business Keys 210 Looking Up Keys with Kettle 210 Generating Surrogate Keys 210 The "Add sequence" Step 211 Working with auto_increment or IDENTITY Columns 217 Keys for Slowly Changing Dimensions 217 Loading Dimension Tables 218 Snowflaked Dimension Tables 218 Top-Down Level-Wise Loading 219 Sakila Snowflake Example 219 Sample Transformation 221 Database Lookup Configuration 222 Sample Job 225 Star Schema Dimension Tables 226 Denormalization 226 Denormalizing to 1NF with the "Database lookup" Step 226 Change Data Capture 227 Slowly Changing Dimensions 228 Types of Slowly Changing Dimensions 228 Type 1 Slowly Changing Dimensions 229 The Insert / Update Step 229 Type 2 Slowly Changing Dimensions 232 The "Dimension lookup / update" Step 232 Other Types of Slowly Changing Dimensions 237 Type 3 Slowly Changing Dimensions 237 Hybrid Slowly Changing Dimensions 238 More Dimensions 239 Generated Dimensions 239 Date and Time Dimensions 239 Generated Mini-Dimensions 239 Junk Dimensions 241 Recursive Hierarchies 242 Summary 243 Chapter 9 Loading Fact Tables 245 Loading in Bulk 246 STDIN and FIFO 247 Kettle Bulk Loaders 248 MySQL Bulk Loading 249 LucidDB Bulk Loader 249 Oracle Bulk Loader 249 PostgreSQL Bulk Loader 250 Table Output Step 250 General Bulk Load Considerations 250 Dimension Lookups 251 Maintaining Referential Integrity 251 The Surrogate Key Pipeline 252 Using In-Memory Lookups 253 Stream Lookups 253 Late-Arriving Data 255 Late-Arriving Facts 256 Late-Arriving Dimensions 256 Fact Table Handling 260 Periodic and Accumulating Snapshots 260 Introducing State-Oriented Fact Tables 261 Loading Periodic Snapshots 263 Loading Accumulati.
Pentaho Kettle Solutions : Building Open Source ETL Solutions with Pentaho Data Integration