Foreword by Ravi Bapna xxi Preface to the RapidMiner Edition xxiii Acknowledgments xxvii PART I PRELIMINARIES CHAPTER 1 Introduction 3 1.1 What Is Business Analytics? . 3 1.2 What Is Machine Learning? . 5 1.3 Machine Learning, AI, and Related Terms . 5 Statistical Modeling vs. Machine Learning .
6 1.4 Big Data . 7 1.5 Data Science . 8 1.6 Why Are There So Many Different Methods? . 9 1.7 Terminology and Notation .
9 1.8 Road Maps to This Book . 12 Order of Topics . 12 1.9 Using RapidMiner Studio . 14 Importing and Loading Data in RapidMiner . 16 RapidMiner Extensions . 16 CHAPTER 2 Overview of the Machine Learning Process 19 2.
1 Introduction . 19 2.2 Core Ideas in Machine Learning . 20 Classification . 20 Prediction . 20 Association Rules and Recommendation Systems . 20 Predictive Analytics . 21 Data Reduction and Dimension Reduction .
21 Data Exploration and Visualization . 21 Supervised and Unsupervised Learning . 22 2.3 The Steps in a Machine Learning Project . 23 vii viii CONTENTS 2.4 Preliminary Steps . 25 Organization of Data . 25 Sampling from a Database .
25 Oversampling Rare Events in Classification Tasks . 26 Preprocessing and Cleaning the Data . 26 2.5 Predictive Power and Overfitting . 32 Overfitting . 32 Creation and Use of Data Partitions . 34 2.6 Building a Predictive Model with RapidMiner .
37 Predicting Home Values in the West Roxbury Neighborhood . 39 Modeling Process . 39 2.7 Using RapidMiner for Machine Learning . 45 2.8 Automating Machine Learning Solutions . 47 Predicting Power Generator Failure . 48 Uber''s Michelangelo .
50 2.9 Ethical Practice in Machine Learning . 52 Machine Learning Software Tools: The State of the Market by Herb Edelstein . 53 Problems . 57 PART II DATA EXPLORATION AND DIMENSION REDUCTION CHAPTER 3 Data Visualization 63 3.1 Introduction . 63 3.2 Data Examples .
65 Example 1: Boston Housing Data . 65 Example 2: Ridership on Amtrak Trains . 66 3.3 Basic Charts: Bar Charts, Line Charts, and Scatter Plots . 66 Distribution Plots: Boxplots and Histograms . 69 Heatmaps: Visualizing Correlations and Missing Values . 72 3.4 Multidimensional Visualization .
75 Adding Attributes: Color, Size, Shape, Multiple Panels, and Animation . 75 Manipulations: Rescaling, Aggregation and Hierarchies, Zooming, and Filtering . 78 Reference: Trend Lines and Labels . 81 Scaling Up to Large Datasets . 82 Multivariate Plot: Parallel Coordinates Plot . 83 Interactive Visualization . 84 3.5 Specialized Visualizations .
87 Visualizing Networked Data . 87 Visualizing Hierarchical Data: Treemaps . 89 Visualizing Geographical Data: Map Charts . 90 3.6 Summary: Major Visualizations and Operations, by Machine Learning Goal . 92 Prediction . 92 Classification . 92 Time Series Forecasting .
92 Unsupervised Learning . 93 Problems . 94 CONTENTS ix CHAPTER 4 Dimension Reduction 97 4.1 Introduction . 97 4.2 Curse of Dimensionality . 98 4.3 Practical Considerations .
98 Example 1: House Prices in Boston . 99 4.4 Data Summaries . 100 Summary Statistics . 100 Aggregation and Pivot Tables . 102 4.5 Correlation Analysis . 103 4.
6 Reducing the Number of Categories in Categorical Attributes . 105 4.7 Converting a Categorical Attribute to a Numerical Attribute . 107 4.8 Principal Component Analysis . 107 Example 2: Breakfast Cereals . 107 Principal Components . 112 Normalizing the Data .
113 Using Principal Components for Classification and Prediction . 117 4.9 Dimension Reduction Using Regression Models . 117 4.10 Dimension Reduction Using Classification and Regression Trees . 119 Problems . 120 PART III PERFORMANCE EVALUATION CHAPTER 5 Evaluating Predictive Performance 125 5.1 Introduction .
125 5.2 Evaluating Predictive Performance . 126 Naive Benchmark: The Average . 127 Prediction Accuracy Measures . 127 Comparing Training and Holdout Performance . 130 Lift Chart . 130 5.3 Judging Classifier Performance .
131 Benchmark: The Naive Rule . 132 Class Separation . 133 The Confusion (Classification) Matrix . 133 Using the Holdout Data . 134 Accuracy Measures . 135 Propensities and Threshold for Classification . 136 Performance in Case of Unequal Importance of Classes . 139 Asymmetric Misclassification Costs .
143 Generalization to More Than Two Classes . 146 5.4 Judging Ranking Performance . 146 Lift Charts for Binary Data .