Link to the repository

Introduction

  • This project uses the House Prices data obtained from Kaggle.

I have seen people in Kaggle use Linear Regression, Generalized Linear Regression, XGBoost, and Support Vector Machine, … This article is about using Auto Machine Learning (AutoML) with package h2o.

The current version of AutoML trains and cross-validates the following algorithms (in the following order): three pre-specified XGBoost GBM (Gradient Boosting Machine) models, a fixed grid of GLMs, a default Random Forest (DRF), five pre-specified H2O GBMs, a near-default Deep Neural Net, an Extremely Randomized Forest (XRT), a random grid of XGBoost GBMs, a random grid of H2O GBMs, and a random grid of Deep Neural Nets.

AutoML then trains two Stacked Ensemble models, one from all the models created, and one from the best model from each algorithm class/family. Both of the ensembles should produce better models than any individual model from the AutoML run, except for some rare cases. The advantage of using AutoML is it can help produce a better prediction. Besides, after fitting models, we can do the model explanation that can help recognize “important” factors that affect the outcomes and how they affect the outcomes. (Link to the”Interpretable Machine Learning” book ).

My goal for this project is to explain how to use AutoML with data having a continuous outcome. All the codes are put in functions with detailed explanations that can be conveniently used later with other data. I will first write some simple code for exploring the data, then jump into our central part, AutoML.

Some notes before working with the Project

1- You need to create a project on R. This is how. It will make your project much more organized and ensure that your code, data, and results will not be mixed together.

2- With this project, I first create a new project on a folder with the same name, let’s say ml_with_r_h2o. Then I will create a folder named code containing all of the Rfile.script or Rfilename.Rmd that I will use. Also, create a folder named “house_dat” that contains train and test data.

3- It is better to read this article when familiar with data wrangling using tidyverse. A good book for studying is R for Data Science, or a more advanced book, Advanced R, or a more advanced book, Advanced R. Here are some things that I use in this project that you might found in these books or use the link provided:

How to reuse functions that you create in Scrips: use source of a function.

Using glue to write shorter and well-organized R code.

With this project. I will

  1. Create a table that contains information from the description file. This seems unnecessary for small data, but it will work well for data with many columns. You can’t check every single column manually, and so it is worth doing this step. ((Link to the work - Work with description file))

  2. Clean the data according to the description file. We will use Tidyverse for data wrangling. Apply AutoML using h2o. ((Link to the work - Read, explore and pre-process the data with Tidyverse)).

  3. Run some models with the preprocessed data.