How to automate machine learning on SQL Server 2019 big data clusters

In this post, we will explore how to use automated machine learning (AutoML) to create new machine learning models over your data in SQL Server 2019 big data clusters.

SQL Server 2019 big data clusters make it possible to use the software of your choice to fit machine learning models on big data and use those models to perform scoring. In fact, Apache SparkTM, the popular open source big data framework, is now built in! Apache SparkTM includes the MLlib Machine Learning Library, and the open source community has developed a wealth of additional packages that integrate with and extend Apache SparkTM and MLlib.

Automated machine learning

Manually selecting and tuning machine learning models requires familiarity with a variety of model types and can be laborious and time-consuming. Software for automating this process has recently become available, relieving both novice and expert Data Scientists and ML Engineers of much of the burden that comes with manual model selection and tuning.

H2O’s open source AutoML APIs

H2O provides popular open source software for data science and machine learning on big data, including Apache SparkTM integration. It provides two open source python AutoML classes: h2o.automl.H2OAutoML and pysparkling.ml.H2OAutoML. Both APIs use the same underlying algorithm implementations, however, the latter follows the conventions of Apache Spark’s MLlib library and allows you to build machine learning pipelines that include MLlib transformers. We will focus on the latter API in this post.

H2OAutoML supports classification and regression. The ML models built and tuned by H2OAutoML include Random Forests, Gradient Boosting Machines, Deep Neural Nets, Generalized Linear Models, and Stacked Ensembles.

H2OAutoML can automatically split training data into training, validation, and leaderboard frames. The h2o.automl.H2OAutoML API also allows these frames to be specified manually, which is useful when the task is to predict the future using a model trained on historical data.

Models produced by H2OAutoML can be persisted to disk, used for prediction/scoring in an Apache SparkTM cluster, used in local mode in Apache SparkTM running on a single node, or used in a Java Virtual Machine (JVM) with the necessary libraries on the CLASSPATH. These options will allow batch and real-time scoring in a SQL Server 2019 big data cluster within Apache SparkTM, within a Transact-SQL stored procedure, or deployed as an application.

Running PySpark3 notebooks in Azure Data Studio

The code discussed in this blog is available as a Jupyter notebook written for the PySpark3 kernel. You can now run Apache SparkTM notebooks in Azure Data Studio connected to a SQL Server 2019 big data cluster as described in this notebook how-to.

Power plant output prediction

Let’s take a tour through our example Jupyter notebook that shows how a customer running a power plant would take advantage of H20 and AutoML in Apache SparkTM to predict power plant output. This example is based on an H20 blog post.

The first cells of the notebook set Apache SparkTM parameters and install the H2O PySparkling package if it’s not already installed; this package provides the pysparkling.ml.H2OAutoML class.

Next, the notebook code downloads the CSV file containing the data and copies the file to HDFS, if it’s not already present.

Running H2OContext.getOrCreate starts the H20 engine.

Next, the notebook uses Apache SparkTM to read the data from HDFS and randomly split it into training and prediction/test sets.

The following screenshot shows how easy it is to invoke automated machine learning:

Here, you are defining a modeling pipeline, fitting it on the training data, and using it to generate predictions on the test data. In our example, we set maxModels=2, which results in two tree-based models and two (identical) stacked ensemble models. This is sufficient for demonstration purposes, but in practice, you should allow H2OAutoML to explore more models to achieve the best possible prediction metrics. If you simply omit the maxModels argument, then H2OAutoML will explore models for a maximum of maxRuntimeSecs, which defaults to 3600 seconds (1 hour).

Our code follows the standard pattern for using the Apache SparkTM MLlib library because the pysparkling.ml.H2OAutoML class inherits from pyspark.ml.wrapper.JavaEstimator.

Notice that we included an Apache SparkTM SQLTransformer in our pipeline, showing that a standard Spark MLlib transformer can be used with a pysparkling.ml.H2OAutoML estimator in an Apache SparkTM MLlib pipeline. During both training and scoring, this transformer will skip any rows that have a Celsius temperature value of less than or equal to 10.

You can see the generalization performance of our model by looking at the leaderboard. The generalization performance we get for predictions on held-out data should be similar to the leaderboard performance. You can use Apache Spark’s RegressionEvaluator class to compute metrics such as the mean absolute error (MAE). As expected, the MAE for predictions on held-out data is similar to the leaderboard MAE, with both typically between 2.3 and 2.5.

Scale and monitor big data in SQL Server 2019 big data clusters

With SQL Server 2019, not only can you automatically select and tune machine learning models, you can also easily scale and monitor your big data cluster.

Scaling to big data

Using SQL Server 2019 big data clusters, large amounts of computing and memory resources can be leveraged to process data at scale quickly and efficiently. To scale to big data, you have the ability to configure the following parameters:

  • The number and size of nodes in the cluster
  • The number of Apache SparkTM pods
  • YARN scheduler memory and cores
  • Apache SparkTM Driver and Executor memory, cores, and the number of executors per pod
  • Livy timeout

Details on setting these parameters are included in the sample notebook.

Monitoring and diagnostics

SQL Server 2019 big data clusters include powerful tools for monitoring and diagnostics. The sample notebook includes instructions for accessing the following graphical user interfaces for monitoring, controlling, and troubleshooting runs in Apache SparkTM:

YARN UI

  • Shows the available and used memory and virtual cores in the Apache SparkTM cluster
  • Lists running and completed Apache SparkTM applications
  • Provides links to the Apache SparkTM UI for running applications and Spark History for completed applications
  • Allows running applications to be terminated

Apache SparkTM UI

  • Provides detailed information on running Apache SparkTM applications

Apache SparkTM History

  • Provides details on completed Apache SparkTM applications
  • Includes newly available Microsoft diagnostics for Apache SparkTM applications

H2O Flow UI

  • Monitors H2O job progress and engine status

Conclusion

In this blog post, you’ve learned that SQL Server has gained a powerful new capability in the 2019 preview – and learned how to run machine learning workloads on big data using built-in Apache SparkTM, with the ability to leverage additional packages of your choosing such as H2O’s automated machine learning software. We have taken a tour through a sample Apache SparkTM notebook for automated machine learning that can be run in Azure Data Studio against a SQL Server 2019 big data cluster. And you’ve seen how you can scale resources such as nodes, cores, and memory, and monitor Apache SparkTM applications using built-in graphical user interfaces.

Getting started

Resources

The pysparkling.ml.H2OAutoML class is part of H2O’s Spark integration, Sparkling Water, which is documented here. Unfortunately, this site currently lacks detailed documentation of pysparkling.ml.H2OAutoML. Instead, you can find help on pysparkling.ml.H2OAutoML‘s attributes and methods by running the following python commands:

from pysparkling.ml import H2OAutoML

help(H2OAutoML)

Since pysparkling.ml.H2OAutoML and h2o.automl.H2OAutoML share underlying code, it is also helpful to refer to the latter’s documentation.