Vertex AI AutoML

Anurag Bhatia
7 min readSep 2, 2022

--

Submitting training jobs through Python SDK

Who is this blog meant for? The goals of this blog are to share info on: 1) basics of AutoML, its advantages and limitations 2) how to submit AutoML training jobs using the Python SDK for Vertex AI (Google Cloud’s flagship product for machine learning) 3) share code snippets (one each) for both structured and unstructured data. Along the way, we’ll hopefully learn a thing or two about kubeflow pipelines as well.

What is AutoML? At the risk of over-simplifying, AutoML is like a tool which takes our data as an input and trains a machine learning model for us. We just submit our data and mention the performance metric to be optimized (e.g. Recall or ROC-AUC) and AutoML does pretty much everything else, from feature-engineering to trying out different model architectures and also hyper-parameter tuning, on its own. The combo which performs best among those trials, on a specific performance metric, is chosen over others. AutoML model performance is usually pretty decent to say the least, and sometimes not at all easy to beat.

Source

Why should we care? AutoML can be a very handy thing to have in our ML toolkit if the time to market” for executing an idea is rather short and we don’t have the luxury of spending weeks on EDA, feature engineering, trying different training algorithms and hyperparameter tuning etc. Each of those steps has been automated in AutoML. If we choose to deploy, AutoML also provides an endpoint and an API instantly, ready to be consumed by our application. Another use-case of AutoML is for benchmarking. e.g. consider a situation where we have an entirely new dataset and we have no idea about the inherent data complexity or for that matter whether there is signal present at all in the dataset. Using AutoML in such a case will usually establish a pretty robust reference point to be compared with our custom models, going forward.

Just one last thing before we dig deeper in the code: What is the difference between kubeflow and Vertex-AI, given the fact that both use kubernetes (K8S) under the hood? kubeflow is an open-source software which makes deployments of machine learning (ML) workflows simple, portable and scalable. It is a K8S-based toolkit but its real USP is that it is cloud-native and is custom designed for ML use-cases. Vertex AI is a kubeflow-based managed service from Google Cloud (GCP) which provides a unified platform to address common pain points encountered across the entire ML lifecycle, starting from data preparation all the way to post-deployment issues like model monitoring. It enables this end-to-end solutioning and orchestration through kubeflow pipelines.

Pipelines-based architecture of Vertex AI (Source: Google Cloud)

Main advantages of Vertex AI: 1) It provides easy integration with other GCP products like Cloud Storage, BigQuery and Dataproc. 2) Since it containerizes the components (different stages of modelling like feature transformations, model training, model deployment etc.), Vertex AI is framework-agnostic. i.e. it does not care whether we use sklearn, TensorFlow, PyTorch or any other framework. 3) Through Vertex AI, one can easily leverage GCP’s hardware accelerators (e.g. GPUs) for faster and distributed training of models. In fact, we even get the flexibility to choose which specific components use an accelerator (e.g. model training) while others don’t (e.g. data ingestion). 4) It supports pre-trained model APIs (e.g. Vision and Natural Language), AutoML as well as custom models.

Common template: Irrespective of the nature of our data (whether tabular, images or text), the underlying building blocks of submitting AutoML jobs would remain the same. Step 1) Create managed dataset on Vertex AI. Consider this as an intermediary yet necessary step to make our input data (e.g. files lying in a Cloud Storage bucket or a table in BigQuery) compatible with what AutoML expects as an input. Step 2) Submit training jobs, either as a kubeflow component of a pipeline (DAG), or in a stand-alone manner. We’ll try both of these here.

Let’s take a tabular data example, to begin with: a relatively well-known open-source card-fraud dataset from Kaggle. We’ll start building a kubeflow pipeline and add the very first (i.e. most upstream) component for creating a Vertex AI managed dataset.

Now that our input data is ready to be fed into the AutoML, it’s time to add the next component (i.e. training job) in our pipeline.

Note: 1) Providing values to the ‘column_transformations’ argument in the above step can be quite cumbersome if we have hundreds of columns, unless we write a custom script to automate it. 2) ‘budget_milli_node_hours’ does NOT exactly mean the same thing as 1 hour (60 minutes). e.g. the node-hours used for this job were less than 2.5

while the actual time elapsed was different..

Model deployment (optional): Finally, we would like to deploy the model on Vertex AI, but only if the performance metrics are satisfactory. Hence, the need to add another component here, this time using an inbuilt method..

Now that all individual components are ready and have been stitched together, it’s time to trigger the pipeline and once it completes successful execution, this is how it looks like on the Vertex AI console:

Snapshot of Vertex AI pipeline (DAG)

What value do the Vertex AI pipelines, bring to the table?: Here are some of the important ones: 1) the functionality to dig deeper inside each component and check its logs for debugging, if required 2) the ability to trace and analyze the artifact(s) being passed from one component to another and 3) the option to choose the type of instance (virtual machine) to be used for model deployment e.g. I may or may not need GPUs here, and might prefer a high-memory instance over higher number of cores or vice-versa, depending on the constraints and business requirements. For more details and access to the complete Jupyter notebook, here is the code repository.

Unstructured data: So far, so good. But how does the process look like if the data contains images, rather than table rows. Also, what if I am at an experimentation stage for now, and hence, not the best time to build training-cum-deployment pipeline? Well, I also have the flexibility to submit AutoML training jobs on a stand-alone basis. I am going to take an example of an image classification problem (for a fire detection scenario). Diving back into the code, we again start with creation of a managed dataset..

And since this one is not orchestrated through a pipeline, we can straight away proceed to the training job..

Note 1) the ability to decide the ratio of train-validation-test split for our data and 2) also, whether or not we want to enable early-stopping (and avoid overfitting, hopefully). For mode details, take a look at the Computer Vision repository.

When NOT to use AutoML? Though things are getting somewhat better over time on this front, black-box problem continues to be the Achilles Heel of AutoML despite all its advantages mentioned above. e.g. What if we are operating in a heavily regulated industry like finance or healthcare and hence, good performance metrics are necessary but just not sufficient, since we also better be in a position to explain where those results are coming from, or rather what is happening under the hood of that model. i.e. which feature? which algorithm? etc. Also, costs incurred can quickly get out of hand in AutoML, especially if the data is large and complex and the model needs frequent re-training. So keep an eye on it.

Hope this was helpful. Please reach out if you have questions or/and suggestions. Happy (machine) learning.. !

--

--