AI Training - Tutorial - Train a PyTorch model and export it to ONNX
Objective
The aim of this tutorial is to show you how to train a custom PyTorch model and export it into ONNX (Open Neural Network Exchange) format.
The goal is to train your own image classification model on the famous MNIST dataset. At the end of the model training, it will be is saved in PyTorch format (.pt) and then transformed into ONNX.
Exporting your model in ONNX format allows you to optimize the inference of a Machine Learning model.
Requirements
- Access to the OVHcloud Control Panel.
- A Public Cloud project created.
- The ovhai CLI interface installed on your system (more information here).
- Docker installed and configured to build images.
- An OCI / Docker image registry. You can use a public registry (such as Docker Hub for example) or a private registry. Refer to the Creating a private registry documentation to create a private registry based on Harbor. To make your registry compatible with AI Solutions usage, follow the Use & manage your registries guide.
- Knowledge about building images with Dockerfile.
Instructions
Create Object Storage bucket for you ONNX model
To be able to retrieve and use the ONNX model at the end of training, you need to create an empty bucket to store it.
Create your bucket via UI (Control Panel)
If you do not feel comfortable with commands, this method may be more intuitive.
First, go to the Public Cloud section of the OVHcloud Control Panel.
Then, select the Object Storage section (in the Storage category) and create a new object container by clicking Storage > Object Storage > Create an object container.
You can create the bucket that will store your ONNX model at the end of the training. Select the container type and the region that match your needs.
Create your bucket via ovhai CLI
To follow this part, make sure you have installed the ovhai CLI on your computer or on an instance.
As in the Control Panel, you will have to specify the regionand the name (cnn-model-onnx) of your bucket. Create your Object Storage bucket as follows:
Write the model training Python code
To train the model, we will use AI Training. This powerful tool will allow you to train your AI models from your own Docker images.
You need to create a Python script that is in charge of doing the training: train_image_classification.py.
First, import the Python dependencies.
Then, define the Neural Network architecture.
Now, define the load_data function.
Check the GPU availability.
Next, define the function that will train your model.
You can also define the function that will test your model.
Finally, export the model to ONNX format thanks to the following function.
It's now time to call these functions into the main!
Find the full Python code on our GitHub repository.
Create the requirements.txt file
Then, create a requirements.txt file to declare the Python dependencies and their versions.
Build your own Docker image
Create a Dockerfile compliant with AI Training
The Dockerfile should start with the FROM instruction indicating the parent image to use. In our case we choose to start from a python:3.10 image.
Then, precise the workspace path, install the Python dependencies and launch the model training using the CMD.
Build the Docker image from the Dockerfile
From the directory containing your Dockerfile, run one of the following commands to build your application image:
-
The first command builds the image using your system’s default architecture. This may work if your machine already uses the
linux/amd64architecture, which is required to run containers with our AI products. However, on systems with a different architecture (e.g.ARM64onApple Silicon), the resulting image will not be compatible and cannot be deployed. -
The second command explicitly targets the
linux/AMD64architecture to ensure compatibility with our AI services. This requiresbuildx, which is not installed by default. If you haven’t usedbuildxbefore, you can install it by running:docker buildx install
The dot . argument indicates that your build context (place of the Dockerfile and other needed files) is the current directory.
The -t argument allows you to choose the identifier to give to your image. Usually image identifiers are composed of a name and a version tag <name>:<version>. For this example we chose train-cnn-model-export-onnx:latest.
Push the image into the shared registry
The shared registry of AI Deploy should only be used for testing purposes. Please consider attaching your own docker registry. More information about this can be found here. The images pushed to this registry are for AI Tools workloads only, and will not be accessible for external uses.
Find the address of your shared registry by launching this command:
Log in on the shared registry with your usual AI Platform user credentials:
Push the compiled image into the shared registry:
Once your Docker image is created and pushed into the registry, you can directly use the ovhai command to create your model training.
You can launch the training specifying more or less GPU depending on the speed you want for your training.
If your images are stored in a private registry, please follow the documentation Registries - Use & manage your registries to add your registry.
Launch the AI Training job
You can launch the training job using the UI or the CLI.
Create your training job via UI (Control Panel)
If you do not feel comfortable with commands, this method may be more intuitive.
First, go to the Public Cloud section of the OVHcloud Control Panel.
Then, select the AI Training section (in the AI & Machine Learning category) and create a new job by clicking AI Training > Launch a new job.
You can create the job that will train your model and export it to ONNX model. Select the region and add your custom docker image (<shared-registry-address>/train-cnn-model-export-onnx:latest).
Then attach your Object Storage container cnn-model-onnx and define the mount directory: /workspace/models.
Finally, configure your job and choose at least 1 GPU.
Create your training job via ovhai CLI
The following command starts the training:
Then you can check the training evolution through the job logs:
Go further
Check our other tutorials to learn how to:
-
Use Transfer Learning with ResNet50 for image classification
-
Fine-Tune and export AI model to ONNX through an AI Notebook
Feedback
Please send us your questions, feedback and suggestions to improve the service:
- On the OVHcloud Discord server