CLI - Launch an AI Training job
Objective
This guide covers the submission of jobs through the ovhai CLI.
To illustrate the submission we will iteratively build a command to run a notebook image ovhcom/ai-training-transformers:3.1.0 with the Huggingface framework preinstalled.
This Docker image is freely available.
Requirements
- a working
ovhaiCLI how to install ovhai CLI
Instructions
job run
If you need any help while submitting a new job, run ovhai job run --help:
Size your run
First you need to tweak the resources you need for your new run depending on your expected workload.
For example if you are on a data exploration step or designing your neural network to train you might start with a few vCPUs. Once your experiment is ready switch over to using GPUs to train.
Flags --cpu and --gpu are exclusive, if GPU resources are specified then the CPU flag is ignored and the standard GPU to CPU ratio is applied.
You can find out more about these ratios in the capabilities.
If you provision GPUs for your run you can also select the model of GPU you wish to use with the --gpu-model flag.
If this flag is not specified the default GPU model for the cluster on which you submit is used.
You can find out about the default GPU for your cluster with ovhai capability command.
The maximum amount of vCPUs or GPUs available depends on the GPU model and the cluster you are using.
You can find out about your cluster resources limitation with ovhai capability.
For this experiment we will deploy a notebook with 1 gpu of the default model
- If no resource flag is specified the job will run with one unit of the default GPU model.
- If both CPU and GPU flags are provided only the GPU one is considered
Attaching volumes
This step assumes that you either have data in your OVHCloud Object Storage that you wish to use during your experiment or that you need to save your job results into the Object Storage. To learn more about data, volumes and permission check out the data page.
You can attach as many volumes as you want to your job with various options. Let us go through those options and outline a few good practices with volume mounts.
The --volume flag is used to attach a container as a volume to the job.
The volume description sets the option for the volume and synchronisation process <container@region/prefix:mount_path:permission:cache>:
containerthe container in OVHcloud Object Storage to synchroniseregionthe Object Storage region on which the container is locatedprefixobjects in the container are filtered on the base of this prefix, only matching objects are syncedmount_paththe location in the job where the synced data is mountedpermissionthe permission rights on the mounted data. Available rights are read only (ro), read write (rw) or read write delete (rwd). Data mounted with ro permission is not synced back at the end of the job. Thus it avoids unnecessary synchronization delay on static data.cachewhether the synced data should be added to the project cache. Available options are eithercacheorno-cache. Data in the cache can be used by other jobs without additional synchronisation, to benefit from the cache the new jobs also need to mount the data with the cache option.
Let's assume you have a team of datascientists working on the same input dataset but running each their own experiment. In this case a good practice is to mount the input dataset with ro permission and cache activated for each experiment, the input data is synced only once and never synced back. In addition, each of the experiment will yield specific results that should be stored in a dedicated container. For each job we would then mount an output container with rw permission and no cache. If a container does not exist yet in the object storage it is created during the data synchronization.
Assuming our data is located in the Gravelines Object Storage in a container named dataset the command would now be:
- Data in the cache is not persisted indefinitely. After a period of inactivity the data is emptied from the cache. Inactivity is defined as having no running jobs using the data in cache.
Define your process
Once resources and volumes are set up you will now need to define the specificities of the process running within your job.
First you need a Docker image that you either built yourself or find freely available on a public repository such as DockerHub.
In our example we will use the notebook image ovhcom/ai-training-transformers:3.1.0.
You can tweak the behavior of your Docker image without having to rebuild it every time (like updating the number of epochs for a training run) by using the --env flag.
Using this you can simply set environment variables directly in your job, e.g:
In our example we do not require any environment variable.
It is also possible to override the default CMD of Entrypoint of the Docker image, simply add the new command at the end of the job run request.
To make sure flags from your command are not interpreted as ovhai parameters you can prefix your command by --.
To simply print Hello World the command would be:
When a job is running a job_url is associated to it that allows you to access any service exposed in your job. By default, the exposed port for this url is the 8080, in our case the Jupyter Notebook is directly exposed on 8080 and we do not need to override it.
However, if you are running an experiment and monitoring it with Tensorboard the default port should be 6006, you can override the port with:
Extra options
A few other options are available for your jobs.
--timeouttimeout after which the job will stop even if the process in the job did not end, helps you control your consumption--labelfree labels to help you organize your jobs, labels are also used to scopeapp_token, learn more aboutapp_tokenand how to create them here--read-useryou can add aread-userto a job, a read user will only have access to the service exposed behind thejob_url. The read-user must match with the username of an AI Platform user with anAI Training readrole.--ssh-public-keysallows you to access your job through SSH, it is particularly useful to setup a VSCode Remote--fromrun a job based on the specification of a previous one. All options will override the base job values. The--imageis the flag used to override the image of the base job.
Run a job
Finally, to submit a notebook job with 1 GPU, a dataset container and an output container we run
You can then follow the progress of all your jobs using the following commands:
If you want to fetch the specific job you just selected, retrieve its ID and then:
For more information about the job and its lifecycle refer to the jobs page.
Going further
To know more about the CLI and available commands to interact with your job check out the overview of ovhai
If you need training or technical assistance to implement our solutions, contact your sales representative or click on this link to get a quote and ask our Professional Services experts for a custom analysis of your project.
Feedback
Please send us your questions, feedback and suggestions to improve the service:
- On the OVHcloud Discord server
-
Secure Shell (SSH) : un protocole de réseau sécurisé utilisé pour établir des connexions entre un client et un serveur. Il permet d'exécuter des commandes à distance de manière sécurisée. ↩