In this blog, we will introduce you to training machine learning models on the TrueFoundry Platform. We will discuss how we can run training jobs on TrueFoundry. We will also see how you can perform hyperparameter tuning easily for your machine-learning models, and run your jobs on GPUs.
Let's first start with a problem statement, say we want to see how the diabetes disease will progress in a patient based on various features like age, BMI, blood pressure, etc. In this blog, we will train the Diabetes Dataset machine-learning model in scikit-learn.Speaking of training machine-learning models, there are several ways to do so like training locally on your machine, training in Jupyter Notebooks, etc. However, the training process might require more resources than what is available on a local machine. This is where TrueFoundry's Jobs enables you to deploy the training code to run on a remote machine and you can track the logs and metrics.
Diabetes Dataset
Note: While we are using this diabetes dataset, the instructions mentioned in this blog apply to other machine-learning / deep-learning models as well.
Jobs provide a way to run short-lived, parallel, or sequential batch tasks within the cluster. Jobs are designed to run to completion, rather than being long-running services or continuously running applications. Once a job is completed, compute and memory resources are released, hence we don't incur any extra cost.
As said before, we will use the Diabetes Dataset in scikit-learn. The dataset contains 442 samples (patients) and 10 features, which are all numeric. The features represent various factors that can affect the progression of diabetes in patients. The target variable is also numeric and represents a quantitative measure of disease progression one year after baseline for each patient.
Before we proceed to train machine-learning models, let's go through the setup instructions:
Go to the TrueFoundry Dashboard and create an account. As soon as you log in, you will be prompted to create a workspace. You would be deploying your jobs in this workspace.
Once you have created your workspace, go ahead and create an ML Repository from the dashboard.
An ML Repository is a collection of runs, models, and artifacts which represents a Machine Learning project. You can think of it like a git repository except that it houses artifacts, models and metadata. All the access controls can be configured on the level of ml-repo.
Once the ML Repo is created, go to workspaces and edit your workspace and enable 'ML Repo Access'. Click on 'Add ML Repo Access' to add your ML Repo to this workspace. This will grant the job running in the workspace to write and read from the ML repo.
pip install servicefoundry
--host: Pass your TrueFoundry Dashboard URL here
--host
sfy login --host <YOUR-HOST-URL-HERE>
Once we have completed the above setup instructions, we can move forward with the implementation section.
Directory Structure
In this blog, we will adhere to the following directory structure where:
❯ tree.├── deploy.py├── requirements.txt└── train.py
Now let's go over the diabetes model training code:
Model Training Steps
from sklearn.metrics import accuracy_scorefrom sklearn.datasets import load_diabetesfrom sklearn.model_selection import train_test_splitfrom sklearn.compose import TransformedTargetRegressorfrom sklearn.preprocessing import QuantileTransformerfrom sklearn.svm import SVR
X, y = load_diabetes(as_frame=True, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
regressor = SVR(kernel=kernel)model = TransformedTargetRegressor( regressor=regressor, transformer=QuantileTransformer(n_quantiles=n_quantiles, output_distribution="normal"),)model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)print(f"Accuracy: {accuracy:.2f}")
Model Training Complete Code
from sklearn.metrics import accuracy_scorefrom sklearn.datasets import load_diabetesfrom sklearn.model_selection import train_test_splitfrom sklearn.compose import TransformedTargetRegressorfrom sklearn.preprocessing import QuantileTransformerfrom sklearn.svm import SVRdef train(kernel: str, n_quantiles: int): # load the dataset and create train and test sets X, y = load_diabetes(as_frame=True, return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # initialize the model regressor = SVR(kernel=kernel) model = TransformedTargetRegressor( regressor=regressor, transformer=QuantileTransformer(n_quantiles=n_quantiles, output_distribution="normal"), ) # train and test model model.fit(X_train, y_train) y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy:.2f}" return regressor, model, X_test, y_test
Now that we have seen the code to train a machine-learning model, we can go ahead and learn about how to save (or log) such models for future use. A Model comprises of model file and some metadata. Each Model can have multiple versions. We can automatically serialize, save, and version model objects by using the save_model_metadata method and the following are the steps to do so:
save_model_metadata
Model Logging Steps
import mlfoundry
run = mlfoundry.get_client().create_run(ml_repo=ml_repo, run_name="SVR-with-QT")
A Run represents a single experiment which in the context of Machine Learning is one specific model (say Logistic Regression), with a fixed set of hyper-parameters. Metrics, and parameters (details below) are all logged under a specific run.
run.log_params(regressor.get_params())
run.log_metrics({"score": model.score(X_test, y_test)})
model_version = run.log_model(name="diabetes-regression", model=model, framework="sklearn")print("model_version =", model_version.version, "model_fqn =", model_version.model_fqn)
Each logged model generates a new version associated with the given name and linked to the current run. Multiple versions of the model can be logged as separate versions under the same name.
name
Model Logging Complete Code
import mlfoundrydef save_model_metadata(regressor, model, X_test, y_test, ml_repo): # create a run in truefoundry’s ml_repo run = mlfoundry.get_client().create_run(ml_repo=ml_repo, run_name="SVR-with-QT") # log the hyperparameters of the model run.log_params(regressor.get_params()) # log the metrics of the model run.log_metrics({"score": model.score(X_test, y_test)}) # log the model model_version = run.log_model(name="diabetes-regression", model=model, framework="sklearn") print("model_version =", model_version.version, "model_fqn =", model_version.model_fqn)
Now that we have seen the process of model training and logging, we can compile that into a single train.py file. The final contents of the train.py file should look like this:
train.py
# required import statements for both functionsdef train(kernel, n_quantiles): ...def save_model_metadata(regressor, model, X_test, y_test, ml_repo): ...regressor, model, X_test, y_test = train(kernel="linear", n_quantiles=100)save_model_metadata(regressor, model, X_test, y_test, ml_repo="YOUR ML REPO NAME")
This completes our code for model training and logging. Now we have to deploy the model training code as a job. The deploy.py contains the code for deploying the above model training code as shown below:
deploy.py
from servicefoundry import Build, Job, PythonBuild, LocalSource# defining the job specificationsjob = Job( name="diabetes-train-job", image=Build( build_spec=PythonBuild(command="python train.py", requirements_path="requirements.txt"), build_source=LocalSource(local_build=False) ),)deployment = job.deploy(workspace_fqn="YOUR WORKSPACE FQN HERE")
The requirements.txt should contain the following packages:
requirements.txt
pandas==1.3.5scikit-learn==1.2.1mlfoundry>=0.7.2,<0.8.0
In the above deploy.py code, a job is deployed that displays the accuracy score of the trained model in the logs when it is invoked. It also logs the trained machine-learning model. To do this, a job object is created using the servicefoundry.Job class. The job name is kept as diabetes-train-job here.
servicefoundry.Job
diabetes-train-job
NOTE: Make sure you replace "YOUR ML REPO NAME" with your ML Repo Name in train.py and "YOUR WORKSPACE FQN HERE" with your workspace FQN in deploy.py file.
In the train.py file, you need to pass the Name of the ML Repo you created to the save_model_metadata() function. In the deploy.py file, you need to pass FQN of the workspace you created to job.deploy() function. Now execute the following command to deploy the job:
Name of the ML Repo
save_model_metadata()
FQN
job.deploy()
python deploy.py
After deploying the training job, go to the "Jobs" sub-section in the "Deployments" section, it should look similar to this:
Now that we are done with deploying our job, we will want to trigger it. You can do so using either our Python SDK or the TrueFoundry Dashboard. Firstly, we will talk about triggering jobs from TrueFoundry Dashboard. To know about other methods of triggering jobs, refer to the Triggering Jobs from Python SDK section.
Triggering Jobs from Python SDK
After the above training job has finished deploying, go to the "Jobs" sub-section in the "Deployments" section, click on the "diabetes-train-job", and click on the "Run Job" to configure the job before triggering. It should look similar to this:
When at the above screen, click on "Run Job" at the bottom right corner to trigger this job. After the training job has finished running, go to the "Jobs" sub-section in the "Deployments" section, and click on the "diabetes-train-job", it should look similar to this:
Under "Run Details", click on "SVR-with-QT" to see key metrics and hyperparameters that were logged in the train.py file. It should look similar to this:
"SVR-with-QT"
Imagine that you have large-scale data processing or batch processing tasks where running a single job with different configurations is essential to not only streamline your workflow but also ensure consistency in job execution. In such cases, a Parameterized job will come in handy.
A Parameterized Job is a type of job that allows you to create multiple instances (pods) with different parameters or inputs. The primary goal of a Parameterized Job is to provide flexibility in job execution by customizing its behavior for different scenarios.
As an example, a job with command as python main.py --n_quantiles {{n_quantiles}} is a parameterized job as it takes n_quantiles as input before running. We can simplify the job deployed above using parameters.
python main.py --n_quantiles {{n_quantiles}}
n_quantiles
To parse the command-line arguments, we will use the argparse module. The following code shows the updated code for the train.py and deploy.py files, where the default values of kernel and n_quantiles are linear and 100 respectively:
argparse
kernel
linear
100
import os, argparsedef train(kernel, n_quantiles): ...def log_model(regressor, model, X_test, y_test, ml_repo): ...parser = argparse.ArgumentParser()parser.add_argument("--kernel", default="linear", type=str)parser.add_argument("--n_quantiles", default=100, type=int)args = parser.parse_args()regressor, model, X_test, y_test = train(kernel=args.kernel, n_quantiles=args.n_quantiles)log_model(regressor, model, X_test, y_test, ml_repo=os.environ.get("ML_REPO_NAME"))
import argparsefrom servicefoundry import Build, Job, PythonBuild, Param, LocalSourceparser = argparse.ArgumentParser()parser.add_argument("--workspace_fqn", type=str, required=True)parser.add_argument("--ml_repo", type=str, required=True)args = parser.parse_args()cmd = "python train.py --kernel {{kernel}} --n_quantiles {{n_quantiles}}"# Defining the job specifications# Only the command changes in 'image' attributejob = Job( ... image=Build(build_spec=PythonBuild(command=cmd, ...), ...), params=[ Param(name="n_quantiles", default='100'), Param(name="kernel", default='linear', description="svm kernel"), ], env={ "ML_REPO_NAME": args.ml_repo })deployment = job.deploy(workspace_fqn=args.workspace_fqn)
NOTE: Make sure you replace "YOUR ML REPO NAME" with your ML Repo Name and "YOUR WORKSPACE FQN HERE" with your workspace FQN in the below command.
Now execute the following command to deploy the parameterized job:
python deploy.py --workspace_fqn "YOUR WORKSPACE FQN HERE" --ml_repo "YOUR ML REPO NAME"
Alternatively, you can execute directly from our Github repository.
git clone https://github.com/truefoundry/truefoundry-examples.gitcd training-job-examplepython deploy.py --workspace_fqn "YOUR WORKSPACE FQN HERE" --ml_repo "YOUR ML REPO NAME"
Version 2 of the Job will be created after deployment has finished. After the training job has finished deploying, the next step is to trigger this job.
Click on the "diabetes-train-job", and click on the "Run Job" to configure the job before triggering. You can now change the n_quantiles and kernel parameters. It should look similar to this:
Try triggering job runs with different values of the kernel parameter like linear, sigmoid, poly and rbf. Similarly, you can use different values of n_quantiles parameters like 50, 80, 100, etc. Various job runs should look similar to this:
sigmoid
poly
rbf
You can compare the metrics of different job runs in the dashboard by clicking on compare button in the top right section of the dashboard as shown below:
You can read more about Parameterized Job Deployment here:
So far we have only seen triggering job runs from the TrueFoundry Dashboard. Now it is possible that triggering a job isn't always desirable via UI, so we will now go over how to trigger a job programmatically via Python SDK.
You can trigger your job programmatically by using the trigger_job function as shown below:
trigger_job
from servicefoundry import Job, trigger_job# Configure a Job Deploymentjob = Job(...)# Deploy a Jobjob_deployment = job.deploy(workspace_fqn="YOUR WORKSPACE FQN")# Trigger/Run a Jobtrigger_job( application_fqn=job_deployment.application_fqn, params={"n_quantiles":"80"})
It is also possible to get the application_fqn easily from the dashboard by going to Deployments --> Jobs --> Look for your job name in your workspace (here diabetes-train-job).
application_fqn
Below is another example of triggering a job programmatically. Firstly, replace YOUR_APPLICATION_FQN with the Application FQN of the job deployed above in the code shown below. The Application FQN for an application is highlighted below:
YOUR_APPLICATION_FQN
In the code shown below, we are randomly choosing a value for model parameters and triggering a job run using those parameters. After that, we search through job runs to find the run with the max score. The job run with the maximum score indicates a more optimal choice of model parameters.
score
import randomimport mlfoundry as mlffrom servicefoundry import trigger_job# Find deployed job, replace with the application fqn of your jobapplication_fqn = "YOUR_APPLICATION_FQN"# Randomly generate paramsn_quantiles = random.randint(50, 100)kernel_values = ['linear', 'sigmoid', 'poly', 'rbf']kernel = kernel_values[random.randrange(0, len(kernel_values))]# Trigger/Run a Jobtriggered_job = trigger_job( application_fqn=application_fqn, params={ "n_quantiles": str(n_quantiles), "kernel": kernel })print(f'Triggered job run with n_quantiles={n_quantiles} and kernel={kernel} and name as', triggered_job.jobRunName)client = mlf.get_client()ml_repo_name = "YOUR ML REPO NAME HERE"runs = client.search_runs(ml_repo=ml_repo_name)max_score = 0for run in runs: metrics = run.get_metrics() print(f'All Metrics for run with name as {run.run_name}":', metrics) if 'score' in metrics: max_score = max(max_score, metrics['score'][0].value)print("Model Max Score: ", max_score)
NOTE: Make sure you replace "YOUR ML REPO NAME HERE" with your ML Repo Name and "YOUR WORKSPACE FQN HERE" with your workspace FQN in the below command.
You can read more about Triggering Jobs here:
You can compare the metrics of different job runs programmatically by using mlfoundry.search_runs function as described in the following code:
mlfoundry.search_runs
import mlfoundry as mlfclient = mlf.get_client()ml_repo_name = "YOUR-ML-REPO-NAME"# Returns all runsruns = client.search_runs(ml_repo=ml_repo_name)# Search for the subset of runs with logged accuracy metric > 0.7filter_string = "metrics.score > 0.7"runs = client.search_runs(ml_repo=ml_repo_name, filter_string=filter_string)
You can read more about the search_runs function here:
search_runs
Imagine dealing with large-scale models which have millions or billions of parameters. Training such models using traditional CPUs would be very time-consuming and may even be infeasible due to memory constraints.
For example, training a CNN model based on the CIFAR-10 dataset on a CPU environment with 10 epochs takes 36 minutes and 31 seconds, but the same model when trained on a GPU (NVIDIA K80) environment took only 4 minutes and 6 seconds. That is an improvement by a factor of 9x (Refer here)
GPUs play a vital role in large-scale model training. They provide the necessary memory bandwidth and parallel processing capabilities to handle large model sizes and complex architectures efficiently. Now we will go over how to use a GPU in a Job.
Using a GPU in the above example requires slight modifications in how a job is configured in the deploy.py file. The code below shows the updated Job Configuration for GPU utilization and Custom CPU and Memory resource allocation:
from servicefoundry import Job, NodeSelector, GPUType, Resourcesjob = Job( resources=Resources( # Configure GPU gpu_count=1, node=NodeSelector(gpu_type=GPUType.T4) # (Optional) Configure CPU and Memory Resources cpu_request=0.2, cpu_limit=0.5, memory_request=128, memory_limit=512, ), ...)
Note: Rest of the code remains unchanged
So far, we have seen several Job Deployment options such as image, params, and env. There are several ways to customize your job with advanced options. Some of those are as follows:
image
params
env
Now that we have discussed Jobs that can be triggered manually either via TrueFoundry Dashboard or Python SDK. But what if we want a Job to run on a schedule (like a cron job)?
A cron job runs the defined job on a repeating schedule. This can be useful to retrain a model periodically, generate reports, and more. We can implement such jobs by changing their trigger type as follows:
from servicefoundry import Job, Schedulejob = Job( trigger=Schedule( schedule="0 8 1 * *", concurrency_policy="Forbid" # Values: ["Forbid","Allow", "Replace"] ), concurrency_limit=3, ...)
For cron jobs, it is possible that the previous run of the job hasn't been completed while it is already time for the job to run again because of the scheduled time. In such cases, we may define concurrency_policy as follows:
Forbid
Allow
Replace
Concurrency doesn't apply to manually triggered jobs. In that case, it always creates a new job run.
Job status can be of 3 types that are FINISHED, TERMINATED, and FAILED. A job can be configured to retry several times on failure.
A Job is marked as FAILED if it does not successfully finish even after the configured number of retries. Retries can be configured for a job like this:
from servicefoundry import Jobjob = Job( retries=6, # default = 1 ...)
In some use cases, you may need to specify what maximum amount of time you want a job to continue executing.
Using timeout, you can specify (in seconds) the maximum amount of time for a job to run, whether it has failed or not. This will take precedence over the retries Limit. By default, this is set to 1000 seconds.
retries
1000
For example, if you set theretries to 6 and a timeout of 480 seconds, the job will terminate after 480 seconds regardless of how many times it attempted to run.
6
timeout
480
from servicefoundry import Jobjob = Job( timeout=480, ...)
Apart from the Job Deployment Options discussed in this blog, there are still some that we won't be discussing in this blog like:
.... and a few more. You may refer to our Documentation given below to know the answers to the above questions :)
Our public repository truefoundry-examples contains this blog's job source code and it also includes several examples including LLM Finetuning, Getting Started with Notebooks, End-to-End Examples giving broader exposure to features offered by TrueFoundry Platform.
In summary, TrueFoundry's Job provides a powerful framework for managing and executing training tasks in a scalable, fault-tolerant, and resource-efficient manner.
They enable you to distribute and control the execution of machine learning workloads, monitor their progress, and ensure that your training models are trained effectively and reliably, and this makes them ideal for executing one-time or on-demand tasks.
Join AI/ML leaders for the latest on product, community, and GenAI developments