In the following blog we are going to talk about Label Studio and how you can easily utilize Label Studio for Labelling via deploying it on TrueFoundry
Quality of data is much more important than quantity. This is a well-established principle in the machine learning community, and it is becoming increasingly important as the MLOps community moves towards a data-centric approach.
In a recent talk, Andrew Ng cited the example of a steel defect detection problem, where the data-centric approach was able to improve the model's accuracy by 16%, while the model-centric approach did not improve the accuracy at all.
The data-centric approach is more important than the model-centric approach, especially in cases where the data is limited or noisy - Andrew Ng
You can watch this talk by Andrew Ng regarding the Data-centric approach to Machine Learning:
This is because even the best machine learning models are only as good as the data they are trained on. If the data is noisy or incomplete, the model will learn to make mistakes. This is why it is so important to use high-quality data when training machine learning models.
Even in the context of large language models (LLMs), the importance of high-quality data remains paramount. The paper titled "Training Compute-Optimal Large Language Models" by DeepMind (2022) delves into the scaling laws for LLMs, demonstrating that data quality trumps dataset size. The authors illustrate that a model trained on a mere 500 rows of high-quality data can outperform a model trained on a dataset containing a staggering 100 million rows of low-quality data.
You can read the paper using the following link:
In a similar vein, "WizardLM: A Small, Efficient, and High-Performance Language Model" by Google AI (2022) introduces WizardLM, a compact language model trained on a dataset comprising only 500,000 words. Remarkably, WizardLM achieves state-of-the-art results across various natural language processing tasks, despite its significantly smaller size compared to other LLMs.
Label Studio is an open-source, web-based data labeling tool developed by Heartex. It's designed to streamline the process of annotating data for machine learning tasks. Whether you're working on image classification, named entity recognition, sentiment analysis, or any other labeling task, Label Studio provides a flexible and user-friendly environment to label your data efficiently.
Step 1: Data Import
The labeling process begins with the import of your raw, unlabeled data into Label Studio. This data could be images, text, audio, video, or any other data format depending on your labeling task. For our example, let's consider an image classification task where you want to categorize images of animals.
Step 2: Project Creation
Once your data is imported, you create a new labeling project within Label Studio. In this project, you define the labeling task, select the data source, and configure the labeling interface. You can customize the interface to suit your specific needs, choosing from various annotation types like bounding boxes, polygons, key points, and more.
For our animal image classification example, you'd create a project named "Animal Classification" and configure it to display images for annotation.
Step 3: Annotation
With the project set up, you invite annotators to start labeling the data. Annotators access the Label Studio platform through a web interface, where they are presented with images from your dataset. In our example, annotators would see images of animals and classify them into predefined categories like "dog," "cat," "horse," etc.
As annotators review each image, they use the labeling tools provided by Label Studio to mark their annotations. For image classification, annotators simply select the appropriate category from the predefined list.
Step 4: Quality Control
To maintain labeling accuracy and consistency, Label Studio offers features for quality control. Annotators can review and validate each other's work, ensuring that labels are correct. You can also implement consensus-building workflows to resolve disagreements among annotators.
Step 5: Data Export
Once the data is labeled to your satisfaction, you export the labeled dataset from Label Studio. The platform provides various export options, including JSON, CSV, or formats compatible with popular machine learning libraries. This labeled dataset is now ready for use in training machine learning models.
Step 6: Model Training
Using the high-quality labeled data obtained from Label Studio, you can proceed to train your machine learning model for animal classification. The labeled data serves as the training set, and the model learns to recognize different animal categories based on the annotations provided by the annotators.
Step 7: Model Evaluation and Deployment
After training the model, you evaluate its performance using a separate test dataset. If the model meets your accuracy criteria, you can deploy it for real-world use, such as automatically categorizing animal images.
You've learned all about LabelStudio, and now we're excited to share that you can effortlessly host LabelStudio on TrueFoundry. Below, you'll find step-by-step instructions on how to deploy LabelStudio on your Machine Learning setup. Let's get started!
💡
Note:Very soon you will be able to deploy LabelStudio on TrueFoundry with just one click. We will keep you posted.
Before you can harness the power of Label Studio within TrueFoundry, you need to register and set up your TrueFoundry account.
TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team.
Go ahead and click on the link below and follow along the instructions:
Show advanced fields
Patch
spec.http.route.destinatoin.host
spec.hosts
apiVersion: networking.istio.io/v1alpha3 kind: VirtualServicemetadata: name: labelstudio-vsspec: http: - route: - destination: host: <label-studio-app-name>-ls-app.<workspace-name>.svc.cluster.local port: number: 80 hosts: - <ds.organization.com/labelstudio> gateways: - istio-system/tfy-wildcard
Once you have deployed Label Studio on TrueFoundry, you can access the Label Studio UI via the provided endpoint. Let's explore how to label images for image classification using Label Studio.
When you open your hosted Label Studio, you'll be prompted to sign up for an account or log in if you already have one. Follow these steps:
Setting up a labeling project in Label Studio involves several steps, including defining the project name, importing the dataset, and configuring labeling tasks. Let's walk through these steps:
After signing up and logging in, you can start creating your labeling projects. Follow these steps:
Project Name Tab:
Before you can begin labeling, you need to import the dataset you want to label. Label Studio supports various data types, including images, text, and audio. Here, we'll focus on importing images:
Data Import Tab:
After importing your dataset, you need to set up labeling tasks that define what you want labelers to annotate or classify in the images. Follow these steps:
Labeling Setup Tab:
Your labelling project is now configured, and labellers can begin annotating or classifying images based on the defined task.
Once your project setup is complete, you'll be redirected to your project's Data Manager. Here, you can efficiently manage your dataset and labeling tasks. Follow these steps to initiate the labeling process:
In the Data Manager, you have several powerful tools at your disposal to efficiently manage your labeling project:
To maintain high-quality labels throughout your project, consider implementing a review process:
Once you've completed the labeling process, you can export the labeled data for use in your machine learning projects or other applications.
Congratulations! You've successfully completed the labeling process and exported your labeled data for further use.
Your labelled data will be somewhat similar to this:
[ { "id": 1, "annotations": [ { "id": 1, "completed_by": 1, "result": [ { "value": { "choices": ["Dog"] }, "id": "m2SBwNysux", "from_name": "choice", "to_name": "image", "type": "choices", "origin": "manual" } ], "was_cancelled": false, "ground_truth": false, "created_at": "2023-09-14T00:51:14.067278Z", "updated_at": "2023-09-14T00:51:14.067302Z", "draft_created_at": null, "lead_time": 7.287, "prediction": {}, "result_count": 0, "unique_id": "21bce98c-ef29-4725-8add-f5d6224441ec", "import_id": null, "last_action": null, "task": 1, "project": 3, "updated_by": 1, "parent_prediction": null, "parent_annotation": null, "last_created_by": null } ], "file_upload": "c2f3a287-czNmcy1wcml2YXRlL3Jhd3BpeG_AYUTabf.pngsNjVuBb-Kdw49uxifZtlp1-3P4mypZAScmHj9-qHiDSk", "drafts": [], "predictions": [], "data": { "image": "\/data\/upload\/3\/c2f3a287-czNmcy1wcml2YXRlL3Jhd3BpeG_AYUTabf.pngsNjVuBb-Kdw49uxifZtlp1-3P4mypZAScmHj9-qHiDSk" }, "meta": {}, "created_at": "2023-09-14T00:45:00.762438Z", "updated_at": "2023-09-14T00:51:14.111241Z", "inner_id": 1, "total_annotations": 1, "cancelled_annotations": 0, "total_predictions": 0, "comment_count": 0, "unresolved_comment_count": 0, "last_comment_updated_at": null, "project": 3, "updated_by": 1, "comment_authors": [] }]
Feel free to customize your Label Studio projects and workflows based on your specific requirements, leveraging Label Studio's powerful data labeling capabilities.
Keep watching the TrueML youtube series and reading all the TrueML blog series.
TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.
Join AI/ML leaders for the latest on product, community, and GenAI developments