Deploying Falcon-40B model on Amazon SageMaker

In this article, we discuss about deploying Falcon model on your own cloud.

The Technology Innovation Institute in Abu Dhabi has developed Falcon, an innovative series of language models. These models, released under the Apache 2.0 license, represent a significant advancement in the field. Notably, Falcon-40B stands out as a truly open model, surpassing numerous closed-source models in its capabilities. This development brings tremendous opportunities for professionals, enthusiasts, and the industry as it paves the way for various exciting applications.

In this blog post, we will describe LLMOps process on SageMaker - hosting the Falcon model on your own AWS cloud account and the different options available. Furthermore, we plan to release another blog post in the future, focusing on running Falcon on other clouds.

We also wrote another blog on deploying Llama 2 on your cloud. Check below:

‍

Deploy & Fine Tune Meta’s Open source LLM: Llama 2 on your cloud

Meta’s Open Source LLM, Llama 2 is available to use for free for research and commercial purposes. In this blog we explain how you can finetune and deploy it on your cloud.

TrueFoundry Blog Truefoundry

‍

Moving on, the Falcon family has two base models: Falcon-40B and Falcon-7B. The 40B parameter model currently tops the charts of the Open LLM Leaderboard, while the 7B model is the best in its weight class. We will be discussing the options for deploying Falcon 40B model.

Falcon-40B requires ~90GB of GPU memory - so this will not fit in a single A100 instance with 80 GB of RAM. The instance type that will work in AWS is g5.12xlarge (https://aws.amazon.com/ec2/instance-types/g5/). We can deploy the model either as an API endpoint for realtime inference or load it in the code itself for batch inference usecases.

The code to load the model and run text-generation task on it is as follows:

# pip install "transformers[tokenizers]>=4.29.2,<5.0.0" # "sentencepiece==0.1.99" "accelerate>=0.19.0,<1.0.0" # "safetensors>=0.3.1,<0.4.0" import torch from transformers import pipeline generator = pipeline( "text-generation", model="tiiuae/falcon-40b-instruct", tokenizer="tiiuae/falcon-40b-instruct", torch_dtype=torch.bfloat16, device_map="balanced_low_0", ) output = generator( "Explain to me the difference between nuclear fission and fusion.", min_new_tokens=30, max_new_tokens=50 ) print(output)

Python code to load Falcon40B in the notebook

Deploying the Model as an API

We can deploy the model as an endpoint either on AWS Sagemaker or EKS cluster or plain EC2 machine. To deploy the model on Sagemaker, you can follow this tutorial: https://aws.amazon.com/blogs/machine-learning/deploy-falcon-40b-with-large-model-inference-dlcs-on-amazon-sagemaker/.

Model Catalogue - Falcon models — Model Catalogue comprising of Falcon models

To deploy the model on EKS, we will need to bring up a EKS cluster, setup a GPU nodepool and gpu operator on it, an ingress layer to be able to hit the api endpoint. TrueFoundry can make this entire journey much more simpler by making model deployment a one click process.

Cost Analysis

Let's dive into the cost of running Falcon LLM on your own AWS account. We will be comparing the cost of running it on Sagemaker vs TrueFoundry.

Sagemaker Cost

Cost of Sagemaker instance (ml.g5.12xlarge) per hour (us-east-1): $7.09

We ran a quick benchmark to compute the request throughput and latency for falcon model on AWS Jumpstart. The exact numbers will vary based on your prompt lenghts and concurrency of requests, but this should provide a rough idea:

As we can see in the graphs above, the p50 latency is around 5.7 seconds and p90 is around 9.4 seconds. We are able to get a throughput of around 6-7 requests per second.

Deploy the model on EKS using TrueFoundry

TrueFoundry deploys the model on EKS and we can utilize spot and on-demand instances to highly reduce the cost. Let's compare the per-hour on-demand, spot and reserved pricing of g5.12xlarge machine in the us-east-1 region.

On Demand: $5.672 (20% cheaper than Sagemaker)

Spot: $2.076 (70% cheaper than Sagemaker)

1 Yr Reserved: $3.573 (50% cheaper than Sagemaker)

3 Yr Reserved: $2.450 (65% cheaper than Sagemaker)

Let's compare the throughput and latency of the model deployed on EKS using TrueFoundry.

Request Latency and Throughput Statistics

As we can see from the stats above, the p50 latency is 5.8 seconds and p90 is 9.5 seconds. The throughput is around 6-7 requests per second. As we see above, the

Pricing Calculator

Let's try to estimate the cost of hosting the Falcon model for an actual use case with live traffic. Let's consider that we are getting 100K requests a day and we are hitting the Falcon model for every single request. To serve this traffic, 1 instance of g5.2xlarge should be enough to handle the traffic since each instance can do 6 requests per second and 100K requests a day implies 1 request a second. However, for reliability reasons, we will want to run atleast 2 instances. Let compare the cost of running the 2 instances:

Sagemaker: $7.1 * 2 ($ per hour) = $10000 a month

EKS:
Using spot instances: $2 * 2 ($ per hour) = $2880 a month
Using on-demand instances: = $8000 a month

We can also use a combination of 1 spot and 1 on-demand instance to reduce the cost by around 40% and also achieve a high level of reliability.

Chat with us

if you're looking to maximise the returns from your LLM projects and empower your business to leverage AI the right way, we would love to chat and exchange notes.

Take TrueFoundry for a Spin

Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now

The fastest way to build, govern and scale your AI

Book a Demo

Deploy Falcon-40B on AWS: 40% Cheaper than Sagemaker

Deploying the Model as an API

Cost Analysis

Sagemaker Cost

Deploy the model on EKS using TrueFoundry

Pricing Calculator

Chat with us

Built for Speed: ~10ms Latency, Even Under Load

GPT-5.1 vs GPT-5: 9 Major Improvements You Need to Know

Mapping the On-Prem AI Market: From Chips to Control Planes

AI Gateways: From Outage Panic to Enterprise Backbone

Cognita: Building an Open Source, Modular, RAG applications for Production

Turning AI Chaos into Control: A Conversation on Agentic AI with Tesseract Talks

Breaking Down AI Gateway Usage: Customer and User-Level Analytics

Rate Limiting in AI Gateway : The Ultimate Guide

AI Gateway: The Central Control Pane of Today’s Generative AI Infrastructure

The Complete Guide to AI Gateways and MCP Servers

Deploy Falcon-40B on AWS: 40% Cheaper than Sagemaker

Deploying the Model as an API

Cost Analysis

Sagemaker Cost

Deploy the model on EKS using TrueFoundry

Pricing Calculator

Chat with us

Built for Speed: ~10ms Latency, Even Under Load

Discover More

GPT-5.1 vs GPT-5: 9 Major Improvements You Need to Know

Mapping the On-Prem AI Market: From Chips to Control Planes

AI Gateways: From Outage Panic to Enterprise Backbone

Cognita: Building an Open Source, Modular, RAG applications for Production

Turning AI Chaos into Control: A Conversation on Agentic AI with Tesseract Talks

Breaking Down AI Gateway Usage: Customer and User-Level Analytics

Rate Limiting in AI Gateway : The Ultimate Guide

AI Gateway: The Central Control Pane of Today’s Generative AI Infrastructure

The Complete Guide to AI Gateways and MCP Servers

Subscribe to our newsletter