A Guide to LLM Gateways

August 8, 2024
Share this post
https://www.truefoundry.com/blog/a-guide-to-llm-gateways
URL
A Guide to LLM Gateways

Introduction

An LLM gateway acts as a centralized interface that simplifies the complexities associated with accessing multiple LLM providers. By providing a unified API, it allows developers to interact with various models without needing to navigate the intricacies of each provider's specific requirements. 

Moreover, LLM gateways play a critical role in enhancing security and compliance. They manage authentication, rate limiting, and data governance, ensuring that sensitive information is protected and that interactions adhere to regulatory standards. This is particularly important for industries dealing with personal data or operating under strict compliance frameworks, as it mitigates risks associated with data breaches and misuse.

In addition to security, LLM gateways optimize performance through features like load balancing and caching, which help manage traffic and reduce latency. By distributing requests across multiple models and caching frequent responses, these gateways ensure that applications remain responsive even under high demand.

Gateway Architecture

Features of LLM Gateway

  1. Unified Access : LLM gateways provide a single, unified interface for accessing multiple LLMs, simplifying interactions and reducing the need to manage various APIs or services separately.
  2. Authentication and Authorization
    • Centralised Key Management: Distributing core API keys can be risky; the AI Gateway centralizes key management, giving each developer their own API key while keeping root keys secure and accountable through integration with Secret Managers like AWS SSM, Google Secret Store, or Azure Vault.
    • Role-Based Access Control (RBAC): : LLM gateways enforce security policies through RBAC, ensuring that only authorized users have access to certain models or functionalities.
  3. Performance Monitoring: Continuous monitoring of model performance, including latency, error rates, and throughput, allows organizations to ensure that models are operating as expected and to identify issues early.
  4. Usage Analytics: Detailed analytics on how models are being used, who is using them, and in what context help organizations optimize resource allocation and understand the impact of their AI initiatives.
  5. Cost Management: Features that track the cost associated with running LLMs help organizations manage their budgets effectively, providing insights into how resources are consumed and where savings can be made.
  6. API Integrations: LLM gateways support integrations with various APIs, allowing them to interact seamlessly with other systems, such as data lakes, databases, or other AI/ML tools such as for Guardrails, Evaluations, etc, enabling a more connected AI ecosystem.
  7. Custom Model Support: Beyond standard models, LLM gateways allow users to deploy and manage custom models tailored to their specific needs, offering flexibility in AI applications.
  8. Caching - Caching in LLM Gateways improves speed and reduces costs by storing and reusing responses for identical or semantically similar requests.
  9. Routing
    • Fallback: Provides continuous service by automatically switching to backup models if the primary one encounters issues, ensuring consistent application performance.
    • Automatic retries: Increases request success rates by retrying failed attempts, reducing the impact of temporary disruptions or errors.
    • Rate Limiting support: Feature that restricts the number of requests a user or application can make to an LLM service within a specified time frame. This is essential for maintaining service stability while managing costs.
    • Load Balancing: Distributes requests across multiple LLM providers or models to optimize performance, availability, and cost. 
Know more about TrueFoundry Gateway
Book Demo

How to evaluate a LLM Gateway

Learn how to evaluate an LLM Gateway by assessing its features for authentication, model selection, usage analytics, cost management etc.

Authentication and Authorization

The LLM Gateway should include centralized key management to securely store and manage API keys, assigning individual keys to each developer or product for accountability while safeguarding root keys.

It should also integrate with secret managers like AWS SSM, Google Secret Store, or Azure Vault for enhanced security and streamlined key management.

TrueFoundry’s LLM Gateway provides fine-grained access control over all the models that are third-party (Called through their respective APIs) or the self-hosted models through a singular admin interface. We ensure that admins do not have to share the 3rd party (e.g. OpenAI) API keys with users, safeguarding against leaks. 

We have a concept of ProviderAccounts through which any self-hosted or third-party LLM can be integrated. Once set up, the admins can provide or restrict access to any user or application to any of the integrated models.

The authorization configuration is saved as a YAML which can also be tracked on git for auditing.

Unified API and Code Generation

The Unified API in the Gateway should offer a standardized interface for accessing and interacting with language models from various providers. It allows for seamless switching between models and providers without requiring changes to your application's code structure. By abstracting underlying complexities, the Unified API simplifies multi-model integration and maintains consistency in access and usage. Additionally, it adheres to the OpenAI request-response format, ensuring compatibility with popular Python libraries like OpenAI and LangChain.

Key Features of the Unified API:

  • Standardization: Uniform requests and responses across models simplify management and integration.
  • Flexibility: Switch between models and providers effortlessly without modifying core application code.
  • Efficiency: Use a single API to interact with multiple models, reducing the need for multiple integrations.
  • Compatibility: Adheres to the OpenAI format, ensuring smooth integration with Python libraries such as OpenAI and LangChain.

TrueFoundry’s Gateway provides automated code generation for integrating language models using various languages and libraries. You can call any model from any provider using standardized code through the Gateway.

Model Selection

LLM gateways facilitate seamless connections to third-party models hosted on platforms like AWS Bedrock, Azure OpenAI, and others. In addition to third-party models, LLM gateways should support the integration of self-hosted models that organizations may develop or fine-tune for specific applications.

TrueFoundry by its design can provide access to any open source or commercially available LLMs and is not restricted to any model or providers or specific set of open source model. Documentation for adding any model to gateway. This can be done through the following 3 routes:

  1. 3rd Party Integrations - For any commercially available LLMs.
  2. Self Hosted models - For custom/fine-tuned models, any open source models.
  3. TrueFoundry public models - Popular and latest Open Source Models

3rd Party Model Integrations

Integrate with any of the model providers for providing access to commercial LLMs. Cost would be the same as the cost charged by the model provider. Some integrations present (not limited to)

  1. AWS Bedrock
  2. Anthropic
  3. Vertex AI
  4. Azure OpenAI
  5. Cohere
  6. AI21
  7. Anyscale
  8. DeepInfra
  9. Groq
  10. Mistral AI
  11. Nomic
  12. Ollama
  13. Palm
  14. Perplexity AI
  15. Together AI

The User would be able to make use of any LLM model provided by these providers and any other provider that is not present in the list.

Self-Hosting Open Source Models

Users can deploy any Open Source LLM on their own cloud. This is not restricted to any particular set of models. We provide direct integration with HuggingFace so that any model from the HuggingFace Model Hub can directly be deployed and added to the gateway with a few clicks and is ready for use through the gateway.

Additional documentation about deploying any self-hosted model can be found here. Any open-source model can be deployed through this route. Some of them can be found here: https://huggingface.co/models

In addition custom built/pre-trained or fine-tuned models can also be deployed and served to the LLM gateway through this route.

TrueFoundry provides one click ‘Add to Gateway’ feature for all the self hosted even finetuned models.

TrueFoundry Hosted Open Source Models

Provide access to popular Open Source LLMs through models hosted by TrueFoundry and shared by multiple clients of TrueFoundry

Most of the latest and popular models are available through this route like:

  1. LLaMA 3.1 (All Sizes)
  2. LLaMa 3
  3. LlaMA 2
  4. Vicuna
  5. CodeLLaMA
  6. Deepseek-Coder
  7. Embedding Models
  8. SOLAR
  9. NSQL LlaMA 2
  10. Mixtral
  11. Nous Hermes LLaMA 2
  12. Stable Diffusion 2.1

And 100+ More Models. Most latest and popular Open-Source models are available through this route.

Performance Monitoring

LLM gateways should collect a wide range of performance metrics, including:

  • Response times: Measure the latency of LLM responses to identify slow-performing models or queries.
  • Throughput: Track the number of requests processed per second to monitor capacity utilization.
  • Error rates: Monitor the frequency of errors or timeouts to detect potential issues with LLM providers.
  • Resource utilization: Collect data on CPU, memory, and network usage to identify performance bottlenecks.

TrueFoundry captures various performance monitoring metrics such as the ones mentioned below and provides intuitive dashboards and reporting tools to visualize performance data

  1. Rate of Tokens
  2. Rate of Inference
  3. Request Latency
  4. Rate of Inference Failure

Usage Analytics

An LLM Gateway should provide comprehensive usage analytics to monitor and manage interactions with LLMs effectively. This ensures that organizations can track performance, optimize resource allocation, and maintain control over model usage.

  1. Request and Response Tracking: Capture detailed logs of API requests and responses, including timestamps, model names, endpoints accessed, and user details.
  2. Token Usage: Tracking the number of tokens consumed helps organizations understand their AI usage patterns and costs. 

TrueFoundry captures various usage analytics metrics such as ones mentioned below- 

  • Models Invoked
  • Cost Incurred
  • Total Input Tokens
  • Total Output Tokens
  • Rate of Tokens

Cost Management

  1. Cost Logging: The gateway should track and log the costs associated with all model usage, including both self-hosted and third-party models.
  2. Cost Monitoring Dashboards: Provide integrated dashboards that offer real-time insights into cost metrics and usage patterns. 
  3. Budget Controls: Implement budget controls that allow organizations to set spending limits for various models or projects.

TrueFoundry logs the cost of all self-hosted and third-party models used by its users. The platform offers the ability to rate-limit access to these models at a granular level, including by model, user, provider, project, and team.

Metrics can be exported to any preferred dashboard, with an integrated dashboard available for monitoring within the platform. Alerts can be configured based on this data, and administrators can receive notifications through their chosen channel (such as email or Slack) depending on the alerting tool in use.

Know more about TrueFoundry Gateway
Book Demo

Advanced Features

Model Caching

An LLM Gateway should ensure effective model caching through the following features:

  1. Response Caching: Implement a caching mechanism to store responses from language models. By saving previous responses, the gateway can quickly return cached results for repeated or identical requests, reducing the need for redundant model invocations and improving overall response times.
  2. Cache Modes: Support multiple caching modes to accommodate different use cases:some text
    • Exact Match Caching: Cache responses for requests that match exactly, ideal for identical prompts and frequent queries.
    • Semantic Caching: Cache responses based on semantic similarity to handle variations in phrasing or context, useful for requests with slight differences in wording.
  3. Configurable Cache Expiry: Allow for customizable cache expiration policies. This ensures that cached data remains relevant and up-to-date, while also freeing up resources by clearing outdated or stale responses.
  4. Cache Invalidation: Provide mechanisms to invalidate or refresh cache entries as needed. This is important for scenarios where updated or new information needs to be retrieved from the model, ensuring that the gateway serves the most current responses.

Routing

Fallback: LLM Gateways need fallback capabilities to maintain uninterrupted service and application performance. When the primary model encounters issues or fails, the gateway can automatically switch to backup models, ensuring that users continue to receive responses without experiencing downtime or degradation in service quality.

Automatic retries: Automatic retries are crucial for improving request success rates by addressing temporary disruptions or errors. If a request fails due to transient issues, the gateway will automatically attempt to resend it, minimizing the impact of brief service interruptions and enhancing the reliability of the system.

Rate Limiting Support: Rate limiting helps manage the volume of requests sent to an LLM service, preventing overload and maintaining service stability. By restricting the number of requests within a specified timeframe, the gateway ensures fair usage, prevents abuse, and controls costs associated with high usage, thereby contributing to better resource management.

Load Balancing: Load balancing is essential for optimizing the distribution of requests across multiple LLM providers or models. By evenly distributing the load, the gateway enhances performance, increases availability, and helps manage costs effectively. It ensures that no single provider or model is overwhelmed, leading to more reliable and efficient service.

Tool Calling

Tool calling allows LLMs to perform specific tasks beyond their core natural language processing abilities. By integrating with external tools and APIs, LLMs can access real-time data, execute custom functions, and extend their utility to a wide range of applications.

Tool calling within the TrueFoundry LLM Gateway allows language models to simulate interactions with external functions.While the gateway does not execute calls to external tools directly, it enables users to describe the tools and simulate the call within the response. This simulation provides a comprehensive representation of the request and the expected response, helping developers understand how the language model would interact with external systems.

Multimodal

Multimodal support in LLM Gateways is essential for applications that need to process and integrate multiple types of data simultaneously. For instance, a customer support application leveraging multimodal capabilities can handle text descriptions and images in a single support ticket, providing more accurate responses by analyzing both modalities.

API integrations

By connecting with a wide range of tools, gateways can enhance the functionality, security, and performance of AI applications.

Monitoring and Observability

By integrating with monitoring tools like Prometheus and Grafana, LLM gateways can track key performance metrics in real-time.

Guardrails and Safety

Integrating with guardrail tools like Guardrails AI and Nemo Guardrails enables LLM gateways to implement safety measures around LLM interactions. These integrations help filter out inappropriate or harmful content, ensuring that model outputs align with organizational policies and user expectations.

Evaluation and Validation

Tools like Arize AI and MLflow allow LLM gateways to continuously evaluate the performance and accuracy of their models. By integrating with these frameworks, gateways can track key metrics such as response quality, relevance, and user satisfaction.

Discover more about TrueFoundry's Gateway and its advanced features by reaching out to us. We can schedule a personalized demo to showcase its capabilities.

Know more about TrueFoundry Gateway
Book Demo

Build, Train, and Deploy LLM/ML Faster
A Guide to LLM Gateways
Book a Demo

Discover More

September 12, 2024

Understanding Total Cost of Ownership for GenAI Infrastructure

Engineering and Product
September 6, 2024

Build Vs Buy

Engineering and Product
September 5, 2024

Building Compound AI Systems

Engineering and Product
October 5, 2023

<Webinar> GenAI Showcase For Enterprises

Engineering and Product

Related Blogs

No items found.

Blazingly fast way to build, track and deploy your models!

pipeline