Recently, Amazon SageMaker AI announced the launch of a real-time inference endpoint that supports OpenAI-compatible APIs. Users can simply change the endpoint URL to call models on SageMaker AI using tools such as the OpenAI SDK, LangChain, or Strands Agents, without needing additional client customization, SigV4 wrapping, or code rewriting.
This update opens up an /openai/v1 path for SageMaker AI endpoints, which can accept chat completion requests and return responses directly, including streaming output. All endpoints and inference components that use standard SageMaker AI APIs and SDKs have been enabled with OpenAI endpoints. By changing the URL, users' existing applications can seamlessly integrate.
SageMaker AI offers rich features that allow users to build multi-step AI agent workflows on their own infrastructure, such as using Strands Agents or LangChain. Users' agents can call models using the same OpenAI interface as their original framework, while the inference process runs on their own GPU instances. Additionally, users can host multiple models on the same SageMaker AI endpoint, such as Llama for general tasks, fine-tuned Mistral models for specific domains, and small models for classification—all accessible through the same OpenAI SDK.
To use these features, users need certain prerequisites, including having an AWS account and the appropriate permissions, installing the SageMaker and OpenAI Python SDKs, and preparing models stored in Amazon S3. In addition, using the SageMaker AI OpenAI-compatible endpoint requires Bearer Token authentication, and the SageMaker Python SDK includes tools to generate tokens, simplifying the authentication process.
In practice, users can easily deploy single-model endpoints or inference component endpoints to host multiple models on a single endpoint. With the OpenAI Python SDK, users can simply call these models to obtain the required inference results. The introduction of this new feature enables seamless integration between SageMaker AI and existing AI applications, providing users with a more efficient and flexible inference solution.
Key Points:
🌟 New OpenAI-Compatible API: The real-time inference endpoints of SageMaker AI now support the OpenAI API, allowing users to call models by simply changing the URL.
🛠️ Multi-Model Hosting: Users can host multiple models on the same endpoint, accessing them using the same OpenAI SDK.
🔑 Simplified Authentication Process: Supports Bearer Token authentication, making it easy for users to securely access SageMaker AI endpoints.
