title | titleSuffix | description | manager | ms.service | ms.topic | ms.date | ms.reviewer | reviewer | ms.author | author | ms.custom | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Azure AI Model Inference API |
Azure AI Studio |
Learn about how to use the Azure AI Model Inference API |
scottpolly |
azure-ai-studio |
conceptual |
5/21/2024 |
fasantia |
santiagxf |
mopeakande |
msakande |
|
[!INCLUDE feature-preview]
The Azure AI Model Inference is an API that exposes a common set of capabilities for foundational models and that can be used by developers to consume predictions from a diverse set of models in a uniform and consistent way. Developers can talk with different models deployed in Azure AI Studio without changing the underlying code they are using.
Foundational models, such as language models, have indeed made remarkable strides in recent years. These advancements have revolutionized various fields, including natural language processing and computer vision, and they have enabled applications like chatbots, virtual assistants, and language translation services.
While foundational models excel in specific domains, they lack a uniform set of capabilities. Some models are better at specific task and even across the same task, some models may approach the problem in one way while others in another. Developers can benefit from this diversity by using the right model for the right job allowing them to:
[!div class="checklist"]
- Improve the performance in a specific downstream task.
- Use more efficient models for simpler tasks.
- Use smaller models that can run faster on specific tasks.
- Compose multiple models to develop intelligent experiences.
Having a uniform way to consume foundational models allow developers to realize all those benefits without sacrificing portability or changing the underlying code.
The Azure AI Model Inference API is available in the following models:
Models deployed to serverless API endpoints:
[!div class="checklist"]
- Cohere Embed V3 family of models
- Cohere Command R family of models
- Meta Llama 2 chat family of models
- Meta Llama 3 instruct family of models
- Mistral-Small
- Mistral-Large
- Jais family of models
- Jamba family of models
- Phi-3 family of models
Models deployed to managed inference:
[!div class="checklist"]
- Meta Llama 3 instruct family of models
- Phi-3 family of models
- Mistral and Mixtral family of models.
The API is compatible with Azure OpenAI model deployments.
Note
The Azure AI model inference API is available in managed inference (Managed Online Endpoints) for models deployed after June 24th, 2024. To take advance of the API, redeploy your endpoint if the model has been deployed before such date.
The following section describes some of the capabilities the API exposes. For a full specification of the API, view the reference section.
The API indicates how developers can consume predictions for the following modalities:
- Get info: Returns the information about the model deployed under the endpoint.
- Text embeddings: Creates an embedding vector representing the input text.
- Text completions: Creates a completion for the provided prompt and parameters.
- Chat completions: Creates a model response for the given chat conversation.
- Image embeddings: Creates an embedding vector representing the input text and image.
You can use streamlined inference clients in the language of your choice to consume predictions from models running the Azure AI model inference API.
Install the package azure-ai-inference
using your package manager, like pip:
pip install azure-ai-inference
Then, you can use the package to consume the model. The following example shows how to create a client to consume chat completions:
import os
from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential
model = ChatCompletionsClient(
endpoint=os.environ["AZUREAI_ENDPOINT_URL"],
credential=AzureKeyCredential(os.environ["AZUREAI_ENDPOINT_KEY"]),
)
If you are using an endpoint with support for Entra ID, you can create your client as follows:
import os
from azure.ai.inference import ChatCompletionsClient
from azure.identity import AzureDefaultCredential
model = ChatCompletionsClient(
endpoint=os.environ["AZUREAI_ENDPOINT_URL"],
credential=AzureDefaultCredential(),
)
Explore our samples and read the API reference documentation to get yourself started.
Install the package @azure-rest/ai-inference
using npm:
npm install @azure-rest/ai-inference
Then, you can use the package to consume the model. The following example shows how to create a client to consume chat completions:
import ModelClient from "@azure-rest/ai-inference";
import { isUnexpected } from "@azure-rest/ai-inference";
import { AzureKeyCredential } from "@azure/core-auth";
const client = new ModelClient(
process.env.AZUREAI_ENDPOINT_URL,
new AzureKeyCredential(process.env.AZUREAI_ENDPOINT_KEY)
);
For endpoint with support for Microsoft Entra ID, you can create your client as follows:
import ModelClient from "@azure-rest/ai-inference";
import { isUnexpected } from "@azure-rest/ai-inference";
import { AzureDefaultCredential } from "@azure/identity";
const client = new ModelClient(
process.env.AZUREAI_ENDPOINT_URL,
new AzureDefaultCredential()
);
Explore our samples and read the API reference documentation to get yourself started.
Install the Azure AI inference library with the following command:
dotnet add package Azure.AI.Inference --prerelease
For endpoint with support for Microsoft Entra ID (formerly Azure Active Directory), install the Azure.Identity
package:
dotnet add package Azure.Identity
Import the following namespaces:
using Azure;
using Azure.Identity;
using Azure.AI.Inference;
Then, you can use the package to consume the model. The following example shows how to create a client to consume chat completions:
ChatCompletionsClient client = new ChatCompletionsClient(
new Uri(Environment.GetEnvironmentVariable("AZURE_INFERENCE_ENDPOINT")),
new AzureKeyCredential(Environment.GetEnvironmentVariable("AZURE_INFERENCE_CREDENTIAL"))
);
For endpoint with support for Microsoft Entra ID (formerly Azure Active Directory):
ChatCompletionsClient client = new ChatCompletionsClient(
new Uri(Environment.GetEnvironmentVariable("AZURE_INFERENCE_ENDPOINT")),
new DefaultAzureCredential(includeInteractiveCredentials: true)
);
Explore our samples and read the API reference documentation to get yourself started.
Use the reference section to explore the API design and which parameters are available. For example, the reference section for Chat completions details how to use the route /chat/completions
to generate predictions based on chat-formatted instructions:
Request
POST /chat/completions?api-version=2024-04-01-preview
Authorization: Bearer <bearer-token>
Content-Type: application/json
The Azure AI Model Inference API specifies a set of modalities and parameters that models can subscribe to. However, some models may have further capabilities that the ones the API indicates. On those cases, the API allows the developer to pass them as extra parameters in the payload.
By setting a header extra-parameters: pass-through
, the API will attempt to pass any unknown parameter directly to the underlying model. If the model can handle that parameter, the request completes.
The following example shows a request passing the parameter safe_prompt
supported by Mistral-Large, which isn't specified in the Azure AI Model Inference API.
from azure.ai.inference.models import SystemMessage, UserMessage
response = model.complete(
messages=[
SystemMessage(content="You are a helpful assistant."),
UserMessage(content="How many languages are in the world?"),
],
model_extras={
"safe_mode": True
}
)
print(response.choices[0].message.content)
Tip
When using Azure AI Inference SDK, using model_extras
configures the request with extra-parameters: pass-through
automatically for you.
var messages = [
{ role: "system", content: "You are a helpful assistant" },
{ role: "user", content: "How many languages are in the world?" },
];
var response = await client.path("/chat/completions").post({
"extra-parameters": "pass-through",
body: {
messages: messages,
safe_mode: true
}
});
console.log(response.choices[0].message.content)
requestOptions = new ChatCompletionsOptions()
{
Messages = {
new ChatRequestSystemMessage("You are a helpful assistant."),
new ChatRequestUserMessage("How many languages are in the world?")
},
AdditionalProperties = { { "logprobs", BinaryData.FromString("true") } },
};
response = client.Complete(requestOptions, extraParams: ExtraParameters.PassThrough);
Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}");
Request
POST /chat/completions?api-version=2024-04-01-preview
Authorization: Bearer <bearer-token>
Content-Type: application/json
extra-parameters: pass-through
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Explain Riemann's conjecture in 1 paragraph"
}
],
"temperature": 0,
"top_p": 1,
"response_format": { "type": "text" },
"safe_prompt": true
}
Note
The default value for extra-parameters
is error
which returns an error if an extra parameter is indicated in the payload. Alternatively, you can set extra-parameters: drop
to drop any unknown parameter in the request. Use this capability in case you happen to be sending requests with extra parameters that you know the model won't support but you want the request to completes anyway. A typical example of this is indicating seed
parameter.
The Azure AI Model Inference API indicates a general set of capabilities but each of the models can decide to implement them or not. A specific error is returned on those cases where the model can't support a specific parameter.
The following example shows the response for a chat completion request indicating the parameter reponse_format
and asking for a reply in JSON
format. In the example, since the model doesn't support such capability an error 422 is returned to the user.
import json
from azure.ai.inference.models import SystemMessage, UserMessage, ChatCompletionsResponseFormatJSON
from azure.core.exceptions import HttpResponseError
try:
response = model.complete(
messages=[
SystemMessage(content="You are a helpful assistant."),
UserMessage(content="How many languages are in the world?"),
],
response_format=ChatCompletionsResponseFormatJSON()
)
except HttpResponseError as ex:
if ex.status_code == 422:
response = json.loads(ex.response._content.decode('utf-8'))
if isinstance(response, dict) and "detail" in response:
for offending in response["detail"]:
param = ".".join(offending["loc"])
value = offending["input"]
print(
f"Looks like the model doesn't support the parameter '{param}' with value '{value}'"
)
else:
raise ex
try {
var messages = [
{ role: "system", content: "You are a helpful assistant" },
{ role: "user", content: "How many languages are in the world?" },
];
var response = await client.path("/chat/completions").post({
body: {
messages: messages,
response_format: { type: "json_object" }
}
});
}
catch (error) {
if (error.status_code == 422) {
var response = JSON.parse(error.response._content)
if (response.detail) {
for (const offending of response.detail) {
var param = offending.loc.join(".")
var value = offending.input
console.log(`Looks like the model doesn't support the parameter '${param}' with value '${value}'`)
}
}
}
else
{
throw error
}
}
try
{
requestOptions = new ChatCompletionsOptions()
{
Messages = {
new ChatRequestSystemMessage("You are a helpful assistant"),
new ChatRequestUserMessage("How many languages are in the world?"),
},
ResponseFormat = new ChatCompletionsResponseFormatJSON()
};
response = client.Complete(requestOptions);
Console.WriteLine(response.Value.Choices[0].Message.Content);
}
catch (RequestFailedException ex)
{
if (ex.Status == 422)
{
Console.WriteLine($"Looks like the model doesn't support a parameter: {ex.Message}");
}
else
{
throw;
}
}
Request
POST /chat/completions?api-version=2024-04-01-preview
Authorization: Bearer <bearer-token>
Content-Type: application/json
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Explain Riemann's conjecture in 1 paragraph"
}
],
"temperature": 0,
"top_p": 1,
"response_format": { "type": "json_object" },
}
Response
{
"status": 422,
"code": "parameter_not_supported",
"detail": {
"loc": [ "body", "response_format" ],
"input": "json_object"
},
"message": "One of the parameters contain invalid values."
}
Tip
You can inspect the property details.loc
to understand the location of the offending parameter and details.input
to see the value that was passed in the request.
The Azure AI model inference API supports Azure AI Content Safety. When using deployments with Azure AI Content Safety on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering (preview) system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions.
The following example shows the response for a chat completion request that has triggered content safety.
from azure.ai.inference.models import AssistantMessage, UserMessage, SystemMessage
from azure.core.exceptions import HttpResponseError
try:
response = model.complete(
messages=[
SystemMessage(content="You are an AI assistant that helps people find information."),
UserMessage(content="Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills."),
]
)
print(response.choices[0].message.content)
except HttpResponseError as ex:
if ex.status_code == 400:
response = json.loads(ex.response._content.decode('utf-8'))
if isinstance(response, dict) and "error" in response:
print(f"Your request triggered an {response['error']['code']} error:\n\t {response['error']['message']}")
else:
raise ex
else:
raise ex
try {
var messages = [
{ role: "system", content: "You are an AI assistant that helps people find information." },
{ role: "user", content: "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills." },
]
var response = await client.path("/chat/completions").post({
body: {
messages: messages,
}
});
console.log(response.body.choices[0].message.content)
}
catch (error) {
if (error.status_code == 400) {
var response = JSON.parse(error.response._content)
if (response.error) {
console.log(`Your request triggered an ${response.error.code} error:\n\t ${response.error.message}`)
}
else
{
throw error
}
}
}
try
{
requestOptions = new ChatCompletionsOptions()
{
Messages = {
new ChatRequestSystemMessage("You are an AI assistant that helps people find information."),
new ChatRequestUserMessage(
"Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills."
),
},
};
response = client.Complete(requestOptions);
Console.WriteLine(response.Value.Choices[0].Message.Content);
}
catch (RequestFailedException ex)
{
if (ex.ErrorCode == "content_filter")
{
Console.WriteLine($"Your query has trigger Azure Content Safety: {ex.Message}");
}
else
{
throw;
}
}
Request
POST /chat/completions?api-version=2024-04-01-preview
Authorization: Bearer <bearer-token>
Content-Type: application/json
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills."
}
],
"temperature": 0,
"top_p": 1,
}
Response
{
"status": 400,
"code": "content_filter",
"message": "The response was filtered",
"param": "messages",
"type": null
}
The Azure AI Model Inference API is currently supported in certain models deployed as Serverless API endpoints and Managed Online Endpoints. Deploy any of the supported models and use the exact same code to consume their predictions.
The client library azure-ai-inference
does inference, including chat completions, for AI models deployed by Azure AI Studio and Azure Machine Learning Studio. It supports Serverless API endpoints and Managed Compute endpoints (formerly known as Managed Online Endpoints).
Explore our samples and read the API reference documentation to get yourself started.
The client library @azure-rest/ai-inference
does inference, including chat completions, for AI models deployed by Azure AI Studio and Azure Machine Learning Studio. It supports Serverless API endpoints and Managed Compute endpoints (formerly known as Managed Online Endpoints).
Explore our samples and read the API reference documentation to get yourself started.
The client library Azure.Ai.Inference
does inference, including chat completions, for AI models deployed by Azure AI Studio and Azure Machine Learning Studio. It supports Serverless API endpoints and Managed Compute endpoints (formerly known as Managed Online Endpoints).
Explore our samples and read the API reference documentation to get yourself started.
Explore the reference section of the Azure AI model inference API to see parameters and options to consume models, including chat completions models, deployed by Azure AI Studio and Azure Machine Learning Studio. It supports Serverless API endpoints and Managed Compute endpoints (formerly known as Managed Online Endpoints).
- Get info: Returns the information about the model deployed under the endpoint.
- Text embeddings: Creates an embedding vector representing the input text.
- Text completions: Creates a completion for the provided prompt and parameters.
- Chat completions: Creates a model response for the given chat conversation.
- Image embeddings: Creates an embedding vector representing the input text and image.