Building a self-serve analytics platform with Azure Data Explorer and OpenAI

Denise_Schlesinger · ‎Jul 16 2023

Introduction

Data is the new oil. Simply having access to data is not enough. To make informed decisions, you must be able to visualize and analyze your data in a meaningful way.

In this article I will show you how you can build a self-serve analytics platform by letting your customers ask questions about data in plain English.

Azure Data Explorer (aka Kusto)

Azure Data Explorer is a powerful analytics platform designed to handle large volumes of data with low latency.

It is a cloud-based service in Azure that enables users to perform real-time data analysis on large datasets, making it ideal for businesses that need to process vast quantities of data quickly.
One of the key benefits of Kusto is its scalability. The platform is capable of handling petabytes of data with ease, making it ideal for businesses of all sizes. This means that companies can easily scale up or down their data processing capabilities as their needs change, without having to worry about investing in additional hardware or infrastructure.
Kusto is also highly versatile, supporting a wide range of data sources, including structured, semi-structured, and unstructured data. This makes it an ideal choice for businesses that need to process a diverse range of data types, such as log files, IoT data, and social media feeds.
One of the most popular use cases for Kusto is in conjunction with Power BI, Microsoft's business intelligence tool.

By integrating Kusto with Power BI, businesses can build powerful analytics solutions that enable them to gain insights into their data quickly and easily. This integration enables users to create interactive dashboards and reports that provide a real-time view of their data, helping them make more informed decisions.

Building a Self-serve analytics platforms

The need for self-serve data analytics is becoming increasingly important, customers want to be able to access and analyze their own data, without having to rely on engineering or data analysts.

This is where the ability to query a database in plain English becomes essential.

With the help of GPT 3.5, you can translate natural language into queries. By using this paradigm you can now offer your customers the ability to ask questions about their data in plain English.

This means that customers can ask questions like

"What were my sales last month?" or "Which products are selling the most?" and your system will automatically translate these questions into queries and query the database.

Using Azure OpenAI to let your customers query your data in English

OpenAI can be used to query databases in two ways, by translating plain English customer queries into query language or by using embeddings to convert the customer's question into a numerical representation that can be matched with the stored data, the best approach to use depends on the type of data that is being analyzed.

Translating a customer's question in English to a query language and querying the database based on that: this works best for analytical data, such as numerical or categorical data.
Using embeddings to convert the customer's question into a numerical representation that can be matched with the same numerical representations that are stored in the database, best for text data that can be indexed using embeddings and then using a Vector Store to query it.

This is because embeddings can capture the semantic meaning of text data, even if the wording or phrasing of a question is not precise. For more information on how to do this you can read my previous article – Build a chatbot to query your documentation using Langchain and Azure OpenAI

There is a third case which is a combination of the first 2 options. If the data is mostly text but creating embeddings for all your data is not an option, or you have a strong semantic search engine (such as elastic search) you can use OpenAI to generate a JSON object that describes the customer’s question and then you can use the JSON to generate the query in your database. You can see such an approach in my github repo with this example - generate JSON from english

In this article we will use the first approach to “translate” the customer’s data into a query language, in our case KQL (Kusto query language) and we will query an analytical DB containing sales data.

The Flow

Loading the data into Azure Data Explorer

Create an ADX (Azure Data Explorer or Kusto) cluster in the azure portal

Here is the link on how to create a free cluster for this demo

In the ADX cluster, create a Database named "retail"

Write click on the Database name and select “Ingest Data”

Upload the CSV file from the github repository: located in my github repository here: Warehouse and retails sales CSV

Click next

Click “Next: Start ingestion”

The data should be loaded into the table

In order to authenticate access to Azure Data Explorer from the code in our jupyter notebook, you need to create an AAD app - see Create an Azure Active Directory application registration in Azure Data Explorer
Make sure you edit the "env.template" file to add all the keys for ADX (Azure data explorer)- then save the file as “.env”

AAD_TENANT_ID = "<your aad tenant id>"
KUSTO_CLUSTER =  "https://<your azure data explorer name>.westeurope.kusto.windows.net"
KUSTO_DATABASE = "<your kusto database name>"
KUSTO_TABLE = "sales"
KUSTO_MANAGED_IDENTITY_APP_ID = "<your aad app registration id>"
KUSTO_MANAGED_IDENTITY_SECRET = "<your kusto managed identity secret>"

Create an Azure OpenAI deployment

In Azure OpenAI deploy
- Gpt35 turbo

Get Azure OpenAI endpoint and key and add it to the file called .env as follows:

OPENAI_DEPLOYMENT_ENDPOINT = "https://<your openai>.openai.azure.com/"
OPENAI_API_KEY = "<your openai api key>"
OPENAI_DEPLOYMENT_NAME = "<your gpt35 deployment name>"
OPENAI_DEPLOYMENT_VERSION = "2023-03-15-preview"
OPENAI_MODEL_NAME="gpt-35-turbo"

The code

Initialize environment variables

from azure.kusto.data import KustoClient, KustoConnectionStringBuilder
from azure.kusto.data.exceptions import KustoServiceError
from azure.kusto.data.helpers import dataframe_from_result_table
from dotenv import load_dotenv
import utils
import pandas as pd
import os
import openai
import json 

load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") 
OPENAI_DEPLOYMENT_ENDPOINT = os.getenv("OPENAI_DEPLOYMENT_ENDPOINT")
OPENAI_DEPLOYMENT_NAME = os.getenv("OPENAI_DEPLOYMENT_NAME")
OPENAI_MODEL_NAME = os.getenv("OPENAI_MODEL_NAME")
OPENAI_DEPLOYMENT_VERSION = os.getenv("OPENAI_DEPLOYMENT_VERSION")
# Configure OpenAI API
openai.api_type = "azure"
openai.api_version = OPENAI_DEPLOYMENT_VERSION
openai.api_base = OPENAI_DEPLOYMENT_ENDPOINT
openai.api_key = OPENAI_API_KEY

AAD_TENANT_ID = os.getenv("AAD_TENANT_ID")
KUSTO_CLUSTER = os.getenv("KUSTO_CLUSTER")
KUSTO_DATABASE = os.getenv("KUSTO_DATABASE")
KUSTO_TABLE = os.getenv("KUSTO_TABLE")
KUSTO_MANAGED_IDENTITY_APP_ID = os.getenv("KUSTO_MANAGED_IDENTITY_APP_ID")
KUSTO_MANAGED_IDENTITY_SECRET = os.getenv("KUSTO_MANAGED_IDENTITY_SECRET")

Connect to Azure Data Explorer Cluster (Kusto)

cluster = KUSTO_CLUSTER
kcsb = KustoConnectionStringBuilder.with_aad_application_key_authentication(cluster, KUSTO_MANAGED_IDENTITY_APP_ID, KUSTO_MANAGED_IDENTITY_SECRET,  AAD_TENANT_ID)
client = KustoClient(kcsb)

Test the connection to Kusto works - sample query to get 10 results from the “sales” table

query = "sales | take 10"

response = client.execute("retail", query)
for row in response.primary_results[0]:
    print("Item:{}".format(row["item_description"]))

Create a prompt for the model to generate the KQL query

kusto_query_template_prefix = """
<|im_start|>system
I have an Azure Data Explorer (Kusto) table containing the following columns: 
year, month, supplier, item_code, item_description,item_type,retail_sales,retail_transfers,warehouse_sales

Write an KQL query based on the user input below. Answer in a concise manner. Answer only with the KQL query where the table name is T, no extra text.

user input: 
"""
template_sufix = "<|im_end|>\n<|im_start|>assistant"

Define functions to call the OpenAI API and run KQL query

def call_openai(template_prefix, text):
    prompt = template_prefix + text + template_sufix
    response = openai.Completion.create(
        engine=utils.OPENAI_DEPLOYMENT_NAME,
        prompt=prompt,
        temperature=0,
        max_tokens=4096,
        top_p=0.95,
        frequency_penalty=0,
        presence_penalty=0,
        stop=["<|im_end|>"])
    response = response['choices'][0]['text']
    response = utils.remove_chars("\n", response)
    response=utils.start_after_string("Answer:", response)
    response=utils.remove_tail_tags("<|im_end|>", response)
    return response

def call_openai_kql_response(text):
    response = call_openai(kusto_query_template_prefix, text)
    query = response.replace("T", "sales")
    query = query.replace("```", "")
    response = client.execute("retail", query)
    df = dataframe_from_result_table(response.primary_results[0])
    return df

Test the model with a sample question - aggregate the retail sales by month for the year 2020

df = call_openai_kql_response("I would like to get the total retail sales for 2020 by months")
print(df)

Use plotly to visualize the results

Generate KQL and run - aggregate the retail sales by supplier - get the top 10

df = call_openai_kql_response("I would like to get the top 10 suppliers by total retail sales")
print(df)

Use plotly to visualize the results

Generate KQL and run

df = call_openai_kql_response("I would like to get the total retail sales by year and month for the top 10 suppliers")
print(df)

What’s next

This is just sample code.

In order to create a highly scalable and efficient system to let your customers query your data in English, you must build a robust infrastructure.

Some of the things you need to consider are:

Create a production Azure Data Explorer cluster: the free cluster is for demo purposes only.
Auto scaling: allows you to dynamically adjust the number of resources your application uses based on demand, ensuring that your application can handle sudden spikes in traffic without experiencing downtime or performance issues. Consider using auto scaling for the backend APIs on your application to handle high demand.

AKS autoscaler
App service autoscale

Caching: helps reduce the load on your application by storing frequently accessed data in memory, reducing the need to access the database repeatedly. Consider caching the questions your customers ask and saving the answers you get from OpenAI to reduce the numbers of API calls to GPT.
Visualization and Dashboards: you can let customers ask questions about the data and then present the results in the most suitable graphical representation, for example by using Power BI embedded.

The source code for this article can be found in this repo:

azure-data-and-ai-examples/openai/ask your Analytical DB questions in english.ipynb at master · deni...

I hope you found this article enjoyable and easy to follow and it will help you get into the right direction in order to build a self-serve analytics application.

Thanks

Denise

Products (49)

Special Topics (27)

Video Hub (462)

Most Active Hubs