Build a Chat Agent for Your BigQuery Data using Gemini Functions

7 min readJul 16, 2024

Forget writing complex SQL queries. What if you could simply ask questions in plain English and get instant results from your BigQuery dataset.

The example in this blog post uses Google Ads data transfer dataset, but of course, the data agent could technically be built on top of any dataset you have in BigQuery.

We will explore using function calling in Gemini to build a data exploration chat app that translates natural language queries to SQL, sends the commands to BigQuery (BQ), and return a human language response again.

What is the Google Ads Data Transfer service?

The Google Ads BigQuery data transfer is a service that allows you to set up scheduled transfers of your reporting data into BigQuery. Nonetheless, the Google Ads data transfer contains a large number of tables and views could be daunting to work with, especially for users who are not comfortable with complex SQL queries.

What is Function Calling in Gemini?

In essence, Function calling allows you to define functions that would output structured data from generative models and make calls to external APIs. This enables LLMs to communicate with several services, such as SQL databases, documents, and anything with an API endpoint.

How is Function Calling Useful Here?

Function calling excels at building complex queries that allow you to extract granular details from the underlying database. It could help you construct complex queries with filters, joins, aggregations, etc., ensuring you get the exact data you need. Function calling gives you more control over the exact data retrieval and manipulation process, compared to other approaches such as using BigQuery ML with Gemini, which are more geared towards broader text analysis and generation tasks.

Solution Overview

Let’s Start Building

The code below is adapted from the SQL Talk App, an example of an AI Data agent that leverages the power of Gemini’s function calling to enable you to interact with your data conversationally. Aside from the Vertex AI library, it uses Streamlit which is a open-source Python library that allows you to create interactive web applications with ease.

Prerequisites

A Google Cloud project with billing enabled.
Google Ads BigQuery Data Transfer (or another dataset).
Vertex AI and BigQuery API’s enabled on GCP.
Familiarity with CloudShell, Python, and SQL concepts.

Set up your Google Ads Data Transfer

If not done already, you could setup your Google Ads BigQuery data transfer by following the instructions here.

Configure your project

Open your Google Cloud Cloud Shell Editor and set the Google Cloud project to use with this sample app. For this, run the following command and replace YOUR_PROJECT_ID with your own Google Cloud project ID.:

gcloud config set project YOUR_PROJECT_ID

Enable the Vertex AI and BigQuery APIs

gcloud services enable aiplatform.googleapis.com
gcloud services enable bigquery.googleapis.com

Install the needed Python packages

pip install google-cloud-aiplatform
pip install google-cloud-bigquery
pip install streamlit

Create a repository in your CloudShell

Create a new repository for the app, for instance, gads_chat_app . You could run the below commands for this step.

mkdir gads_chat_app
cd gads_chat_app

Create the solution’s main Python file

Under the new repository, create a new file, and name it app.py , and paste in the below code, replacing google_ads_demo with your Google Ads Dataset name.

import time

from google.cloud import bigquery
import streamlit as st
from vertexai.generative_models import FunctionDeclaration, GenerativeModel, Part, Tool

BIGQUERY_DATASET_ID = "google_ads_demo"

list_datasets_func = FunctionDeclaration(
    name="list_datasets",
    description="Get a list of datasets that will help answer the user's question",
    parameters={
        "type": "object",
        "properties": {},
    },
)

list_tables_func = FunctionDeclaration(
    name="list_tables",
    description="List tables in a dataset that will help answer the user's question",
    parameters={
        "type": "object",
        "properties": {
            "dataset_id": {
                "type": "string",
                "description": "Dataset ID to fetch tables from.",
            }
        },
        "required": [
            "dataset_id",
        ],
    },
)

get_table_func = FunctionDeclaration(
    name="get_table",
    description="Get information about a table, including the description, schema, and number of rows that will help answer the user's question. Always use the fully qualified dataset and table names.",
    parameters={
        "type": "object",
        "properties": {
            "table_id": {
                "type": "string",
                "description": "Fully qualified ID of the table to get information about",
            }
        },
        "required": [
            "table_id",
        ],
    },
)

sql_query_func = FunctionDeclaration(
    name="sql_query",
    description="Get information from data in BigQuery using SQL queries",
    parameters={
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "SQL query on a single line that will help give quantitative answers to the user's question when run on a BigQuery dataset and table. In the SQL query, always use the fully qualified dataset and table names.",
            }
        },
        "required": [
            "query",
        ],
    },
)

sql_query_tool = Tool(
    function_declarations=[
        list_datasets_func,
        list_tables_func,
        get_table_func,
        sql_query_func,
    ],
)

model = GenerativeModel(
    "gemini-1.5-pro-001",
    generation_config={"temperature": 0},
    tools=[sql_query_tool],
)

st.set_page_config(
    page_title="SQL Talk with BigQuery",
    layout="wide",
)

col1, col2 = st.columns([8, 1])
with col1:
    st.title("SQL Talk with BigQuery Google Ads Data Transfer")
with col2:
    st.text("Demo")

st.subheader("Powered by Function Calling in Gemini")

st.markdown(
    "[Source Code](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/gemini/function-calling/sql-talk-app/)   •   [Documentation](https://cloud.google.com/vertex-ai/docs/generative-ai/multimodal/function-calling)   •   [Codelab](https://codelabs.developers.google.com/codelabs/gemini-function-calling)   •   [Sample Notebook](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/function-calling/intro_function_calling.ipynb)"
)

with st.expander("Sample prompts", expanded=True):
    st.write(
        """
        - What kind of information is in this database?
        - Can you show the top 3 campaigns by impressions?
        - Can you show how clicks are distributed across devices?
        - What is the click-through rate (clicks divided by impressions) for campaigns in the 'DISPLAY' Network?
    """
    )

if "messages" not in st.session_state:
    st.session_state.messages = []

for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"].replace("$", "\$"))  # noqa: W605
        try:
            with st.expander("Function calls, parameters, and responses"):
                st.markdown(message["backend_details"])
        except KeyError:
            pass

if prompt := st.chat_input("Ask me about information in the database..."):
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.markdown(prompt)

    with st.chat_message("assistant"):
        message_placeholder = st.empty()
        full_response = ""
        chat = model.start_chat()
        client = bigquery.Client()

        prompt += """
            Please give a concise, high-level summary followed by detail in
            plain language about where the information in your response is
            coming from in the database. Only use information that you learn
            from BigQuery, do not make up information.
            """

        response = chat.send_message(prompt)
        response = response.candidates[0].content.parts[0]

        print(response)

        api_requests_and_responses = []
        backend_details = ""

        function_calling_in_process = True
        while function_calling_in_process:
            try:
                params = {}
                for key, value in response.function_call.args.items():
                    params[key] = value

                print(response.function_call.name)
                print(params)

                if response.function_call.name == "list_datasets":
                    api_response = client.list_datasets()
                    api_response = BIGQUERY_DATASET_ID
                    api_requests_and_responses.append(
                        [response.function_call.name, params, api_response]
                    )

                if response.function_call.name == "list_tables":
                    api_response = client.list_tables(params["dataset_id"])
                    api_response = str([table.table_id for table in api_response])
                    api_requests_and_responses.append(
                        [response.function_call.name, params, api_response]
                    )

                if response.function_call.name == "get_table":
                    api_response = client.get_table(params["table_id"])
                    api_response = api_response.to_api_repr()
                    api_requests_and_responses.append(
                        [
                            response.function_call.name,
                            params,
                            [
                                str(api_response.get("description", "")),
                                str(
                                    [
                                        column["name"]
                                        for column in api_response["schema"]["fields"]
                                    ]
                                ),
                            ],
                        ]
                    )
                    api_response = str(api_response)

                if response.function_call.name == "sql_query":
                    job_config = bigquery.QueryJobConfig(
                        maximum_bytes_billed=100000000
                    )  # Data limit per query job
                    try:
                        cleaned_query = (
                            params["query"]
                            .replace("\\n", " ")
                            .replace("\n", "")
                            .replace("\\", "")
                        )
                        query_job = client.query(cleaned_query, job_config=job_config)
                        api_response = query_job.result()
                        api_response = str([dict(row) for row in api_response])
                        api_response = api_response.replace("\\", "").replace("\n", "")
                        api_requests_and_responses.append(
                            [response.function_call.name, params, api_response]
                        )
                    except Exception as e:
                        api_response = f"{str(e)}"
                        api_requests_and_responses.append(
                            [response.function_call.name, params, api_response]
                        )

                print(api_response)

                response = chat.send_message(
                    Part.from_function_response(
                        name=response.function_call.name,
                        response={
                            "content": api_response,
                        },
                    ),
                )
                response = response.candidates[0].content.parts[0]

                backend_details += "- Function call:\n"
                backend_details += (
                    "   - Function name: ```"
                    + str(api_requests_and_responses[-1][0])
                    + "```"
                )
                backend_details += "\n\n"
                backend_details += (
                    "   - Function parameters: ```"
                    + str(api_requests_and_responses[-1][1])
                    + "```"
                )
                backend_details += "\n\n"
                backend_details += (
                    "   - API response: ```"
                    + str(api_requests_and_responses[-1][2])
                    + "```"
                )
                backend_details += "\n\n"
                with message_placeholder.container():
                    st.markdown(backend_details)

            except AttributeError:
                function_calling_in_process = False

        time.sleep(3)

        full_response = response.text
        with message_placeholder.container():
            st.markdown(full_response.replace("$", "\$"))  # noqa: W605
            with st.expander("Function calls, parameters, and responses:"):
                st.markdown(backend_details)

        st.session_state.messages.append(
            {
                "role": "assistant",
                "content": full_response,
                "backend_details": backend_details,
            }
        )

Notice that each function includes parameter definitions and a description. In the sql_query_func for example, we are sending a single argument to Gemini, and asking the model to generate the corresponding SQL query for this parameter. The function parameter types and descriptions are used by Gemini’s function calling to identify the appropriate functions to use, and replace the arguments with the appropriate values.

We then register a the functions into a Tool so that Gemini could utilize those functions.

Finally, after receiving a natural language response from Gemini, we display the response to the end user.

The full analysis flow is also displayed to the user. This includes table selection, fields selection, and the query that was used to return the results.

The first function call list_datasets lists the tables to identify the right one to use.

The second function call get_table retrieves the right table to use, and the third one, sql_query takes the generated SQL query and sends it as a parameter to the BigQuery API.

Run your application

Run the below command to launch your application locally. You could select a different port number if needed.

streamlit run app.py --server.enableCORS=false --server.enableXsrfProtection=false --server.port 8080

Depending on your requirements, you could also deploy your application to App Engine or Cloud Run and secure it via IAP (Identity-Aware Proxy), but that would be out of the scope of this post.

Modify the app’s behavior and add functionality

You could also add new functionality to the app by registering new function definitions as instructed here.

How it all works

To summarize what we did: First, we defined the list_datasets_func, list_tables_func , get_table_func , and sql_query_func . Those functions take in parameters and descriptions, which Gemini leverages to trigger the functions based on the user input. Gemini then generates the respective queries, executes them in BigQuery (via the BigQuery Python SDK), and returns the results in plain language.

Disclaimer: My posts and opinions are my own, not those of my company (Google).