OpenAI Prompt Cache Monitoring

As part of their recent DEV Day presentation, OpenAI announced that Prompt Caching was now available for various models. At the time of writing, those models were:-

GPT-4o, GPT-4o mini, o1-preview and o1-mini, as well as fine-tuned versions of those models.

This news shouldn’t be underestimated, as it will allow developers to save on costs and reduce application runtime latency.

API calls to supported models will automatically benefit from Prompt Caching on prompts longer than 1,024 tokens. The API caches the longest prefix of a prompt that has been previously computed, starting at 1,024 tokens and increasing in 128-token increments. If you reuse prompts with common prefixes, OpenAI will automatically apply the Prompt Caching discount without requiring you to change your API integration.

As an OpenAI API developer, the only thing you may have to worry about is how to monitor your Prompt Caching use, i.e. check that it’s being applied.

In this article, I’ll show you how to do that using Python, a Jupyter Notebook and a chat completion example.

Install WSL2 Ubuntu

I’m on Windows, but I’ll run my example code under WSL2 Ubuntu. Check out the link below for a comprehensive guide on installing WSL2 for Windows.

Installing WSL2 Ubuntu for Windows

Setting up our development environment

Before developing like this, I always create a separate Python development environment where I can install any software needed and experiment with coding. Now, anything I do in this environment will be siloed and won’t impact my other projects.

I use Miniconda for this, but there are many other ways to do it, so use whatever method you know best.

If you want to go down the Miniconda route and don’t already have it, you must install Miniconda first. Get it using this link,

Miniconda – Anaconda documentation

To follow along with my example, you’ll need an OpenAI API key. Create an OpenAI account if you don’t already have one, then you can get a key from the OpenAI platform using the link below:

https://platform.openai.com/api-keys

1/ Create our new dev environment and install the required libraries

(base) $ conda create -n oai_test python=3.11 -y
(base) $ conda activate oai_test
(oai_test) $ pip install openai --upgrade
(oai_test) $ pip install jupyter

2/ Start Jupyter

Now type in jupyter notebook into your command prompt. You should see a jupyter notebook open in your browser. If that doesn’t happen automatically, you’ll likely see a screenful of information after the jupyter notebook command. Near the bottom, there will be a URL that you should copy and paste into your browser to initiate the Jupyter Notebook.

Your URL will be different to mine, but it should look something like this:-

http://127.0.0.1:8888/tree?token=3b9f7bd07b6966b41b68e2350721b2d0b6f388d248cc69

The code

Prompt caching is automatic so you don’t have to change your existing code base. But recall that it only kicks in when the combined system and user prompt are > 1024 tokens.

OpenAI recommends structuring your prompts so that any static information is at the beginning and dynamic content towards the end. This ties in nicely with the static data being in the system prompt and the dynamic data in the user prompt. You don’t have to do this, but it makes the most sense to do so.

So, let’s put all this together by showing a hypothetical example grounded in a real-use case study. In our hypothetical scenario, we’ll model a smart home system where you can remotely request actions to be taken in or around your home. For example, you might like your smart home system to turn on your lights, heating system, etc…. when you’re away from your house.

Our code consists of two tools (functions) that the Llm can use. One does the actual switching on/off of a control device, and the other can do so in response to a timed event.

After that, we have our system prompt, which clearly defines what the smart home system should be capable of and any rules/guidance it needs to perform its function.

Additionally, we have, in the first instance, a simple user prompt that requests the control system to turn on the house lights. We run this initial command and get a count of the total tokens in the prompts, the number of cached tokens and a few other data points.

After this initial run, we ask the control system to perform a different task, and once again, we get various token counts for that operation.

from OpenAI import OpenAI
import os
import json
import time

api_key = "YOUR_API_KEY_GOES_HERE"
client = OpenAI( api_key=api_key)

# Define tools (functions)
tools = [
    {
        "type": "function",
        "function": {
            "name": "control_device",
            "description": "Control a smart home device, such as turning it on/off or changing settings.",
            "parameters": {
                "type": "object",
                "properties": {
                    "device_id": {
                        "type": "string",
                        "description": "The unique identifier of the device to control."
                    },
                    "action": {
                        "type": "string",
                        "description": "The action to perform (e.g., 'turn_on', 'turn_off', 'set_temperature')."
                    },
                    "value": {
                        "type": ["string", "number"],
                        "description": "Optional value for the action, such as temperature setting."
                    }
                },
                "required": ["device_id", "action"],
                "additionalProperties": False
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "set_schedule",
            "description": "Set a schedule for a smart home device to perform an action at a specified time.",
            "parameters": {
                "type": "object",
                "properties": {
                    "device_id": {
                        "type": "string",
                        "description": "The unique identifier of the device to schedule."
                    },
                    "action": {
                        "type": "string",
                        "description": "The action to perform (e.g., 'turn_on', 'turn_off')."
                    },
                    "schedule_time": {
                        "type": "string",
                        "description": "The time to perform the action, in ISO 8601 format or a natural language description."
                    }
                },
                "required": ["device_id", "action", "schedule_time"],
                "additionalProperties": False
            }
        }
    }
]

# System message with guidelines
# Expanded system message to exceed 1024 tokens
# to make sure Prompt Caching enabled
messages = [
    {
        "role": "system",
        "content": (
            "You are a smart home assistant that helps users control their smart home devices securely and efficiently. "
            "Your goals are to execute user commands, provide device statuses, and manage schedules while ensuring safety and privacy. "
            "Always confirm actions with the user before executing them, especially for critical devices like security systems or door locks. "
            "Maintain a friendly and professional tone, adapting to the user's level of technical expertise.nn"
            # Begin expansion
            "Important guidelines to follow:nn"
            "1. **User Privacy and Security**: Handle all personal and device information confidentially. "
            "Verify the user's identity if necessary before performing sensitive actions. Never share personal data with unauthorized parties. "
            "Ensure that all communications comply with data protection laws and regulations.nn"
            "2. **Confirmation Before Actions**: Always confirm the user's intent before executing actions that affect their devices. "
            "For example, if a user asks to unlock the front door, verify their identity and confirm the action to prevent unauthorized access.nn"
            "3. **Error Handling**: If an action cannot be completed, politely inform the user and suggest alternative solutions. "
            "Provide clear explanations for any issues, and guide the user through troubleshooting steps if appropriate.nn"
            "4. **Safety Measures**: Ensure that commands do not compromise safety. "
            "Avoid setting temperatures beyond safe limits, and alert the user if a requested action might be unsafe. "
            "For instance, if the user tries to turn off security cameras, remind them of potential security risks.nn"
            "5. **No Unauthorized Access**: Do not control devices without explicit user permission. "
            "Ensure that any scheduled tasks or automated routines are clearly communicated and approved by the user.nn"
            "6. **Clear Communication**: Use simple language and avoid technical jargon unless the user is familiar with it. "
            "Explain any technical terms if necessary, and ensure that instructions are easy to understand.nn"
            "7. **Compliance**: Adhere to all relevant laws, regulations, and company policies regarding smart home operations. "
            "Stay updated on changes to regulations that may affect how devices should be controlled or monitored.nn"
            "8. **Accurate Information**: Provide precise device statuses and avoid speculation. "
            "If unsure about a device's status, inform the user and suggest ways to verify or troubleshoot the issue.nn"
            "9. **Accessibility Considerations**: Be mindful of users with disabilities. "
            "Ensure that instructions and responses are accessible, and offer alternative interaction methods if needed.nn"
            "10. **Personalization**: Adapt to the user's preferences and prior interactions. "
            "Remember frequent commands and offer suggestions based on usage patterns, while respecting privacy settings.nn"
            "11. **Timeouts and Idle States**: If a session is idle for a prolonged period, securely end the session to protect user data. "
            "Notify the user when the session is about to expire and provide options to extend it if necessary.nn"
            "12. **Multi-User Environments**: Recognize when multiple users may be interacting with the system. "
            "Manage profiles separately to ensure personalized experiences and maintain privacy between users.nn"
            "13. **Energy Efficiency**: Promote energy-saving practices. "
            "If a user forgets to turn off devices, gently remind them or offer to automate energy-saving routines.nn"
            "14. **Emergency Protocols**: Be prepared to assist during emergencies. "
            "Provide quick access to emergency services if requested, and understand basic protocols for common emergencies.nn"
            "15. **Continuous Learning**: Stay updated with the latest device integrations and features. "
            "Inform users about new capabilities that may enhance their smart home experience.nn"
            "16. **Language and Cultural Sensitivity**: Be aware of cultural differences and language preferences. "
            "Support multiple languages if possible and be sensitive to cultural norms in communication.nn"
            "17. **Proactive Assistance**: Anticipate user needs by offering helpful suggestions. "
            "For example, if the weather forecast indicates rain, suggest closing windows or adjusting irrigation systems.nn"
            "18. **Logging and Monitoring**: Keep accurate logs of actions taken, while ensuring compliance with privacy policies. "
            "Use logs to help troubleshoot issues but never share log details with unauthorized parties.nn"
            "19. **Third-Party Integrations**: When interacting with third-party services, ensure secure connections and compliance with their terms of service. "
            "Inform users when third-party services are involved.nn"
            "20. **Disaster Recovery**: In case of system failures, have protocols in place to restore functionality quickly. "
            "Keep the user informed about outages and provide estimated resolution times.nn"
        )
    },
    {
        "role": "user",
        "content": "Hi, could you please turn on the living room lights?"
    }
]
# Function to run completion with the provided message history and tools
def completion_run(messages, tools):
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        tools=tools,
        messages=messages,
        tool_choice="required"
    )
    usage_data = json.dumps(completion.to_dict(), indent=4)
    return usage_data

# Main function to handle the runs
def main(messages, tools):
    # Run 1: Initial query
    print("Run 1:")
    run1 = completion_run(messages, tools)
    print(run1)

    # Delay for 3 seconds
    time.sleep(3)

    # Append user_query2 to the message history
    user_query2 = {
        "role": "user",
        "content": "Actually, could you set the thermostat to 72 degrees at 6 PM every day?"
    }
    messages.append(user_query2)

    # Run 2: With appended query
    print("nRun 2:")
    run2 = completion_run(messages, tools)
    print(run2)

# Run the main function
if __name__ == "__main__":
    main(messages, tools)

And our output is:-

Run 1:
{
    "id": "chatcmpl-AFePFIyWQtNJ4txIGcLbXZaZleEZv",
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "logprobs": null,
            "message": {
                "content": null,
                "refusal": null,
                "role": "assistant",
                "tool_calls": [
                    {
                        "id": "call_m4V9sn2PY7X3EapH7ph1K8t9",
                        "function": {
                            "arguments": "{"device_id":"living_room_lights","action":"turn_on"}",
                            "name": "control_device"
                        },
                        "type": "function"
                    }
                ]
            }
        }
    ],
    "created": 1728293605,
    "model": "gpt-4o-mini-2024-07-18",
    "object": "chat.completion",
    "system_fingerprint": "fp_f85bea6784",
    "usage": {
        "completion_tokens": 21,
        "prompt_tokens": 1070,
        "total_tokens": 1091,
        "completion_tokens_details": {
            "reasoning_tokens": 0
        },
        "prompt_tokens_details": {
            "cached_tokens": 0
        }
    }
}

Run 2:
{
    "id": "chatcmpl-AFePJwIczKSjJnvwed7wpyRI7gLWU",
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "logprobs": null,
            "message": {
                "content": null,
                "refusal": null,
                "role": "assistant",
                "tool_calls": [
                    {
                        "id": "call_PjCse4kD4QJxYcFuZ7KlqJAc",
                        "function": {
                            "arguments": "{"device_id": "living_room_lights", "action": "turn_on"}",
                            "name": "control_device"
                        },
                        "type": "function"
                    },
                    {
                        "id": "call_GOr7qfGUPD0ZV9gAgUktyKj6",
                        "function": {
                            "arguments": "{"device_id": "thermostat", "action": "set_temperature", "schedule_time": "2023-10-23T18:00:00"}",
                            "name": "set_schedule"
                        },
                        "type": "function"
                    }
                ]
            }
        }
    ],
    "created": 1728293609,
    "model": "gpt-4o-mini-2024-07-18",
    "object": "chat.completion",
    "system_fingerprint": "fp_f85bea6784",
    "usage": {
        "completion_tokens": 75,
        "prompt_tokens": 1092,
        "total_tokens": 1167,
        "completion_tokens_details": {
            "reasoning_tokens": 0
        },
        "prompt_tokens_details": {
            "cached_tokens": 1024
        }
    }
}

We can see that in Run 1, the cached_tokens count is zero, which is to be expected. However, in Run 2, the `cached_tokens` count is 1024. This indicates that caching took place.

Summary

Prompt caching is a very useful new addition to OpenAI’s capabilities. It can save on application run times by reducing latency and your token costs. So it’s important to monitor if and when it’s being used and investigate why it isn’t if you think it should be being used.

So, using code, as I’ve shown above, you can effectively monitor your system and intervene when you think prompt caching isn’t being applied. It would be fairly straightforward to send an automated message to yourself or to a team to indicate a potential caching issue.

_That’s all from me for now. I hope you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories and subscribe to get notified when I post new content._

I know times are tough and wallets constrained, but if you got real value from this article, please consider buying me a wee dram.

If you liked this content, Medium thinks you’ll find these articles interesting, too.