How to validate LLM responses continuously in real time
January 17, 2024
Introduction
No one likes looking at a blank white page on the Internet. One of the allures of Large Language Models (LLMs) like ChatGPT is that they provide output to users in real-time. Problem is, the LLM's first response may not be its best response.
In this article, I'll show how Guardrails AI provides the best of both worlds with its support for ChatGPT's streaming capabilities in its completion and chat completion models. By coupling streaming responses with Guardrails' advanced verification logic, you can provide users with real-time responses that are both fast and accurate.
How streaming works in ChatGPT
By default, calls you make to ChatGPT's completion and chat completion APIs are batched. This means that they won't return any output to the user until ChatGPT's generated a full response.
This isn't a great user experience. When using ChatGPT directly, users expect the engine will begin issuing a response within a few seconds at most.
Fortunately, you can get the same behavior from your usage of ChatGPT by setting the stream parameter to true on your ChatGPT API calls. For example, you can stream a call to the chat.completions.create method like so:
completion = client.chat.completions.create( model='gpt-3.5-turbo', messages=[ {'role': 'user', 'content': "What's 1+1? Answer in one word."} ], temperature=0, stream=True # this time, we set stream=True )
What's the catch? According to OpenAI themselves, streaming makes it harder to validate the output from the LLM. That can be a major stumbling block when accuracy is of utmost importance in user interactions.
Guardrails AI is the leading framework for crafting custom validation and orchestration of LLM responses to ensure maximum accuracy. And now, with our latest release, we support real-time validation for streaming responses, plugging this critical gap in streaming LLM usage.
Guardrails AI streaming under the hood
How does Guardrails AI marry accuracy with real-time delivery?
To use Guardrails, developers define a specification in Pydantic or RAIL that specifies the expected format of a response from an LLM. For example, a developer using AI to generate synthetic structured data can specify that a field returned in JSON format from the LLM must be an integer within a specified range.
Guardrails generates a prompt based on the developer's inputs tailored to return the specified format. It then wraps the call to the LLM, waits for the response, and then filters the output to ensure it conforms to the provided spec. It may also re-ask the LLM with a more detailed prompt if the original response doesn't meet a specific quality threshold.
Typically, Guardrails has to wait for the entire LLM response to arrive before it performs validation. When you enable streaming, Guardrails instead validates each valid fragment as the LLM returns it. For responses in structured formats like JSON, Guardrails defines a “valid fragment” as a chunk that lints as fully-formed JSON.
With a valid fragment in hand, Guardrails then performs sub-schema validation on structured responses, testing the response chunk against the appropriate chunk of the schema. For example, say you are asking an LLM to extract patient information from a doctor's description. The schema that Guardrails sends to the LLM might look like this:
<output> <string name="gender" description="Patient's gender"/> <integer name="age" description="Patient's age" format="valid-range: min=0 max=100"/> <list name="symptoms" description="Symptoms that the patient is currently experiencing. Each symptom should be classified into separate item in the list."> <object> <string name="symptom" description="Symptom that a patient is experiencing"/> <string name="affected_area" description="What part of the body the symptom is affecting" format="lower-case"/> </object> </list> <list name="current_meds" description="Medications the patient is currently taking and their response"> <object> <string name="medication" description="Name of the medication the patient is taking" format="upper-case"/> <string name="response" description="How the patient is responding to the medication"/> </object> </list> <string name="miscellaneous" description="Any other information that is relevant to the patient's health; something that doesn't fit into the other categories." format="lower-case; one-line"/> </output>
If Guardrails receives a chunk corresponding to a symptom, it'll use the specification for the symptoms element to validate it. Guardrails will then run the failer handler you specify to process all failed verifications before sending the result on to the user.
For unstructured (text) responses, there is no schema, so Guardrails simply runs the validation logic directly on each chunk it receives.
Advantages and limitations of Guardrails AI streaming validation
There are a number of reasons you might want to use streaming with Guardrails AI:
- Low latency. For longer outputs, you don't need to force the user to wait on the final generated response. They can see the raw and validated outputs as soon as they're ready.
- Broad support. Guardrails supports both JSON and string (text) outputs.
- Supports multiple on_fail behaviors. The current version of streaming supports fix, refrain, filter`` and noopon_fail behaviors.
- Works with OpenAI's completion and chat completion methods for both OpenAI v0.x and v1.x.
- Supports any arbitrary LLM provider. As long as the LLM follows instructions well and supports streaming its output, Guardrails will work with it.
However, keep in mind there are also a couple of drawbacks:
- No reask support. The on_fail=”reask” behavior will re-prompt the LLM if the output doesn't conform to the specification.
- No async callback support.
- Streaming support is dependent on the LLM's ability to follow instructions. Specifically, if you prompt an LLM to return JSON, it should return just JSON with no additional extraneous free-form text.
We plan to support both reask and async in future versions, so stay tuned!
Guardrails AI streaming in action
Let's see how this works in action. We'll build on the example I cited earlier where we want to parse plain text descriptions of a doctor/patient consult and return them as structured JSON data. Once we have this data in structured format, we can more easily store, query, and mine intelligence from it to detect anomalies or improve patient care.
You can run the code in this Jupyter notebook to see it working yourself first-hand.
Install Python prerequisites
You can run this from Jupyter or your Python command line:
! python.exe -m pip install --upgrade pip ! pip install -U guardrails-ai openai cohere numpy nltk rich -q
Obtain an OpenAI API key
To obtain a free OpenAI key, create an account. When asked to choose between using ChatGPT and API, select API.
From there, select your account icon in the upper right corner and then select View API keys. Select Create new secret key to generate an API key.
Make sure to copy and save the value somewhere, as you will not be able to copy it again after this. (You can always generate a new API key if you lose the old one.)
Once you have your OpenAI key, set it as the value of the environment variable OPEN_API_KEY:
import os os.environ["OPENAI_API_KEY"] = "aaasecretaaa"
(Note: This is just an example - remember to never check in secrets to source code control. Instead, store all secrets in an encrypted file or secure secrets manager service in the cloud.)
Run the batch example
Now let's see how Guardrails works normally in batch mode. First, we define a model for symptoms responses using Pydantic:
from pydantic import BaseModel, Field from typing import List from guardrails.validators import ( ValidRange, UpperCase, LowerCase, OneLine, ) prompt = """ Given the following doctor's notes about a patient, please extract a dictionary that contains the patient's information. ${doctors_notes} ${gr.complete_json_suffix_v2} """ doctors_notes = """152 y/o female with chronic macular rash to face and hair, worse in beard, eyebrows and nares. The rash is itchy, flaky and slightly scaly. Moderate response to OTC steroid cream. Patient has been using cream for 2 weeks and also suffers from diabetes.""" class Symptom(BaseModel): symptom: str = Field(description="Symptom that a patient is experiencing") affected_area: str = Field( description="What part of the body the symptom is affecting", validators=[ LowerCase(on_fail="fix"), ], ) class Medication(BaseModel): medication: str = Field( description="Name of the medication the patient is taking", validators=[UpperCase(on_fail="fix")], ) response: str = Field(description="How the patient is responding to the medication") class PatientInfo(BaseModel): gender: str = Field(description="Patient's gender") age: int = Field( description="Patient's age", validators=[ValidRange(min=0, max=150, on_fail="fix")], ) symptoms: List[Symptom] = Field( description="Symptoms that the patient is currently experiencing. Each symptom should be classified into separate item in the list." ) current_meds: List[Medication] = Field( description="Medications the patient is currently taking and their response" ) miscellaneous: str = Field( description="Any other information that is relevant to the patient's health; something that doesn't fit into the other categories.", validators=[LowerCase(on_fail="fix"), OneLine(on_fail="fix")], )
We're asking the LLM to scan the provided description and return the following data:
- The patient's gender and age, with age restricted to between 0 and 150
- A list of medications the patient is taking
- The symptoms the patient is experiencing
- Any other information that doesn't fit these parameters
We then create a guard object in Guardrails that instantiates this specification:
guard = gd.Guard.from_pydantic(output_class=PatientInfo, prompt=prompt)
Finally, we wrap the OpenAI call with the guard object to verify the response once it's finished generating:
Wrap the OpenAI API call with the `guard` object raw_llm_output, validated_output, *rest = guard( openai.chat.completions.create, prompt_params={"doctors_notes": doctors_notes}, max_tokens=1024, temperature=0.0, ) Print the validated output from the LLM print(validated_output)
The output of the guard object is an array consisting of the raw LLM output, the validated output corrected by Guardrails AI, and any additional information.
The LLM will eventually return a response like this:
{ "gender": "female", "age": 100, "symptoms": [ { "symptom": "chronic macular rash", "affected_area": "face" }, { "symptom": "itchy", "affected_area": "beard" }, { "symptom": "itchy", "affected_area": "eyebrows" }, { "symptom": "itchy", "affected_area": "nares" }, { "symptom": "flaky", "affected_area": "face" }, { "symptom": "flaky", "affected_area": "beard" }, { "symptom": "flaky", "affected_area": "eyebrows" }, { "symptom": "flaky", "affected_area": "nares" }, { "symptom": "slightly scaly", "affected_area": "face" }, { "symptom": "slightly scaly", "affected_area": "beard" }, { "symptom": "slightly scaly", "affected_area": "eyebrows" }, { "symptom": "slightly scaly", "affected_area": "nares" } ], "current_meds": [ { "medication": "OTC STEROID CREAM", "response": "moderate" } ], "miscellaneous": "patient also suffers from diabetes" }
Implementing streaming
If you run this yourself, you'll see that the program takes half a minute or so to return the full response. We can switch this to a stream example, however, with a small code change:
fragment_generator = guard( openai.chat.completions.create, prompt_params={"doctors_notes": doctors_notes}, max_tokens=1024, temperature=0, stream=True, ) for op in fragment_generator: clear_output(wait=True) print(op) time.sleep(0.5)
You'll notice we've made two slight changes. First, we're sending OpenAI the stream=True parameter. We didn't include stream in our previous example, so it defaulted to False.
Second, the output from the guard object changes. It now returns a fragment_generator object over which we can iterate every half-second to obtain the next validated response chunk. We then print that out directly to the user. We clear the output at the top of the loop and then, when the next chunk is ready, re-print the entirety of the response we've received to date.
If you run this, you'll see how the response JSON slowly fills in over time as the LLM returns more chunks and Guardrails validates them. You can see the difference in the batch vs. streaming behavior in this short video:
Conclusion
With the addition of streaming support, Guardrails AI enables adding safeguards to your interactive AI-driven applications to ensure high-quality output. Continue exploring Guardrails' features to see how it can take your Generative AI products to the next level.
Tags:
Similar ones you might find interesting
Handling fix results for streaming
How we handle fix results for streaming in Guardrails.
How we rewrote LLM Streaming to deal with validation failures
The new pipeline for LLM Streaming now includes ways to merge fixes across chunks after validating.
Latency and usability upgrades for ML-based validators
The numbers behind our validators