Streaming
In production systems with user interaction, streaming output from LLMs greatly improves the user experience. Streaming allows you to build real-time systems that minimize the time to first token (TTFT) rather than waiting for the entire document to be completed before progessing.
Guardrails natively supports validation for streaming output, supporting both synchronous and asynchronous approaches.
from rich import print
import guardrails as gd
import litellm
from IPython.display import clear_output
import time
Streaming with a guard class can be done by setting the 'stream' parameter to 'True'
from guardrails.hub import CompetitorCheck
prompt = "Tell me about the Apple Iphone"
guard = gd.Guard().use(CompetitorCheck, ["Apple"])
fragment_generator = guard(
litellm.completion,
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me about LLM streaming APIs."},
],
max_tokens=1024,
temperature=0,
stream=True,
)
for op in fragment_generator:
clear_output(wait=True)
print(op)
time.sleep(0.5)
With streaming, not only do chunks from the LLM arrive as they are generated, but validation results can stream in real time as well.
To do this, validators specify a chunk strategy. By default, validators wait until they have accumulated a sentence's worth of content from the LLM before running validation. Once they've run validation, they emit that result in real time.
In practice, this means that you do not have to wait until the LLM has finished outputting tokens to access validation results, which helps you create smoother and faster user experiences. It also means that validation can run only on individual sentences, instead of the entire accumulated response, which helps save on costs for validators that require expensive inference.
To access these validation results, use the error_spans_in_output helper method on Guard. This will provide an up to date list of all ranges of text in the output so far that have failed validation.
error_spans = guard.error_spans_in_output()
Async Streaming
In cases where concurrent network calls are happening (many LLM calls!) it may be beneficial to use an asynchronous LLM client. Guardrails also natively supports asynchronous streaming calls.
Learn more about async streaming here.
guard = gd.Guard()
fragment_generator = await guard(
litellm.completion,
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me about the streaming API of guardrails."},
],
max_tokens=1024,
temperature=0,
stream=True,
)
async for op in fragment_generator:
clear_output(wait=True)
print(op)
time.sleep(0.5)