How to Generate Synthetic Structured Data with Cohere

Safeer MohiuddinSafeer Mohiuddin

October 26, 2023

Categories:

news
knowledge

Introduction

Generating synthetic structured data is critical for training your AI models. But getting AI engines to produce structured data isn’t always easy. This article will show how you can easily use Cohere and Guardrails to produce synthetic structured data.

What is synthetic structured data?

Structured data is any data put into a format that machines can easily parse, manage, and analyze.

Structured data abounds in traditional applications. Database tables, XML files, and JSON files are common examples of structured data software developers use daily. Structured data is also a good way for models in a complex AI-based application to exchange data.

Synthetic data is fictional data that is statistically or mathematically similar to real data. Companies, particularly in fields such as finance and healthcare, are increasingly using synthetic data to train their AI models. Recent research shows that, on top of being cheaper to produce, synthetic data may create AI models that are just as good, if not better, than models trained on real-world data.

Generating synthetic structured data with Cohere and Guardrails AI

Generating synthetic structured data using today’s AI engines can be a challenge. By default, a Large Language Model (LLM) outputs unstructured text. So, how do you coach it to return properly structured synthetic text?

Using Cohere and Guardrails together, you can generate synthetic structured text with high realism and accuracy. Cohere’s Command model provide generation capabilities for structured data. Guardrails AI adds structural and quality assurances that refine Cohere’s output, resulting in more accurate, realistic data.

Let’s discuss how the Cohere and Guardrails AI portions work, then see how they work even better together.

How Cohere works

Cohere offers access to cutting-edge LLMs through a simple API. It provides a variety of API endpoints to use depending on your use case, including Chat, Generate, Embed, Rerank, Semantic Search, Rerank, and Classify.

Cohere’s models power a variety of use cases, including running interactive chatbots, generating text for product descriptions or blog articles, moderating content, and recognizing intent. Companies can leverage Cohere’s LLMs in their apps without needing to train their own AI models from the ground up.

How Guardrails AI works

Guardrails AI is a Python package you can use to enhance the outputs of LLMs by adding structural, type, and quality assurance checks.

Guardrails AI leverages the pydantic format, one of the industry's most widely used data validation libraries. Guardrails checks for defects such as bias in generated text and bugs in generated code. Guardrails also enforces structural and type guarantees (e.g., returning proper JSON formatting) and takes corrective actions, such as prompt submission retries, when validation fails.

Creating a Guardrail for LLM output is a three-step process: Guardrails Spec

  1. Create a data structure spec. You can create a spec either using a Pydantic model or using RAIL. RAIL (Reliable AI Markup Language) is a language-agnostic, human-readable XML dialect for defining the expected structure and type from the LLM, as well as any validators and corrective actions. For the example below, we will use Pydantic.

  2. Create a guard from the spec. The Python gd.Guard object serves as the basic executable unit for calls to the LLM.

  3. Wrap the LLM call with the guard. The guard combines the spec and the call to the LLM in order to validate, structure, and correct its outputs.

Walkthrough: Generating structured data with Cohere and Guardrails AI

Now, let’s see how to use these two technologies to create highly realistic synthetic structured data.

Prerequisites

  • Python 3 installed on a dev machine with the latest version of Pip
  • Cohere account and a Cohere API key

Generating data with Cohere

First, to get started with Cohere, sign up and generate an API key.

Cohere Signup

We'll use Cohere's Generate endpoint (co.generate) to generate realistic text conditioned on a given input. Cohere supports a REST API that developers can call from any programming language. In this walkthrough, we'll use Cohere's official Python SDK.

Start by installing the Python library for Cohere on your dev box:

pip install cohere

Next, write a simple Cohere application to generate structured data in JSON format:

import cohere

co = cohere.Client(api_key='<API_KEY>')

response = co.generate(
	prompt='Generate different structured data and render it in JSON format',
    model='command',
	max_tokens=300,
	temperature=0.9,
	k=0,
	stop_sequences=[],
	return_likelihoods='NONE'
)

print(response)

Let's step through this line by line to understand what's going on.

  1. import cohere: Standard Python import call.

  2. co = cohere.Client(api_key='<API KEY>'): Create a client object named co to interact with the Cohere API. It uses the provided API key to authenticate the requests. The API key is essential for accessing the Cohere services.

    (Note: Remember never to check API secrets directly into source code or leave them in Notebooks! Use a secrets vault, such as AWS Secrets Manager, for secure storage and retrieval of secrets.)

  3. response = co.generate(...): This line sends a text generation request to the Cohere API using the generate method. It provides several parameters as input for the text generation task:

    • model='command': Specifies the type of language model to use for text generation. In this case, it uses the "command" model.
    • prompt='Generate different structured data and render it in JSON format': Contains the input text prompt that serves as a starting point for text generation. The language model will generate text based on this prompt.
    • max_tokens=300: Sets the maximum number of tokens (words or subwords) the generated text should contain. This is used to limit the length of the generated response. Longer responses take more computational power to process, which increases application costs.
    • temperature=0.9: Controls the randomness of the generated text. Higher values (e.g., 1.0) make the output more diverse, while lower values (e.g., 0.2) make it more deterministic.
    • k=0: The number of top-k candidates to sample from during text generation. Setting it to 0 means it will consider all candidates.
    • stop_sequences=[]: A list of strings that will stop the text generation if encountered. However, the list is empty, so the generation will continue until it reaches the max_tokens limit.
    • return_likelihoods='NONE': Specifies whether to return the likelihoods of each generated candidate. In this case, it is set to 'NONE', meaning it won't return likelihoods.
  4. print(response): This line prints the entire response object returned by the Cohere API after text generation. The response may contain various information, such as the generated text, the likelihoods, and other metadata.

  5. print('Prediction: {}'.format(response.generations[0].text)): This line prints the generated text obtained from the response. The response.generations attribute is a list of generated text candidates, and response.generations[0].text retrieves the first candidate's text. The format function includes the generated text in the output string, labeled as "Prediction."

Cohere Code

Adding guards with Guardrails AI

Now, let’s improve the quality of Cohere’s response by adding a guard. Install Guardrails AI and other required dependencies locally using pip:

pip install guardrails-ai pydantic typing openai rich

Create a new Jupyter Notebook entry that imports the Guardrails AI library:

import guardrails as gd

We will define structured data for an online order that has the following attributes:

  1. Each user should have a first and a last name.
  2. Each user should have between 0 and 50 orders.
  3. The dataset should contain exactly 10 rows.
  4. The output should be in JSON format.

To accomplish this, we’ll create a spec as a Pydantic model:

from pydantic import BaseModel, Field
from guardrails.validators import ValidLength, TwoWords, ValidRange
from typing import List
prompt = """
Generate a dataset of fake user orders. Each row of the dataset should be valid. The format should not be a list, it should be a JSON object.
${gr.complete_json_suffix}
an example of output may look like this:
{
	"user_orders": [{                                                                                                   │ │
        "user_id": 1,
        "user_name": "John Mcdonald",
    	"num_orders": 6
    }]
}
"""
class Order(BaseModel):
	user_id: int = Field(description="The user's id.", validators=[("1-indexed", "noop")])
	user_name: str = Field(
		description="The user's first name and last name",
		validators=[TwoWords()]
	)
	num_orders: int = Field(
		description="The number of orders the user has placed",
		validators=[ValidRange(0, 50)]
	)

class Orders(BaseModel):
	user_orders: List[Order] = Field(
		description="Generate a list of users and how many orders they have placed in the past.",
		validators=[ValidLength(10, 10, on_fail="noop")]
	)

The Pydantic file above defines two models: an Order model that defines the format of each order; and an Orders model that holds a list of Order objects. The validators parameters for each property define the parameter format that the output from the LLM must satisfy for Guardrails to accept it.

Now, we can create a Guard from our Pydantic model:

guard = gd.Guard.from_pydantic(output_class=Orders, prompt=prompt)

Guardrails will generate a full prompt based on our Pydantic model. Note that it compiles an XML specification for the output and makes it part of the prompt.

Finally, let’s wrap our call to the LLM in Cohere with our guard:

raw_llm_response, validated_response = guard(
	co.generate,
	model="command",
	max_tokens=1024,
	temperature=0.3
)

Once again, let's break this down line by line:

  1. raw_llm_response: The raw response object returned by the GPT-3 model. It will contain various information, such as the generated text, the likelihoods, and other metadata.
  2. validated_response: The validated or processed version of the raw response.
  3. co.generate: The method to call on Cohere to generate our synthetic structured data.
  4. model="command": Specifies we could use Cohere's command module.
  5. max_tokens=1024: Again, we use max_tokens to limit response length and cap computing resources.
  6. temperature=0: The degree of randomness. Since this is structured text, we use 0.3 to specify we want the result to be mostly deterministic.

The result in validated_response is the JSON data generated by Guardrails:

{
  "user_orders": [
    { "user_id": 1, "user_name": "John Smith", "num_orders": 10 },
    { "user_id": 2, "user_name": "Jane Doe", "num_orders": 20 },
    { "user_id": 3, "user_name": "Bob Jones", "num_orders": 30 },
    { "user_id": 4, "user_name": "Alice Smith", "num_orders": 40 },
    { "user_id": 5, "user_name": "John Doe", "num_orders": 50 },
    { "user_id": 6, "user_name": "Jane Jones", "num_orders": 0 },
    { "user_id": 7, "user_name": "Bob Smith", "num_orders": 10 },
    { "user_id": 8, "user_name": "Alice Doe", "num_orders": 20 },
    { "user_id": 9, "user_name": "John Jones", "num_orders": 30 },
    { "user_id": 10, "user_name": "Jane Smith", "num_orders": 40 }
  ]
}

Guardrails logs the full history of calls it makes to the LLM. You can see this history in Python by running:

print(guard.state.most_recent_call.tree)

You can use this information for:

  • Debugging and pinpointing issues. Use the full sequence of LLM calls to identify the source of errors or unexpected behavior in the generated output.
  • Understanding model behavior. By reviewing the full history, developers can better understand how the LLM interprets and responds to various input types.
  • Fine-tuning and parameter optimization. The history of calls provides insights into how different prompts and parameters affect the model's responses. This information can be invaluable when fine-tuning the model or optimizing parameters to achieve desired outcomes.
  • Version control and collaboration. Developers can track changes and experiment with different prompt variations over time. This is essential for version control and collaboration among team members on the same project.
  • Context preservation. Examining past calls preserves the context of previous interactions with the model. This context is vital when dealing with conversations or dialogue-based systems, where the model's responses depend on preceding prompts.

Conclusion

Cohere combined with Guardrails AI is a novel and groundbreaking approach to data generation. In this article, we’ve introduced both technologies and shown how you can leverage them with a simple Python script to create your own synthetic structured data. By harnessing the power of Large Language Models, you can effortlessly generate diverse and contextually relevant structured data with just a few lines of code.

Tags:

tutorial

Similar ones you might find interesting

Guardrails AI's Commitment to Responsible Vulnerability Disclosure

We believe that strong collaboration with the security research community is essential for continuous improvement.

Read more

The Future of AI Reliability Is Open and Collaborative: Introducing Guardrails Hub

Guardrails Hub empowers developers globally to work together in solving the AI reliability puzzle

Read more

How Well Do LLMs Generate Structured Data?

What’s the best Large Language Model (LLM) for generating structured data in JSON? We put them to the test.

Read more