Product problem considerations when building LLM based applications

Diego OppenheimerDiego Oppenheimer

December 19, 2023

Categories:

news
knowledge

The Advent of LLM-Powered Applications

In the rapidly evolving landscape of artificial intelligence, Large Language Models (LLMs) represent a paradigm shift. These sophisticated models which continue to showcase emerging capabilities are reshaping how we build future applications, offering capabilities ranging from text generation to complex problem-solving. However, their integration into product development is accompanied by unique challenges. In this blog I will explore four critical problem areas that we have been spending time working on: Stability, Accuracy, Limited Developer Control, and Critical Application Concerns, offering some insights for awareness for those embarking on the product building journey as well as potential ways to address. It is worth noting that this space is evolving faster than any technology probably in human history so new methodologies and concerns are popping up daily.

Stability: Ensuring Consistent and Reliable Outputs

By definition, LLMs are probabilistic just like all other machine learning models. When integrating machine learning models into previous generations of applications we usually only had a narrow task we were looking for them to complete (predict X, classify y). LLMs are general task completers meaning they can do many different kind of tasks (summarize, extract, generate text, etc) but all of these tasks will also now become probabilistic in nature meaning we must design our applications (with users who expect deterministic outcomes from interacting with our applications) to handle this. Stability in LLMs refers to the consistency of their output. Given the same input, an LLM might produce varying responses. This characteristic is particularly challenging in applications where consistency is paramount.

A notable example was observed recently at a financial services organization building an AI-powered chatbot. Customers who were inquiries were receiving different responses to identical queries, leading to confusion and a decrease in user satisfaction - ultimately leading to abandoning the usage of the virtual assistant (we have all asked to go direct to ‘Representative” right?)

One way to address this stability issue is to take advantage of the fact that LLMs are generally good instruction followers. By communicating requirements and expectations to LLMs you are likely to be able to specify desired behaviors for the LLM. Another approach is creating custom validators that can be used to check the output of LLMs for accuracy, reliability, and adherence to specific guidelines.

As we navigate the waters of stability, we encounter the equally crucial aspect of accuracy in LLM applications.

Accuracy: Navigating the Truth in a Sea of Data

Accuracy in LLMs concerns the correctness and reliability of the information they provide. These models, while powerful, are only as accurate as the data they're trained on, leading to potential misinformation. You might have referred to this problem called a hallucination. Hallucinations are statements or pieces of text that are generated by an LLM that are not supported by the evidence provided to that model. These often are the result of LLMs being fed incomplete or inaccurate information and they can be a source of confusion , misinformation and loss of trust. In particular when building workflows where users have a high degree of expectation against this. This issue can be particularly hairy when building high risk applications like in finance, legal, health care.

Let's take the example of a health advice chatbot, trained on an extensive but outdated medical database, providing inaccurate health recommendations, underscoring the importance of current and reliable data sources. Or even worse if an LLM hallucinates a medical fact that a user might act on.

Improving accuracy can involve adding more high-quality training data but also continuous updates to the model, ensuring it stays current with the latest information and trends. It can also involve using proprietary data to improve model performance via fine tuning. In most cases when we are building LLM based applications we are likely to be consuming a third party model (Open AI, Anthropic or one of the many OSS models available) meaning not much we can do about this. This means we need to rely on methodologies where all we control is the input into the model to get some level of guarantees on the output. In this category there are a number of techniques (including using other machine learning models and even other LLMs) to validate and “check” the work before returning information to the application or user. ie : A provenance validator ensuring that any fact can be traced back to a specific document or piece of information.

With a grasp on accuracy, we turn to the complexities of developer control in the realm of LLMs.

Limited Developer Control: Unlocking the Black Box of LLMs

Limited developer control refers to the challenge of understanding and manipulating the internal workings of LLMs. This "black box" nature can impede troubleshooting and refinement. In most cases you will be relying on a third party model provider and have access to it via an API. This means for most purposes input (the prompt) will be the only developer controllable component of interacting with an LLM.

This can become problematic if an LLM used for automated content creation produces unpredictable and at times inappropriate content.

Addressing this issue involves developing sets of validations on the output and a system that gives you the option to either block undesired outputs or inject new information in the prompt (input) that would guide the LLMs output to past validations.

The final piece of the puzzle lies in addressing the application-specific concerns of LLMs.

Critical Application Concerns: Safeguarding High-Stakes Use Cases

Critical application concerns revolve around the use of LLMs in high-risk scenarios where errors can have severe consequences, such as healthcare or finance.

An LLM used for financial recommendations could produce inaccurate predictions, leading to significant investment losses. This situation illustrates the need for extreme caution in such applications.

Mitigating risks in these scenarios requires comprehensive testing and validation procedures, as well as the development of frameworks and guidelines to govern LLM use.

Charting a Course Through the LLM Landscape

As we journey through the world of LLM-powered applications, it becomes clear that while they offer transformative potential, their integration into products is a path laden with challenges. Addressing issues of stability, accuracy, developer control, and application-specific concerns requires a combination of innovative solutions, rigorous testing, and ongoing checks and balances. By navigating these waters with care and expertise, we can harness the full power of LLMs to create products that are not only revolutionary but also reliable and responsible.

Join our community discord and mailing list!

About the author:

Diego Oppenheimer is a serial entrepreneur, product developer and investor with an extensive background in all things data. Currently, he is a Managing Partner at Factory a venture fund specialized in AI investments as well as a co-founder at Guardrails AI. Previously he was an executive vice president at DataRobot, Founder and CEO at Algorithmia (acquired by DataRobot) and shipped some of Microsoft’s most used data analysis products including Excel, PowerBI and SQL Server.

Diego is active in AI/ML communities as a founding member and strategic advisor for the AI Infrastructure Alliance and MLops.Community and works with leaders to define AI industry standards and best practices. Diego holds a Bachelor's degree in Information Systems and a Masters degree in Business Intelligence and Data Analytics from Carnegie Mellon University.

Tags:

opinion

Similar ones you might find interesting

Guardrails AI's Commitment to Responsible Vulnerability Disclosure

We believe that strong collaboration with the security research community is essential for continuous improvement.

Read more

The Future of AI Reliability Is Open and Collaborative: Introducing Guardrails Hub

Guardrails Hub empowers developers globally to work together in solving the AI reliability puzzle

Read more

How Well Do LLMs Generate Structured Data?

What’s the best Large Language Model (LLM) for generating structured data in JSON? We put them to the test.

Read more