Challenges of evaluating LLM apps

Software testing is usually straightforward, following established methods. But when it comes to testing Language Model (LLM) apps, things get tricky.

Unlike regular software, which we can check with standard methods and tools, testing LLM apps brings new challenges and complexities to which most teams are ill-prepared to navigate.

When it comes to evaluations and testing applications built on top of LLM, the traditional software QA playbook and tools fall short.

There are lot of reasons for that:

Outputs are often qualitative
- Making it hard to objectively define what a good response is.
LLM are highly unstable systems
- Their behavior is non-deterministic and involves chance or probability. The same input doesn't always yield the same output.
LLM are very general
- The range of use cases people apply them and inputs you see in production is very long tailed, making it difficult to have good coverage on what to test.
  - When in production you have a wider range of distribution of inputs than tested. As a result, it is difficult to know upfront if the application is working well on a wide variety of cases.
Traditional ML metrics are not enough
- LLM have a broad range of use while Trad ML are rather focused on specific tasks where the output is pre determined
Non-technical people are the unsung heroes of LLM QA

Let's drill into theses challenges in details.

LLM are very general

Testing LLM (Large Language Model) applications presents an unique set of challenges due to their reliance on Natural Language interaction.

Unlike traditional software, LLM apps have very few input constraints.

LLM apps cover an extensive range of use cases given their conversational nature. This means they can respond to virtually infinite variations of inputs, making it incredibly challenging to predict and verify their responses accurately.

Imagine an AI-copilot designed to assist college students. It must be prepared to handle inquiries on a vast array of subjects, each with its own nuances and complexities. Moreover, the output it generates can manifest in countless forms, ranging from brief answers to extensive paragraphs. However, even a single error or inconsistency within a response can render the entire output unreliable.

This broad nature introduces uncertainty into the testing process, as it becomes challenging to predict and verify the myriad ways in which the app may respond to different inputs.

It is virtually impossible to account for all possible scenarios during testing.

Which doesn't mean testing shouldn't be done. The issue here is to find the correct coverage to test your LLM app. Knowing there is a trade off: more coverage -> more cost ; less coverage -> less cost.

Be risk-focused and prioritize what to test
Look at a product and ask: What can go terribly wrong here? What kind of failure would cost me money? What kind of failure would cost my reputation? It is impossible to test everything. A good test strategy addresses the biggest risks first

Traditional ML metrics don't work

Traditional Machine Learning is rather focused on a single task where the output is predetermined (i.e. predict if a certain transaction is a fraud - output is or is not).

As a consequence, there are consolidated metrics that are standardized for the whole community and they are efficient for analyzing performance of single task traditional ML.

But do not offer much to understand if your LLM app is working as intended.

They are also normally not accessible to folks which are not ML Engineers.

Some established ML metrics might not directly apply to evaluating the performance of most LLM applications.

LLMs introduce a broader array of tasks that may lack clearly defined metrics for assessment.

Compile a test suite that includes various use cases representing what your app is expected to do. Test your app against each case in the suite. By doing so, you'll gain a more comprehensive understanding of its performance by observing both successes and failures across different scenarios.

LLM are unstable systems

LLM work differently than traditional software.

Their behavior is non-deterministic and involves chance or probability.The same input doesn't always yield the same output.

LLM are stochastic systems
Stochastic refers to a random or probabilistic process. It means involving chance or probability. A stochastic process is one whose behavior is non-deterministic in nature and evolves over time due to random fluctuations.

This unpredictability poses challenges for reproducibility.

Quality assurance (QA) and product teams often struggle to replicate issues using the same input because the system's behavior is not consistent.

Sometimes, problems arise seemingly out of nowhere, even when no changes have been made to the system.

Not to mention that LLM applications typically rely on external providers who may alter the system, further complicating matters and potentially causing previously functional aspects to malfunction.

Don't assume things are still working just because they were fine yesterday and you haven't made any changes. LLM apps are inherently unstable, and even if you don't change anything, the provider's model can drift or update, throwing you into a complicated situation. Testing should be made constantly.

LLM QA burden mostly on non-technical collaborators

It is seldom a ML Engineer or a developer who validate performance of this applications.

This happens for a very simple reason.

The outputs and the experience with the conversational interface is overall a matter of content, not functionality validation. Most of cases you can't do that properly without an expertise in that subject.

What we've seeing growing in company across different spaces building with LLM is a conviction that the QA heroes of LLM apps are non-technical people with vertical domain knowledge in a certain field.

Bottom line is:

Quality assurance in LLM applications should involve a cross-functional team, with both technical and non-technical collaborators working closely

But this doesn´t mean that QA can be done right without method and rigor.

Domain experts in areas other than Engineering seldom are oriented towards processes and principles that are needed in order to build a good test strategy. Testers should understand the basics of the background processes of how a LLM works to develop a systematic approach to testing these systems.

There is a bunch of skills non-technical QA should be empowered with to do a good work ensuring the system is working as it should.

Non-technical folks are the unsung heroes of QA.

Empower non-technical folks.
QA and testing of your LLM apps often fall to non-technical staff, not ML engineers or devs. Avoid writing tests in code. This creates silos and obscures understanding. Instead, aim for tests in which they can be involved and that everyone can understand.

Outputs are often qualitative

LLM outputs are often qualitative, making it hard to objectively define what a good response is.

Team are often navigating subjectivity in evaluation of LLMs.

What may be deemed as a satisfactory response by one individual could be perceived differently by another.

There is a lack of a common language that could be easily bridged by setting up some objective criteria around the application, which also helps to give a clear roadmap to improve the model's performance.

LLM testing with clear criteria help teams share a common understanding on whether the application is robust enough to go live. Testing is a way to build consensus around the team on the performance of a LLM app.

Conclusion

Testing Language Model (LLM) apps is tough because they give qualitative outputs, behave unpredictably, and cover a wide range of tasks.

Traditional QA methods don't work. To tackle this, focus on high-risk areas, create diverse test scenarios, and involve both tech and non-tech folks. Setting clear evaluation criteria helps everyone understand how well the app is doing and keeps things moving forward.

Challenges of evaluating LLM apps

LLM are very general

Traditional ML metrics don't work

LLM are unstable systems

LLM QA burden mostly on non-technical collaborators

Outputs are often qualitative

Conclusion

Read next

LLM leaderboards: potentials and limitations of public benchmarks

Hyper-quality content on autopilot