Software testing is usually straightforward, following established methods. But when it comes to testing Language Model (LLM) apps, things get tricky.
Unlike regular software, which we can check with standard methods and tools, testing LLM apps brings new challenges and complexities to which most teams are ill-prepared to navigate.
When it comes to evaluations and testing applications built on top of LLM, the traditional software QA playbook and tools fall short.
There are lot of reasons for that:
Let's drill into theses challenges in details.
Testing LLM (Large Language Model) applications presents an unique set of challenges due to their reliance on Natural Language interaction.
Unlike traditional software, LLM apps have very few input constraints.
LLM apps cover an extensive range of use cases given their conversational nature. This means they can respond to virtually infinite variations of inputs, making it incredibly challenging to predict and verify their responses accurately.
Imagine an AI-copilot designed to assist college students. It must be prepared to handle inquiries on a vast array of subjects, each with its own nuances and complexities. Moreover, the output it generates can manifest in countless forms, ranging from brief answers to extensive paragraphs. However, even a single error or inconsistency within a response can render the entire output unreliable.
This broad nature introduces uncertainty into the testing process, as it becomes challenging to predict and verify the myriad ways in which the app may respond to different inputs.
It is virtually impossible to account for all possible scenarios during testing.
Which doesn't mean testing shouldn't be done. The issue here is to find the correct coverage to test your LLM app. Knowing there is a trade off: more coverage -> more cost ; less coverage -> less cost.
Traditional Machine Learning is rather focused on a single task where the output is predetermined (i.e. predict if a certain transaction is a fraud - output is or is not).
As a consequence, there are consolidated metrics that are standardized for the whole community and they are efficient for analyzing performance of single task traditional ML.
But do not offer much to understand if your LLM app is working as intended.
They are also normally not accessible to folks which are not ML Engineers.
Some established ML metrics might not directly apply to evaluating the performance of most LLM applications.
LLMs introduce a broader array of tasks that may lack clearly defined metrics for assessment.
LLM work differently than traditional software.
Their behavior is non-deterministic and involves chance or probability.The same input doesn't always yield the same output.
This unpredictability poses challenges for reproducibility.
Quality assurance (QA) and product teams often struggle to replicate issues using the same input because the system's behavior is not consistent.
Sometimes, problems arise seemingly out of nowhere, even when no changes have been made to the system.
Not to mention that LLM applications typically rely on external providers who may alter the system, further complicating matters and potentially causing previously functional aspects to malfunction.
It is seldom a ML Engineer or a developer who validate performance of this applications.
This happens for a very simple reason.
The outputs and the experience with the conversational interface is overall a matter of content, not functionality validation. Most of cases you can't do that properly without an expertise in that subject.
What we've seeing growing in company across different spaces building with LLM is a conviction that the QA heroes of LLM apps are non-technical people with vertical domain knowledge in a certain field.
Bottom line is:
Quality assurance in LLM applications should involve a cross-functional team, with both technical and non-technical collaborators working closely
But this doesn´t mean that QA can be done right without method and rigor.
Domain experts in areas other than Engineering seldom are oriented towards processes and principles that are needed in order to build a good test strategy. Testers should understand the basics of the background processes of how a LLM works to develop a systematic approach to testing these systems.
There is a bunch of skills non-technical QA should be empowered with to do a good work ensuring the system is working as it should.
Non-technical folks are the unsung heroes of QA.
LLM outputs are often qualitative, making it hard to objectively define what a good response is.
Team are often navigating subjectivity in evaluation of LLMs.
What may be deemed as a satisfactory response by one individual could be perceived differently by another.
There is a lack of a common language that could be easily bridged by setting up some objective criteria around the application, which also helps to give a clear roadmap to improve the model's performance.
Testing Language Model (LLM) apps is tough because they give qualitative outputs, behave unpredictably, and cover a wide range of tasks.
Traditional QA methods don't work. To tackle this, focus on high-risk areas, create diverse test scenarios, and involve both tech and non-tech folks. Setting clear evaluation criteria helps everyone understand how well the app is doing and keeps things moving forward.
Rafael Pinheiro
Co-founder at Ottic