Post By: Raj Gupta
Understanding the Testing Philosophy
Testing AI applications requires a fundamentally different approach compared to traditional software testing. PydanticAI recognises this by establishing two distinct categories of tests, each serving a unique purpose in ensuring application reliability.
The Two Pillars of Testing
Let’s first understand the core distinction in PydanticAI’s testing approach:
Unit Tests: These are traditional software tests that verify whether your application code is functioning correctly. They follow established patterns and practices from software engineering.
Evals: These specialized tests assess the Large Language Model (LLM) itself, measuring how well it performs and the quality of its responses. Unlike unit tests, evals are more akin to benchmarks than pass/fail checks.
This separation is crucial because each type of test addresses different aspects of AI application reliability. Let’s explore each in detail.
Unit Testing in PydanticAI
Setting Up the Testing Environment
PydanticAI recommends a specific set of tools for effective unit testing:
pytest as your primary test harness
inline-snapshot for managing complex assertions
dirty-equals for comparing large data structures
Additionally, PydanticAI provides two crucial features for unit testing:
TestModel or FunctionModel to replace actual LLM calls
Agent.override to modify your model’s behaviour during tests
Let’s explore PydanticAI testing by building and testing a book recommendation system. We’ll create an AI agent that recommends books based on user preferences and reading history, then learn how to thoroughly test it.
from datetime import date
from typing import List
from pydantic import BaseModel
from pydantic_ai import Agent, RunContext
class BookRecommendation(BaseModel):
title: str
author: str
genre: str
confidence_score: float
reasoning: str
class BookService:
async def get_user_history(self, user_id: str) -> List[str]:
pass
async def check_availability(self, title: str) -> bool:
pass
book_agent = Agent(
'openai:gpt-4',
deps_type=BookService,
system_prompt='You are a book recommendation assistant that suggests books based on user preferences and reading history.'
)
@book_agent.tool
async def recommend_books(
ctx: RunContext[BookService],
user_preferences: str,
num_recommendations: int = 3
) -> List[BookRecommendation]:
"""Recommend books based on user preferences."""
pass
async def get_personalized_recommendations(
user_id: str,
preferences: str) -> List[BookRecommendation]:
"""Get personalized book recommendations for a user."""
async with BookService() as book_service:
result = await book_agent.run(
f"Recommend books for a reader who enjoys {preferences}",
deps=book_service
)
return result.data
Let’s break down each component:
BookRecommendation Model:
This Pydantic model defines the structure of a book recommendation
confidence_score
indicates the AI's confidence in the recommendation
reasoning
provides an explanation for why this book was recommended
2. BookService:
A service class that handles database operations
get_user_history
retrieves a user's reading history
check_availability
verifies if a book is in stock
3. book_agent:
Creates a PydanticAI agent using GPT-4
Specifies BookService as a dependency
Sets up the system prompt for book recommendations
4. recommend_books Tool:
A tool that the agent can use to generate recommendations
Takes user preferences and desired number of recommendations
Returns a list of BookRecommendation objects
5. get_personalized_recommendations:
The main function that clients will call
Creates a BookService instance
Runs the agent with the user’s preferences
Returns the recommended books
Testing with TestModel
TestModel is PydanticAI’s simplest approach to unit testing. Here’s how it works with our Book Recommendation application:
import pytest
from pydantic_ai.models.test import TestModel
from pydantic_ai.messages import SystemPrompt, UserPrompt, ModelStructuredResponse
pytestmark = pytest.mark.anyio
models.ALLOW_MODEL_REQUESTS = False
async def test_basic_recommendation():
"""Test that the recommendation system returns valid book recommendations."""
with book_agent.override(model=TestModel()):
recommendations = await get_personalized_recommendations(
"user123",
"science fiction with complex characters"
)
assert len(recommendations) == 3
for rec in recommendations:
assert isinstance(rec, BookRecommendation)
assert 0 <= rec.confidence_score <= 1
assert rec.reasoning != ""
Let’s understand what’s happening in this test:
Test Setup:
pytestmark = pytest.mark.anyio
models.ALLOW_MODEL_REQUESTS = False
pytestmark
enables async test support
ALLOW_MODEL_REQUESTS = False
prevents accidental API calls to OpenAI
2. Agent Override:
with book_agent.override(model=TestModel())
Temporarily replaces the real GPT-4 model with TestModel
TestModel generates valid data without making API calls
The context manager ensures the original model is restored after the test
3. Test Execution:
recommendations = await get_personalized_recommendations(
"user123",
"science fiction with complex characters"
)
Calls our main function with test parameters
TestModel automatically generates valid BookRecommendation objects
4. Assertions:
assert len(recommendations) == 3
for rec in recommendations:
assert isinstance(rec, BookRecommendation)
assert 0 <= rec.confidence_score <= 1
assert rec.reasoning != ""
Verifies we get the expected number of recommendations
Checks that each recommendation matches our schema
Validates confidence scores are in valid range
Ensures reasoning is provided for each recommendation
Let’s look at testing the message flow:
async def test_message_flow():
"""Test the sequence of messages in the recommendation process."""
with book_agent.override(model=TestModel()):
await get_personalized_recommendations(
"user123",
"mystery novels"
)
assert book_agent.last_run_messages == [
SystemPrompt(
content='You are a book recommendation assistant that suggests books based on user preferences and reading history.',
role='system'
),
UserPrompt(
content='Recommend books for a reader who enjoys mystery novels',
role='user'
),
ModelStructuredResponse(
calls=[{
'tool_name': 'recommend_books',
'args': {
'user_preferences': 'mystery novels',
'num_recommendations': 3
}
}]
)
]
This test verifies:
The correct system prompt is used
User input is properly formatted
The recommend_books tool is called with correct arguments
Advanced Testing with FunctionModel
While TestModel is great for basic validation, FunctionModel gives us precise control over how our agent behaves during tests. This is particularly useful when we need to:
Test specific response patterns
Verify handling of different user inputs
Simulate complex interaction sequences
Test edge cases and error conditions
Let’s examine a comprehensive FunctionModel implementation:
from pydantic_ai.models.function
import FunctionModel, AgentInfofrom pydantic_ai.messages import Message, ModelAnyResponse, ModelStructuredResponse, ModelTextResponse
def custom_book_recommendations(
messages: list[Message],
info: AgentInfo
) -> ModelAnyResponse:
"""
Custom function to generate specific test recommendations.
Parameters:
messages: List of messages in the conversation history
info: Information about the agent and its configuration
Returns:
ModelAnyResponse: Either a structured response for tool calls
or a text response for final answers
"""
user_request = messages[1].content.lower()
if "mystery" in user_request:
recommendations = [
BookRecommendation(
title="The Silent Patient",
author="Alex Michaelides",
genre="Mystery/Thriller",
confidence_score=0.95,
reasoning="Strong psychological mystery elements"
),
BookRecommendation(
title="Gone Girl",
author="Gillian Flynn",
genre="Mystery/Thriller",
confidence_score=0.90,
reasoning="Complex plot with unreliable narrators"
)
]
return ModelStructuredResponse(
calls=[{
'tool_name': 'recommend_books',
'args': {
'user_preferences': 'mystery novels',
'num_recommendations': len(recommendations)
},
'response': recommendations
}]
)
elif "science fiction" in user_request:
...
else:
return ModelStructuredResponse(
calls=[{
'tool_name': 'recommend_books',
'args': {
'user_preferences': 'general fiction',
'num_recommendations': 3
}
}]
)
Let’s break down how this FunctionModel works:
Message Processing:
user_request = messages[1].content.lower()
The messages list contains the conversation history
Index 0 is the system prompt
Index 1 is the user’s request
We convert to lowercase for consistent matching
2. Conditional Response Generation:
if "mystery" in user_request:
recommendations = [
BookRecommendation(
title="The Silent Patient",
author="Alex Michaelides",
genre="Mystery/Thriller",
confidence_score=0.95,
reasoning="Strong psychological mystery elements"
),
]
We check the user’s request for specific genres
Create predetermined recommendations for each genre
Include realistic metadata like confidence scores and reasoning
3. Structured Response Creation:
return ModelStructuredResponse(
calls=[{
'tool_name': 'recommend_books',
'args': {
'user_preferences': 'mystery novels',
'num_recommendations': len(recommendations)
},
'response': recommendations
}]
)
Returns a structured response that mimics the real LLM
Includes tool name, arguments, and response data
Maintains the same interface as the real agent
Now let’s look at how to use this FunctionModel in tests:
async def test_genre_specific_recommendations():
"""
Test that recommendations change based on genre preferences.
This test verifies that our agent provides appropriate
recommendations for different genres.
"""
with book_agent.override(model=FunctionModel(custom_book_recommendations)):
mystery_recs = await get_personalized_recommendations(
"user123",
"mystery novels with complex plots"
)
scifi_recs = await get_personalized_recommendations(
"user123",
"science fiction with AI themes"
)
assert all('mystery' in rec.genre.lower() for rec in mystery_recs)
assert all('science fiction' in rec.genre.lower() for rec in scifi_recs)
for rec in mystery_recs + scifi_recs:
assert rec.confidence_score >= 0.8, "Low confidence recommendation"
assert len(rec.reasoning) >= 20, "Insufficient reasoning provided"
Let’s also test some edge cases and error handling:
async def test_recommendation_edge_cases():
"""Test handling of unusual or edge case requests."""
def edge_case_handler(messages: list[Message], info: AgentInfo) -> ModelAnyResponse:
user_request = messages[1].content.lower()
if not user_request.strip():
return ModelTextResponse(
content="Error: Empty preference string provided"
)
elif len(user_request) > 1000:
return ModelTextResponse(
content="Error: Request too long"
)
return ModelStructuredResponse(
calls=[{
'tool_name': 'recommend_books',
'args': {
'user_preferences': user_request,
'num_recommendations': 1
},
'response': [
BookRecommendation(
title="Universal Appeal",
author="Test Author",
genre="General Fiction",
confidence_score=0.5,
reasoning="Fallback recommendation for unusual request"
)
]
}]
)
with book_agent.override(model=FunctionModel(edge_case_handler)):
with pytest.raises(ValueError):
await get_personalized_recommendations("user123", "")
long_request = "fiction " * 200
with pytest.raises(ValueError):
await get_personalized_recommendations("user123", long_request)
result = await get_personalized_recommendations(
"user123",
"books about quantum physics written as romance novels"
)
assert len(result) == 1
assert result[0].confidence_score == 0.5
Reusable Test Fixtures
For tests that frequently need model overrides, we can create pytest fixtures:
import pytest
from weather_app import weather_agent
from pydantic_ai.models.test import TestModel
@pytest.fixture
def override_weather_agent():
with weather_agent.override(model=TestModel()):
yield
async def test_forecast(override_weather_agent: None):
...
Evals: The Art of Model Evaluation
Unlike unit tests, evals represent an emerging field in AI testing that requires a different mindset. Evals are fundamentally different from unit tests. They’re more like benchmarks that help you understand how your model’s performance changes over time.
Here’s what makes evals unique:
They never truly “pass” in the traditional sense — they provide performance metrics that you track over time
They’re typically slower and more expensive to run than unit tests
They’re not suitable for continuous integration pipelines that run on every commit
They require careful consideration of what constitutes “good performance”
Let’s explore this through a practical example of a Cypher query generation system.
Implementing a Cypher Generation System
First, let’s look at how we structure our Cypher generation application:
import json
from pathlib import Path
from typing import Union
from pydantic_ai import Agent, RunContext
from neo4j_database import GraphDBConn
class CypherSystemPrompt:
def __init__(
self,
examples: Union[list[dict[str, str]], None] = None,
db: str = 'Neo4j'
):
if examples is None:
with Path('cypher_examples.json').open('rb') as f:
self.examples = json.load(f)
else:
self.examples = examples
self.db = db
def build_prompt(self) -> str:
return f"""\
Given the following {self.db} graph schema, your job is to
write a Cypher query that suits the user's request.
Graph schema:
CREATE
(person:Person {{name: string, age: int}}),
(movie:Movie {{title: string, year: int}})
CREATE (person)-[:ACTED_IN]->(movie)
{''.join(self.format_example(example) for example in self.examples)}"""
@staticmethod
def format_example(example: dict[str, str]) -> str:
return f"""\
<example>
<request>{example['request']}</request>
<cypher>{example['cypher']}</cypher>
</example>
"""
cypher_agent = Agent(
'gemini-1.5-flash',
deps_type=CypherSystemPrompt,
)
@cypher_agent.system_prompt
async def system_prompt(ctx: RunContext[CypherSystemPrompt]) -> str:
return ctx.deps.build_prompt()
Measuring Performance with Cross-Validation
The most challenging aspect of evals is measuring performance effectively. Let’s implement a comprehensive evaluation system:
import statistics
from itertools import chain
from neo4j_database import GraphDBConn, QueryError
async def evaluate_cypher_generation():
with Path('cypher_examples.json').open('rb') as f:
examples = json.load(f)
fold_size = len(examples) // 5
folds = [examples[i : i + fold_size] for i in range(0, len(examples), fold_size)]
conn = GraphDBConn()
scores = []
for i, fold in enumerate(folds, start=1):
fold_score = 0
other_folds = list(chain(*(f for j, f in enumerate(folds) if j != i)))
system_prompt = CypherSystemPrompt(examples=other_folds)
with cypher_agent.override(deps=system_prompt):
for case in fold:
try:
agent_results = await generate_cypher(case['request'])
agent_nodes = await conn.execute(agent_results)
except QueryError as e:
print(f'Fold {i} {case}: {e}')
fold_score -= 100
else:
expected_nodes = await conn.execute(case['cypher'])
agent_node_ids = [n['id'] for n in agent_nodes]
expected_node_ids = {n['id'] for n in expected_nodes}
fold_score -= len(agent_node_ids)
fold_score += 5 * len(set(agent_node_ids) & expected_node_ids)
scores.append(fold_score)
overall_score = statistics.mean(scores)
variance = statistics.variance(scores) if len(scores) > 1 else 0
return {
'overall_score': overall_score,
'score_variance': variance,
'fold_scores': scores,
'number_of_examples': len(examples)
}
Performance Measurement Strategies
Our evaluation system implements several key strategies for measuring LLM performance:
End-to-end Testing: We execute the generated Cypher queries against a real graph database to verify they work correctly.
Cross-validation: We use 5-fold cross-validation to ensure our results are robust:
Train on 4 parts, test on 1
Split examples into 5 parts
Rotate through all combinations
Average the results
3. Sophisticated Scoring:
Heavily penalise invalid queries (-100 points)
Penalise overly broad queries (-1 point per returned node)
Reward accurate results (+5 points per correct node)
4. Variance Tracking: We calculate score variance across folds to understand consistency.
Here’s an example of what our test examples might look like in cypher_examples.json
:
{ "examples": [
{ "request": "Find all actors who appeared in movies from 2020",
"cypher": "MATCH (p:Person)-[:ACTED_IN]->(m:Movie) WHERE m.year = 2020 RETURN p"
},
{ "request": "Get actors who worked with Tom Hanks",
"cypher": "MATCH (p1:Person)-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(p2:Person) WHERE p2.name = 'Tom Hanks' RETURN DISTINCT p1"
},
{
"request": "Find actors who appeared in both drama and comedy movies",
"cypher": "MATCH (p:Person)-[:ACTED_IN]->(m1:Movie), (p)-[:ACTED_IN]->(m2:Movie) WHERE m1.genre = 'Drama' AND m2.genre = 'Comedy' RETURN DISTINCT p"
}
]
}
Best Practices for Evals
From our implementation, we can extract several key practices for effective evals:
Separate Training and Testing Data:
Use cross-validation to make the most of limited examples
Ensure test cases don’t leak into training data
2. Comprehensive Scoring:
Penalise undesirable behaviours (invalid queries, over-broad results)
Reward desired outcomes (accurate matches)
Balance precision and recall
3. Error Handling:
Gracefully handle invalid queries
Record and categorise failures
Use penalties that reflect real-world impact
4. Performance Tracking:
Calculate meaningful statistics (mean, variance)
Track scores over time
Monitor for performance regression
Implementing Testing in Your Development Workflow
To implement these testing approaches effectively in your own projects, consider this structured approach:
Start with comprehensive unit tests:
tests/
├── conftest.py
├── unit/
│ ├── test_tools.py
│ └── test_agents.py
└── evals/
└── test_quality.py
2. Create reusable testing utilities:
import pytest
from pydantic_ai.models.test import TestModel
@pytest.fixture(autouse=True)
def prevent_real_calls():
models.ALLOW_MODEL_REQUESTS = False
yield
@pytest.fixture
def override_agent():
with agent.override(model=TestModel()):
yield
3. Implement regular evaluation runs:
async def run_evaluation_suite():
pytest.main(['tests/unit'])
if is_scheduled_eval_time():
results = await evaluate_model_performance()
store_eval_results(results)
check_performance_regression(results)
Best Practices for Production
When moving your AI application to production, remember these key principles:
Separation of Concerns: Keep unit tests and evals separate but complementary. Unit tests should run quickly and often, while evals can run less frequently but more thoroughly.
Progressive Testing: Start with basic unit tests, then add increasingly sophisticated evals as you understand your application’s requirements better.
Continuous Evaluation: Regularly run your eval suite to track performance over time and catch regressions early.
Data Management: Maintain a growing set of test cases and examples, learning from production usage to improve your test coverage.
Final Thoughts
PydanticAI’s testing framework provides a robust foundation for building reliable AI applications. By combining traditional unit testing with specialised AI evaluation techniques, you can create applications that are both technically sound and functionally effective.
Remember that testing AI applications is an evolving field. While the principles we’ve covered provide a solid foundation, be prepared to adapt and extend these approaches as you encounter new challenges and as the field continues to develop.
The key to success lies in maintaining a balance.
use unit tests to ensure your code’s reliability and evals to measure and improve your AI’s performance. Together, these tools enable you to build AI applications that you can confidently deploy and maintain in production environments.
Remember, the goal isn’t perfection but continuous improvement.
Start with these basic patterns and evolve them as your needs grow. Your testing strategy should grow with your application, always focusing on providing real value to your users while maintaining system reliability.