PydanticAI Just Made AI Testing Dead Simple — Here’s How It Works

Dec 12, 2024

11 Min Read

Post By: Raj Gupta

Understanding the Testing Philosophy

Testing AI applications requires a fundamentally different approach compared to traditional software testing. PydanticAI recognises this by establishing two distinct categories of tests, each serving a unique purpose in ensuring application reliability.

The Two Pillars of Testing

Let’s first understand the core distinction in PydanticAI’s testing approach:

Unit Tests: These are traditional software tests that verify whether your application code is functioning correctly. They follow established patterns and practices from software engineering.
Evals: These specialized tests assess the Large Language Model (LLM) itself, measuring how well it performs and the quality of its responses. Unlike unit tests, evals are more akin to benchmarks than pass/fail checks.

This separation is crucial because each type of test addresses different aspects of AI application reliability. Let’s explore each in detail.

Unit Testing in PydanticAI

Setting Up the Testing Environment

PydanticAI recommends a specific set of tools for effective unit testing:

pytest as your primary test harness
inline-snapshot for managing complex assertions
dirty-equals for comparing large data structures

Additionally, PydanticAI provides two crucial features for unit testing:

TestModel or FunctionModel to replace actual LLM calls
Agent.override to modify your model’s behaviour during tests

Let’s explore PydanticAI testing by building and testing a book recommendation system. We’ll create an AI agent that recommends books based on user preferences and reading history, then learn how to thoroughly test it.

from datetime import date
from typing import List
from pydantic import BaseModel
from pydantic_ai import Agent, RunContext

class BookRecommendation(BaseModel):    
    title: str    
    author: str    
    genre: str    
    confidence_score: float    
    reasoning: str
    
class BookService:    
    async def get_user_history(self, user_id: str) -> List[str]: 
      # In real app, this would query a database        
      pass       
    
    async def check_availability(self, title: str) -> bool:  
      # In real app, this would check inventory        
      pass
book_agent = Agent(    
    'openai:gpt-4',   
     deps_type=BookService,  
     system_prompt='You are a book recommendation assistant that suggests books based on user preferences and reading history.'
)
     @book_agent.tool
     async def recommend_books(    
       ctx: RunContext[BookService], 
       user_preferences: str,   
       num_recommendations: int = 3
) -> List[BookRecommendation]:    
       """Recommend books based on user preferences."""    
       pass
       
async def get_personalized_recommendations(    
      user_id: str,    
      preferences: str) -> List[BookRecommendation]:    
      """Get personalized book recommendations for a user."""    
      async with BookService() as book_service:        
        result = await book_agent.run(            
          f"Recommend books for a reader who enjoys {preferences}",            
          deps=book_service        
        )        
        return result.data

Let’s break down each component:

BookRecommendation Model:

This Pydantic model defines the structure of a book recommendation
confidence_score indicates the AI's confidence in the recommendation
reasoning provides an explanation for why this book was recommended

2. BookService:

A service class that handles database operations
get_user_history retrieves a user's reading history
check_availability verifies if a book is in stock

3. book_agent:

Creates a PydanticAI agent using GPT-4
Specifies BookService as a dependency
Sets up the system prompt for book recommendations

4. recommend_books Tool:

A tool that the agent can use to generate recommendations
Takes user preferences and desired number of recommendations
Returns a list of BookRecommendation objects

5. get_personalized_recommendations:

The main function that clients will call
Creates a BookService instance
Runs the agent with the user’s preferences
Returns the recommended books

Testing with TestModel

TestModel is PydanticAI’s simplest approach to unit testing. Here’s how it works with our Book Recommendation application:

import pytest
from pydantic_ai.models.test import TestModel
from pydantic_ai.messages import SystemPrompt, UserPrompt, ModelStructuredResponse

pytestmark = pytest.mark.anyio  # For async support
models.ALLOW_MODEL_REQUESTS = False  # Prevent accidental API calls
async def test_basic_recommendation():    
      """Test that the recommendation system returns valid book recommendations."""   
      with book_agent.override(model=TestModel()):        
            recommendations = await get_personalized_recommendations( 
              "user123",            
              "science fiction with complex characters"        
            )                
            
            # TestModel will generate valid data matching our BookRecommendation schema       
            assert len(recommendations) == 3       
            for rec in recommendations:           
                assert isinstance(rec, BookRecommendation)       
                assert 0 <= rec.confidence_score <= 1           
                assert rec.reasoning != ""

Let’s understand what’s happening in this test:

Test Setup:

pytestmark = pytest.mark.anyio
models.ALLOW_MODEL_REQUESTS = False

pytestmark enables async test support
ALLOW_MODEL_REQUESTS = False prevents accidental API calls to OpenAI

2. Agent Override:

with book_agent.override(model=TestModel())

Temporarily replaces the real GPT-4 model with TestModel
TestModel generates valid data without making API calls
The context manager ensures the original model is restored after the test

3. Test Execution:

recommendations = await get_personalized_recommendations(    
  "user123",   
  "science fiction with complex characters"
)

Calls our main function with test parameters
TestModel automatically generates valid BookRecommendation objects

4. Assertions:

assert len(recommendations) == 3
for rec in recommendations:    
    assert isinstance(rec, BookRecommendation)   
    assert 0 <= rec.confidence_score <= 1    
    assert rec.reasoning != ""

Verifies we get the expected number of recommendations
Checks that each recommendation matches our schema
Validates confidence scores are in valid range
Ensures reasoning is provided for each recommendation

Let’s look at testing the message flow:

async def test_message_flow():   
    """Test the sequence of messages in the recommendation process.""" 
    with book_agent.override(model=TestModel()):        
          await get_personalized_recommendations(            
                "user123",           
                "mystery novels"        
           )       
           
           # Verify the exact message sequence        
           assert book_agent.last_run_messages == [            
                 SystemPrompt(               
                       content='You are a book recommendation assistant that suggests books based on user preferences and reading history.',                
                       role='system'            
                 ),            
                 UserPrompt(                
                     content='Recommend books for a reader who enjoys mystery novels',               
                     role='user'            
                 ),            
                 ModelStructuredResponse(                
                     calls=[{                    
                         'tool_name': 'recommend_books',                   
                         'args': {                        
                             'user_preferences': 'mystery novels',                       
                             'num_recommendations': 3                
                          }               
                     }]           
                  )       
            ]

This test verifies:

The correct system prompt is used
User input is properly formatted
The recommend_books tool is called with correct arguments

Advanced Testing with FunctionModel

While TestModel is great for basic validation, FunctionModel gives us precise control over how our agent behaves during tests. This is particularly useful when we need to:

Test specific response patterns
Verify handling of different user inputs
Simulate complex interaction sequences
Test edge cases and error conditions

Let’s examine a comprehensive FunctionModel implementation:

from pydantic_ai.models.function 
import FunctionModel, AgentInfofrom pydantic_ai.messages import Message, ModelAnyResponse, ModelStructuredResponse, ModelTextResponse
def custom_book_recommendations(   
  messages: list[Message],   
  info: AgentInfo
) -> ModelAnyResponse:   
  """  
  Custom function to generate specific test recommendations.       
  
  Parameters:        
  messages: List of messages in the conversation history     
  info: Information about the agent and its configuration        
  
  Returns:     
  ModelAnyResponse: Either a structured response for tool calls    
    or a text response for final answers    
  """   
  # Extract the user's request from the second message  
  # (first is system prompt, second is user input)   
  user_request = messages[1].content.lower()        
  
  # Create response based on genre preferences   
  if "mystery" in user_request:       
    recommendations = [           
      BookRecommendation(               
        title="The Silent Patient",          
        author="Alex Michaelides",             
        genre="Mystery/Thriller",               
        confidence_score=0.95,              
        reasoning="Strong psychological mystery elements"         
      ),          
      BookRecommendation(         
        title="Gone Girl",              
        author="Gillian Flynn",        
        genre="Mystery/Thriller",           
        confidence_score=0.90,              
        reasoning="Complex plot with unreliable narrators"    
      )       
    ]               
    
    return ModelStructuredResponse(         
      calls=[{               
        'tool_name': 'recommend_books',     
        'args': {                   
          'user_preferences': 'mystery novels',   
          'num_recommendations': len(recommendations)           
        },              
        'response': recommendations   
      }]       
    )    
  elif "science fiction" in user_request:     
    # Similar structure for science fiction recommendations    
    ...    
  else:       
    # Default recommendations for general fiction      
    return ModelStructuredResponse(           
      calls=[{             
        'tool_name': 'recommend_books',       
        'args': {                  
          'user_preferences': 'general fiction',    
          'num_recommendations': 3              
        }           
      }]      
    )

Let’s break down how this FunctionModel works:

Message Processing:

user_request = messages[1].content.lower()

The messages list contains the conversation history
Index 0 is the system prompt
Index 1 is the user’s request
We convert to lowercase for consistent matching

2. Conditional Response Generation:

if "mystery" in user_request:  
  recommendations = [      
    BookRecommendation(          
      title="The Silent Patient",      
      author="Alex Michaelides",          
      genre="Mystery/Thriller",       
      confidence_score=0.95,        
      reasoning="Strong psychological mystery elements"  
    ),   
    # ...   
  ]

We check the user’s request for specific genres
Create predetermined recommendations for each genre
Include realistic metadata like confidence scores and reasoning

3. Structured Response Creation:

return ModelStructuredResponse(  
  calls=[{        
    'tool_name': 'recommend_books', 
    'args': {           
      'user_preferences': 'mystery novels',       
      'num_recommendations': len(recommendations)   
    },     
    'response': recommendations    
  }]
)

Returns a structured response that mimics the real LLM
Includes tool name, arguments, and response data
Maintains the same interface as the real agent

Now let’s look at how to use this FunctionModel in tests:

async def test_genre_specific_recommendations(): 
  """   
  Test that recommendations change based on genre preferences.  
  This test verifies that our agent provides appropriate   
  recommendations for different genres.   
  """    
  with book_agent.override(model=FunctionModel(custom_book_recommendations)):    
    # Test mystery recommendations       
    mystery_recs = await get_personalized_recommendations(  
      "user123",          
      "mystery novels with complex plots"  
    )              
    
    # Test science fiction recommendations       
    scifi_recs = await get_personalized_recommendations( 
      "user123",           
      "science fiction with AI themes"   
    )               
    
    # Verify genre-specific recommendations   
    assert all('mystery' in rec.genre.lower() for rec in mystery_recs)   
    assert all('science fiction' in rec.genre.lower() for rec in scifi_recs)       
    
    # Verify recommendation quality       
    for rec in mystery_recs + scifi_recs:       
      assert rec.confidence_score >= 0.8, "Low confidence recommendation"         
      assert len(rec.reasoning) >= 20, "Insufficient reasoning provided"

Let’s also test some edge cases and error handling:

async def test_recommendation_edge_cases(): 
  """Test handling of unusual or edge case requests."""   
  
  def edge_case_handler(messages: list[Message], info: AgentInfo) -> ModelAnyResponse:    
    user_request = messages[1].content.lower()           
    
    if not user_request.strip():          
      # Handle empty request           
      return ModelTextResponse(           
        content="Error: Empty preference string provided"    
      )        
    elif len(user_request) > 1000:    
      # Handle extremely long request        
      return ModelTextResponse(     
        content="Error: Request too long"     
      )               
      
      # Handle valid but unusual requests      
      return ModelStructuredResponse(       
        calls=[{               
          'tool_name': 'recommend_books',     
          'args': {                    
            'user_preferences': user_request, 
            'num_recommendations': 1          
          },               
          'response': [        
            BookRecommendation(     
              title="Universal Appeal",    
              author="Test Author",       
              genre="General Fiction",   
              confidence_score=0.5,    
              reasoning="Fallback recommendation for unusual request"                
            )               
          ]           
        }]        
      )        
      
      with book_agent.override(model=FunctionModel(edge_case_handler)):       
        # Test empty request        
        with pytest.raises(ValueError):    
          await get_personalized_recommendations("user123", "")  
          
          # Test extremely long request       
          long_request = "fiction " * 200      
          with pytest.raises(ValueError):        
            await get_personalized_recommendations("user123", long_request)  
            
            # Test unusual but valid request      
            result = await get_personalized_recommendations(         
              "user123",           
              "books about quantum physics written as romance novels"  
            )       
            assert len(result) == 1      
            assert result[0].confidence_score == 0.5  # Lower confidence for unusual request

Reusable Test Fixtures

For tests that frequently need model overrides, we can create pytest fixtures:

import pytest
from weather_app import weather_agent

from pydantic_ai.models.test import TestModel
@pytest.fixture
def override_weather_agent():  
  with weather_agent.override(model=TestModel()):      
    yield
    
async def test_forecast(override_weather_agent: None):  
  ...   
  # test code here

Evals: The Art of Model Evaluation

Unlike unit tests, evals represent an emerging field in AI testing that requires a different mindset. Evals are fundamentally different from unit tests. They’re more like benchmarks that help you understand how your model’s performance changes over time.

Here’s what makes evals unique:

They never truly “pass” in the traditional sense — they provide performance metrics that you track over time
They’re typically slower and more expensive to run than unit tests
They’re not suitable for continuous integration pipelines that run on every commit
They require careful consideration of what constitutes “good performance”

Let’s explore this through a practical example of a Cypher query generation system.

Implementing a Cypher Generation System

First, let’s look at how we structure our Cypher generation application:

import json
from pathlib import Path
from typing import Union
from pydantic_ai import Agent, RunContext
from neo4j_database import GraphDBConn 

class CypherSystemPrompt:    
  def __init__(     
    self,       
    examples: Union[list[dict[str, str]], None] = None,     
    db: str = 'Neo4j'   
  ):      
    if examples is None:   
      # Load default examples if none provided  
      with Path('cypher_examples.json').open('rb') as f:   
        self.examples = json.load(f)     
    else:          
      self.examples = examples  
      self.db = db    
      
      def build_prompt(self) -> str:   
        return f"""\
        
Given the following {self.db} graph schema, your job is to 
write a Cypher query that suits the user's request.
Graph schema:
CREATE    
    (person:Person {{name: string, age: int}}),  
    (movie:Movie {{title: string, year: int}})
CREATE (person)-[:ACTED_IN]->(movie)
{''.join(self.format_example(example) for example in self.examples)}"""  
        
        @staticmethod   
        def format_example(example: dict[str, str]) -> str:     
          return f"""\
<example>  
    <request>{example['request']}</request> 
    <cypher>{example['cypher']}</cypher>
</example>
    """
          
      cypher_agent = Agent( 
        'gemini-1.5-flash',   
        deps_type=CypherSystemPrompt,
      
      )
      
      @cypher_agent.system_prompt
      async def system_prompt(ctx: RunContext[CypherSystemPrompt]) -> str:  
        return ctx.deps.build_prompt()

Measuring Performance with Cross-Validation

The most challenging aspect of evals is measuring performance effectively. Let’s implement a comprehensive evaluation system:

import statistics
from itertools import chain
from neo4j_database import GraphDBConn, QueryError

async def evaluate_cypher_generation():   
  # Load our test examples    
  with Path('cypher_examples.json').open('rb') as f:    
    examples = json.load(f)    
    # Split examples into 5 folds for cross-validation  
    fold_size = len(examples) // 5    
    folds = [examples[i : i + fold_size] for i in range(0, len(examples), fold_size)]    
    conn = GraphDBConn() 
    scores = []  
    for i, fold in enumerate(folds, start=1):       
      fold_score = 0       
      # Build training data from other folds     
      other_folds = list(chain(*(f for j, f in enumerate(folds) if j != i)))       
      system_prompt = CypherSystemPrompt(examples=other_folds)    
      # Test the model on this fold     
      with cypher_agent.override(deps=system_prompt):        
        for case in fold:               
          try:                   
            # Generate and execute the Cypher query       
            agent_results = await generate_cypher(case['request']) 
            agent_nodes = await conn.execute(agent_results)       
          except QueryError as e:                  
            print(f'Fold {i} {case}: {e}')         
            # Penalize invalid queries heavily   
            fold_score -= 100              
          else:                
            # Get expected results using the reference query 
            expected_nodes = await conn.execute(case['cypher']) 
            # Score based on result accuracy             
            agent_node_ids = [n['id'] for n in agent_nodes] 
            expected_node_ids = {n['id'] for n in expected_nodes}    
            # Scoring logic:                 
            # -1 point for each returned node (encourages precision) 
            fold_score -= len(agent_node_ids)        
            # +5 points for each correctly matched node (rewards accuracy) 
            fold_score += 5 * len(set(agent_node_ids) & expected_node_ids) 
      scores.append(fold_score)   
            
      # Calculate overall performance metrics  
       overall_score = statistics.mean(scores)   
       variance = statistics.variance(scores) if len(scores) > 1 else 0       
      
      return {       
        'overall_score': overall_score,     
        'score_variance': variance,        
        'fold_scores': scores,     
        'number_of_examples': len(examples)   
      }

Performance Measurement Strategies

Our evaluation system implements several key strategies for measuring LLM performance:

End-to-end Testing: We execute the generated Cypher queries against a real graph database to verify they work correctly.
Cross-validation: We use 5-fold cross-validation to ensure our results are robust:

Train on 4 parts, test on 1
Split examples into 5 parts
Rotate through all combinations
Average the results

3. Sophisticated Scoring:

Heavily penalise invalid queries (-100 points)
Penalise overly broad queries (-1 point per returned node)
Reward accurate results (+5 points per correct node)

4. Variance Tracking: We calculate score variance across folds to understand consistency.

Here’s an example of what our test examples might look like in cypher_examples.json:

{  "examples": [  
  {      "request": "Find all actors who appeared in movies from 2020",  
   "cypher": "MATCH (p:Person)-[:ACTED_IN]->(m:Movie) WHERE m.year = 2020 RETURN p"   
  },   
  {      "request": "Get actors who worked with Tom Hanks",    
   "cypher": "MATCH (p1:Person)-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(p2:Person) WHERE p2.name = 'Tom Hanks' RETURN DISTINCT p1"   
  },   
  {    
    "request": "Find actors who appeared in both drama and comedy movies",   
    "cypher": "MATCH (p:Person)-[:ACTED_IN]->(m1:Movie), (p)-[:ACTED_IN]->(m2:Movie) WHERE m1.genre = 'Drama' AND m2.genre = 'Comedy' RETURN DISTINCT p"   
  }  
]
}

Best Practices for Evals

From our implementation, we can extract several key practices for effective evals:

Separate Training and Testing Data:

Use cross-validation to make the most of limited examples
Ensure test cases don’t leak into training data

2. Comprehensive Scoring:

Penalise undesirable behaviours (invalid queries, over-broad results)
Reward desired outcomes (accurate matches)
Balance precision and recall

3. Error Handling:

Gracefully handle invalid queries
Record and categorise failures
Use penalties that reflect real-world impact

4. Performance Tracking:

Calculate meaningful statistics (mean, variance)
Track scores over time
Monitor for performance regression

Implementing Testing in Your Development Workflow

To implement these testing approaches effectively in your own projects, consider this structured approach:

Start with comprehensive unit tests:

# Basic structure for your tests directory
tests/
├── conftest.py          # Shared fixtures and configurations
├── unit/
│   ├── test_tools.py    # Test individual tools
│   └── test_agents.py   # Test agent behavior
└── evals/  
  └── test_quality.py  # Performance evaluation tests

2. Create reusable testing utilities:

# In conftest.py
import pytest
from pydantic_ai.models.test import TestModel

@pytest.fixture(autouse=True)
def prevent_real_calls():   
  models.ALLOW_MODEL_REQUESTS = False 
  yield
  
@pytest.fixture
def override_agent():   
  with agent.override(model=TestModel()):   
    yield

3. Implement regular evaluation runs:

# In your CI/CD pipeline 
async def run_evaluation_suite():       
  # Run unit tests on every commit         
  pytest.main(['tests/unit'])             
  
  # Run evals on a schedule (e.g., daily)       
  if is_scheduled_eval_time():               
    results = await evaluate_model_performance()   
    store_eval_results(results)               
    check_performance_regression(results)

Best Practices for Production

When moving your AI application to production, remember these key principles:

Separation of Concerns: Keep unit tests and evals separate but complementary. Unit tests should run quickly and often, while evals can run less frequently but more thoroughly.
Progressive Testing: Start with basic unit tests, then add increasingly sophisticated evals as you understand your application’s requirements better.
Continuous Evaluation: Regularly run your eval suite to track performance over time and catch regressions early.
Data Management: Maintain a growing set of test cases and examples, learning from production usage to improve your test coverage.

Final Thoughts

PydanticAI’s testing framework provides a robust foundation for building reliable AI applications. By combining traditional unit testing with specialised AI evaluation techniques, you can create applications that are both technically sound and functionally effective.

Remember that testing AI applications is an evolving field. While the principles we’ve covered provide a solid foundation, be prepared to adapt and extend these approaches as you encounter new challenges and as the field continues to develop.

The key to success lies in maintaining a balance.

use unit tests to ensure your code’s reliability and evals to measure and improve your AI’s performance. Together, these tools enable you to build AI applications that you can confidently deploy and maintain in production environments.

Remember, the goal isn’t perfection but continuous improvement.

Start with these basic patterns and evolve them as your needs grow. Your testing strategy should grow with your application, always focusing on providing real value to your users while maintaining system reliability.

‹ Finetune Llama 3.2 1B for Tool Calling: A Practical Guide

Dec 20, 2024

6 Min Read

Hugging Face TGI v3.0: A Quantum Leap in LLM Performance and Simplified Local Deployment

Dec 20, 2024

6 Min Read

Hugging Face TGI v3.0: A Quantum Leap in LLM Performance and Simplified Local Deployment

May 3, 2025

8 Min Read

How we Designed an Entity Resolution Logic Engine

May 3, 2025

8 Min Read

How we Designed an Entity Resolution Logic Engine

Apr 19, 2025

6 Min Read

How A2A and MCP Are Teaching AI Agents to Speak (and Use Their Tools)

Apr 19, 2025

6 Min Read

PydanticAI Just Made AI Testing Dead Simple — Here’s How It Works

Understanding the Testing Philosophy

The Two Pillars of Testing

Unit Testing in PydanticAI

Setting Up the Testing Environment

Testing with TestModel

Test Setup:

2. Agent Override:

3. Test Execution:

4. Assertions:

Advanced Testing with FunctionModel

Reusable Test Fixtures

Evals: The Art of Model Evaluation

Implementing a Cypher Generation System

Measuring Performance with Cross-Validation

Performance Measurement Strategies

Best Practices for Evals

Implementing Testing in Your Development Workflow

Best Practices for Production

Final Thoughts

Read More

Hugging Face TGI v3.0: A Quantum Leap in LLM Performance and Simplified Local Deployment

Hugging Face TGI v3.0: A Quantum Leap in LLM Performance and Simplified Local Deployment

How we Designed an Entity Resolution Logic Engine

How we Designed an Entity Resolution Logic Engine

How A2A and MCP Are Teaching AI Agents to Speak (and Use Their Tools)

How A2A and MCP Are Teaching AI Agents to Speak (and Use Their Tools)