A production-ready C# service for extracting structured data from raw invoice text using OpenAI's GPT-4o model with a comprehensive evaluation suite focused on Probabilistic Testing and Evals.
Represents the extracted invoice data with strong typing:
InvoiceNumber(string?): The extracted invoice numberVendorName(string?): The extracted vendor/supplier nameInvoiceDate(DateTime?): The invoice dateTotalAmount(decimal): The total invoice amountLineItems(List): Collection of line items
Represents individual line items:
Description(string): Item descriptionAmount(decimal): Item amount
Defines the contract for invoice extraction:
Task<InvoiceExtractionResult> ExtractInvoiceAsync(string invoiceText, CancellationToken cancellationToken = default);- Uses OpenAI's GPT-4o model (
gpt-4o-2024-08-06) - Implements Structured Outputs (JSON Mode) for deterministic responses
- System Prompt: "You are a financial data extraction assistant. Extract data strictly. If a field is missing, return null."
- Ensures strict JSON schema compliance
- Error handling with descriptive exceptions
The InvoiceExtractionEvals class implements comprehensive quality gates:
5 parameterized test cases covering:
- Standard invoices with complete data
- Vendor name variations (case differences, abbreviations)
- Minimal invoice formats
- Decimal precision handling
- OCR-like variations and typos
Each case provides:
input_text: Raw invoice textexpected_vendor: Ground truth vendor nameexpected_total: Ground truth total amount
Purpose: Hallucination Detection
- Validates that sum of
LineItems.AmountequalsTotalAmount - Allows delta of ±0.01 for rounding errors
- Detects when LLM generates inconsistent totals
Theory Test: Runs across all 5 golden invoices
Purpose: Fuzzy Matching with OCR Error Tolerance
- Uses
CalculateLevenshteinDistance()helper function - Levenshtein distance threshold: ≤ 3 characters
- Allows small OCR errors: "Inc." ↔ "Inc", typos, case variations
- Detects vendor name extraction quality
Theory Test: Runs across all 5 golden invoices
Purpose: Data Format and Reasonableness Validation
- Asserts
InvoiceDateis not null - Asserts
InvoiceDateis not in the future (1-day tolerance) - Asserts
InvoiceDateyear >= 2000 (sanity check) - Detects hallucinated or invalid dates
Theory Test: Runs across all 5 golden invoices
Calculates minimum edit distance between two strings:
- Handles null/empty cases
- Case-insensitive comparison
- 2D dynamic programming implementation
- O(n*m) time complexity where n, m are string lengths
LLM-Integration/
├── LLM-Integration.csproj
├── Program.cs
├── Settings.json
├── Models/
│ ├── InvoiceExtractionResult.cs
│ └── LineItem.cs
└── Services/
├── IInvoiceParser.cs
└── OpenAIInvoiceService.cs
LLM-Integration.Tests/
├── LLM-Integration.Tests.csproj
├── Evals/
│ └── InvoiceExtractionEvals.cs
└── Utilities/
└── StringDistance.cs
- .NET 9.0+
- OpenAI API key (GPT-4o access required)
- Update
Settings.json:
{
"API-Key": "your-openai-api-key-here"
}- Build Solution:
dotnet build- Run Tests:
# Run all evaluation tests
dotnet test LLM-Integration.Tests/
# Run specific test class
dotnet test LLM-Integration.Tests/ --filter "ClassName=LLM_Integration.Tests.Evals.InvoiceExtractionEvals"
# Run with verbose output
dotnet test LLM-Integration.Tests/ --logger "console;verbosity=detailed"using LLM_Integration.Services;
var apiKey = "your-openai-api-key";
var service = new OpenAIInvoiceService(apiKey);
var invoiceText = """
INVOICE INV-2024-001
Vendor: ACME Corp
Date: 2024-11-15
Items:
- Widget: $100.00
- Service: $50.00
Total: $150.00
""";
var result = await service.ExtractInvoiceAsync(invoiceText);
Console.WriteLine($"Vendor: {result.VendorName}");
Console.WriteLine($"Total: ${result.TotalAmount}");
foreach (var item in result.LineItems)
{
Console.WriteLine($" - {item.Description}: ${item.Amount}");
}- Ensures deterministic JSON responses from GPT-4o
- Eliminates free-form text parsing ambiguity
- Guarantees schema compliance
- Golden dataset approach for regression testing
- Theory-based tests (xUnit) for parameterized validation
- Multiple evaluation angles (consistency, accuracy, format)
- Levenshtein distance handles OCR errors
- Threshold of 3 allows realistic variances
- Example: "ACME Inc." → "Acme Inc" is 1 edit
- Avoids API call costs during testing
- Provides predictable, deterministic results
- Faster feedback loop for development
| Evaluation | Purpose | Method | Threshold |
|---|---|---|---|
| Internal Consistency | Hallucination Detection | Sum(LineItems) == TotalAmount | ±0.01 delta |
| Vendor Accuracy | Fuzzy Name Matching | Levenshtein Distance | ≤ 3 characters |
| Date Validity | Format & Sanity | Date checks | Not null, not future, year ≥ 2000 |
# Build and run the console app
dotnet run --project LLM-Integration/
# This will attempt to extract invoice data from a sample invoice
# Requires Settings.json with valid OpenAI API key- Manually curated test cases with known good outputs
- Covers edge cases: variations in vendor names, OCR errors, precision
- Provides ground truth for accuracy measurement
- Each evaluation runs against all 5 golden invoices
- Total of 15 test cases (3 evals × 5 invoices)
- Parallel execution via xUnit
To add new test cases:
- Add new
yield returnstatement inGetGoldenInvoices() - All four evaluation tests automatically run against the new case
- Integration with real OpenAI API in integration tests
- Performance benchmarking (response time, cost tracking)
- Confidence scores for extracted fields
- Support for multiple invoice formats (scanned images via OCR)
- Database persistence for audit trails
- Async batch processing for bulk extractions
- Custom evaluation metrics per customer
- Cost optimization (switching between GPT-4o and GPT-4o mini)
MIT License
For issues or questions, please refer to the OpenAI API documentation: