QuantCoderFS – Implementation Evaluation and First Public Release

Version 1.1 · Release 9 May 2025 · Includes evaluation of QuantCoderFS on the low volatility anomaly as a foundational strategy test

May 08, 2025

QuantCoderFS - Algorithm generation from the article ‘Benchmarks as limit to arbitrage (Baker and al. 2010(

Overview

This release note presents a focused benchmark of the QuantCoderFS software—an agent-based system designed to extract and implement trading strategies from research papers—against a well-documented empirical anomaly: the long-term outperformance of low volatility equity portfolios.

The primary goal is to evaluate the intrinsic added value of QuantCoderFS relative to direct prompt engineering, using a single academic source across multiple large language model configurations. The results serve as a reference point for future code releases and as technical guidance for users interested in automated alpha replication.

Methodology

The experiment consists of a single research article input:

Baker, Bradley, and Wurgler (2010), Benchmarks as Limits to Arbitrage: Understanding the Low Volatility Anomaly

QuantCoderFS was tasked with parsing the PDF, interpreting the strategy, and generating production-ready code for QuantConnect’s Lean engine. The system was evaluated across three OpenAI model configurations:

GPT-3.5 (via API)
GPT-4o (via API)
GPT-4.1 (gpt-4-0125-preview, API)

Each model was used to generate code within the same structured agentic flow, and the resulting output was tested for correctness, compilation, and backtest viability.

Summary of Results

Only GPT-4.1 produced a working implementation consistent with the source paper. It accurately captured the key design elements:

Filtering by liquidity and price
Computing 252-day volatility
Selecting the lowest volatility quintile
Monthly rebalancing
Benchmarking and volatility tracking versus SPY

Model Behavior Analysis

GPT-3.5

This version generated only placeholder methods and high-level comments. It lacked any usable implementation logic. While syntactically valid, the output failed to capture both strategy structure and trading behavior.

GPT-4o (First Attempt)

The initial GPT-4o output produced a generic equal-weight portfolio with no volatility logic. It demonstrated knowledge of QuantConnect's framework but did not extract the core alpha signal from the paper. The result resembled a template more than a strategy.

GPT-4o (Second Attempt)

The second attempt included logic to compute volatility but did so incorrectly, using self.History() inside the CoarseSelectionFunction, which is not permitted by QuantConnect. Despite this, the design intent was closer to the paper than in previous versions.

GPT-4.1 (Two Attempts)

Both versions generated by GPT-4.1 successfully implemented the core mechanics of the low volatility anomaly strategy. Each version computed volatility using log returns, maintained internal state correctly, applied scheduled monthly rebalancing, and incorporated benchmarking against SPY. Crucially, both algorithms compiled without error and executed backtests on QuantConnect without requiring manual intervention.

**Backtest executed in the backend console.**

Note: the LEAN engine is used primarily as a compilation and runtime check. Full backtests are subsequently replicated on the QuantConnect platform for performance evaluation and visualization.

Algorithm generated by GPT-4.1 (first attempt).

Code and backtest results are available here.

Of the two, the second implementation stands out for its cleaner structure and higher maintainability. It demonstrates a clearer separation of responsibilities across methods, encapsulates volatility calculations within the rebalance logic, and manages universe changes more robustly. These attributes align more closely with QuantConnect development best practices, making the second version the stronger candidate for production use and public code release.

Algorithm generated by GPT-4.1 (second attempt)

Code and backtest results are available here.

On the Value of Prompting vs. Agentic Design

A natural question arises: could the same implementation quality be achieved by prompting GPT-4.1 directly with a well-structured description of the paper?

The answer is conditionally yes. If the relevant portions of the paper are pre-extracted and the model is given a clear, structured prompt emphasizing:

Universe selection
Signal construction
Rebalancing
Portfolio weights
Risk constraints

Then GPT-4.1 is fully capable of producing equivalent code in a single-shot API call.

However, this presumes that the user has already:

Extracted and structured the relevant data
Removed theoretical or behavioral finance content
Identified the strategy implementation scope

The QuantCoderFS architecture adds intrinsic value by automating these steps: parsing, summarization, validation, and modular code generation. This is particularly important when scaling to:

Long and complex documents
Multiple strategy sections
Noisy or unstructured formats

In such cases, prompt engineering alone becomes brittle and unscalable.

Implementation Issues and Code Design Improvements

The benchmarking revealed several recurrent implementation challenges:

Incorrect API usage
- self.History() must not be called in CoarseSelectionFunction. Instead, history requests should be deferred to Rebalance() or OnSecuritiesChanged().
Missing rebalancing schedule
- Several outputs omitted Schedule.On(...), resulting in either no trades or continuous daily logic.
Insufficient selection logic
- Some outputs did not sort by volatility, use quintiles, or apply ranking at all.

Recommended code improvements:

Always separate data fetching from universe selection
Use log returns for volatility:
returns = np.diff(np.log(closes))
Ensure all SetHoldings() calls are gated by IsTradable
Store selected symbols and rebalance state clearly
Use RollingWindow only where appropriate and size-limited

Recommendations for QuantCoderFS

To improve robustness and accuracy across varied academic inputs:

Add a dedicated SignalExtractorAgent
Extracts structured strategy logic from textual input (lookback, metric, selection rule, etc.)
Refactor the SummaryTool prompt
Guide the model to return actionable implementation metadata in JSON or bullet format
Add a strategy completeness validator
Checks for signal logic, universe constraints, rebalance method, and weight allocation before passing to code generation
Use GPT-4.1 for all production code tasks
Until GPT-4o shows parity in multi-step financial logic, restrict GPT-4o use to summarization and user-facing tasks

Costs of code generation

Given the results of this benchmark, it is evident that GPT-4.1 is currently necessary to produce reliably implementable trading strategies under the QuantCoderFS agentic configuration. The average cost of generating a complete algorithm using GPT-4.1 via API is approximately $0.70 USD per run, based on current token pricing and workflow length.

For users seeking a more cost-effective alternative, particularly for exploratory or lower-fidelity use cases, it is advisable to use the legacy NLP-based pipeline (built with LangChain). While less robust in handling complex documents, the legacy version avoids multi-agent orchestration and relies on prompt-chained summarization and code generation using older, cheaper models such as GPT-3.5. This approach is suitable when high implementation accuracy is not a strict requirement.

Conclusion

This study provides a foundational benchmark for evaluating the current capabilities of QuantCoderFS in translating academic research into executable trading strategies. The results demonstrate that high-fidelity code generation is achievable using GPT-4.1, particularly when combined with structured document parsing and a modular agent-based workflow. These findings offer practical guidance for users aiming to extract implementable alpha directly from research literature, while also highlighting the architectural and model-level dependencies of the system.

However, a full assessment of QuantCoderFS's effectiveness will require the generation and validation of a broader set of strategies. Only through repeated application across diverse papers and asset classes can its generalizability be realistically evaluated and future development priorities be clearly defined.

References

QuantCoderFS repository on GitHub

QuantConnect platform

QuantCoderFS R&D

Discussion about this post