A Bespoke LLM Code Scanner

Building a Nightly AI Code Scanner with vLLM, ROCm, and JIRA Integration

I've been running a ballistics calculation engine — a Rust physics library with several components, like a Flask app wrapper with machine learning capabilities, bindings for a python library as well as a Ruby gem library. There are also Android and iOS apps, too. The codebase has grown to about 15,000 lines of Rust and another 10,000 lines of Python. At this scale, bugs hide in edge cases: division by zero, floating-point precision issues in transonic drag calculations, unwrap() panics on unexpected input.

What if I could run an AI code reviewer every night while I sleep? Not a cloud API with per-token billing that could run up a $500 bill scanning 50 files, but a local model running on my own hardware, grinding through the codebase and filing JIRA tickets for anything suspicious.

This is the story of building that system.

The Hardware: AMD Strix Halo on ROCm 7.0

I'm running this on a server with an AMD Radeon 8060S (Strix Halo APU) — specifically the gfx1151 architecture. This isn't a data center GPU. It's essentially an integrated GPU with 128GB of shared memory; configured to give 96GB to VRAM and the rest to system RAM. Not the 80GB of HBM3 you'd get on an H100, but enough to run a 32B parameter model comfortably.

The key insight: for batch processing where latency doesn't matter, you don't need bleeding-edge hardware. A nightly scan can take hours. I'm not serving production traffic; I'm analyzing code files one at a time with a 30-second cooldown between requests. The APU handles this fine.

Hardware Configuration:
- AMD Radeon 8060S (gfx1151 Strix Halo APU)
- 96GB shared memory
- ROCm 7.0 with HSA_OVERRIDE_GFX_VERSION=11.5.1

The HSA_OVERRIDE_GFX_VERSION environment variable is critical. Without it, ROCm doesn't recognize the Strix Halo architecture. This is the kind of sharp edge you hit running ML on AMD consumer hardware.

Model Selection: Qwen2.5-Coder-7B-Instruct

I tested several models:

Model Parameters Context Quality Notes
DeepSeek-Coder-V2-Lite 16B 32k Good Requires flash_attn (ROCm issues)
Qwen3-Coder-30B 30B 32k Excellent Too slow on APU
Qwen2.5-Coder-7B-Instruct 7B 16k Good Sweet spot
TinyLlama-1.1B 1.1B 4k Poor Too small for code review

Qwen2.5-Coder-7B-Instruct hits the sweet spot. It understands Rust and Python well enough to spot real issues, runs fast enough to process 50 files per night, and doesn't require flash attention (which has ROCm compatibility issues on consumer hardware).

vLLM Setup

vLLM provides an OpenAI-compatible API server that makes integration trivial. Here's the startup command:

source ~/vllm-rocm7-venv/bin/activate
export HSA_OVERRIDE_GFX_VERSION=11.5.1
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-Coder-7B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --trust-remote-code \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.85

The --max-model-len 16384 limits context to 16k tokens. My code files rarely exceed 500 lines (truncated), so this is plenty. The --gpu-memory-utilization 0.85 leaves headroom for the system.

I run this in a Python venv rather than Docker because ROCm device passthrough with Docker on Strix Halo is finicky. Sometimes you have to choose pragmatism over elegance.

Docker Configuration (When It Works)

For reference, here's the Docker Compose configuration I initially built. It works on dedicated AMD GPUs but has issues on integrated APUs:

services:
  vllm:
    image: rocm/vllm-dev:latest
    container_name: vllm-code-scanner
    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
    group_add:
      - video
      - render
    security_opt:
      - seccomp:unconfined
    cap_add:
      - SYS_PTRACE
    ipc: host
    environment:
      - HSA_OVERRIDE_GFX_VERSION=11.5.1
      - PYTORCH_ROCM_ARCH=gfx1151
      - HIP_VISIBLE_DEVICES=0
    volumes:
      - /home/alex/models:/models
      - /home/alex/.cache/huggingface:/root/.cache/huggingface
    ports:
      - "8000:8000"
    command: >
      python -m vllm.entrypoints.openai.api_server
      --model Qwen/Qwen2.5-Coder-7B-Instruct
      --host 0.0.0.0
      --port 8000
      --trust-remote-code
      --max-model-len 16384
      --gpu-memory-utilization 0.85
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s

  scanner:
    build: .
    container_name: code-scanner-agent
    depends_on:
      vllm:
        condition: service_healthy
    environment:
      - VLLM_HOST=vllm
      - VLLM_PORT=8000
      - JIRA_EMAIL=${JIRA_EMAIL}
      - JIRA_API_KEY=${JIRA_API_KEY}
    volumes:
      - /home/alex/projects:/projects:ro
      - ./config:/app/config:ro
      - /home/alex/projects/code-scanner-results:/app/results

The ipc: host and seccomp:unconfined are necessary for ROCm to function properly. The depends_on with service_healthy ensures the scanner waits for vLLM to be fully loaded before starting — important since model loading can take 2-3 minutes.

The scanner Dockerfile is minimal:

FROM python:3.11-slim

WORKDIR /app

RUN apt-get update && apt-get install -y \
    git curl ripgrep \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY agent/ /app/agent/
COPY prompts/ /app/prompts/
COPY config/ /app/config/

CMD ["python", "-m", "agent.scanner"]

Including ripgrep in the container enables fast pattern matching when the scanner needs to search for related code.

The Scanner Architecture

The system has three main components:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Systemd       │     │    vLLM         │     │     JIRA        │
│   Timer         │────▶│    Server       │────▶│     API         │
│   (11pm daily)  │     │  (Qwen 7B)      │     │   (tickets)     │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                               │
                               ▼
                    ┌─────────────────────┐
                    │   Scanner Agent     │
                    │ - File discovery    │
                    │ - Code analysis     │
                    │ - Finding validation│
                    │ - JIRA integration  │
                    └─────────────────────┘

Configuration

Everything is driven by a YAML configuration file:

vllm:
  host: "10.1.1.27"
  port: 8000
  model: "Qwen/Qwen2.5-Coder-7B-Instruct"

schedule:
  start_hour: 23  # 11pm
  end_hour: 6     # 6am
  max_iterations: 50
  cooldown_seconds: 30

repositories:
  - name: "ballistics-engine"
    path: "/home/alex/projects/ballistics-engine"
    languages: ["rust"]
    scan_patterns:
      - "src//*.rs"
    exclude_patterns:
      - "target/"
      - "*.lock"

  - name: "ballistics-api"
    path: "/home/alex/projects/ballistics-api"
    languages: ["python", "rust"]
    scan_patterns:
      - "ballistics//*.py"
      - "ballistics_rust/src//*.rs"
    exclude_patterns:
      - "__pycache__/"
      - "target/"
      - ".venv/"

jira:
  enabled: true
  project_key: "MBA"
  confidence_threshold: 0.75
  labels: ["ai-detected", "code-scanner"]
  max_tickets_per_run: 10
  review_threshold: 5

The confidence_threshold: 0.75 is crucial. Without it, the model reports every minor style issue. At 75%, it focuses on things it's genuinely concerned about.

The review_threshold: 5 triggers a different behavior: if the model finds more than 5 issues, it creates a single summary ticket for manual review rather than flooding JIRA with individual tickets. This is a safety valve for when the model goes haywire.

Structured Outputs with Pydantic

LLMs are great at finding issues but terrible at formatting output consistently. Left to their own devices, they'll return findings as markdown, prose, JSON with missing fields, or creative combinations thereof.

The solution is structured outputs. I define Pydantic models for exactly what I expect:

class Severity(str, Enum):
    CRITICAL = "critical"
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"
    INFO = "info"

class FindingType(str, Enum):
    BUG = "bug"
    PERFORMANCE = "performance"
    SECURITY = "security"
    CODE_QUALITY = "code_quality"
    POTENTIAL_ISSUE = "potential_issue"

class CodeFinding(BaseModel):
    file_path: str = Field(description="Path to the file")
    line_start: int = Field(description="Starting line number")
    line_end: Optional[int] = Field(default=None)
    finding_type: FindingType
    severity: Severity
    title: str = Field(max_length=100)
    description: str
    suggestion: Optional[str] = None
    confidence: float = Field(ge=0.0, le=1.0)
    code_snippet: Optional[str] = None

The confidence field is a float between 0 and 1. The model learns to be honest about uncertainty — "I think this might be a bug (0.6)" versus "This is definitely division by zero (0.95)."

In a perfect world, I'd use vLLM's Outlines integration for guided JSON generation. In practice, I found that prompting Qwen for JSON and parsing the response works reliably:

def _analyze_code(self, file_path: str, content: str) -> List[CodeFinding]:
    messages = [
        {"role": "system", "content": self.system_prompt},
        {"role": "user", "content": f"""Analyze this code for bugs and issues.

File: {file_path}

{content}

Return a JSON array of findings. Each finding must have:
- file_path: string
- line_start: number
- finding_type: "bug" | "performance" | "security" | "code_quality"
- severity: "critical" | "high" | "medium" | "low" | "info"
- title: string (max 100 chars)
- description: string
- suggestion: string or null
- confidence: number 0-1

If no issues found, return an empty array: []"""}
    ]

    response = self._call_llm(messages)

    # Parse JSON from response (handles markdown code blocks too)
    if response.strip().startswith('['):
        findings_data = json.loads(response)
    elif '```json' in response:
        json_str = response.split('```json')[1].split('```')[0]
        findings_data = json.loads(json_str)
    elif '[' in response:
        start = response.index('[')
        end = response.rindex(']') + 1
        findings_data = json.loads(response[start:end])
    else:
        return []

    # Validate each finding with Pydantic
    findings = []
    for item in findings_data:
        try:
            finding = CodeFinding(item)
            findings.append(finding)
        except ValidationError:
            pass  # Skip malformed findings

    return findings

The System Prompt

The system prompt is where you teach the model what you care about. Here's mine:

You are an expert code reviewer specializing in Rust and Python.
Your job is to find bugs, performance issues, security vulnerabilities,
and code quality problems.

You are analyzing code from a ballistics calculation project that includes:
- A Rust physics engine for trajectory calculations
- Python Flask API with ML models
- PyO3 bindings between Rust and Python

Key areas to focus on:
1. Numerical precision issues (floating point errors, rounding)
2. Edge cases in physics calculations (division by zero, negative values)
3. Memory safety in Rust code
4. Error handling (silent failures, unwrap panics)
5. Performance bottlenecks (unnecessary allocations, redundant calculations)
6. Security issues (input validation, injection vulnerabilities)

Be conservative with findings - only report issues you are confident about.
Avoid false positives.

The phrase "Be conservative with findings" is doing heavy lifting. Without it, the model reports everything that looks slightly unusual. With it, it focuses on actual problems.

Timeout Handling

Large files (500+ lines) can take a while to analyze. My initial 120-second timeout caused failures on complex files. I bumped it to 600 seconds (10 minutes):

response = requests.post(
    f"{self.base_url}/chat/completions",
    json=payload,
    headers={"Content-Type": "application/json"},
    timeout=600
)

I also truncate files to 300 lines. For longer files, the model only sees the first 300 lines. This is a trade-off — I might miss bugs in the back half of long files — but it keeps scans predictable and prevents timeout cascades. I plan to revisit this in future iterations.

lines = content.split('\n')
if len(lines) > 300:
    content = '\n'.join(lines[:300])
    logger.info("Truncated to 300 lines for analysis")

JIRA Integration

When the scanner finds issues, it creates JIRA tickets automatically. The API is straightforward:

def create_jira_tickets(self, findings: List[CodeFinding]):
    jira_base_url = f"https://{jira_domain}/rest/api/3"

    for finding in findings:
        # Map severity to JIRA priority
        priority_map = {
            Severity.CRITICAL: "Highest",
            Severity.HIGH: "High",
            Severity.MEDIUM: "Medium",
            Severity.LOW: "Low",
            Severity.INFO: "Lowest"
        }

        payload = {
            "fields": {
                "project": {"key": "MBA"},
                "summary": f"[AI] {finding.title}",
                "description": {
                    "type": "doc",
                    "version": 1,
                    "content": [{"type": "paragraph", "content": [
                        {"type": "text", "text": build_description(finding)}
                    ]}]
                },
                "issuetype": {"name": "Bug" if finding.finding_type == FindingType.BUG else "Task"},
                "priority": {"name": priority_map[finding.severity]},
                "labels": ["ai-detected", "code-scanner"]
            }
        }

        response = requests.post(
            f"{jira_base_url}/issue",
            json=payload,
            auth=(jira_email, jira_api_key),
            headers={"Content-Type": "application/json"}
        )

The [AI] prefix in the summary makes it obvious these tickets came from the scanner. The ai-detected label allows filtering.

I add a 2-second delay between ticket creation to avoid rate limiting:

time.sleep(2)  # Rate limit protection

Systemd Scheduling

The scanner runs nightly via systemd timer:

# /etc/systemd/system/code-scanner.timer
[Unit]
Description=Run Code Scanner nightly at 11pm

[Timer]
OnCalendar=*-*-* 23:00:00
Persistent=true
RandomizedDelaySec=300

[Install]
WantedBy=timers.target

The RandomizedDelaySec=300 adds up to 5 minutes of random delay. This prevents the scanner from always starting at exactly 11:00:00, which helps if multiple services share the same schedule.

The service unit is a oneshot that runs the scanner script:

# /etc/systemd/system/code-scanner.service
[Unit]
Description=Code Scanner Agent
After=docker.service

[Service]
Type=oneshot
User=alex
WorkingDirectory=/home/alex/projects/ballistics/code-scanner
ExecStart=/home/alex/projects/ballistics/code-scanner/scripts/start_scanner.sh
TimeoutStartSec=25200

The TimeoutStartSec=25200 (7 hours) gives the scanner enough time to complete even if it scans every file.

Sample Findings

Here's what the scanner actually finds. From a recent run:

{
  "file_path": "/home/alex/projects/ballistics-engine/src/fast_trajectory.rs",
  "line_start": 115,
  "finding_type": "bug",
  "severity": "high",
  "title": "Division by zero in fast_integrate when velocity approaches zero",
  "description": "The division dt / velocity_magnitude could result in division by zero if the projectile stalls (velocity_magnitude = 0). This can happen at the apex of a high-angle shot.",
  "suggestion": "Add a check for velocity_magnitude < epsilon before division, or clamp to a minimum value.",
  "confidence": 0.85
}

This is a real issue. In ballistics calculations, a projectile fired at a high angle momentarily has zero horizontal velocity at the apex. Without a guard, this causes a panic.

Not every finding is valid. The model occasionally flags intentional design decisions as "issues." But at a 75% confidence threshold, the false positive rate is manageable — maybe 1 in 10 findings needs to be closed as "not a bug."

Trade-offs and Lessons

What works well: - Finding numerical edge cases (division by zero, overflow) - Spotting unwrap() calls on Options that might be None - Identifying missing error handling - Flagging dead code and unreachable branches

What doesn't work as well: - Understanding business logic (the model doesn't know physics) - Spotting subtle race conditions in concurrent code - False positives on intentional patterns

Operational lessons: - Start with a low iteration limit (10-20 files) to test the pipeline - Monitor the first few runs manually before trusting it - Keep credentials in .env files excluded from rsync - The 300-line truncation is aggressive; consider chunking for long files

Handling JSON Parse Failures

Despite asking for JSON, LLMs sometimes produce malformed output. I see two failure modes:

  1. Truncated JSON: The model runs out of tokens mid-response, leaving an unterminated string or missing closing brackets.
  2. Wrapped JSON: The model adds explanatory text around the JSON, like "Here are the findings:" before the array.

My parser handles both:

def parse_findings_response(response: str) -> list:
    """Extract JSON from potentially messy LLM output."""
    response = response.strip()

    # Best case: raw JSON array
    if response.startswith('['):
        try:
            return json.loads(response)
        except json.JSONDecodeError:
            pass  # Fall through to extraction

    # Common case: JSON in markdown code block
    if '```json' in response:
        try:
            json_str = response.split('```json')[1].split('```')[0]
            return json.loads(json_str)
        except (IndexError, json.JSONDecodeError):
            pass

    # Fallback: extract JSON array from surrounding text
    if '[' in response and ']' in response:
        try:
            start = response.index('[')
            end = response.rindex(']') + 1
            return json.loads(response[start:end])
        except json.JSONDecodeError:
            pass

    # Give up
    logger.warning("Could not extract JSON from response")
    return []

When parsing fails, I log the error and skip that file rather than crashing the entire scan. In a typical 50-file run, I see 2-3 parse failures — annoying but acceptable.

Testing the Pipeline

Before trusting the scanner with JIRA ticket creation, I ran it in "dry run" mode:

# Set max iterations low and disable JIRA
export MAX_ITERATIONS=5
# In config: jira.enabled: false

python run_scanner_direct.py

This scans just 5 files and prints findings without creating tickets. I manually reviewed each finding:

  • True positive: Division by zero in trajectory calculation — good catch
  • False positive: Flagged intentional unwrap() on a guaranteed-Some Option — needs better context
  • True positive: Dead code path never executed — valid cleanup suggestion
  • Marginal: Style suggestion about variable naming — below my quality threshold

After tuning the confidence threshold and system prompt, the true positive rate improved to roughly 90%.

Monitoring and Observability

The scanner writes detailed logs to stdout and a JSON results file. Sample log output:

2025-11-26 15:48:25 - CODE SCANNER AGENT STARTING
2025-11-26 15:48:25 - Max iterations: 50
2025-11-26 15:48:25 - Model: Qwen/Qwen2.5-Coder-7B-Instruct
2025-11-26 15:48:25 - Starting scan of ballistics-engine
2025-11-26 15:48:25 - Found 35 files to scan
2025-11-26 15:48:25 - Scanning: src/trajectory_sampling.rs
2025-11-26 15:48:25 -   Truncated to 300 lines for analysis
2025-11-26 15:49:24 -   Found 5 findings (>= 75% confidence)
2025-11-26 15:49:24 -     [LOW] Redundant check for step_m value
2025-11-26 15:49:24 -     [LOW] Potential off-by-one error

The JSON results include full finding details:

{
  "timestamp": "20251126_151136",
  "total_findings": 12,
  "repositories": [
    {
      "repository": "ballistics-engine",
      "files_scanned": 35,
      "findings": [...],
      "duration_seconds": 1842.5,
      "iterations_used": 35
    }
  ]
}

I keep the last 30 result files (configurable) for historical comparison. Eventually I'll build a dashboard showing finding trends over time.

What's Next

The current system is batch-oriented: run once per night, file tickets, done. Future improvements I'm considering:

  1. Pre-commit integration: Run on changed files only, fast enough for CI
  2. Retrieval-augmented context: Include related files when analyzing (e.g., when scanning a function, include its callers)
  3. Learning from feedback: Track which tickets get closed as "not a bug" and use that to tune prompts
  4. Multi-model ensemble: Run the same code through two models, only file tickets when both agree

For now, though, the simple approach works. Every morning I check JIRA, triage the overnight findings, and fix the real bugs. The model isn't perfect, but it finds things I miss. And unlike a human reviewer, it never gets tired, never skips files, and never has a bad day.

Get the Code

I've open-sourced the complete scanner implementation on GitHub: llm-code-scanner

The project includes:

  • Dual scanning modes: Fast nightly scans via vLLM and comprehensive weekly analyses through Ollama
  • Smart deduplication: SQLite database prevents redundant issue tracking across runs
  • JIRA integration: Automatically creates tickets for findings above your confidence threshold
  • Email reports: SendGrid integration for daily/weekly summaries
  • Multi-language support: Python, Rust, TypeScript, Kotlin, Swift, Go, and more

To get started, clone the repo, configure your scanner_config.yaml with your vLLM/Ollama server details, and run python -m agent.scanner. The README has full setup instructions including environment variables for JIRA and SendGrid integration.

Building Cross-Platform Rust Binaries: A Multi-Architecture Build Orchestration System

When developing ballistics-engine, a high-performance ballistics calculation library written in Rust, I faced a challenge: how do I efficiently build and distribute binaries for multiple operating systems and architectures? The answer led to the creation of an automated build orchestration system that leverages diverse hardware—from single-board computers to powerful x86_64 servers—to build native binaries for macOS, Linux, FreeBSD, NetBSD, and OpenBSD across both ARM64 and x86_64 architectures. Now, you are probably wondering why I am bothering to show love for the BSD Trilogy; the answer is simple: because I want to. Sure they are a bit esoteric, but I ran FreeBSD for years as my mail server. I still like the BSDs.

This article explores the architecture, implementation, and lessons learned from building a production-grade multi-platform build system that powers https://ballistics.zip, where users can download pre-built binaries for their platform with a simple curl command.

curl --proto '=https' --tlsv1.2 -sSf https://ballistics.zip/install.sh | sh

The Problem: Cross-Platform Distribution

Rust's cross-compilation capabilities are impressive, but they have limitations:

  • Cross-compilation complexity: While Rust supports cross-compilation, getting it working reliably for BSD systems (especially with system dependencies) is challenging
  • Native testing: You need to test on actual hardware to ensure binaries work correctly
  • Binary compatibility: Different BSD versions and configurations require native builds
  • Performance verification: Emulated builds may behave differently than native ones

The solution? Build natively on each target platform using actual hardware or high-performance emulation.

Architecture Overview

The build orchestration system consists of three main components:

1. Build Nodes (Physical and Virtual Machines)

  • macOS systems (x86_64 and aarch64) - Local builds
  • Linux x86_64 server - Remote build via SSH
  • FreeBSD ARM64 - Single-board computer (Raspberry Pi 4)
  • OpenBSD ARM64 - QEMU VM emulated on x86_64 (rig.localnet)
  • NetBSD x86_64 and ARM64 - QEMU VMs

2. Orchestrator (Python-based coordinator)

  • Reads build node configuration from build-nodes.yaml
  • Executes builds in parallel across all nodes
  • Collects artifacts via SSH/SCP
  • Generates SHA256 checksums
  • Uploads to Google Cloud Storage
  • Updates version metadata

3. Distribution (ballistics.zip website)

  • Serves install script at https://ballistics.zip
  • Hosts binaries in GCS bucket (gs://ballistics-releases/)
  • Provides version detection and automatic downloads
  • Supports version fallback for platforms with delayed releases

Hardware Infrastructure

Single-Board Computers

Orange Pi 5 Max (ARM64)

  • Role: Host for NetBSD ARM64 VM
  • CPU: Rockchip RK3588 (8-core ARM Cortex-A76/A55)
  • RAM: 16GB
  • Why: Native ARM64 hardware for running QEMU VMs
  • Host IP: 10.1.1.10
  • VM IPs:
  • NetBSD ARM64: 10.1.1.15
  • OpenBSD ARM64 (native, disabled): 10.1.1.11

Raspberry Pi 4 (ARM64)

  • Role: FreeBSD ARM64 native builds
  • CPU: Broadcom BCM2711 (quad-core Cortex-A72)
  • RAM: 8GB
  • Why: Stable FreeBSD support, reliable ARM64 platform
  • IP: 10.1.1.7

x86_64 ("rig.localnet")

  • Role: Linux builds, BSD VM host, emulated ARM64 builds
  • CPU: Intel i9
  • RAM: 96GB
  • IP: 10.1.1.27 (Linux host), 10.1.1.17 (KVM host)
  • VMs Hosted:
  • FreeBSD x86_64: 10.1.1.21
  • OpenBSD x86_64: 10.1.1.20
  • OpenBSD ARM64 (emulated): 10.1.1.23
  • NetBSD x86_64: 10.1.1.19

Local macOS Development Machine

  • Role: macOS binary builds (both architectures)
  • Build Method: Local cargo builds with target flags
  • Architectures:
  • aarch64-apple-darwin (Apple Silicon)
  • x86_64-apple-darwin (Intel Macs)

A Surprising Discovery: Emulated ARM64 Performance

One of the most interesting findings during development was discovering that emulated ARM64 builds on powerful x86_64 hardware are significantly faster than emulated ARM64 on native ARM64 builds on single-board computers.

Performance Comparison

  • Emulated ARM64 on ARM64: ~99+ minutes per build
  • Emulated ARM64 on x86_64: 15m 37s ⚡

The emulated build on rig.localnet (running QEMU with KVM acceleration) completed in about 6x less time than the native ARM64 hardware. This is because:

  1. The x86_64 server has significantly more powerful CPU cores
  2. QEMU with KVM provides near-native performance for many workloads
  3. Rust compilation is primarily CPU-bound and benefits from faster single-core performance
  4. The x86_64 server has faster storage (NVMe vs eMMC/SD card)

As a result, the native OpenBSD ARM64 node on the Orange Pi is now disabled in favor of the emulated version.

Prerequisites

SSH Key-Based Authentication

Critical: The orchestration system requires passwordless SSH access to all remote build nodes. Here's how to set it up:

  1. Generate SSH key (if you don't have one):
ssh-keygen -t ed25519 -C "build-orchestrator"
  1. Copy public key to each build node:
# For each build node
ssh-copy-id user@build-node-ip

# Examples:
ssh-copy-id [email protected]      # Linux x86_64
ssh-copy-id [email protected]     # FreeBSD ARM64
ssh-copy-id [email protected]       # OpenBSD x86_64
ssh-copy-id [email protected]       # OpenBSD ARM64 emulated
ssh-copy-id [email protected]       # NetBSD x86_64
ssh-copy-id [email protected]       # NetBSD ARM64
  1. Test SSH access:
ssh user@build-node-ip "uname -a"

Software Requirements

On Build Orchestrator Machine:

  • Python 3.8+
  • pyyaml (pip install pyyaml)
  • Google Cloud SDK (gcloud command) for GCS uploads
  • SSH client

On Each Build Node:

  • Rust toolchain (cargo, rustc)
  • Build essentials (compiler, linker)
  • curl, wget, or ftp (for downloading source)
  • Sufficient disk space (~2GB for build artifacts)

BSD-Specific Requirements

NetBSD: Install curl via pkgsrc (native ftp doesn't support HTTPS)

# Bootstrap pkgsrc
cd /usr && ftp -o pkgsrc.tar.gz http://cdn.netbsd.org/pub/pkgsrc/current/pkgsrc.tar.gz
tar -xzf pkgsrc.tar.gz
cd /usr/pkgsrc/bootstrap && ./bootstrap --prefix=/usr/pkg

# Install curl
/usr/pkg/bin/pkgin -y update
/usr/pkg/bin/pkgin -y install curl

OpenBSD: Native ftp supports HTTPS

pkg_add rust git

FreeBSD: Use pkg for everything

pkg install -y rust git curl

The ballistics.zip Website and Install Script

How It Works

https://ballistics.zip serves as the primary distribution point for pre-built ballistics-engine binaries. The system uses:

  1. GCS Bucket:

  2. gs://ballistics-releases/ - Binary artifacts

  3. CDN: Google Cloud CDN provides global distribution

  4. Install Script: Universal installer that:

  5. Detects OS and architecture

  6. Downloads appropriate binary
  7. Verifies SHA256 checksum
  8. Installs to /usr/local/bin

Usage

Basic installation:

curl -sSL https://ballistics.zip/install.sh | bash

Specific version:

curl -sSL https://ballistics.zip/install.sh | bash -s -- --version 0.13.3

Different install location:

curl -sSL https://ballistics.zip/install.sh | bash -s -- --prefix ~/.local

Install Script Architecture

The install.sh script intelligently handles:

Platform Detection:

OS=$(uname -s | tr '[:upper:]' '[:lower:]')
ARCH=$(uname -m)

case "$ARCH" in
  x86_64|amd64) ARCH="x86_64" ;;
  aarch64|arm64) ARCH="aarch64" ;;
  *) echo "Unsupported architecture: $ARCH"; exit 1 ;;
esac

PLATFORM="${OS}-${ARCH}"  # e.g., "openbsd-aarch64"

Version Fallback: If a requested version isn't available for a platform, the script automatically finds the latest available version:

# If openbsd-aarch64 0.13.3 doesn't exist, fall back to 0.13.2
AVAILABLE_VERSION=$(curl -sL $BASE_URL/versions.txt | grep "^$PLATFORM:" | cut -d: -f2)

Checksum Verification:

EXPECTED_SHA=$(cat "$BINARY.sha256")
ACTUAL_SHA=$(sha256sum "$BINARY" | awk '{print $1}')

if [ "$EXPECTED_SHA" != "$ACTUAL_SHA" ]; then
  echo "Checksum verification failed!"
  exit 1
fi

Build Orchestration System Deep Dive

Configuration: build-nodes.yaml

The heart of the system is build-nodes.yaml, which defines all build targets:

nodes:
  # macOS builds (local machine)
  - name: macos-aarch64
    host: local
    target: aarch64-apple-darwin
    build_command: |
      cd /tmp && rm -rf ballistics-engine-{version}
      curl -L -o v{version}.tar.gz https://github.com/ajokela/ballistics-engine/archive/refs/tags/v{version}.tar.gz
      tar xzf v{version}.tar.gz
      cd ballistics-engine-{version}
      cargo build --release --target {target}
    binary_path: /tmp/ballistics-engine-{version}/target/{target}/release/ballistics
    enabled: true

  # Linux x86_64 (remote via SSH)
  - name: linux-x86_64
    host: [email protected]
    target: x86_64-unknown-linux-gnu
    build_command: |
      cd /tmp && rm -rf ballistics-engine-{version}
      wget -q https://github.com/ajokela/ballistics-engine/archive/refs/tags/v{version}.tar.gz
      tar xzf v{version}.tar.gz
      cd ballistics-engine-{version}
      ~/.cargo/bin/cargo build --release --target {target}
    binary_path: /tmp/ballistics-engine-{version}/target/{target}/release/ballistics
    enabled: true

  # OpenBSD ARM64 emulated (FASTEST ARM64 BUILD!)
  - name: openbsd-aarch64-emulated
    host: [email protected]
    target: aarch64-unknown-openbsd
    build_command: |
      cd /tmp && rm -rf ballistics-engine-{version}
      ftp -o v{version}.tar.gz https://github.com/ajokela/ballistics-engine/archive/refs/tags/v{version}.tar.gz
      tar xzf v{version}.tar.gz
      cd ballistics-engine-{version}
      cargo build --release
    binary_path: /tmp/ballistics-engine-{version}/target/release/ballistics
    enabled: true

  # NetBSD x86_64 (HTTPS support via pkgsrc curl)
  - name: netbsd-x86_64
    host: [email protected]
    target: x86_64-unknown-netbsd
    build_command: |
      cd /tmp && rm -rf ballistics-engine-{version}
      /usr/pkg/bin/curl -L -o v{version}.tar.gz https://github.com/ajokela/ballistics-engine/archive/refs/tags/v{version}.tar.gz
      tar xzf v{version}.tar.gz
      cd ballistics-engine-{version}
      /usr/pkg/bin/cargo build --release
    binary_path: /tmp/ballistics-engine-{version}/target/release/ballistics
    enabled: true

Orchestrator Workflow

The orchestrator.py script coordinates the entire build process:

Step 1: Parallel Build Execution

def build_on_node(node, version):
    if node['host'] == 'local':
        # Local build
        subprocess.run(build_command, shell=True, check=True)
    else:
        # Remote build via SSH
        ssh_command = f"ssh {node['host']} '{build_command}'"
        subprocess.run(ssh_command, shell=True, check=True)

Step 2: Artifact Collection

def collect_artifacts(node, version):
    binary_name = f"ballistics-{version}-{node['name']}"

    if node['host'] == 'local':
        shutil.copy(node['binary_path'], f"./{binary_name}")
    else:
        # Download via SCP
        scp_command = f"scp {node['host']}:{node['binary_path']} ./{binary_name}"
        subprocess.run(scp_command, shell=True, check=True)

Step 3: Checksum Generation

def generate_checksum(binary_path):
    with open(binary_path, 'rb') as f:
        sha256 = hashlib.sha256(f.read()).hexdigest()

    with open(f"{binary_path}.sha256", 'w') as f:
        f.write(sha256)

Step 4: Upload to GCS

def upload_to_gcs(version):
    bucket_path = f"gs://ballistics-releases/{version}/"

    # Upload binaries and checksums
    subprocess.run(f"gsutil -m cp ballistics-* {bucket_path}", shell=True)

    # Set public read permissions
    subprocess.run(f"gsutil -m acl ch -u AllUsers:R {bucket_path}*", shell=True)

    # Update latest-version.txt
    with open('latest-version.txt', 'w') as f:
        f.write(version)
    subprocess.run("gsutil cp latest-version.txt gs://ballistics-releases/", shell=True)

Running a Build

Dry-run (test without uploading):

cd build-orchestrator
./build.sh --version 0.13.4 --dry-run

Production build:

./build.sh --version 0.13.4

Output:

Building ballistics-engine v0.13.4
===========================================

Enabled build nodes: 7
- macos-aarch64 (local)
- macos-x86_64 (local)
- linux-x86_64 (alex@10.1.1.27)
- freebsd-aarch64 (freebsd@10.1.1.7)
- openbsd-aarch64-emulated (root@10.1.1.23)
- netbsd-x86_64 (root@10.1.1.19)
- netbsd-aarch64 (root@10.1.1.15)

Starting parallel builds...
[macos-aarch64] Building... (PID: 12345)
[linux-x86_64] Building... (PID: 12346)
...

Build results:
 macos-aarch64 (45s)
 linux-x86_64 (28s)
 freebsd-aarch64 (6m 32s)
 openbsd-aarch64-emulated (15m 37s)   FASTEST ARM64!
...

Uploading to gs://ballistics-releases/0.13.4/
 Uploaded 7 binaries
 Uploaded 7 checksums
 Updated latest-version.txt

Build complete! 🎉
Total time: 16m 12s

Adding New Build Nodes

Interactive Script

The easiest way to add a new node is using the interactive script:

cd build-orchestrator
./add-node.sh

This will prompt you for: - Node name (e.g., openbsd-aarch64-emulated) - SSH host (e.g., [email protected] or local) - Rust target triple (e.g., aarch64-unknown-openbsd) - Build commands (how to download and build) - Binary location (where the compiled binary is located)

Manual Configuration

Alternatively, edit build-nodes.yaml directly:

  - name: your-new-platform
    host: user@ip-address  # or 'local' for local builds
    target: rust-target-triple
    build_command: |
      # Commands to download source and build
      cd /tmp && rm -rf ballistics-engine-{version}
      curl -L -o v{version}.tar.gz https://github.com/...
      tar xzf v{version}.tar.gz
      cd ballistics-engine-{version}
      cargo build --release
    binary_path: /path/to/compiled/binary
    enabled: true

Variables: - {version}: Replaced with target version (e.g., 0.13.4) - {target}: Replaced with Rust target triple

Setting Up a New VM

Example: OpenBSD ARM64 Emulated

  1. Create VM on host:
ssh [email protected]
cd /opt/bsd-vms/openbsd-arm64-emulated
  1. Create boot script:
cat > boot.sh << 'EOF'
#!/bin/bash
exec qemu-system-aarch64 \
  -M virt,highmem=off \
  -cpu cortex-a57 \
  -smp 4 \
  -m 2G \
  -bios /usr/share/qemu-efi-aarch64/QEMU_EFI.fd \
  -drive file=openbsd.qcow2,if=virtio,format=qcow2 \
  -netdev bridge,id=net0,br=br0 \
  -device virtio-net-pci,netdev=net0,romfile=,mac=52:54:00:12:34:99 \
  -nographic
EOF
chmod +x boot.sh
  1. Create systemd service:
sudo cat > /etc/systemd/system/openbsd-arm64-emulated-vm.service << 'EOF'
[Unit]
Description=OpenBSD ARM64 VM (Emulated on x86_64)
After=network.target

[Service]
Type=simple
User=alex
WorkingDirectory=/opt/bsd-vms/openbsd-arm64-emulated
ExecStart=/opt/bsd-vms/openbsd-arm64-emulated/boot.sh
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl enable openbsd-arm64-emulated-vm.service
sudo systemctl start openbsd-arm64-emulated-vm.service
  1. Configure networking (assign static IP 10.1.1.23)

  2. Install build tools inside VM:

ssh [email protected]
pkg_add rust git
  1. Test SSH access:
ssh [email protected] "cargo --version"
  1. Add to build-nodes.yaml and test:
./build.sh --version 0.13.3 --dry-run

GitHub Webhook Integration (Optional)

For fully automated builds triggered by GitHub releases:

1. Deploy Webhook Receiver to Cloud Run

cd build-orchestrator
gcloud run deploy ballistics-build-webhook \
  --source . \
  --region us-central1 \
  --allow-unauthenticated \
  --set-env-vars GITHUB_WEBHOOK_SECRET=your-secret-here

2. Configure GitHub Webhook

  1. Go to: https://github.com/yourusername/your-repo/settings/hooks
  2. Add webhook:

  3. Payload URL: https://ballistics-build-webhook-xxx.run.app/webhook

  4. Content type: application/json
  5. Secret: Your webhook secret
  6. Events: Select "Releases" only

3. Test

Create a new release on GitHub, and the webhook will automatically trigger builds for all platforms!

Performance Metrics and Insights

From real-world builds of ballistics-engine v0.13.3:

Platform Hardware Build Time Notes
macOS aarch64 Apple M1/M2 45s Native Apple Silicon
macOS x86_64 Intel i7/i9 30s Cross-compile on Apple Silicon
Linux x86_64 Xeon/EPYC 25s Fastest overall ⚡
FreeBSD aarch64 Raspberry Pi 4 6m 32s Native ARM64 hardware
OpenBSD aarch64 (emulated) x86_64 QEMU 15m 37s ⚡ FASTEST ARM64
OpenBSD aarch64 (native) Orange Pi 5 Max 99+ min Disabled due to slower speed
NetBSD x86_64 x86_64 VM 3m 45s KVM acceleration
NetBSD aarch64 Orange Pi VM 8m 12s QEMU on ARM64 host

Key Insights:

  1. x86_64 is fastest: Modern x86_64 CPUs dominate for single-threaded compilation
  2. Emulation wins for ARM64: x86_64 emulating ARM64 beats native ARM64 SBCs
  3. SBCs are viable: Raspberry Pi and Orange Pi work well for native builds, but slower
  4. Parallel execution: Running all 7 builds in parallel takes only ~16 minutes (longest pole is FreeBSD ARM64)

Conclusion

Building a custom multi-platform build orchestration system may seem daunting, but the benefits are substantial:

→ Full control: Own your build infrastructure

→ Native builds: Real hardware ensures compatibility

→ Cost-effective: Low operational costs after initial hardware investment

→ Fast iteration: Parallel builds complete in ~16 minutes

→ Flexibility: Easy to add new platforms

→ Learning: Deep understanding of cross-platform development

The surprising discovery that emulated ARM64 on powerful x86_64 hardware outperforms native ARM64 single-board computers has practical implications: you don't always need native hardware for every architecture. Strategic use of emulation can provide better performance while maintaining compatibility.

For projects requiring broad platform support (especially BSD systems not well-served by traditional CI/CD), this approach offers a reliable, maintainable, and cost-effective solution.

Architecture Diagram

graph TB subgraph "Trigger Sources" GH[GitHub Release
v0.13.x] MANUAL[Manual Execution
./build.sh] end subgraph "Build Orchestrator" ORCH[Python Orchestrator
orchestrator.py] CONFIG[Build Configuration
build-nodes.yaml] end subgraph "Build Nodes - Local" MAC_ARM[macOS ARM64
Apple Silicon
~45s] MAC_X86[macOS x86_64
Rosetta 2
~30s] end subgraph "Build Nodes - Remote x86_64" LINUX_X86[Linux x86_64
[email protected]
~25s] FREEBSD_X86[FreeBSD x86_64
[email protected]
~4m] OPENBSD_X86[OpenBSD x86_64
[email protected]
~12m] NETBSD_X86[NetBSD x86_64
[email protected]
~3m 45s] end subgraph "Build Nodes - Remote ARM64" FREEBSD_ARM[FreeBSD ARM64
[email protected]
~6m 32s] OPENBSD_ARM_EMU[OpenBSD ARM64
[email protected]
Emulated on x86_64
~15m 37s ⚡] NETBSD_ARM[NetBSD ARM64
[email protected]
~8m 12s] end subgraph "Artifact Collection" COLLECT[SCP Collection
Pull binaries from nodes] CHECKSUM[Generate SHA256
checksums] end subgraph "Distribution" GCS[Google Cloud Storage
gs://ballistics-releases/] WEBSITE[ballistics.zip
Install Script] end GH -->|webhook| ORCH MANUAL -->|CLI| ORCH CONFIG -->|reads| ORCH ORCH -->|SSH parallel builds| MAC_ARM ORCH -->|SSH parallel builds| MAC_X86 ORCH -->|SSH parallel builds| LINUX_X86 ORCH -->|SSH parallel builds| FREEBSD_X86 ORCH -->|SSH parallel builds| OPENBSD_X86 ORCH -->|SSH parallel builds| NETBSD_X86 ORCH -->|SSH parallel builds| FREEBSD_ARM ORCH -->|SSH parallel builds| OPENBSD_ARM_EMU ORCH -->|SSH parallel builds| NETBSD_ARM MAC_ARM -->|binary| COLLECT MAC_X86 -->|binary| COLLECT LINUX_X86 -->|binary| COLLECT FREEBSD_X86 -->|binary| COLLECT OPENBSD_X86 -->|binary| COLLECT NETBSD_X86 -->|binary| COLLECT FREEBSD_ARM -->|binary| COLLECT OPENBSD_ARM_EMU -->|binary| COLLECT NETBSD_ARM -->|binary| COLLECT COLLECT --> CHECKSUM CHECKSUM --> GCS GCS --> WEBSITE style OPENBSD_ARM_EMU fill:#90EE90 style LINUX_X86 fill:#87CEEB style GCS fill:#FFD700 style WEBSITE fill:#FFD700

Diagram Legend

  • Green: Fastest ARM64 build (emulated on powerful x86_64)
  • Blue: Fastest overall build (native Linux x86_64)
  • Yellow: Distribution endpoints

Build Flow

  1. Trigger: GitHub release webhook or manual execution
  2. Parallel Execution: All enabled build nodes start simultaneously
  3. Collection: Orchestrator collects binaries via SCP
  4. Verification: SHA256 checksums generated for integrity
  5. Upload: Binaries and checksums uploaded to GCS
  6. Availability: Install script immediately serves new version

Rockchip RK3588 NPU Deep Dive: Real-World AI Performance Across Multiple Platforms

Introduction

The Rockchip RK3588 has emerged as one of the most compelling ARM System-on-Chips (SoCs) for edge AI applications in 2024-2025, featuring a dedicated 6 TOPS Neural Processing Unit (NPU) integrated alongside powerful Cortex-A76/A55 CPU cores. This SoC powers a growing ecosystem of single-board computers and system-on-modules from manufacturers worldwide, including Orange Pi, Radxa, FriendlyElec, Banana Pi, and numerous industrial board makers.

But how does the RK3588's NPU perform in real-world scenarios? In this comprehensive deep dive, I'll share detailed benchmarks of the RK3588 NPU testing both Large Language Models (LLMs) and computer vision workloads, with primary testing on the Orange Pi 5 Max and comparative analysis against the closely-related RK3576 found in the Banana Pi CM5-Pro.

RK3588 NPU Performance Benchmarks

The RK3588 Ecosystem: Devices and Availability

The Rockchip RK3588 powers a diverse range of single-board computers (SBCs) and system-on-modules (SoMs) from multiple manufacturers in 2024-2025:

Consumer SBCs:

Industrial and Embedded Modules:

Recent Developments:

  • RK3588S2 (2024-2025) - Updated variant with modernized memory controllers and platform I/O while maintaining the same 6 TOPS NPU performance

The RK3576, found in devices like the Banana Pi CM5-Pro, shares the same 6 TOPS NPU architecture as the RK3588 but features different CPU cores (Cortex-A72/A53 vs. A76/A55), making it an interesting comparison point for NPU-focused workloads.

Hardware Overview

RK3588 SoC Specifications

Built on an 8nm process, the Rockchip RK3588 integrates:

CPU:

  • 4x ARM Cortex-A76 @ 2.4 GHz (high-performance cores)
  • 4x ARM Cortex-A55 @ 1.8 GHz (efficiency cores)

NPU:

  • 6 TOPS total performance
  • 3-core architecture (2 TOPS per core)
  • Shared memory architecture
  • Optimized for INT8 operations
  • Supports INT4/INT8/INT16/BF16/TF32 quantization formats
  • Device path: /sys/kernel/iommu_groups/0/devices/fdab0000.npu

GPU:

  • ARM Mali-G610 MP4 (quad-core)
  • 8K@30fps H.265/VP9 decoding
  • 4K@60fps H.264/H.265 encoding

Architecture: ARM64 (aarch64)

Test Platform: Orange Pi 5 Max

For these benchmarks, we used the Orange Pi 5 Max with:

Software Stack:

  • RKNPU Driver: v0.9.8
  • RKLLM Runtime: v1.2.2 (for LLM inference)
  • RKNN Runtime: v1.6.0 (for general AI models)
  • RKNN-Toolkit-Lite2: v2.3.2

Test Setup

I conducted two separate benchmark suites:

  1. Large Language Model (LLM) Testing using RKLLM
  2. Computer Vision Model Testing using RKNN-Toolkit2

Both tests used a two-system approach:

  • Conversion System: AMD RYZEN AI MAX+ 395 (32 cores, x86_64) running Ubuntu 24.04.3 LTS
  • Inference System: Orange Pi 5 Max (ARM64) with RK3588 NPU

This reflects the real-world workflow where model conversion happens on powerful workstations, and inference runs on edge devices.

Part 1: Large Language Model Performance

Model: TinyLlama 1.1B Chat

Source: Hugging Face (TinyLlama-1.1B-Chat-v1.0)

Parameters: 1.1 billion

Original Size: ~2.1 GB (505 MB model.safetensors)

Conversion Performance (x86_64)

Converting the Hugging Face model to RKNN format on the AMD RYZEN AI MAX+ 395:

Phase Time Details
Load 0.36s Loading Hugging Face model
Build 22.72s W8A8 quantization + NPU optimization
Export 56.38s Export to .rkllm format
Total 79.46s ~1.3 minutes

Output Model:

  • File: tinyllama_W8A8_rk3588.rkllm
  • Size: 1142.9 MB (1.14 GB)
  • Compression: 54% of original size
  • Quantization: W8A8 (8-bit weights, 8-bit activations)

Note: The RK3588 only supports W8A8 quantization for LLM inference, not W4A16.

NPU Inference Results

Hardware Detection:

I rkllm: rkllm-runtime version: 1.2.2, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 2048, npu_core_num: 3
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4

Key Observations:

  • ✅ NPU successfully detected and initialized
  • ✅ All 3 NPU cores utilized
  • ✅ 4 CPU cores (Cortex-A76) enabled for coordination
  • ✅ Model loaded and text generation working
  • ✅ Coherent English text output

Expected Performance (from Rockchip official benchmarks):

  • TinyLlama 1.1B W8A8 on RK3588: ~10-15 tokens/second
  • First token latency: ~200-500ms

Is This Fast Enough for Real-Time Conversation?

To put the 10-15 tokens/second performance in perspective, let's compare it to human reading speeds:

Human Reading Rates:

  • Silent reading: 200-300 words/minute (3.3-5 words/second)
  • Reading aloud: 150-160 words/minute (2.5-2.7 words/second)
  • Speed reading: 400-700 words/minute (6.7-11.7 words/second)

Token-to-Word Conversion:

  • LLM tokens ≈ 0.75 words on average (1.33 tokens per word)
  • 10-15 tokens/sec = ~7.5-11.25 words/second

Performance Analysis:

  • ✅ 2-4x faster than reading aloud (2.5-2.7 words/sec)
  • ✅ 2-3x faster than comfortable silent reading (3.3-5 words/sec)
  • ✅ Comparable to speed reading (6.7-11.7 words/sec)

Verdict: The RK3588 NPU running TinyLlama 1.1B generates text significantly faster than most humans can comfortably read, making it well-suited for real-time conversational AI, chatbots, and interactive applications at the edge.

This is particularly impressive for a $180 device consuming only 5-6W of power. Users won't be waiting for the AI to "catch up" - instead, the limiting factor is human reading speed, not the NPU's generation capability.

Output Quality Verification

To verify the model produces meaningful, coherent responses, I tested it with several prompts:

Test 1: Factual Question

Prompt: "What is the capital of France?"
Response: "The capital of France is Paris."

✅ Result: Correct and concise answer.

Test 2: Simple Math

Prompt: "What is 2 plus 2?"
Response: "2 + 2 = 4"

✅ Result: Correct mathematical calculation.

Test 3: List Generation

Prompt: "List 3 colors: red,"
Response: "Here are three different color options for your text:
1. Red
2. Orange
3. Yellow"

✅ Result: Logical completion with proper formatting.

Observations:

  • Responses are coherent and grammatically correct
  • Factual accuracy is maintained after W8A8 quantization
  • The model understands context and provides relevant answers
  • Text generation is fluent and natural
  • No obvious degradation from quantization

Note: The interactive demo tends to continue generating after the initial response, sometimes repeating patterns. This appears to be a demo interface issue rather than a model quality problem - the initial responses to each prompt are consistently accurate and useful.

LLM Findings

Strengths:

  1. Fast model conversion (~1.3 minutes for 1.1B model)
  2. Successful NPU detection and initialization
  3. Good compression ratio (54% size reduction)
  4. Verified high-quality output: Factually correct, grammatically sound responses
  5. Text generation faster than human reading speed (7.5-11.25 words/sec)
  6. All 3 NPU cores actively utilized
  7. No noticeable quality degradation from W8A8 quantization

Limitations:

  1. RK3588 only supports W8A8 quantization (no W4A16 for better compression)
  2. 1.14 GB model size may be limiting for memory-constrained deployments
  3. Max context length: 2048 tokens

RK3588 vs RK3576: NPU Performance Comparison

The RK3576, found in the Banana Pi CM5-Pro, shares the same 6 TOPS NPU architecture as the RK3588 but differs in CPU configuration (Cortex-A72/A53 vs. A76/A55). This provides an interesting comparison for understanding NPU-specific performance versus overall platform capabilities.

LLM Performance (Official Rockchip Benchmarks):

Model RK3588 (W8A8) RK3576 (W4A16) Notes
Qwen2 0.5B ~42.58 tokens/sec 34.24 tokens/sec RK3588 ~1.24x faster
MiniCPM4 0.5B N/A 35.8 tokens/sec -
TinyLlama 1.1B ~10-15 tokens/sec 21.32 tokens/sec RK3576 faster (different quant)
InternLM2 1.8B N/A 13.65 tokens/sec -

Key Observations:

  • RK3588 supports W8A8 quantization only for LLMs
  • RK3576 supports W4A16 quantization (4-bit weights, 16-bit activations)
  • W4A16 models are smaller (645MB vs 1.14GB for TinyLlama) but may run slower on some models
  • The NPU architecture is fundamentally the same (6 TOPS, 3 cores), but software stack differences affect performance
  • For 0.5B models, RK3588 shows ~20% better performance
  • Larger models benefit from W4A16's memory efficiency on RK3576

Computer Vision Performance:

Both RK3588 and RK3576 share the same NPU architecture for computer vision workloads:

  • MobileNet V1 on RK3576 (Banana Pi CM5-Pro): ~161.8ms per image (~6.2 FPS)
  • ResNet18 on RK3588 (Orange Pi 5 Max): 4.09ms per image (244 FPS)

The dramatic performance difference here is primarily due to model complexity (ResNet18 is better optimized for NPU execution than older MobileNet V1) rather than NPU hardware differences.

Practical Implications:

For NPU-focused workloads, both the RK3588 and RK3576 deliver similar AI acceleration capabilities. The choice between platforms should be based on:

  • CPU performance needs: RK3588's A76 cores are significantly faster
  • Quantization requirements: RK3576 offers W4A16 for LLMs, RK3588 only W8A8
  • Model size constraints: W4A16 (RK3576) produces smaller models
  • Cost considerations: RK3576 platforms (like CM5-Pro at $103) vs RK3588 platforms ($150-180)

Part 2: Computer Vision Model Performance

Model: ResNet18 (PyTorch Converted)

Source: PyTorch pretrained ResNet18

Parameters: 11.7 million

Original Size: 44.6 MB (ONNX format)

Can PyTorch Run on RK3588 NPU?

Short Answer: Yes, but through conversion.

Workflow: PyTorch → ONNX → RKNN → NPU Runtime

PyTorch/TensorFlow models cannot execute directly on the NPU. They must be converted through an AOT (Ahead-of-Time) compilation process. However, this conversion is fast and straightforward.

Conversion Performance (x86_64)

Converting PyTorch ResNet18 to RKNN format:

Phase Time Size Details
PyTorch → ONNX 0.25s 44.6 MB Fixed batch size, opset 11
ONNX → RKNN 1.11s - INT8 quantization, operator fusion
Export 0.00s 11.4 MB Final .rknn file
Total 1.37s 11.4 MB 25.7% of ONNX size

Model Optimizations:

  • INT8 quantization (weights and activations)
  • Automatic operator fusion
  • Layout optimization for NPU
  • Target: 3 NPU cores on RK3588

Memory Usage:

  • Internal memory: 1.1 MB
  • Weight memory: 11.5 MB
  • Total model size: 11.4 MB

NPU Inference Performance

Running ResNet18 inference on Orange Pi 5 Max (10 iterations after 2 warmup runs):

Results:

  • Average Inference Time: 4.09 ms
  • Min Inference Time: 4.02 ms
  • Max Inference Time: 4.43 ms
  • Standard Deviation: ±0.11 ms
  • Throughput: 244.36 FPS

Initialization Overhead:

  • NPU initialization: 0.350s (one-time)
  • Model load: 0.008s (one-time)

Input/Output:

  • Input: 224×224×3 images (INT8)
  • Output: 1000 classes (Float32)

Performance Comparison

Platform Inference Time Throughput Notes
RK3588 NPU 4.09 ms 244 FPS 3 NPU cores, INT8
ARM A76 CPU (est.) ~50 ms ~20 FPS Single core
Desktop RTX 3080 ~2-3 ms ~400 FPS Reference
NPU Speedup 12x faster than CPU - Same hardware

Computer Vision Findings

Strengths:

  1. Extremely fast conversion (<2 seconds)
  2. Excellent inference performance (4.09ms, 244 FPS)
  3. Very consistent latency (±0.11ms)
  4. Efficient quantization (74% size reduction)
  5. 12x speedup vs CPU cores on same SoC
  6. Simple Python API for inference

Trade-offs:

  1. INT8 quantization may reduce accuracy slightly
  2. AOT conversion required (no dynamic model execution)
  3. Fixed input shapes required

Technical Deep Dive

NPU Architecture

The RK3588 NPU is based on a 3-core design with 6 TOPS total performance:

  • Each core contributes 2 TOPS
  • Shared memory architecture
  • Optimized for INT8 operations
  • Direct DRAM access for large models

Memory Layout

For ResNet18, the NPU memory allocation:

Feature Tensor Memory:
- Input (224×224×3):     147 KB
- Layer activations:     776 KB (peak)
- Output (1000 classes): 4 KB

Constant Memory (Weights):
- Conv layers:    11.5 MB
- FC layers:      2.0 MB
- Total:          11.5 MB

Operator Support

The RKNN runtime successfully handled all ResNet18 operators:

  • Convolution layers: ✅ Fused with ReLU activation
  • Batch normalization: ✅ Folded into convolution
  • MaxPooling: ✅ Native support
  • Global average pooling: ✅ Converted to convolution
  • Fully connected: ✅ Converted to 1×1 convolution

All 26 operators executed on NPU (no CPU fallback needed).

Power Efficiency

While I didn't measure power consumption directly, the RK3588 NPU is designed for edge deployment:

Estimated Power Draw:

  • Idle: ~2-3W (entire SoC)
  • NPU active: +2-3W
  • Total under AI load: ~5-6W

Performance per Watt:

  • ResNet18 @ 244 FPS / ~5W = ~49 FPS per Watt
  • Compare to desktop GPU: RTX 3080 @ 400 FPS / ~320W = ~1.25 FPS per Watt

The RK3588 NPU delivers approximately 39x better performance per watt than a high-end desktop GPU for INT8 inference workloads.

Real-World Applications

Based on these benchmarks, the RK3588 NPU is well-suited for:

✅ Excellent Performance:

  • Real-time object detection: 244 FPS for ResNet18-class models
  • Image classification: Sub-5ms latency
  • Face recognition: Multiple faces per frame at 30+ FPS
  • Pose estimation: Real-time tracking
  • Edge AI cameras: Low power, high throughput

✅ Good Performance:

  • Small LLMs: 1B-class models at 10-15 tokens/second
  • Chatbots: Acceptable latency for edge applications
  • Text classification: Fast inference for short sequences

⚠️ Limited Performance:

  • Large LLMs: 7B+ models may not fit in memory or run slowly
  • High-resolution video: 4K processing may require frame decimation
  • Transformer models: Attention mechanism less optimized than CNNs

Developer Experience

Pros:

  • Clear documentation and examples
  • Python API is straightforward
  • Automatic NPU detection
  • Fast conversion times
  • Good error messages

Cons:

  • Requires separate x86_64 system for conversion
  • Some dependency conflicts (PyTorch versions)
  • Limited dynamic shape support
  • Debugging NPU issues can be challenging

Getting Started

Here's a minimal example for running inference:

from rknnlite.api import RKNNLite
import numpy as np

# Initialize
rknn = RKNNLite()

# Load model
rknn.load_rknn('model.rknn')
rknn.init_runtime()

# Run inference
input_data = np.random.randint(0, 256, (1, 3, 224, 224), dtype=np.uint8)
outputs = rknn.inference(inputs=[input_data])

# Cleanup
rknn.release()

That's it! The NPU is automatically detected and utilized.

Cost Analysis

Orange Pi 5 Max: ~$150-180 (16GB RAM variant)

Performance per Dollar:

  • 244 FPS / $180 = 1.36 FPS per dollar (ResNet18)
  • 10-15 tokens/s / $180 = 0.055-0.083 tokens/s per dollar (TinyLlama 1.1B)

Compare to:

The RK3588 NPU offers excellent value for edge AI applications, especially for INT8 workloads.

Comparison to Other Edge AI Platforms

Platform NPU/GPU TOPS Price ResNet18 FPS Notes
Orange Pi 5 Max (RK3588) 3-core NPU 6 $180 244 Best value
Raspberry Pi 5 CPU only - $80 ~5 No accelerator
Google Coral Dev Board Edge TPU 4 $150 ~400 INT8 only
NVIDIA Jetson Orin Nano GPU (1024 CUDA) 40 $499 ~400 More flexible
Intel NUC with Neural Compute Stick 2 VPU 4 $300+ ~150 Requires USB

The RK3588 stands out for offering strong NPU performance at a very competitive price point.

Limitations and Gotchas

1. Conversion System Required

You cannot convert models directly on the Orange Pi. You need an x86_64 Linux system with RKNN-Toolkit2 for model conversion.

2. Quantization Constraints

  • LLMs: Only W8A8 supported (no W4A16)
  • Computer vision: INT8 quantization required for best performance
  • Floating-point models will run slower

3. Memory Limitations

  • Large models (>2GB) may not fit
  • Context length limited to 2048 tokens for LLMs
  • Batch sizes are constrained by NPU memory

4. Framework Support

  • PyTorch/TensorFlow: Supported via conversion
  • Direct framework execution: Not supported
  • Some operators may fall back to CPU

5. Software Maturity

  • RKNN-Toolkit2 is actively developed but not as mature as CUDA
  • Some edge cases and exotic operators may not be supported
  • Version compatibility between toolkit and runtime must match

Best Practices

Based on my testing, here are recommendations for optimal RK3588 NPU usage:

1. Model Selection

  • Choose models designed for mobile/edge: MobileNet, EfficientNet, SqueezeNet
  • Start small: Test with smaller models before scaling up
  • Consider quantization-aware training: Better accuracy with INT8

2. Optimization

  • Use fixed input shapes: Dynamic shapes have overhead
  • Batch carefully: Batch size 1 often optimal for latency
  • Leverage operator fusion: Design models with fusible ops (Conv+BN+ReLU)

3. Deployment

  • Pre-load models: Model loading takes ~350ms
  • Use separate threads: Don't block main application during inference
  • Monitor memory: Large models can cause OOM errors

4. Development Workflow

1. Train on workstation (GPU)
2. Export to ONNX with fixed shapes
3. Convert to RKNN on x86_64 system
4. Test on Orange Pi 5 Max
5. Iterate based on accuracy/performance

Conclusion

The RK3588 NPU on the Orange Pi 5 Max delivers impressive performance for edge AI applications. With 244 FPS for ResNet18 (4.09ms latency) and 10-15 tokens/second for 1.1B LLMs, it's well-positioned for real-time computer vision and small language model inference.

Key Takeaways:

✅ Excellent computer vision performance: 244 FPS for ResNet18, <5ms latency

✅ Good LLM support: 1B-class models run at usable speeds

✅ Outstanding value: $180 for 6 TOPS of NPU performance

✅ Easy to use: Simple Python API, automatic NPU detection

✅ Power efficient: ~5-6W under AI load, 39x better than desktop GPU

✅ PyTorch compatible: Via conversion workflow

⚠️ Conversion required: Cannot run PyTorch/TensorFlow directly

⚠️ Quantization needed: INT8 for best performance

⚠️ Memory constrained: Large models (>2GB) challenging

The RK3588 NPU is an excellent choice for edge AI applications where power efficiency and cost matter. It's not going to replace high-end GPUs for training or large-scale inference, but for deploying computer vision models and small LLMs at the edge, it's one of the best options available today.

Recommended for:

  • Edge AI cameras and surveillance
  • Robotics and autonomous systems
  • IoT devices with AI requirements
  • Embedded AI applications
  • Prototyping and development

Not recommended for:

  • Large language model training
  • 7B+ LLM inference
  • High-precision (FP32) inference
  • Dynamic model execution
  • Cloud-scale deployments

Banana Pi CM5-Pro Review: A Solid Middle Ground with AI Ambitions

Introduction

The Banana Pi CM5-Pro (also sold as the ArmSoM-CM5) represents Banana Pi's entry into the Raspberry Pi Compute Module 4 form factor market, powered by Rockchip's RK3576 SoC. Released in 2024, this compute module targets developers seeking a CM4-compatible solution with enhanced specifications: up to 16GB of RAM, 128GB of storage, WiFi 6 connectivity, and a 6 TOPS Neural Processing Unit for AI acceleration. With a price point of approximately $103 for the 8GB/64GB configuration and a guaranteed production life until at least August 2034, Banana Pi positions the CM5-Pro as a long-term alternative to Raspberry Pi's official offerings.

After extensive testing, benchmarking, and comparison against contemporary single-board computers including the Orange Pi 5 Max, Raspberry Pi 5, and LattePanda IOTA, the Banana Pi CM5-Pro emerges as a competent but not exceptional offering. It delivers solid performance, useful features including AI acceleration, and good expandability, but falls short of being a clear winner in any specific category. This review examines where the CM5-Pro excels, where it disappoints, and who should consider it for their projects.

Banana Pi CM5-Pro compute module

Banana Pi CM5-Pro showing the dual 100-pin connectors and CM4-compatible form factor

Hardware Architecture: The Rockchip RK3576

At the heart of the Banana Pi CM5-Pro lies the Rockchip RK3576, a second-generation 8nm SoC featuring a big.LITTLE ARM architecture:

  • 4x ARM Cortex-A72 cores @ 2.2 GHz (high performance)
  • 4x ARM Cortex-A53 cores @ 1.8 GHz (power efficiency)
  • 6 TOPS Neural Processing Unit (NPU)
  • Mali-G52 MC3 GPU
  • 8K@30fps H.265/VP9 decoding, 4K@60fps H.264/H.265 encoding
  • Up to 16GB LPDDR5 RAM support
  • Dual-channel DDR4/LPDDR4/LPDDR5 memory controller

The Cortex-A72, originally released by ARM in 2015, represents a significant step up from the ancient Cortex-A53 (2012) but still trails the more modern Cortex-A76 (2018) found in Raspberry Pi 5 and Orange Pi 5 Max. The A72 offers approximately 1.8-2x the performance per clock compared to the A53, with better branch prediction, wider execution units, and more sophisticated memory prefetching. However, it lacks the A76's more advanced microarchitecture improvements and typically runs at lower clock speeds (2.2 GHz vs. 2.4 GHz for the A76 in the Pi 5).

The inclusion of four Cortex-A53 efficiency cores alongside the A72 performance cores gives the RK3576 a total of eight cores, allowing it to balance power consumption and performance. In practice, this means the system can handle background tasks and light workloads on the A53 cores while reserving the A72 cores for demanding applications. The big.LITTLE scheduler in the Linux kernel attempts to make intelligent decisions about which cores to use for which tasks, though the effectiveness varies depending on workload characteristics.

Memory, Storage, and Connectivity

Our test unit came configured with:

  • 4GB LPDDR5 RAM (8GB and 16GB options available)
  • 29GB eMMC internal storage (32GB nominal, formatted capacity lower)
  • M.2 NVMe SSD support (our unit had a 932GB NVMe drive installed)
  • WiFi 6 (802.11ax) and Bluetooth 5.3
  • Gigabit Ethernet
  • HDMI 2.0 output supporting 4K@60fps
  • Multiple MIPI CSI camera interfaces
  • USB 3.0 and USB 2.0 interfaces via the 100-pin connectors

The LPDDR5 memory is a notable upgrade over the LPDDR4 found in many competing boards, offering higher bandwidth and better power efficiency. In our testing, memory bandwidth didn't appear to be a significant bottleneck for CPU-bound workloads, though applications that heavily stress memory subsystems (large dataset processing, video encoding, etc.) may benefit from the faster RAM.

The inclusion of both eMMC storage and M.2 NVMe support provides excellent flexibility. The eMMC serves as a reliable boot medium with consistent performance, while the NVMe slot allows for high-capacity, high-speed storage expansion. This dual-storage approach is superior to SD card-only solutions, which suffer from reliability issues and inconsistent performance.

WiFi 6 and Bluetooth 5.3 represent current-generation wireless standards, providing better performance and lower latency than the WiFi 5 found in older boards. For robotics applications, low-latency wireless communication can be crucial for remote control and telemetry, making this a meaningful upgrade.

The NPU: 6 TOPS of AI Potential

The RK3576's integrated 6 TOPS Neural Processing Unit is the CM5-Pro's headline AI feature, designed to accelerate machine learning inference workloads. The NPU supports multiple quantization formats (INT4/INT8/INT16/BF16/TF32) and can interface with mainstream frameworks including TensorFlow, PyTorch, MXNet, and Caffe through Rockchip's RKNN toolkit.

In our testing, we confirmed the presence of the NPU hardware at /sys/kernel/iommu_groups/0/devices/27700000.npu and verified that the RKNN runtime library (librknnrt.so) and server (rknn_server) were installed and accessible. To validate real-world NPU performance, we ran MobileNet V1 image classification inference tests using the pre-installed RKNN model.

NPU Inference Benchmarks - MobileNet V1:

Running 10 inference iterations on a 224x224 RGB image (bell.jpg), we measured consistent performance:

  • Average inference time: 161.8ms per image
  • Min/Max: 146ms to 172ms
  • Standard deviation: ~7.2ms
  • Throughput: ~6.2 frames per second

The model successfully classified test images with appropriate confidence scores across 1,001 ImageNet classes. The inference pipeline includes:

  • JPEG decoding and preprocessing
  • Image resizing and color space conversion
  • INT8 quantized inference on the NPU
  • FP16 output tensor postprocessing

This demonstrates that the NPU is fully functional and provides practical acceleration for computer vision workloads. The ~160ms inference time for MobileNet V1 is reasonable for edge AI applications, though more demanding models like YOLOv8 or larger classification networks would benefit from the full 6 TOPS capacity.

Rockchip's RKNN toolkit provides a development workflow that converts trained models into RKNN format for efficient execution on the NPU. The process involves:

  1. Training a model using a standard framework (TensorFlow, PyTorch, etc.)
  2. Exporting the model to ONNX or framework-specific format
  3. Converting the model using rknn-toolkit2 on a PC
  4. Quantizing the model to INT8 or other supported formats
  5. Deploying the RKNN model file to the board
  6. Running inference using RKNN C/C++ or Python APIs

This workflow is more complex than simply running a PyTorch or TensorFlow model directly, but the trade-off is significantly improved inference performance and lower power consumption compared to CPU-only execution. For applications like real-time object detection, the 6 TOPS NPU can deliver:

  • Face recognition: 240fps @ 1080p
  • Object detection (YOLO-based models): 50fps @ 4K
  • Semantic segmentation: 30fps @ 2K

These performance figures represent substantial improvements over CPU-based inference, making the NPU genuinely useful for edge AI applications. However, they also require investment in learning the RKNN toolchain, optimizing models for the specific NPU architecture, and managing the conversion pipeline as part of your development workflow.

RKLLM and Large Language Model Support:

To thoroughly test LLM capabilities, we performed end-to-end testing: model conversion on an x86_64 platform (LattePanda IOTA), transfer to the CM5-Pro, and NPU inference validation. RKLLM (Rockchip Large Language Model) toolkit enables running quantized LLMs on the RK3576's 6 TOPS NPU, supporting models including Qwen, Llama, ChatGLM, Phi, Gemma, InternLM, MiniCPM, and others.

LLM Model Conversion Benchmark:

We converted TinyLLAMA 1.1B Chat from Hugging Face format to RKLLM format using an Intel N150-powered LattePanda IOTA:

  • Source Model: TinyLLAMA 1.1B Chat v1.0 (505 MB safetensors)
  • Conversion Platform: x86_64 (RKLLM-Toolkit only available for x86, not ARM)
  • Quantization: W4A16 (4-bit weights, 16-bit activations)
  • Conversion Time Breakdown:
  • Model loading: 6.95 seconds
  • Building/Quantizing: 220.47 seconds (293 layers)
  • Optimization: 206.72 seconds (22 optimization steps)
  • Export to RKLLM format: 37.41 seconds
  • Total Conversion Time: 264.83 seconds (4.41 minutes)
  • Output File Size: 644.75 MB (increased from 505 MB due to RKNN format overhead)

The cross-platform requirement is important: RKLLM-Toolkit is distributed as x86_64-only Python wheels, so model conversion must be performed on an x86 PC or VM, not on the ARM-based CM5-Pro itself. Conversion time scales with model size and CPU performance - larger models on slower CPUs will take proportionally longer.

NPU LLM Inference Testing:

After transferring the converted model to the CM5-Pro, we successfully:

  • ✓ Loaded the TinyLLAMA 1.1B model (645 MB) into RKLLM runtime
  • ✓ Initialized NPU with 2-core configuration for W4A16 inference
  • ✓ Verified token generation and text output
  • ✓ Confirmed the model runs on NPU cores (not CPU fallback)

The RKLLM runtime v1.2.2 correctly identified the model configuration (W4A16, max_context=2048, 2 NPU cores) and enabled the Cortex-A72 cores [4,5,6,7] for host processing while the NPU handled inference.

Actual RK3576 LLM Performance (Official Rockchip Benchmarks):

Based on Rockchip's published benchmarks for the RK3576, small language models perform as follows:

  • Qwen2 0.5B (w4a16): 34.24 tokens/second, 327ms first token latency, 426 MB memory
  • MiniCPM4 0.5B (w4a16): 35.8 tokens/second, 349ms first token latency, 322 MB memory
  • TinyLLAMA 1.1B (w4a16): 21.32 tokens/second, 518ms first token latency, 591 MB memory
  • InternLM2 1.8B (w4a16): 13.65 tokens/second, 772ms first token latency, 966 MB memory

For context, the RK3588 (with more powerful NPU) achieves 42.58 tokens/second for Qwen2 0.5B - about 1.85x faster than the RK3576.

Practical Assessment:

The 30-35 tokens/second achieved with 0.5B models is usable for offline chatbots, text classification, and simple Q&A applications, but would feel noticeably slow compared to cloud LLM APIs or GPU-accelerated solutions. Humans typically read at 200-300 words per minute (~50 tokens/second), so 35 tokens/second is borderline for comfortable real-time conversation. Larger models (1.8B+) drop to 13 tokens/second or less, which feels sluggish for interactive use.

The complete workflow (download model → convert on x86 → transfer to ARM → run inference) works as designed but requires infrastructure: an x86 machine or VM for conversion, network transfer for large model files (645 MB), and familiarity with Python environments and RKLLM APIs. For embedded deployments, this is acceptable; for rapid prototyping, it adds friction compared to cloud-based LLM solutions.

Compared to Google's Coral TPU (4 TOPS), the RK3576's 6 TOPS provides 1.5x more computational power, though the Coral benefits from more mature tooling and broader community support. Against the Horizon X3's 5 TOPS, the RK3576 offers 20% more capability with far better CPU performance backing it up. For serious AI workloads, NVIDIA's Jetson platforms (40+ TOPS) remain in a different performance class, but at significantly higher price points and power requirements.

Performance Testing: Real-World Compilation Benchmarks

To assess the Banana Pi CM5-Pro's CPU performance, we ran our standard Rust compilation benchmark: building a complex ballistics simulation engine with numerous dependencies from a clean state, three times, and averaging the results. This real-world workload stresses CPU cores, memory bandwidth, compiler performance, and I/O subsystems.

Banana Pi CM5-Pro Compilation Times:

  • Run 1: 173.16 seconds (2 minutes 53 seconds)
  • Run 2: 162.29 seconds (2 minutes 42 seconds)
  • Run 3: 165.99 seconds (2 minutes 46 seconds)
  • Average: 167.15 seconds (2 minutes 47 seconds)

For context, here's how the CM5-Pro compares to other contemporary single-board computers:

System CPU Cores Average Time vs. CM5-Pro
Orange Pi 5 Max Cortex-A55/A76 8 (4+4) 62.31s 2.68x faster
Raspberry Pi CM5 Cortex-A76 4 71.04s 2.35x faster
LattePanda IOTA Intel N150 4 72.21s 2.31x faster
Raspberry Pi 5 Cortex-A76 4 76.65s 2.18x faster
Banana Pi CM5-Pro Cortex-A53/A72 8 (4+4) 167.15s 1.00x (baseline)

The results reveal the CM5-Pro's positioning: it's significantly slower than top-tier ARM and x86 single-board computers, but respectable within its price and power class. The 2.68x performance deficit versus the Orange Pi 5 Max is substantial, explained by the RK3588's newer Cortex-A76 cores running at higher clock speeds (2.4 GHz) with more advanced microarchitecture.

More telling is the comparison to the Raspberry Pi 5 and Raspberry Pi CM5, both featuring four Cortex-A76 cores at 2.4 GHz. Despite having eight cores to the Pi's four, the CM5-Pro is approximately 2.2x slower. This performance gap illustrates the generational advantage of the A76 architecture - the Pi 5's four newer cores outperform the CM5-Pro's four A72 cores plus four A53 cores combined for this workload.

The LattePanda IOTA's Intel N150, despite having only four cores, also outperforms the CM5-Pro by 2.3x. Intel's Alder Lake-N architecture, even in its low-power form, delivers superior single-threaded performance and more effective multi-threading than the RK3576.

However, context matters. The CM5-Pro's 167-second compilation time is still quite usable for development workflows. A project that takes 77 seconds to compile on a Raspberry Pi 5 will take 167 seconds on the CM5-Pro - an additional 90 seconds. For most developers, this difference is noticeable but not crippling. Compile times remain in the "get a coffee" range rather than the "go to lunch" range.

More importantly, the CM5-Pro vastly outperforms older ARM platforms. Compared to boards using only Cortex-A53 cores (like the Horizon X3 CM at 379 seconds), the CM5-Pro is 2.27x faster, demonstrating the value of the Cortex-A72 performance cores.

Geekbench 6 CPU Performance

To provide standardized synthetic benchmarks, we ran Geekbench 6.5.0 on the Banana Pi CM5-Pro:

Geekbench 6 Scores:

  • Single-Core Score: 328
  • Multi-Core Score: 1337

These scores reflect the RK3576's positioning as a mid-range ARM platform. The single-core score of 328 indicates modest per-core performance from the Cortex-A72 cores, while the multi-core score of 1337 demonstrates reasonable scaling across all eight cores (4x A72 + 4x A53). For context, the Raspberry Pi 5 with Cortex-A76 cores typically scores around 550-600 single-core and 1700-1900 multi-core, showing the generational advantage of the newer ARM architecture.

Notable individual benchmark results include:

  • PDF Renderer: 542 single-core, 2904 multi-core
  • Ray Tracer: 2763 multi-core
  • Asset Compression: 2756 multi-core
  • Horizon Detection: 540 single-core
  • HTML5 Browser: 455 single-core

The relatively strong performance on PDF rendering and asset compression tasks suggests the RK3576 handles real-world productivity workloads reasonably well, though the lower single-core scores indicate that latency-sensitive interactive applications may feel less responsive than on platforms with faster per-core performance.

Full Geekbench results: https://browser.geekbench.com/v6/cpu/14853854

Comparative Analysis: CM5-Pro vs. the Competition

vs. Orange Pi 5 Max

The Orange Pi 5 Max represents the performance leader in our testing, powered by Rockchip's flagship RK3588 SoC with four Cortex-A76 + four Cortex-A55 cores. The 5 Max compiled our benchmark in 62.31 seconds - 2.68x faster than the CM5-Pro's 167.15 seconds.

Key differences:

Performance: The 5 Max's Cortex-A76 cores deliver substantially better single-threaded and multi-threaded performance. For CPU-intensive development work, the performance gap is significant.

NPU: The RK3588 includes a 6 TOPS NPU, matching the RK3576's AI capabilities. Both boards can run similar RKNN-optimized models with comparable inference performance.

Form Factor: The 5 Max is a full-sized single-board computer with on-board ports and connectors, while the CM5-Pro is a compute module requiring a carrier board. This makes the 5 Max more suitable for standalone projects and the CM5-Pro better for embedded integration.

Price: The Orange Pi 5 Max sells for approximately \$150-180 with 8GB RAM, compared to $103 for the CM5-Pro. The 5 Max's superior performance comes at a premium, but the cost-per-performance ratio remains competitive.

Memory: Both support up to 16GB RAM, though the 5 Max typically ships with higher-capacity configurations.

Verdict: If raw CPU performance is your priority and you can accommodate a full-sized SBC, the Orange Pi 5 Max is the clear choice. The CM5-Pro makes sense if you need the compute module form factor, want to minimize cost, or have thermal/power constraints that favor the slightly more efficient RK3576.

vs. Raspberry Pi 5

The Raspberry Pi 5, with its Broadcom BCM2712 SoC featuring four Cortex-A76 cores at 2.4 GHz, compiled our benchmark in 76.65 seconds - 2.18x faster than the CM5-Pro.

Key differences:

Performance: The Pi 5's four A76 cores outperform the CM5-Pro's 4+4 big.LITTLE configuration for most workloads. Single-threaded performance heavily favors the Pi 5, while multi-threaded performance depends on whether the workload can effectively utilize the CM5-Pro's additional A53 cores.

NPU: The Pi 5 lacks integrated AI acceleration, while the CM5-Pro includes a 6 TOPS NPU. For AI-heavy applications, this is a significant advantage for the CM5-Pro.

Ecosystem: The Raspberry Pi ecosystem is vastly more mature, with extensive documentation, massive community support, and guaranteed long-term software maintenance. While Banana Pi has committed to supporting the CM5-Pro until 2034, the Pi Foundation's track record inspires more confidence.

Software: Raspberry Pi OS is polished and actively maintained, with hardware-specific optimizations. The CM5-Pro runs generic ARM Linux distributions (Debian, Ubuntu) which work well but lack Pi-specific refinements.

Price: The Raspberry Pi 5 (8GB model) retails for \$80, significantly cheaper than the CM5-Pro's \$103. The Pi 5 offers better performance for less money - a compelling value proposition.

Expansion: The Pi 5's standard SBC form factor provides easier access to GPIO, HDMI, USB, and other interfaces. The CM5-Pro requires a carrier board, adding cost and complexity but enabling more customized designs.

Verdict: For general-purpose computing, development, and hobbyist projects, the Raspberry Pi 5 is the better choice: faster, cheaper, and better supported. The CM5-Pro makes sense if you specifically need AI acceleration, prefer the compute module form factor, or want more RAM/storage capacity than the Pi 5 offers.

vs. LattePanda IOTA

The LattePanda IOTA, powered by Intel's N150 Alder Lake-N processor with four cores, compiled our benchmark in 72.21 seconds - 2.31x faster than the CM5-Pro.

Key differences:

Architecture: The IOTA uses x86_64 architecture, providing compatibility with a wider range of software that may not be well-optimized for ARM. The CM5-Pro's ARM architecture benefits from lower power consumption and better mobile/embedded software support.

Performance: Intel's N150, despite having only four cores, delivers superior single-threaded performance and competitive multi-threaded performance against the CM5-Pro's eight cores. Intel's microarchitecture and higher sustained frequencies provide an edge for CPU-bound tasks.

NPU: The IOTA lacks dedicated AI acceleration, relying on CPU or external accelerators for machine learning workloads. The CM5-Pro's integrated 6 TOPS NPU is a clear advantage for AI applications.

Power Consumption: The N150 is a low-power x86 chip, but still consumes more power than ARM solutions under typical workloads. The CM5-Pro's big.LITTLE configuration can achieve better power efficiency for mixed workloads.

Form Factor: The IOTA is a small x86 board with Arduino co-processor integration, targeting maker/IoT applications. The CM5-Pro's compute module format serves different use cases, primarily embedded systems and custom carrier board designs.

Price: The LattePanda IOTA sells for approximately $149, more expensive than the CM5-Pro. However, it includes unique features like the Arduino co-processor and x86 compatibility that may justify the premium for specific applications.

Software Ecosystem: x86 enjoys broader commercial software support, while ARM excels in embedded and mobile-focused applications. Choose based on your software requirements.

Verdict: If you need x86 compatibility or want a compact standalone board with Arduino integration, the LattePanda IOTA makes sense despite its higher price. If you're working in ARM-native embedded Linux, need AI acceleration, or want the compute module form factor, the CM5-Pro is the better choice at a lower price point.

vs. Raspberry Pi CM5

The Raspberry Pi Compute Module 5 is the most direct competitor to the Banana Pi CM5-Pro, offering the same CM4-compatible form factor with different specifications. The Pi CM5 compiled our benchmark in 71.04 seconds - 2.35x faster than the CM5-Pro.

Key differences:

Performance: The Pi CM5's four Cortex-A76 cores at 2.4 GHz significantly outperform the CM5-Pro's 4x A72 + 4x A53 configuration. The architectural advantage of the A76 over the A72 translates to approximately 2.35x better performance in our testing.

NPU: The CM5-Pro's 6 TOPS NPU provides integrated AI acceleration, while the Pi CM5 requires external solutions (Hailo-8, Coral TPU) for hardware-accelerated inference. If AI is central to your application, the CM5-Pro's integrated NPU is more elegant.

Memory Options: The CM5-Pro supports up to 16GB LPDDR5, while the Pi CM5 offers up to 8GB LPDDR4X. For memory-intensive applications, the CM5-Pro's higher capacity could be decisive.

Storage: Both offer eMMC options, with the CM5-Pro available up to 128GB and the Pi CM5 up to 64GB. Both support additional storage via carrier board interfaces.

Price: The Raspberry Pi CM5 (8GB/32GB eMMC) sells for approximately $95, slightly cheaper than the CM5-Pro's $103. The CM5-Pro's extra features (more RAM/storage options, integrated NPU) justify the small price premium for those who need them.

Ecosystem: The Pi CM5 benefits from Raspberry Pi's ecosystem, tooling, and community. The CM5-Pro has decent support but can't match the Pi's extensive resources.

Carrier Boards: Both are CM4-compatible, meaning they can use the same carrier boards. However, some boards may not fully support CM5-Pro-specific features, and subtle electrical differences could cause issues in rare cases.

Verdict: For maximum CPU performance in the CM4 form factor, choose the Pi CM5. Its 2.35x performance advantage is significant for compute-intensive applications. Choose the CM5-Pro if you need integrated AI acceleration, more than 8GB of RAM, more than 64GB of eMMC storage, or prefer the better wireless connectivity (WiFi 6 vs. WiFi 5).

Use Cases and Recommendations

Based on our testing and analysis, here are scenarios where the Banana Pi CM5-Pro excels and where alternatives might be better:

Choose the Banana Pi CM5-Pro if you:

Need AI acceleration in a compute module: The integrated 6 TOPS NPU eliminates the need for external AI accelerators, simplifying hardware design and reducing BOM costs. For robotics, smart cameras, or IoT devices with AI workloads, this is a compelling advantage.

Require more than 8GB of RAM: The CM5-Pro supports up to 16GB LPDDR5, double the Pi CM5's maximum. If your application processes large datasets, runs multiple VMs, or needs extensive buffering, the extra RAM headroom matters.

Want high-capacity built-in storage: With up to 128GB eMMC options, the CM5-Pro can store large datasets, models, or applications without requiring external storage. This simplifies deployment and improves reliability compared to SD cards or network storage.

Prefer WiFi 6 and Bluetooth 5.3: Current-generation wireless standards provide better performance and lower latency than WiFi 5. For wireless robotics control or IoT applications with many connected devices, WiFi 6's improvements are meaningful.

Value long production lifetime: Banana Pi's commitment to produce the CM5-Pro until August 2034 provides assurance for commercial products with multi-year lifecycles. You can design around this module without fear of it being discontinued in 2-3 years.

Have thermal or power constraints: The RK3576's 8nm process and big.LITTLE architecture can deliver better power efficiency than always-on high-performance cores, extending battery life or reducing cooling requirements for fanless designs.

Choose alternatives if you:

Prioritize raw CPU performance: The Raspberry Pi 5, Pi CM5, Orange Pi 5 Max, and LattePanda IOTA all deliver significantly faster CPU performance. If your application is CPU-bound and doesn't benefit from the NPU, these platforms are better choices.

Want the simplest development experience: The Raspberry Pi ecosystem's polish, documentation, and community support make it the easiest platform for beginners and rapid prototyping. The Pi 5 or Pi CM5 will get you running faster with fewer obstacles.

Need maximum AI performance: NVIDIA Jetson platforms provide 40+ TOPS of AI performance with mature CUDA/TensorRT tooling. If AI is your primary workload, the investment in a Jetson module is worthwhile despite higher costs.

Require x86 compatibility: The LattePanda IOTA or other x86 platforms provide better software compatibility for commercial applications that depend on x86-specific libraries or software.

Work with standard SBC form factors: If you don't need a compute module and prefer the convenience of a full-sized SBC with onboard ports, the Orange Pi 5 Max or Raspberry Pi 5 are better choices.

The NPU in Practice: RKNN Toolkit and Ecosystem

While we didn't perform exhaustive AI benchmarking, our exploration of the RKNN ecosystem reveals both promise and challenges. The infrastructure exists: the NPU hardware is present and accessible, the runtime libraries are installed, and documentation is available from both Rockchip and Banana Pi. The RKNN toolkit can convert mainstream frameworks to NPU-optimized models, and community examples demonstrate YOLO11n object detection running successfully on the CM5-Pro.

However, the RKNN development experience is not as streamlined as more mature ecosystems. Converting and optimizing models requires learning Rockchip-specific tools and workflows. Debugging performance issues or accuracy degradation during quantization demands patience and experimentation. The documentation is improving but remains fragmented across Rockchip's official site, Banana Pi's docs, and community forums.

For developers already familiar with embedded AI deployment, the RKNN workflow will feel familiar - it follows similar patterns to TensorFlow Lite, ONNX Runtime, or other edge inference frameworks. For developers new to edge AI, the learning curve is steeper than cloud-based solutions but gentler than some alternatives (looking at you, Hailo's toolchain).

The 6 TOPS performance figure is real and achievable for properly optimized models. INT8 quantized YOLO models can indeed run at 50fps @ 4K, and simpler models scale accordingly. The NPU's support for INT4 and BF16 formats provides flexibility for trading off accuracy versus performance. For many robotics and IoT applications, the 6 TOPS NPU hits a sweet spot: enough performance for useful AI workloads, integrated into the SoC to minimize complexity and cost, and accessible through reasonable (if not perfect) tooling.

Build Quality and Physical Characteristics

The Banana Pi CM5-Pro adheres to the Raspberry Pi CM4 mechanical specification, featuring dual 100-pin high-density connectors arranged in the standard layout. Physical dimensions match the CM4, allowing drop-in replacement in compatible carrier boards. Our sample unit appeared well-manufactured with clean solder joints, proper component placement, and no obvious defects.

The module includes an on-board WiFi/Bluetooth antenna connector (U.FL/IPEX), power management IC, and all necessary supporting components. Unlike some compute modules that require extensive external components on the carrier board, the CM5-Pro is relatively self-contained, simplifying carrier board design.

Thermal performance is adequate but not exceptional. Under sustained load during our compilation benchmarks, the SoC reached temperatures requiring thermal management. For applications running continuous AI inference or heavy CPU workloads, active cooling (fan) or substantial passive cooling (heatsink and airflow) is recommended. The carrier board design should account for thermal dissipation, especially if the module will be enclosed in a case.

Software and Ecosystem

The CM5-Pro ships with Banana Pi's custom Debian-based Linux distribution, featuring a 6.1.75 kernel with Rockchip-specific patches and drivers. In our testing, the system worked well out of the box: networking functioned, sudo worked (refreshingly, after the Horizon X3 CM disaster), and package management operated normally.

The distribution includes pre-installed RKNN libraries and tools, enabling NPU development without additional setup. Python 3 and essential development packages are available, and standard Debian repositories provide access to thousands of additional packages. For developers comfortable with Debian/Ubuntu, the environment feels familiar and capable.

However, the software ecosystem lags behind Raspberry Pi's. Raspberry Pi OS includes countless optimizations, hardware-specific integrations, and utilities that simply don't exist for Rockchip platforms. Camera support, GPIO access, and peripheral interfaces work, but often require more manual configuration or programming compared to the Pi's plug-and-play experience.

Third-party software support varies. Popular frameworks like ROS2, OpenCV, and TensorFlow compile and run without issues. Hardware-specific accelerators (GPU, NPU) may require additional configuration or custom builds. Overall, the software situation is "good enough" for experienced developers but not as polished as the Raspberry Pi ecosystem.

Banana Pi's documentation has improved significantly over the years, with reasonably comprehensive guides covering basic setup, GPIO usage, and RKNN deployment. Community support exists through forums and GitHub, though it's smaller and less active than Raspberry Pi's communities. Expect to do more troubleshooting independently and rely less on finding someone who's already solved your exact problem.

Conclusion: A Capable Platform for Specific Niches

The Banana Pi CM5-Pro is a solid, if unspectacular, compute module that serves specific niches well while falling short of being a universal recommendation. Its combination of integrated 6 TOPS NPU, up to 16GB RAM, WiFi 6 connectivity, and CM4-compatible form factor creates a unique offering that competes effectively against alternatives when your requirements align with its strengths.

For projects needing AI acceleration in a compute module format, the CM5-Pro is arguably the best choice currently available. The integrated NPU eliminates the complexity and cost of external AI accelerators while delivering genuine performance improvements for inference workloads. The RKNN toolkit, while imperfect, provides a workable path to deploying optimized models. If your robotics platform, smart camera, or IoT device depends on local AI processing, the CM5-Pro deserves serious consideration.

For projects requiring more than 8GB of RAM or more than 64GB of storage in a compute module, the CM5-Pro is the only game in town among CM4-compatible options. This makes it the default choice for memory-intensive applications that need the compute module form factor.

For general-purpose computing, development, or applications where AI is not central, the Raspberry Pi CM5 is the better choice. Its 2.35x performance advantage is substantial and directly translates to faster build times, quicker application responsiveness, and better user experience. The Pi's ecosystem advantages further tip the scales for most users.

Our compilation benchmark results - 167 seconds for the CM5-Pro versus 71-77 seconds for Pi5/CM5 - illustrate the performance gap clearly. For development workflows, this difference is noticeable but workable. Most developers can tolerate the CM5-Pro's slower compilation times if other factors (AI acceleration, RAM capacity, price) favor it. But if maximum CPU performance is your priority, look elsewhere.

The comparison to the Orange Pi 5 Max reveals a significant performance gap (62 vs. 167 seconds), but also highlights different market positions. The 5 Max is a full-featured SBC designed for standalone use, while the CM5-Pro is a compute module designed for embedded integration. They serve different purposes and target different applications.

Against the LattePanda IOTA's x86 architecture, the CM5-Pro trades x86 compatibility for better power efficiency, integrated AI, and lower cost. The choice between them depends entirely on software requirements - x86-specific applications favor the IOTA, while ARM-native embedded applications favor the CM5-Pro.

The Banana Pi CM5-Pro earns a qualified recommendation: excellent for AI-focused embedded projects, good for high-RAM compute module applications, acceptable for general embedded Linux development, and not recommended if raw CPU performance or ecosystem maturity are priorities. At $103 for the 8GB/64GB configuration, it offers reasonable value for applications that leverage its strengths, though it won't excite buyers seeking the fastest or cheapest option.

If your project needs:

  • AI acceleration integrated into a compute module
  • More than 8GB RAM in CM4 form factor
  • WiFi 6 and current wireless standards
  • Guaranteed long production life (until 2034)

Then the Banana Pi CM5-Pro is a solid choice that delivers on its promises.

If your project needs:

  • Maximum CPU performance
  • The most polished software ecosystem
  • The easiest development experience
  • The lowest cost

Then the Raspberry Pi CM5 or Pi 5 remains the better option.

The CM5-Pro occupies a middle ground: not the fastest, not the cheapest, not the easiest, but uniquely capable in specific areas. For the right application, it's exactly what you need. For others, it's a compromise that doesn't quite satisfy. Choose accordingly.

Specifications Summary

Processor:

  • Rockchip RK3576 (8nm process)
  • 4x ARM Cortex-A72 @ 2.2 GHz (performance cores)
  • 4x ARM Cortex-A53 @ 1.8 GHz (efficiency cores)
  • Mali-G52 MC3 GPU
  • 6 TOPS NPU (Rockchip RKNPU)

Memory & Storage:

  • 4GB/8GB/16GB LPDDR5 RAM options
  • 32GB/64GB/128GB eMMC options
  • M.2 NVMe SSD support via carrier board

Video:

  • 8K@30fps H.265/VP9 decoding
  • 4K@60fps H.264/H.265 encoding
  • HDMI 2.0 output (via carrier board)

Connectivity:

  • WiFi 6 (802.11ax) and Bluetooth 5.3
  • Gigabit Ethernet (via carrier board)
  • Multiple USB 2.0/3.0 interfaces
  • MIPI CSI camera inputs
  • I2C, SPI, UART, PWM

Physical:

  • Dual 100-pin board-to-board connectors (CM4-compatible)
  • Dimensions: 55mm x 40mm

Benchmark Performance:

  • Rust compilation: 167.15 seconds average
  • 2.68x slower than Orange Pi 5 Max
  • 2.35x slower than Raspberry Pi CM5
  • 2.31x slower than LattePanda IOTA
  • 2.18x slower than Raspberry Pi 5
  • 2.27x faster than Horizon X3 CM

Pricing: ~$103 USD (8GB RAM / 64GB eMMC configuration)

Production Lifetime: Guaranteed until August 2034

Recommendation: Good choice for AI-focused embedded projects requiring compute module form factor; not recommended if raw CPU performance is the priority.


Review Date: November 3, 2025

Hardware Tested: Banana Pi CM5-Pro (ArmSoM-CM5) with 4GB RAM, 29GB eMMC, 932GB NVMe SSD

OS Tested: Banana Pi Debian (based on Debian GNU/Linux), kernel 6.1.75

Conclusion: Solid middle-ground option with integrated AI acceleration; best for specific niches rather than general-purpose use.