(Research-only Tool) The Most Efficient Chunked or Target Quality AV1/AV2 Encoding Framework
- Dependencies
- Description
- Features
- Design Decisions
- Usage
- Building
- Video Showcase
- How TQ Works
- Credits
- Minimal and Faster Than Av1an
- SVT-AV1 (mainline or a fork)
- mkvmerge (to concatenate chunks)
- FFMS2 (a hard dependency)
- VSHIP (optional - needed for target quality encoding with CVVDP)
- ZIMG (optional - provides color conversion features needed by VSHIP)
xav aims to be the fastest, most minimal AV1/AV2 encoding framework. By keeping its feature scope limited, the potential for the best encoder and the best video quality metric can be maximized without getting limited by extensive features.
As the author has been involved with the av1an project since its inception as a user and continues to develop it; creating a direct competitor without purpose was not the objective. xav is a faster, more minimal alternative to Av1an's most popular features and the author acknowledges that av1an is the most powerful & feature-rich video encoding framework. This tool was developed with a strong interest and focus on the "av1an" concept.
For this reason, adding xav features to av1an and av1an features to xav, do not make sense.
- Parses the new fancy progress output on SVT-AV1 encoders (there is an example in the below video).
- Parses color and video metadata (container & frame based) to encoders automatically, including HDR metadata (Dolby Vision RPU automation for chunking is considered), FPS and resolution.
- Offers fun process monitoring with almost no overhead for indexing, SCD, encoding, TQ processes.
- Fastest chunked encoding with
svt-av1. - Fastest target quality encoding with
CVVDP.
- Uses only absolute bleeding-edge tools with an opinionated setup.
- No flexibility or extensive feature support (such as VapourSynth filtering, zoning, different encoders, metrics, chunking methods, scaling, configurable SC parameters, statistical pooling for TQ, probing with different parameters than actual encoding for TQ).
yuv420p&yuv420p10leinput ANDyuv420p10leoutput only. No 8 (output) or 12bit support, as well as yuv422, yuv444 support.- TQ aim is to: Get exactly what you requested in the most accurate / fastest way possible with no chance of deviation.
- Chunked encoding's aim is to optimize internally and reduce overhead as much as possible to get the fastest possible encoding speed overall.
- The tool's general aim is to achieve the previous 2 points, using as little characters in CLI, as possible:
xav -t 9.4-9.6 i.mkv
These help me make the tool's already present features closer to perfect with each day. So I am constantly trying to reduce extra options and code-size.
Run the build_all_static.sh script to build dependencies statically and build the main tool with them. This is the intended way for maximum performance. Though this is not particularly trivial.
For dynamic builds, you need ffmpegsource (ffms2) installed on your system and need to run build_dynamic.sh.
For TQ support, you need zimg, ffms2, vship.
NOTE: Building this tool statically requires you to have static libraries in your system for the C library (glibc), CXX library (libstdc++), llvm-libunwind, compiler-rt. They are usually found with -static, -dev, -git suffixes in package managers. Some package managers do not provide them, in this case; they need to be compiled manually.
Rust Nightly is also needed for -Z based optimizations.
NOTE: The tool is still in pre-beta. Even though it works, especially static building has complexities that are hard to handle universally. I will provide arch specific optimized builds soon with or without TQ support.
i.mp4
Target quality logic comes from my pull requests on av1an and it includes a little bit improvement on top of those.
The tool gets the allowed target and CRF range from the user, such as:
CRF = 12.25 - 44.75- This means the tool will never use a CRF lower than12.25or higher than44.75.TQ = 9.49-9.51- This means the allowed TQ range is very narrow and we target for a CVVDP score of9.5for each chunk separately.
Convergence rounds:
- Binary Search
- Binary Search
- Linear Interpolation
- Natural Cubic Spline Interpolation
- PCHIP Interpolation
- AKIMA Interpolation
- Falls back to Binary Search
It constantly uses higher-order interpolation methods to increase accuracy with additional data. And after each round, we shrink the search space.
For example, if the user allows the whole CRF range (0 70), the first binary search tries CRF 35 and if it's lower than the target quality, then we limit the next search within CRF 0 to 34.75.
Interpolation + search space shrinkage + intelligently used --tq and --qp parameters make the tool as fast as possible while keeping the accuracy.
Early Exit Conditions:
- It found the target.
- Impossible to find (picks the closest candidate). This can be because of very narrow TQ range or an absurd CRF range (you allowed CRF 60-70 but requested a visually transparent quality).
Huge thanks to Soda for the tremendous help & motivation & support to build this tool, and more importantly, for his friendship along the way. He is the partner in crime.
Also thanks Lumen for her great contributions on GPU based accessible state-of-the-art metric implementations and general help around the tooling.
- Uses a direct memory pipeline (zero external process overhead). Everything runs within one Rust process with direct memory access.
- Direct C FFI bindings to FFMS2. FFMS2 is currently the most efficient library to open/index/decode videos. With this way, we also get rid of Python/Vapoursynth/FFMPEG dependencies.
- Frames flow directly from decoder -> memory buffers -> encoder stdin via pipes.
- Uses zero-copy frame handling.
- If the input is 10bit, custom 4-pixel-to-5-byte packing reduces memory by
37.5%. The bit packing overhead is literally 0. - If the input is 8bit, we can store the chunk in memory as 8bit reducing almost
50%. - On demand 10bit conversion is only done efficiently when needed.
- Uses contiguous YUV420 layout optimized for cache locality.
- The producer-consumer pipeline is lockless.
- Single thread extracts frames using FFMS2 -> Multiple encoder threads process chunks in parallel -> Lockless MPSC crossbeam channel communication with backpressure
- There is no thread contention: Single decoder eliminates seeking conflicts.
- Bounded channels prevent memory explosion.
- Workers operate on independent memory regions.
- All components share the same address space.
- OS can optimize single-process thread scheduling in an easier way.
- Minimal data movement between processing stages.
- Sequential memory access
- Only a single index needed for SCD/encoding.
- No interpreter overhead.
- TQ: Can directly use already handled frames for encoding, for metric comparison as well by utilizing
vshipAPI directly instead of using VapourSynth based CVVDP with inefficient seeking/decoding/computing.
Av1an on the other hand:
Relies on Python -> Vapoursynth -> FFmpeg -> Encoder and it means multiple pipe/subprocess calls with serialization overhead. And it must also parse and execute .vpy scripts.
The whole overhead can be summed up as:
- Python interpreter startup
- VapourSynth initialization
- FFmpeg subprocess spawning
- Multiple encoder process creation
- Python objects <-> VapourSynth frames
- FFmpeg -> VapourSynth -> Encoder pipes and inter process communication between them. Let's say you use 32 workers: It means 32 independent ffmpeg instances, 32 vapoursynth instances and also 32 encoder instances (96 processes communicating with each other and creating memory explosion)
- If you add TQ into the equation, separate decoding/seeking and using VapourSynth based metrics create extra significant overhead