This is the Definitive Technical Manual and Documentation for the MuseTalk Warm-Server Edition (v1.5 Custom). This document serves as a comprehensive guide for developers, researchers, and system architects aiming to deploy real-time digital humans with sub-second latency.
Please cd MuseTalk/MuseTalk to start the following commands and install requirements.txt
THe requirements.txt has the latest working dependencies to run this project. However, you also need to download some dependencies through installing github repo and weights. --} use LLM for this please
(you may spend a long time to download and update dependencies)
Also, the mouth movement is still abit unrealistic even after applying multiple algorithms such as Kalman and one-euro algorithm to reduce noise.
The Definitive Architecture Guide & Usage Manual
Version: 1.5.0-Warm-Production Author: Customized for Real-Time Digital Human Interaction Architecture: Persistent Latent Inference with Continuous Session Management
- Executive Summary & Core Philosophy
- System Architecture Deep Dive
- The "Cold Start" Problem
- The "Warm Server" Solution
- The Continuous Session Pipeline
- Hardware & Software Prerequisites
- Installation & Environment Setup
- Phase 1: The Force-Prep Protocol (Caching)
- Theory of Operation
- Execution Guide
- Phase 2: High-Fidelity Enhancement
- GFPGAN Integration
- The Enhancement Workflow
- Phase 3: The Warm Server (Runtime)
- API Endpoints
- Under the Hood:
warm_api.py - Stabilization Algorithms (One-Euro, Affine)
- Phase 4: The Full-Stack Web Application
- Backend Job System
- Frontend Polling Architecture
- Configuration Reference
- Troubleshooting & Optimization
Standard implementation of Generative AI video models, specifically MuseTalk, suffers from a critical bottleneck known as "Cold Start Latency." In a naive implementation, a request to generate a speaking video triggers a cascade of heavy operations: loading 4GB+ of PyTorch weights into VRAM, instantiating the VAE (Variational Autoencoder) and UNet, reading the source video from disk, running face detection algorithms (like S3FD) on every frame, and finally encoding those frames into latent tensors.
For a 5-second response, this "overhead" can take 10-15 seconds before generation even begins. This renders real-time conversation impossible.
The MuseTalk Warm-Server Edition fundamentally re-architects this process into a persistent, stateful service. By decoupling the "Preparation Phase" from the "Inference Phase," we achieve:
- Sub-Second Time-to-First-Frame (TTFF): The model waits in VRAM ("warm"), ready to accept audio tensors instantly.
- Zero-Shot Face Detection: Face coordinates and background latents are pre-calculated and cached (
force_prep.py), eliminating computer vision overhead during the live chat. - Continuous Conversation: Instead of generating isolated video files, the server intelligently stitches new sentences into a growing, seamless video file using FFmpeg concat-demuxing (
server_fast.py).
In standard inference scripts (like the original inference.py), the lifecycle of a request is:
import torch(2s)Load UNet/VAE(4s)Face Detection(100ms per frame)VAE Encoding(50ms per frame)UNet Inference(Fast)VAE Decoding(Fast)Cleanup
For a 100-frame video (4 seconds), steps 3 and 4 alone add 15 seconds of latency on a mid-range GPU.
Also, when it first render AI generated mouth on top of the video uploaded, the rendering will take up to 10 mins on a low range GPU. But after rendering, we are able to
The Warm Server (server_fast.py + warm_api.py) changes the lifecycle to:
- Server Start: Load all models and caches (Done once).
- Idle State: Consume ~6GB VRAM, waiting for requests.
- Request Received:
- Audio → Whisper Feature Extractor (100ms)
- UNet Inference (Direct on GPU)
- VAE Decode & Blend
- Total Latency: ~0.8s for start of generation.
To simulate a real video call, we cannot just return "video_1.mp4", then "video_2.mp4". We need a single, growing stream.
- The Session File:
results/full_session.mp4 - The Stitcher: When a new clip is generated,
server_fast.pygenerates a temporary text file list (concat_list.txt) containing the absolute paths of the existing session and the new clip. - FFmpeg Concat: It executes
ffmpeg -f concat -c copy .... The-c copyflag is crucial; it performs a bitstream copy, meaning it does not re-encode frames. It simply updates the container metadata. This allows stitching hours of video in milliseconds.
-
GPU: NVIDIA RTX 3060 (12GB) or higher recommended.
-
Minimum: 8GB VRAM (e.g., RTX 2070). You may need to reduce batch sizes.
-
Ideal: RTX 4090 (24GB) allows for batch sizes of 32+ and 4K caching.
-
Storage: NVMe SSD. Loading large
.ptlatent caches from a spinning HDD will introduce stutter. -
RAM: 32GB System RAM.
- OS: Linux (Ubuntu 20.04/22.04) or Windows 10/11 (via PowerShell or CMD).
- Python: 3.10.x (Strict dependency).
- CUDA: 11.8 or 12.1.
- FFmpeg: CRITICAL. Must be installed and accessible via system PATH.
- Verify by running
ffmpeg -versionin your terminal.
# Clone the repository
git clone https://github.com/TMElyralab/MuseTalk.git
cd MuseTalk/MuseTalk
# create and activate environment
python3 venv venv
source venv/bin/activate
# Install PyTorch (Ensure CUDA compatibility)
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
Use the provided requirements.txt.
pip install -r requirements.txt
pip install -U openmim
mim install mmengine
mim install "mmcv>=2.0.1"
mim install "mmdet>=3.1.0"
mim install "mmpose>=1.1.0"
Note: We also need gfpgan for the enhancement module.
pip install gfpgan basicsr
You must organize your models/ directory exactly as follows. Use the download_weights.sh script if on Linux, or manually place files for Windows. (Ask AI for this since the dependencies are usually changed frequently)
Directory Structure:
./models/
├── musetalkV15/ # The V1.5 UNet Checkpoint
│ ├── musetalk.json
│ └── unet.pth
├── sd-vae/ # Stable Diffusion VAE (ft-mse)
│ ├── config.json
│ └── diffusion_pytorch_model.bin
├── whisper/ # Whisper Tiny (Audio Encoder)
│ ├── config.json
│ └── pytorch_model.bin
├── dwpose/ # Body pose estimation models
├── face-parse-bisent/ # Face Parsing models
├── GFPGANv1.4.pth # For enhance.py
└── resnet18-5c106cde.pth
Important folder: MuseTalk/MuseTalk/scripts:
Based on the code provided in the MuseTalk/scripts directory, here is an explanation of what each script does:
This is the main script for offline inference, used to generate a lip-synced video from a source video and an audio file.
- Functionality: It processes input videos by extracting frames, detecting faces, and encoding them. It then uses the MuseTalk model (UNet and VAE) to generate lip movements synchronized with the input audio.
- Key Features:
- Configuration: Reads tasks (pairs of audio and video) from a YAML configuration file (e.g.,
test_img.yaml). - Face Parsing: Uses a face parsing model to blend the generated mouth region back into the original face seamlessly.
- Optimization: Supports
float16precision to speed up inference. - Output: Combines the generated visual frames with the input audio using
ffmpegto produce the final MP4 file.
This script simulates a real-time inference scenario designed for lower latency, suitable for digital avatars or chatbots.
- Functionality: Unlike
inference.py, which processes everything from scratch, this script relies on pre-computed "avatar" data. It caches the VAE latent representations of the background video so that during inference, the model only needs to run the generation step (UNet) and decoding, skipping the costly encoding step. - Key Features:
- Avatar Class: Manages a specific video avatar's state, loading cached latents (
latents.pt) and coordinates (coords.pkl). - Preparation Mode: Can automatically generate the necessary cache files if they don't exist.
- Concurrency: Uses Python
threadingand queues to process frame resizing and blending in parallel with the model's generation loop to maximize throughput.
This is a utility script dedicated to manually generating the cache required for real-time inference.
- Functionality: It performs the "Preparation Phase" of
realtime_inference.pyin isolation. It extracts frames, detects landmarks, and pre-calculates the VAE latents (the heavy computation) for a specific video. - Purpose: It is useful for setting up an avatar (e.g.,
bank_avatar_1) beforehand so that the real-time server or inference script can start immediately without a setup delay. It saves the cache files (latents.pt,masks.pt,mask_coords.pt) to the results directory.
This script is for training data preparation, not for generating videos. It processes raw video datasets to train the MuseTalk models.
- Functionality: It standardizes raw videos into a format suitable for training.
- Key Steps:
- Conversion: Converts input videos to a fixed frame rate (25 FPS).
- Segmentation: Splits long videos into shorter clips (e.g., 30 seconds).
- Metadata Extraction: Uses
mmposeandFaceAlignmentto detect face bounding boxes and landmarks for every frame, saving the data to JSON files. - Audio Extraction: Separates audio tracks into
.wavfiles. - List Generation: automatically splits data into training and validation lists based on the configuration.
Although not explicitly requested, this file is present in the folder. It appears to be an advanced inference script focused on high-stability output.
- Functionality: It implements advanced filtering (One-Euro filters) to smooth out jittery landmarks and uses sophisticated blending techniques (Poisson blending, sub-pixel shifting) to improve the visual quality of the final composite video.
The Force-Prep phase is the secret sauce of the Warm Server. It moves the heavy lifting from "Runtime" to "Setup Time."
The script scripts/force_prep.py performs the following atomic operations:
- Frame Extraction: Converts the source MP4 into a sequence of PNGs (stored in memory or temp disk).
- Face Alignment: Uses
face_alignmentto detect 68 landmarks. It calculates the bounding box for the face. - Coordinate Caching: Saves the bounding box coordinates
(x1, y1, x2, y2)to a pickle file (coords.pkl). This ensures the "camera" doesn't jitter during live inference because the coordinates are fixed. - VAE Encoding: It crops the face, normalizes it to
-1...1, and passes it through the VAE Encoder. The result is a latent tensor of shape(4, 32, 32). - Latent Caching: These tensors are concatenated and saved as
latents.pt.
- Prepare Source Video: Place your high-quality avatar video (e.g.,
avatar_1.mp4) indata/video/. - Edit Config: Open
scripts/force_prep.py.
AVATAR_ID = "my_avatar_v1"
VIDEO_PATH = "data/video/avatar_1.mp4"
BBOX_SHIFT = -5 # Adjusts the chin/mouth crop region- Run Script:
python -m scripts.force_prep
Result: A new folder results/avatars/my_avatar_v1/ is created containing latents.pt, coords.pkl, etc.
However, after this, you may notice you are encountering this error:
no file path: ./models/dwpose/dw-ll_ucoco_384.pth
this is because you haven't downloaded the weight of the open sourced model. the way to fix it is to paste this into terminal:
python -c ' import os from huggingface_hub import hf_hub_download
downloads = [ ("TMElyralab/MuseTalk", "musetalk/musetalk.json", "models/musetalk"), ("TMElyralab/MuseTalk", "musetalk/pytorch_model.bin", "models/musetalk"), ("TMElyralab/MuseTalk", "musetalkV15/musetalk.json", "models/musetalkV15"), ("TMElyralab/MuseTalk", "musetalkV15/unet.pth", "models/musetalkV15"), ("stabilityai/sd-vae-ft-mse", "config.json", "models/sd-vae"), ("stabilityai/sd-vae-ft-mse", "diffusion_pytorch_model.bin", "models/sd-vae"), ("openai/whisper-tiny", "config.json", "models/whisper"), ("openai/whisper-tiny", "pytorch_model.bin", "models/whisper"), ("openai/whisper-tiny", "preprocessor_config.json", "models/whisper"), ("yzd-v/DWPose", "dw-ll_ucoco_384.pth", "models/dwpose"), ("ByteDance/LatentSync", "latentsync_syncnet.pt", "models/syncnet") ]
print("Starting download...") for repo, filename, local_dir in downloads: print(f"Downloading {filename}...") os.makedirs(local_dir, exist_ok=True) hf_hub_download(repo_id=repo, filename=filename, local_dir=local_dir)
import urllib.request print("Downloading resnet18...") os.makedirs("models/face-parse-bisent", exist_ok=True) urllib.request.urlretrieve( "https://download.pytorch.org/models/resnet18-5c106cde.pth", "models/face-parse-bisent/resnet18-5c106cde.pth" ) print("All downloads complete.")
If this still returns some other errors like this:
apply this:
python -c ' import os import shutil from huggingface_hub import hf_hub_download
def fix_or_download(repo, filename, expected_path, correct_local_dir_arg): # 1. Check if file exists in the "wrong" nested location # Previous script likely made: models/musetalkV15/musetalkV15/unet.pth nested_path = os.path.join(os.path.dirname(expected_path), filename)
if os.path.exists(nested_path):
print(f"Found nested file at {nested_path}. Moving to {expected_path}...")
os.rename(nested_path, expected_path)
# Try to remove empty nested dir
try:
os.rmdir(os.path.dirname(nested_path))
except:
pass
elif os.path.exists(expected_path):
print(f"✅ File already exists at {expected_path}")
else:
print(f"File missing. Downloading {filename} to {expected_path}...")
# To get models/musetalkV15/unet.pth, we set local_dir="models"
# because filename already contains "musetalkV15/"
hf_hub_download(repo_id=repo, filename=filename, local_dir=correct_local_dir_arg)
fix_or_download( "TMElyralab/MuseTalk", "musetalkV15/unet.pth", "models/musetalkV15/unet.pth", "models" )
fix_or_download( "TMElyralab/MuseTalk", "musetalkV15/musetalk.json", "models/musetalkV15/musetalk.json", "models" )
If your source video is 1080p or 4K, the standard MuseTalk output (256x256 face crop) might look soft. Use enhance.py to fix this.
improve quality significantly (making it more realistic)
The enhance.py script wraps the GFPGAN (Generative Facial Prior GAN) framework. It acts as a restoration filter that "hallucinates" high-frequency details (pores, eyelashes, sharper teeth) onto the generated face.
You have two strategies here:
- Pre-Enhancement (Recommended): Run
enhance.pyon your source video before runningforce_prep.py. This ensures the background and non-moving parts of the face are already HD. - Post-Enhancement: Run
enhance.pyon the finalfull_session.mp4after the chat is over.
Usage:
Edit enhance.py configuration:
INPUT_VIDEO = "results/full_session.mp4"
OUTPUT_VIDEO = "results/full_session_hd.mp4"Run the script:
python enhance.py
Note: The script extracts frames to a temp_frames directory, applies restorer.enhance(), and then restitches them using FFmpeg to preserve the audio sync.
7. Phase 3: The Warm Server (Runtime) (the main file to keep the render stable and effective) IMPORTANT
This is the core of the project. The server_fast.py script initializes the warm_api.py engine.
The RealTimeInference class inside warm_api.py is a marvel of engineering designed for stability.
Raw face detection is noisy. One frame the box is at x=100, the next at x=101. This causes the face to shake.
The One-Euro Filter (implemented in OneEuro class) is a low-pass filter that dynamically adjusts its aggressiveness based on speed.
- Stationary Face: High filtering (jitter removal).
- Moving Head: Low filtering (low latency tracking).
warm_api.py includes a scale_lock_enabled flag.
- Warmup: For the first 50 frames, it collects width/height statistics.
- Lock: It calculates the median size and locks the zoom level. This prevents the "pulsing head" effect where the face grows/shrinks slightly as the mouth opens.
To handle head rotation (roll), the API uses cv2.calcOpticalFlowPyrLK (Lucas-Kanade) to track stable features (nose bridge, eyes).
It computes an Affine Matrix (M_affine) representing the rotation difference between the current frame and the reference. This matrix is applied to the generated mouth patch before pasting, ensuring the mouth rotates perfectly with the head.
You can add more features and algorithms in to help with the rendering stability.
POST /speak- Payload:
{"text": "Hello world"} - Action:
- Triggers
VoiceEngine(TTS) to generateoutput_raw.wav. - Converts audio to 16kHz via FFmpeg.
warm_model.run()generates the visual clip.append_to_session()stitches the clip tofull_session.mp4.
-
Return: JSON with the path to the updated session video.
-
GET /reset -
Action: Deletes
full_session.mp4to start a fresh conversation.
Located in webapp/, this is a fully functional chat interface.
Because video generation takes time (even if it's fast), we cannot block the HTTP request.
- Request: User hits
/chat. - Job Creation: Backend creates a UUID
job_idand spawns a Thread. - Thread Execution:
- Call LLM (Simulated or Real).
- Call TTS (
_generate_tts_wav). - Call
render_avatar_video.
- Status: The main thread returns the
job_idimmediately.
The JavaScript frontend does not wait for a long HTTP response.
- Poll: It checks
/status/{job_id}every 500ms. - Video Swap: When status is
done, it updates the<video>src attribute. - Auto-Play Handling: It manages the
video.play()promise to avoid browser "User Activation" errors.
The behavior of the Warm Server is governed by configs/inference/realtime_live.yaml.
| Parameter | Type | Description | Recommended |
|---|---|---|---|
avatar_id |
String | The folder name in results/avatars/. Must match force_prep. |
bank_avatar_1 |
video_path |
Path | Source video file. | data/video/bank.mp4 |
bbox_shift |
Int | Vertical offset for the mouth mask. | -5 (More chin) |
batch_size |
Int | How many frames to process in parallel on GPU. | 4 (RTX 3060), 16 (RTX 4090) |
preparation |
Bool | Whether to run face detection at runtime. | False (Use Cache) |
fps |
Int | Target output FPS. | 25 |
Inside scripts/warm_api.py, RealTimeInference.__init__:
self.blend_mode = 'feather'or'poisson'. Poisson is higher quality but slower (CPU intensive).self.subpixel_enabled = True. Keeps the face stable.self.scale_lock_enabled = True. Prevents pulsing.
**1. "Error: Input video not found or empty" in enhance.py**
- Cause: The
INPUT_VIDEOpath inenhance.pyis incorrect or the video has 0 frames. - Fix: Verify the path logic. Note that
enhance.pydefinesINPUT_VIDEO = "results/...". Ensure you are pointing to the correct file.
2. "FFmpeg command not found"
- Cause: Python's
subprocessoros.systemcannot seeffmpeg. - Fix: Add FFmpeg to your System PATH variables. Restart your terminal/IDE.
**3. "Model returned None" in server_fast.py**
- Cause: The inference pipeline failed to generate frames. Usually because the audio file was empty or the
avatar_idcache is missing. - Fix: Run
force_prep.pyagain. Checklogs/for specific CUDA errors.
4. Face Jitter / Vibration
- Optimization: In
warm_api.py, lower thebetavalue inOneEurofilter (e.g.,0.05). This increases smoothing but adds slight latency trails.
5. OOM (Out of Memory)
- Optimization: Reduce
BATCH_SIZEinforce_prep.pyand inrealtime_live.yaml.
Manderin version (中文版本):
权威架构指南 & 使用手册(开发者/研究者/系统架构师版)
版本: 1.5.0-Warm-Production 作者: 为实时数字人交互定制(Real-Time Digital Human Interaction) 架构: 持久在线推理(Persistent Latent Inference)+ 连续会话管理(Continuous Session Management)
你这份文档是 MuseTalk Warm-Server Edition (v1.5 Custom) 的 最终技术手册与文档。它为希望部署 亚秒级延迟 的实时数字人系统的开发者、研究人员与系统架构师提供完整指导。
请先 cd MuseTalk/MuseTalk 再执行后续命令,并安装 requirements.txt。
requirements.txt 里包含运行该项目的最新可用依赖。但你还需要通过安装 GitHub 仓库依赖与下载权重文件来补全运行环境 —— 请用 LLM(大模型)来帮你管理这部分(因为依赖会经常变动、手动查很麻烦)。
(你可能需要花较长时间下载与更新依赖)
另外:即使你已经用了 Kalman / One-Euro 等多种去噪稳定算法,嘴部运动仍然有点不够真实。
-
执行摘要 & 核心理念
-
系统架构深度解析
- “冷启动(Cold Start)”问题
- “Warm Server(热启动服务)”解决方案
- 连续会话流水线(Continuous Session Pipeline)
-
硬件 & 软件前置条件
-
安装与环境配置
-
阶段 1:Force-Prep 协议(缓存)
- 工作原理
- 执行指南
-
阶段 2:高保真增强
- GFPGAN 集成
- 增强工作流
-
阶段 3:Warm Server(运行时)
- API 端点
warm_api.py的内部机制- 稳定算法(One-Euro、仿射等)
-
阶段 4:全栈 Web 应用
- 后端 Job 系统
- 前端轮询架构
-
配置参数参考
-
排错与优化
标准的生成式 AI 视频模型实现(尤其是 MuseTalk)存在一个关键瓶颈:冷启动延迟(Cold Start Latency)。
在最朴素的实现中(naive implementation),每次请求生成口型视频都会触发一连串“重操作”:
- 将 4GB+ 的 PyTorch 权重加载进 VRAM
- 实例化 VAE(变分自编码器)和 UNet
- 从硬盘读取源视频
- 对每一帧都做人脸检测(例如 S3FD)
- 将帧编码成 latent 张量
因此,即便你只想生成 5 秒的回复视频,启动开销也可能要 10–15 秒 才开始真正推理 —— 这让实时对话几乎不可能。
MuseTalk Warm-Server 版本从根本上重构了这个流程:把系统做成持久、可复用、状态化服务。通过将“准备阶段(Preparation Phase)”与“推理阶段(Inference Phase)”解耦,我们实现:
- 亚秒级首帧时间(TTFF,Time-to-First-Frame):模型常驻 VRAM(保持 warm),随时可接收音频张量。
- 零开销人脸检测(Zero-Shot Face Detection):人脸坐标与背景 latents 在
force_prep.py预计算并缓存,实时推理中不再做 CV 计算。 - 连续对话输出:不是生成
video_1.mp4、video_2.mp4这种碎片文件,而是通过server_fast.py+ FFmpeg 拼接机制,把每段新句子无缝拼进一个“持续增长”的会话视频里。
在标准推理脚本(如原版 inference.py)中,一次请求的生命周期通常是:
import torch(2s)加载 UNet/VAE(4s)人脸检测(每帧 ~100ms)VAE 编码(每帧 ~50ms)UNet 推理(快)VAE 解码(快)清理资源
对一个 100 帧(约 4 秒) 的视频来说,仅步骤 3 与 4 就可能额外增加 15 秒 的延迟(在中档 GPU 上)。
另外:当你第一次把 AI 口型渲染到上传视频上时,在低端 GPU 上可能需要 长达 10 分钟。但渲染完成后,我们就可以……
(此处你原文句子未写完,我保持原样不补写。)
Warm Server(server_fast.py + warm_api.py)把生命周期改成:
-
服务器启动时:加载所有模型与缓存(只做一次)
-
空闲等待:占用 ~6GB VRAM,等待请求
-
收到请求:
- 音频 → Whisper 特征提取(~100ms)
- UNet 推理(直接走 GPU)
- VAE 解码 + 混合合成
- 总延迟: 首次开始生成大约 ~0.8s
为了模拟视频通话,我们不能只返回 “video_1.mp4”,再返回 “video_2.mp4”。我们需要一个持续增长的、单一的输出流。
-
会话文件:
results/full_session.mp4 -
拼接器: 每次生成新片段后,
server_fast.py会生成一个临时文本列表(concat_list.txt),里面写的是 旧会话视频 与 新片段 的绝对路径 -
FFmpeg 拼接: 执行:
ffmpeg -f concat -c copy ...
其中 -c copy 非常关键:它做的是比特流拷贝(bitstream copy),不会重新编码帧,只更新容器元数据。因此拼接几个小时视频也能在毫秒级完成。
-
GPU: 推荐 NVIDIA RTX 3060(12GB)或更高
- 最低: 8GB VRAM(如 RTX 2070),可能需要降低 batch size
- 理想: RTX 4090(24GB),可支持 batch 32+ 与 4K 缓存
-
存储: NVMe SSD(关键)
- 从机械硬盘读取大
.ptlatent 缓存会造成卡顿
- 从机械硬盘读取大
-
内存: 32GB 系统内存
-
OS: Linux(Ubuntu 20.04/22.04)或 Windows 10/11(PowerShell/CMD)
-
Python: 3.10.x(严格要求)
-
CUDA: 11.8 或 12.1
-
FFmpeg: 关键依赖,必须安装并加入 PATH
- 通过运行
ffmpeg -version验证
- 通过运行
# Clone the repository
git clone https://github.com/TMElyralab/MuseTalk.git
cd MuseTalk/MuseTalk
# create and activate environment
python3 venv venv
source venv/bin/activate
# Install PyTorch (Ensure CUDA compatibility)
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118使用提供的 requirements.txt:
pip install -r requirements.txt
pip install -U openmim
mim install mmengine
mim install "mmcv>=2.0.1"
mim install "mmdet>=3.1.0"
mim install "mmpose>=1.1.0"注:增强模块还需要
gfpgan
pip install gfpgan basicsr你必须把 models/ 目录组织成如下结构。Linux 可用 download_weights.sh,Windows 需要手动放置文件。(由于依赖经常变化,建议“问 AI/LLM 获取最新正确方法”)
目录结构:
./models/
├── musetalkV15/ # The V1.5 UNet Checkpoint
│ ├── musetalk.json
│ └── unet.pth
├── sd-vae/ # Stable Diffusion VAE (ft-mse)
│ ├── config.json
│ └── diffusion_pytorch_model.bin
├── whisper/ # Whisper Tiny (Audio Encoder)
│ ├── config.json
│ └── pytorch_model.bin
├── dwpose/ # Body pose estimation models
├── face-parse-bisent/ # Face Parsing models
├── GFPGANv1.4.pth # For enhance.py
└── resnet18-5c106cde.pth
MuseTalk/MuseTalk/scripts: (比较重要的文件夹)
这是用于 离线推理 (Offline Inference) 的主脚本,用于根据源视频和音频文件生成口型同步的视频。
- 功能: 它处理输入视频,提取帧,检测人脸并对其进行编码。然后,它使用 MuseTalk 模型(UNet 和 VAE)生成与输入音频同步的唇部动作。
- 主要特点:
- 配置: 从 YAML 配置文件(例如
test_img.yaml)中读取任务(音频和视频对)。 - 面部解析 (Face Parsing): 使用面部解析模型将生成的嘴部区域无缝融合回原始面部。
- 优化: 支持
float16精度以加速推理。 - 输出: 使用
ffmpeg将生成的视觉帧与输入音频结合,生成最终的 MP4 文件。
此脚本模拟 实时推理 (Real-time Inference) 场景,专为低延迟应用(如数字头像或聊天机器人)设计。
- 功能: 与从头开始处理所有内容的
inference.py不同,此脚本依赖于 预先计算的“头像”数据。它缓存背景视频的 VAE 潜在表示(Latents),因此在推理过程中,模型只需要运行生成步骤(UNet)和解码,从而跳过了耗时的编码步骤。 - 主要特点:
- Avatar 类: 管理特定视频头像的状态,加载缓存的潜在表示 (
latents.pt) 和坐标 (coords.pkl)。 - 准备模式: 如果缓存文件不存在,可以自动生成它们。
- 并发: 使用 Python 的
threading和队列并行处理帧的大小调整和混合,与模型的生成循环同时进行,以最大化吞吐量。
这是一个专门用于 手动生成实时推理所需缓存 的实用脚本。
- 功能: 它独立执行
realtime_inference.py中的“准备阶段”。它提取帧,检测关键点,并为特定视频预先计算 VAE 潜在表示(这是繁重的计算部分)。 - 目的: 用于预先设置头像(例如
bank_avatar_1),以便实时服务器或推理脚本可以立即启动,而无需设置延迟。它将缓存文件(latents.pt,masks.pt,mask_coords.pt)保存到结果目录中。
此脚本用于 训练数据准备,而不是用于生成视频。它处理原始视频数据集以训练 MuseTalk 模型。
- 功能: 将原始视频标准化为适合训练的格式。
- 主要步骤:
- 转换: 将输入视频转换为固定的帧率(25 FPS)。
- 分割: 将长视频分割成较短的片段(例如 30 秒)。
- 元数据提取: 使用
mmpose和FaceAlignment检测每一帧的人脸边界框和关键点,并将数据保存到 JSON 文件中。 - 音频提取: 将音轨分离为
.wav文件。 - 列表生成: 根据配置自动将数据拆分为训练列表和验证列表。
Force-Prep 是 Warm Server 的“秘密武器”:它把运行时的重计算转移到“部署准备阶段”。
scripts/force_prep.py 的原子操作流程:
-
抽帧: 将源 MP4 转为 PNG 帧序列(内存或临时磁盘)
-
人脸对齐: 用
face_alignment检测 68 个关键点并计算人脸 bbox -
坐标缓存: 将 bbox 坐标
(x1, y1, x2, y2)保存到coords.pkl- 这样坐标固定,推理时不会“摄像机抖动”
-
VAE 编码: 裁剪人脸并归一化到
-1...1,送入 VAE Encoder- 得到
(4, 32, 32)的 latent 张量
- 得到
-
Latent 缓存: 将所有帧 latent 拼接并保存为
latents.pt
- 准备源视频: 把高质量 avatar 视频(如
avatar_1.mp4)放到data/video/ - 编辑配置: 打开
scripts/force_prep.py
AVATAR_ID = "my_avatar_v1"
VIDEO_PATH = "data/video/avatar_1.mp4"
BBOX_SHIFT = -5 # Adjusts the chin/mouth crop region- 运行脚本:
python -m scripts.force_prep结果: 会生成 results/avatars/my_avatar_v1/,其中包含 latents.pt、coords.pkl 等文件。
Force-prep 之后你可能会遇到:
no file path: ./models/dwpose/dw-ll_ucoco_384.pth
原因:你没下载开源模型的权重。修复方式:把下面命令粘贴到终端运行:
python -c '
import os
from huggingface_hub import hf_hub_download
# Define the weights to download
downloads = [
("TMElyralab/MuseTalk", "musetalk/musetalk.json", "models/musetalk"),
("TMElyralab/MuseTalk", "musetalk/pytorch_model.bin", "models/musetalk"),
("TMElyralab/MuseTalk", "musetalkV15/musetalk.json", "models/musetalkV15"),
("TMElyralab/MuseTalk", "musetalkV15/unet.pth", "models/musetalkV15"),
("stabilityai/sd-vae-ft-mse", "config.json", "models/sd-vae"),
("stabilityai/sd-vae-ft-mse", "diffusion_pytorch_model.bin", "models/sd-vae"),
("openai/whisper-tiny", "config.json", "models/whisper"),
("openai/whisper-tiny", "pytorch_model.bin", "models/whisper"),
("openai/whisper-tiny", "preprocessor_config.json", "models/whisper"),
("yzd-v/DWPose", "dw-ll_ucoco_384.pth", "models/dwpose"),
("ByteDance/LatentSync", "latentsync_syncnet.pt", "models/syncnet")
]
print("Starting download...")
for repo, filename, local_dir in downloads:
print(f"Downloading {filename}...")
os.makedirs(local_dir, exist_ok=True)
hf_hub_download(repo_id=repo, filename=filename, local_dir=local_dir)
# Download ResNet manually (not on HF Hub)
import urllib.request
print("Downloading resnet18...")
os.makedirs("models/face-parse-bisent", exist_ok=True)
urllib.request.urlretrieve(
"https://download.pytorch.org/models/resnet18-5c106cde.pth",
"models/face-parse-bisent/resnet18-5c106cde.pth"
)
print("All downloads complete.")
'如果仍然报出类似错误(你的截图场景),请用下面脚本修复:
python -c '
import os
import shutil
from huggingface_hub import hf_hub_download
def fix_or_download(repo, filename, expected_path, correct_local_dir_arg):
# 1. Check if file exists in the "wrong" nested location
# Previous script likely made: models/musetalkV15/musetalkV15/unet.pth
nested_path = os.path.join(os.path.dirname(expected_path), filename)
if os.path.exists(nested_path):
print(f"Found nested file at {nested_path}. Moving to {expected_path}...")
os.rename(nested_path, expected_path)
# Try to remove empty nested dir
try:
os.rmdir(os.path.dirname(nested_path))
except:
pass
elif os.path.exists(expected_path):
print(f"✅ File already exists at {expected_path}")
else:
print(f"File missing. Downloading {filename} to {expected_path}...")
# To get models/musetalkV15/unet.pth, we set local_dir="models"
# because filename already contains "musetalkV15/"
hf_hub_download(repo_id=repo, filename=filename, local_dir=correct_local_dir_arg)
# Fix V1.5 UNet
fix_or_download(
"TMElyralab/MuseTalk",
"musetalkV15/unet.pth",
"models/musetalkV15/unet.pth",
"models"
)
# Fix V1.5 Config
fix_or_download(
"TMElyralab/MuseTalk",
"musetalkV15/musetalk.json",
"models/musetalkV15/musetalk.json",
"models"
)
print("Fix complete. Running prep...")
'如果你的源视频是 1080p 或 4K,标准 MuseTalk 输出(256x256 人脸裁剪)可能会偏软。可以用 enhance.py 改善清晰度。
enhance.py 封装了 GFPGAN(Generative Facial Prior GAN)作为修复滤镜,它会“脑补”高频细节(毛孔、睫毛、更清晰的牙齿等),让脸更锐利。
两种策略:
-
预增强(推荐): 在跑
force_prep.py之前先对源视频跑enhance.py- 这样背景与静态区域先变 HD,后续口型合成整体更稳
-
后增强: 等对话结束后,对最终
full_session.mp4再跑增强
用法: 修改 enhance.py 配置:
INPUT_VIDEO = "results/full_session.mp4"
OUTPUT_VIDEO = "results/full_session_hd.mp4"运行:
python enhance.py脚本会抽帧到
temp_frames,对每帧执行restorer.enhance(),再用 FFmpeg 拼回视频以保持音画同步。
(你提供的 Before/After 图片保持不变,这里不翻译图片内容。)
这是项目核心。server_fast.py 会初始化 warm_api.py 引擎。
warm_api.py 中的 RealTimeInference 类专为稳定性而设计。
原始人脸检测有噪声。上一帧 bbox 可能 x=100,下一帧 x=101,导致画面轻微抖动。
One-Euro Filter(在 OneEuro 类里)是一种动态低通滤波:
- 脸不动: 强滤波(去抖动)
- 头移动: 弱滤波(减延迟,跟踪更灵敏)
warm_api.py 里有 scale_lock_enabled:
- 热身阶段: 前 50 帧统计宽高
- 锁定: 取中位数并锁住缩放比例,避免“头一会儿大一会儿小”的脉动效果(嘴张合时尤其明显)
为了处理头部旋转(roll),API 使用 cv2.calcOpticalFlowPyrLK(Lucas-Kanade 光流)跟踪稳定特征点(鼻梁、眼睛)。
它计算一个 仿射矩阵 M_affine 表示当前帧相对参考帧的旋转差异,并在贴 mouth patch 之前先把生成结果做仿射变换,使嘴部能与头部旋转同步。
你还可以继续加入更多特征与算法来提升渲染稳定性。
-
Payload:
{"text": "Hello world"} -
流程:
- 触发
VoiceEngine(TTS)生成output_raw.wav - 用 FFmpeg 转 16kHz 音频
warm_model.run()生成视频片段append_to_session()拼接进full_session.mp4
- 触发
-
返回: JSON(包含更新后的会话视频路径)
- 作用: 删除
full_session.mp4以开始新会话
位于 webapp/,提供可用的聊天 UI。
因为视频生成仍需要时间(即使很快),不能阻塞 HTTP 请求:
-
用户请求
/chat -
后端生成 UUID
job_id并启动一个 Thread -
线程执行:
- 调用 LLM(模拟或真实)
- 调用 TTS(
_generate_tts_wav) - 调用
render_avatar_video
-
主线程立即返回
job_id
前端不等待长请求返回:
- 每 500ms 轮询
/status/{job_id} - 当状态为
done,更新<video>的src - 处理
video.play()Promise,避免浏览器 “User Activation” 报错
Warm Server 的行为由 configs/inference/realtime_live.yaml 控制:
| 参数 | 类型 | 描述 | 推荐 |
|---|---|---|---|
avatar_id |
String | results/avatars/ 下的文件夹名,必须匹配 force_prep |
bank_avatar_1 |
video_path |
Path | 源视频路径 | data/video/bank.mp4 |
bbox_shift |
Int | mouth mask 的垂直偏移 | -5(更多下巴) |
batch_size |
Int | GPU 并行处理帧数 | RTX 3060 用 4,RTX 4090 用 16 |
preparation |
Bool | 是否运行时做人脸检测 | False(使用缓存) |
fps |
Int | 输出目标 FPS | 25 |
在 scripts/warm_api.py 的 RealTimeInference.__init__:
-
self.blend_mode = 'feather'或'poisson'- Poisson 质量更高但更慢(CPU 重)
-
self.subpixel_enabled = True- 让脸更稳定
-
self.scale_lock_enabled = True- 防止 pulsing(头部轻微缩放)
1)enhance.py 报 “Error: Input video not found or empty”
- 原因:
enhance.py里的INPUT_VIDEO路径不正确或视频 0 帧 - 修复: 检查路径与文件是否存在;注意
enhance.py默认INPUT_VIDEO = "results/...",确认指向正确文件
2)“FFmpeg command not found”
- 原因: Python 的
subprocess或os.system找不到ffmpeg - 修复: 把 FFmpeg 加入系统 PATH,重启终端/IDE
3)server_fast.py 报 “Model returned None”
- 原因: 推理流水线没生成帧,常见是音频为空或 avatar 缓存缺失
- 修复: 重新跑
force_prep.py,并检查logs/中是否有 CUDA 具体错误
4)人脸抖动 / 震动(Face Jitter / Vibration)
-
优化: 在
warm_api.py中把 One-Euro 的beta调小(如0.05)- 平滑更强,但会有轻微拖尾(增加一点点延迟感)
5)显存不足(OOM)
- 优化: 降低
force_prep.py与realtime_live.yaml中的BATCH_SIZE