Thanks to visit codestin.com
Credit goes to github.com

Skip to content

MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting

License

Notifications You must be signed in to change notification settings

herehere14/MuseTalk

 
 

Repository files navigation

This is the Definitive Technical Manual and Documentation for the MuseTalk Warm-Server Edition (v1.5 Custom). This document serves as a comprehensive guide for developers, researchers, and system architects aiming to deploy real-time digital humans with sub-second latency.


Please cd MuseTalk/MuseTalk to start the following commands and install requirements.txt

THe requirements.txt has the latest working dependencies to run this project. However, you also need to download some dependencies through installing github repo and weights. --} use LLM for this please

(you may spend a long time to download and update dependencies)

Also, the mouth movement is still abit unrealistic even after applying multiple algorithms such as Kalman and one-euro algorithm to reduce noise.

MuseTalk: Warm-Server Edition (v1.5)

The Definitive Architecture Guide & Usage Manual

Version: 1.5.0-Warm-Production Author: Customized for Real-Time Digital Human Interaction Architecture: Persistent Latent Inference with Continuous Session Management


📚 Table of Contents

  1. Executive Summary & Core Philosophy
  2. System Architecture Deep Dive
  • The "Cold Start" Problem
  • The "Warm Server" Solution
  • The Continuous Session Pipeline
  1. Hardware & Software Prerequisites
  2. Installation & Environment Setup
  3. Phase 1: The Force-Prep Protocol (Caching)
  • Theory of Operation
  • Execution Guide
  1. Phase 2: High-Fidelity Enhancement
  • GFPGAN Integration
  • The Enhancement Workflow
  1. Phase 3: The Warm Server (Runtime)
  • API Endpoints
  • Under the Hood: warm_api.py
  • Stabilization Algorithms (One-Euro, Affine)
  1. Phase 4: The Full-Stack Web Application
  • Backend Job System
  • Frontend Polling Architecture
  1. Configuration Reference
  2. Troubleshooting & Optimization

1. Executive Summary & Core Philosophy

Standard implementation of Generative AI video models, specifically MuseTalk, suffers from a critical bottleneck known as "Cold Start Latency." In a naive implementation, a request to generate a speaking video triggers a cascade of heavy operations: loading 4GB+ of PyTorch weights into VRAM, instantiating the VAE (Variational Autoencoder) and UNet, reading the source video from disk, running face detection algorithms (like S3FD) on every frame, and finally encoding those frames into latent tensors.

For a 5-second response, this "overhead" can take 10-15 seconds before generation even begins. This renders real-time conversation impossible.

The MuseTalk Warm-Server Edition fundamentally re-architects this process into a persistent, stateful service. By decoupling the "Preparation Phase" from the "Inference Phase," we achieve:

  1. Sub-Second Time-to-First-Frame (TTFF): The model waits in VRAM ("warm"), ready to accept audio tensors instantly.
  2. Zero-Shot Face Detection: Face coordinates and background latents are pre-calculated and cached (force_prep.py), eliminating computer vision overhead during the live chat.
  3. Continuous Conversation: Instead of generating isolated video files, the server intelligently stitches new sentences into a growing, seamless video file using FFmpeg concat-demuxing (server_fast.py).

2. System Architecture Deep Dive

The "Cold Start" Problem

In standard inference scripts (like the original inference.py), the lifecycle of a request is:

  1. import torch (2s)
  2. Load UNet/VAE (4s)
  3. Face Detection (100ms per frame)
  4. VAE Encoding (50ms per frame)
  5. UNet Inference (Fast)
  6. VAE Decoding (Fast)
  7. Cleanup

For a 100-frame video (4 seconds), steps 3 and 4 alone add 15 seconds of latency on a mid-range GPU.

Also, when it first render AI generated mouth on top of the video uploaded, the rendering will take up to 10 mins on a low range GPU. But after rendering, we are able to

The "Warm Server" Solution

The Warm Server (server_fast.py + warm_api.py) changes the lifecycle to:

  1. Server Start: Load all models and caches (Done once).
  2. Idle State: Consume ~6GB VRAM, waiting for requests.
  3. Request Received:
  • Audio → Whisper Feature Extractor (100ms)
  • UNet Inference (Direct on GPU)
  • VAE Decode & Blend
  • Total Latency: ~0.8s for start of generation.

The Continuous Session Pipeline

To simulate a real video call, we cannot just return "video_1.mp4", then "video_2.mp4". We need a single, growing stream.

  • The Session File: results/full_session.mp4
  • The Stitcher: When a new clip is generated, server_fast.py generates a temporary text file list (concat_list.txt) containing the absolute paths of the existing session and the new clip.
  • FFmpeg Concat: It executes ffmpeg -f concat -c copy .... The -c copy flag is crucial; it performs a bitstream copy, meaning it does not re-encode frames. It simply updates the container metadata. This allows stitching hours of video in milliseconds.

3. Hardware & Software Prerequisites

Hardware

  • GPU: NVIDIA RTX 3060 (12GB) or higher recommended.

  • Minimum: 8GB VRAM (e.g., RTX 2070). You may need to reduce batch sizes.

  • Ideal: RTX 4090 (24GB) allows for batch sizes of 32+ and 4K caching.

  • Storage: NVMe SSD. Loading large .pt latent caches from a spinning HDD will introduce stutter.

  • RAM: 32GB System RAM.

Software

  • OS: Linux (Ubuntu 20.04/22.04) or Windows 10/11 (via PowerShell or CMD).
  • Python: 3.10.x (Strict dependency).
  • CUDA: 11.8 or 12.1.
  • FFmpeg: CRITICAL. Must be installed and accessible via system PATH.
  • Verify by running ffmpeg -version in your terminal.

4. Installation & Environment Setup

Step 1: Clone and Environment

# Clone the repository
git clone https://github.com/TMElyralab/MuseTalk.git
cd MuseTalk/MuseTalk

# create and activate environment
python3 venv venv
source venv/bin/activate

# Install PyTorch (Ensure CUDA compatibility)
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118

Step 2: Install Core Dependencies

Use the provided requirements.txt.

pip install -r requirements.txt
pip install -U openmim
mim install mmengine
mim install "mmcv>=2.0.1"
mim install "mmdet>=3.1.0"
mim install "mmpose>=1.1.0"

Note: We also need gfpgan for the enhancement module.

pip install gfpgan basicsr

Step 3: Model Weight Acquisition

You must organize your models/ directory exactly as follows. Use the download_weights.sh script if on Linux, or manually place files for Windows. (Ask AI for this since the dependencies are usually changed frequently)

Directory Structure:

./models/
├── musetalkV15/       # The V1.5 UNet Checkpoint
│   ├── musetalk.json
│   └── unet.pth
├── sd-vae/            # Stable Diffusion VAE (ft-mse)
│   ├── config.json
│   └── diffusion_pytorch_model.bin
├── whisper/           # Whisper Tiny (Audio Encoder)
│   ├── config.json
│   └── pytorch_model.bin
├── dwpose/            # Body pose estimation models
├── face-parse-bisent/ # Face Parsing models
├── GFPGANv1.4.pth     # For enhance.py
└── resnet18-5c106cde.pth


Important folder: MuseTalk/MuseTalk/scripts:

Based on the code provided in the MuseTalk/scripts directory, here is an explanation of what each script does:

1. inference.py

This is the main script for offline inference, used to generate a lip-synced video from a source video and an audio file.

  • Functionality: It processes input videos by extracting frames, detecting faces, and encoding them. It then uses the MuseTalk model (UNet and VAE) to generate lip movements synchronized with the input audio.
  • Key Features:
  • Configuration: Reads tasks (pairs of audio and video) from a YAML configuration file (e.g., test_img.yaml).
  • Face Parsing: Uses a face parsing model to blend the generated mouth region back into the original face seamlessly.
  • Optimization: Supports float16 precision to speed up inference.
  • Output: Combines the generated visual frames with the input audio using ffmpeg to produce the final MP4 file.

2. realtime_inference.py --} to cache

This script simulates a real-time inference scenario designed for lower latency, suitable for digital avatars or chatbots.

  • Functionality: Unlike inference.py, which processes everything from scratch, this script relies on pre-computed "avatar" data. It caches the VAE latent representations of the background video so that during inference, the model only needs to run the generation step (UNet) and decoding, skipping the costly encoding step.
  • Key Features:
  • Avatar Class: Manages a specific video avatar's state, loading cached latents (latents.pt) and coordinates (coords.pkl).
  • Preparation Mode: Can automatically generate the necessary cache files if they don't exist.
  • Concurrency: Uses Python threading and queues to process frame resizing and blending in parallel with the model's generation loop to maximize throughput.

3. force_prep.py

This is a utility script dedicated to manually generating the cache required for real-time inference.

  • Functionality: It performs the "Preparation Phase" of realtime_inference.py in isolation. It extracts frames, detects landmarks, and pre-calculates the VAE latents (the heavy computation) for a specific video.
  • Purpose: It is useful for setting up an avatar (e.g., bank_avatar_1) beforehand so that the real-time server or inference script can start immediately without a setup delay. It saves the cache files (latents.pt, masks.pt, mask_coords.pt) to the results directory.

4. preprocess.py

This script is for training data preparation, not for generating videos. It processes raw video datasets to train the MuseTalk models.

  • Functionality: It standardizes raw videos into a format suitable for training.
  • Key Steps:
  1. Conversion: Converts input videos to a fixed frame rate (25 FPS).
  2. Segmentation: Splits long videos into shorter clips (e.g., 30 seconds).
  3. Metadata Extraction: Uses mmpose and FaceAlignment to detect face bounding boxes and landmarks for every frame, saving the data to JSON files.
  4. Audio Extraction: Separates audio tracks into .wav files.
  5. List Generation: automatically splits data into training and validation lists based on the configuration.

Bonus: warm_api.py

Although not explicitly requested, this file is present in the folder. It appears to be an advanced inference script focused on high-stability output.

  • Functionality: It implements advanced filtering (One-Euro filters) to smooth out jittery landmarks and uses sophisticated blending techniques (Poisson blending, sub-pixel shifting) to improve the visual quality of the final composite video.

5. Phase 1: The Force-Prep Protocol (Caching)

The Force-Prep phase is the secret sauce of the Warm Server. It moves the heavy lifting from "Runtime" to "Setup Time."

Theory of Operation

The script scripts/force_prep.py performs the following atomic operations:

  1. Frame Extraction: Converts the source MP4 into a sequence of PNGs (stored in memory or temp disk).
  2. Face Alignment: Uses face_alignment to detect 68 landmarks. It calculates the bounding box for the face.
  3. Coordinate Caching: Saves the bounding box coordinates (x1, y1, x2, y2) to a pickle file (coords.pkl). This ensures the "camera" doesn't jitter during live inference because the coordinates are fixed.
  4. VAE Encoding: It crops the face, normalizes it to -1...1, and passes it through the VAE Encoder. The result is a latent tensor of shape (4, 32, 32).
  5. Latent Caching: These tensors are concatenated and saved as latents.pt.

Execution Guide

  1. Prepare Source Video: Place your high-quality avatar video (e.g., avatar_1.mp4) in data/video/.
  2. Edit Config: Open scripts/force_prep.py.
AVATAR_ID = "my_avatar_v1"
VIDEO_PATH = "data/video/avatar_1.mp4"
BBOX_SHIFT = -5  # Adjusts the chin/mouth crop region
  1. Run Script:
python -m scripts.force_prep

Result: A new folder results/avatars/my_avatar_v1/ is created containing latents.pt, coords.pkl, etc.


However, after this, you may notice you are encountering this error:

no file path: ./models/dwpose/dw-ll_ucoco_384.pth

this is because you haven't downloaded the weight of the open sourced model. the way to fix it is to paste this into terminal:


python -c ' import os from huggingface_hub import hf_hub_download

downloads = [ ("TMElyralab/MuseTalk", "musetalk/musetalk.json", "models/musetalk"), ("TMElyralab/MuseTalk", "musetalk/pytorch_model.bin", "models/musetalk"), ("TMElyralab/MuseTalk", "musetalkV15/musetalk.json", "models/musetalkV15"), ("TMElyralab/MuseTalk", "musetalkV15/unet.pth", "models/musetalkV15"), ("stabilityai/sd-vae-ft-mse", "config.json", "models/sd-vae"), ("stabilityai/sd-vae-ft-mse", "diffusion_pytorch_model.bin", "models/sd-vae"), ("openai/whisper-tiny", "config.json", "models/whisper"), ("openai/whisper-tiny", "pytorch_model.bin", "models/whisper"), ("openai/whisper-tiny", "preprocessor_config.json", "models/whisper"), ("yzd-v/DWPose", "dw-ll_ucoco_384.pth", "models/dwpose"), ("ByteDance/LatentSync", "latentsync_syncnet.pt", "models/syncnet") ]

print("Starting download...") for repo, filename, local_dir in downloads: print(f"Downloading {filename}...") os.makedirs(local_dir, exist_ok=True) hf_hub_download(repo_id=repo, filename=filename, local_dir=local_dir)

import urllib.request print("Downloading resnet18...") os.makedirs("models/face-parse-bisent", exist_ok=True) urllib.request.urlretrieve( "https://download.pytorch.org/models/resnet18-5c106cde.pth", "models/face-parse-bisent/resnet18-5c106cde.pth" ) print("All downloads complete.")


If this still returns some other errors like this:

Screenshot 2026-02-04 at 4 44 48 pm

apply this:


python -c ' import os import shutil from huggingface_hub import hf_hub_download

def fix_or_download(repo, filename, expected_path, correct_local_dir_arg): # 1. Check if file exists in the "wrong" nested location # Previous script likely made: models/musetalkV15/musetalkV15/unet.pth nested_path = os.path.join(os.path.dirname(expected_path), filename)

if os.path.exists(nested_path):
    print(f"Found nested file at {nested_path}. Moving to {expected_path}...")
    os.rename(nested_path, expected_path)
    # Try to remove empty nested dir
    try:
        os.rmdir(os.path.dirname(nested_path))
    except:
        pass
elif os.path.exists(expected_path):
    print(f"✅ File already exists at {expected_path}")
else:
    print(f"File missing. Downloading {filename} to {expected_path}...")
    # To get models/musetalkV15/unet.pth, we set local_dir="models" 
    # because filename already contains "musetalkV15/"
    hf_hub_download(repo_id=repo, filename=filename, local_dir=correct_local_dir_arg)

fix_or_download( "TMElyralab/MuseTalk", "musetalkV15/unet.pth", "models/musetalkV15/unet.pth", "models" )

fix_or_download( "TMElyralab/MuseTalk", "musetalkV15/musetalk.json", "models/musetalkV15/musetalk.json", "models" )


6. Phase 2: High-Fidelity Enhancement (important)

If your source video is 1080p or 4K, the standard MuseTalk output (256x256 face crop) might look soft. Use enhance.py to fix this.

improve quality significantly (making it more realistic)

GFPGAN Integration

The enhance.py script wraps the GFPGAN (Generative Facial Prior GAN) framework. It acts as a restoration filter that "hallucinates" high-frequency details (pores, eyelashes, sharper teeth) onto the generated face.

The Enhancement Workflow

You have two strategies here:

  1. Pre-Enhancement (Recommended): Run enhance.py on your source video before running force_prep.py. This ensures the background and non-moving parts of the face are already HD.
  2. Post-Enhancement: Run enhance.py on the final full_session.mp4 after the chat is over.

Usage: Edit enhance.py configuration:

INPUT_VIDEO = "results/full_session.mp4"
OUTPUT_VIDEO = "results/full_session_hd.mp4"

Run the script:

python enhance.py

Note: The script extracts frames to a temp_frames directory, applies restorer.enhance(), and then restitches them using FFmpeg to preserve the audio sync.

Before: Screenshot 2026-02-04 at 4 46 32 pm

After: Screenshot 2026-02-04 at 4 45 33 pm


7. Phase 3: The Warm Server (Runtime) (the main file to keep the render stable and effective) IMPORTANT

This is the core of the project. The server_fast.py script initializes the warm_api.py engine.

Under the Hood: warm_api.py

The RealTimeInference class inside warm_api.py is a marvel of engineering designed for stability.

1. One-Euro Filtering

Raw face detection is noisy. One frame the box is at x=100, the next at x=101. This causes the face to shake. The One-Euro Filter (implemented in OneEuro class) is a low-pass filter that dynamically adjusts its aggressiveness based on speed.

  • Stationary Face: High filtering (jitter removal).
  • Moving Head: Low filtering (low latency tracking).

2. Scale Locking

warm_api.py includes a scale_lock_enabled flag.

  • Warmup: For the first 50 frames, it collects width/height statistics.
  • Lock: It calculates the median size and locks the zoom level. This prevents the "pulsing head" effect where the face grows/shrinks slightly as the mouth opens.

3. Affine Tracking & Sub-Pixel Shift

To handle head rotation (roll), the API uses cv2.calcOpticalFlowPyrLK (Lucas-Kanade) to track stable features (nose bridge, eyes). It computes an Affine Matrix (M_affine) representing the rotation difference between the current frame and the reference. This matrix is applied to the generated mouth patch before pasting, ensuring the mouth rotates perfectly with the head.

You can add more features and algorithms in to help with the rendering stability.

API Endpoints (server_fast.py)

  • POST /speak
  • Payload: {"text": "Hello world"}
  • Action:
  1. Triggers VoiceEngine (TTS) to generate output_raw.wav.
  2. Converts audio to 16kHz via FFmpeg.
  3. warm_model.run() generates the visual clip.
  4. append_to_session() stitches the clip to full_session.mp4.
  • Return: JSON with the path to the updated session video.

  • GET /reset

  • Action: Deletes full_session.mp4 to start a fresh conversation.


8. Phase 4: The Full-Stack Web Application (still need to work on, especially the frontend)

Located in webapp/, this is a fully functional chat interface.

Backend Job System (webapp/backend/main.py)

Because video generation takes time (even if it's fast), we cannot block the HTTP request.

  1. Request: User hits /chat.
  2. Job Creation: Backend creates a UUID job_id and spawns a Thread.
  3. Thread Execution:
  • Call LLM (Simulated or Real).
  • Call TTS (_generate_tts_wav).
  • Call render_avatar_video.
  1. Status: The main thread returns the job_id immediately.

Frontend Polling Architecture (webapp/frontend/app.js)

The JavaScript frontend does not wait for a long HTTP response.

  1. Poll: It checks /status/{job_id} every 500ms.
  2. Video Swap: When status is done, it updates the <video> src attribute.
  3. Auto-Play Handling: It manages the video.play() promise to avoid browser "User Activation" errors.

9. Configuration Reference

The behavior of the Warm Server is governed by configs/inference/realtime_live.yaml.

Parameter Type Description Recommended
avatar_id String The folder name in results/avatars/. Must match force_prep. bank_avatar_1
video_path Path Source video file. data/video/bank.mp4
bbox_shift Int Vertical offset for the mouth mask. -5 (More chin)
batch_size Int How many frames to process in parallel on GPU. 4 (RTX 3060), 16 (RTX 4090)
preparation Bool Whether to run face detection at runtime. False (Use Cache)
fps Int Target output FPS. 25

Warm API Internal Toggles

Inside scripts/warm_api.py, RealTimeInference.__init__:

  • self.blend_mode = 'feather' or 'poisson'. Poisson is higher quality but slower (CPU intensive).
  • self.subpixel_enabled = True. Keeps the face stable.
  • self.scale_lock_enabled = True. Prevents pulsing.

10. Troubleshooting & Optimization

Common Errors

**1. "Error: Input video not found or empty" in enhance.py**

  • Cause: The INPUT_VIDEO path in enhance.py is incorrect or the video has 0 frames.
  • Fix: Verify the path logic. Note that enhance.py defines INPUT_VIDEO = "results/...". Ensure you are pointing to the correct file.

2. "FFmpeg command not found"

  • Cause: Python's subprocess or os.system cannot see ffmpeg.
  • Fix: Add FFmpeg to your System PATH variables. Restart your terminal/IDE.

**3. "Model returned None" in server_fast.py**

  • Cause: The inference pipeline failed to generate frames. Usually because the audio file was empty or the avatar_id cache is missing.
  • Fix: Run force_prep.py again. Check logs/ for specific CUDA errors.

4. Face Jitter / Vibration

  • Optimization: In warm_api.py, lower the beta value in OneEuro filter (e.g., 0.05). This increases smoothing but adds slight latency trails.

5. OOM (Out of Memory)

  • Optimization: Reduce BATCH_SIZE in force_prep.py and in realtime_live.yaml.

Manderin version (中文版本):

MuseTalk:Warm-Server 版本(v1.5)

权威架构指南 & 使用手册(开发者/研究者/系统架构师版)

版本: 1.5.0-Warm-Production 作者: 为实时数字人交互定制(Real-Time Digital Human Interaction) 架构: 持久在线推理(Persistent Latent Inference)+ 连续会话管理(Continuous Session Management)


你这份文档是 MuseTalk Warm-Server Edition (v1.5 Custom)最终技术手册与文档。它为希望部署 亚秒级延迟 的实时数字人系统的开发者、研究人员与系统架构师提供完整指导。


✅ 首先请在终端执行(很重要)

请先 cd MuseTalk/MuseTalk 再执行后续命令,并安装 requirements.txt

requirements.txt 里包含运行该项目的最新可用依赖。但你还需要通过安装 GitHub 仓库依赖与下载权重文件来补全运行环境 —— 请用 LLM(大模型)来帮你管理这部分(因为依赖会经常变动、手动查很麻烦)。

(你可能需要花较长时间下载与更新依赖)

另外:即使你已经用了 Kalman / One-Euro 等多种去噪稳定算法,嘴部运动仍然有点不够真实。


📚 目录

  1. 执行摘要 & 核心理念

  2. 系统架构深度解析

    • “冷启动(Cold Start)”问题
    • “Warm Server(热启动服务)”解决方案
    • 连续会话流水线(Continuous Session Pipeline)
  3. 硬件 & 软件前置条件

  4. 安装与环境配置

  5. 阶段 1:Force-Prep 协议(缓存)

    • 工作原理
    • 执行指南
  6. 阶段 2:高保真增强

    • GFPGAN 集成
    • 增强工作流
  7. 阶段 3:Warm Server(运行时)

    • API 端点
    • warm_api.py 的内部机制
    • 稳定算法(One-Euro、仿射等)
  8. 阶段 4:全栈 Web 应用

    • 后端 Job 系统
    • 前端轮询架构
  9. 配置参数参考

  10. 排错与优化


1. 执行摘要 & 核心理念

标准的生成式 AI 视频模型实现(尤其是 MuseTalk)存在一个关键瓶颈:冷启动延迟(Cold Start Latency)

在最朴素的实现中(naive implementation),每次请求生成口型视频都会触发一连串“重操作”:

  • 4GB+ 的 PyTorch 权重加载进 VRAM
  • 实例化 VAE(变分自编码器)和 UNet
  • 从硬盘读取源视频
  • 对每一帧都做人脸检测(例如 S3FD)
  • 将帧编码成 latent 张量

因此,即便你只想生成 5 秒的回复视频,启动开销也可能要 10–15 秒 才开始真正推理 —— 这让实时对话几乎不可能。

MuseTalk Warm-Server 版本从根本上重构了这个流程:把系统做成持久、可复用、状态化服务。通过将“准备阶段(Preparation Phase)”与“推理阶段(Inference Phase)”解耦,我们实现:

  1. 亚秒级首帧时间(TTFF,Time-to-First-Frame):模型常驻 VRAM(保持 warm),随时可接收音频张量。
  2. 零开销人脸检测(Zero-Shot Face Detection):人脸坐标与背景 latents 在 force_prep.py 预计算并缓存,实时推理中不再做 CV 计算。
  3. 连续对话输出:不是生成 video_1.mp4video_2.mp4 这种碎片文件,而是通过 server_fast.py + FFmpeg 拼接机制,把每段新句子无缝拼进一个“持续增长”的会话视频里。

2. 系统架构深度解析

“冷启动(Cold Start)”问题

在标准推理脚本(如原版 inference.py)中,一次请求的生命周期通常是:

  1. import torch(2s)
  2. 加载 UNet/VAE(4s)
  3. 人脸检测(每帧 ~100ms)
  4. VAE 编码(每帧 ~50ms)
  5. UNet 推理(快)
  6. VAE 解码(快)
  7. 清理资源

对一个 100 帧(约 4 秒) 的视频来说,仅步骤 3 与 4 就可能额外增加 15 秒 的延迟(在中档 GPU 上)。

另外:当你第一次把 AI 口型渲染到上传视频上时,在低端 GPU 上可能需要 长达 10 分钟。但渲染完成后,我们就可以……

(此处你原文句子未写完,我保持原样不补写。)


“Warm Server(热启动服务)”解决方案

Warm Server(server_fast.py + warm_api.py)把生命周期改成:

  1. 服务器启动时:加载所有模型与缓存(只做一次)

  2. 空闲等待:占用 ~6GB VRAM,等待请求

  3. 收到请求

    • 音频 → Whisper 特征提取(~100ms)
    • UNet 推理(直接走 GPU)
    • VAE 解码 + 混合合成
    • 总延迟: 首次开始生成大约 ~0.8s

连续会话流水线(Continuous Session Pipeline)

为了模拟视频通话,我们不能只返回 “video_1.mp4”,再返回 “video_2.mp4”。我们需要一个持续增长的、单一的输出流。

  • 会话文件: results/full_session.mp4

  • 拼接器: 每次生成新片段后,server_fast.py 会生成一个临时文本列表(concat_list.txt),里面写的是 旧会话视频新片段 的绝对路径

  • FFmpeg 拼接: 执行:

    ffmpeg -f concat -c copy ...

其中 -c copy 非常关键:它做的是比特流拷贝(bitstream copy),不会重新编码帧,只更新容器元数据。因此拼接几个小时视频也能在毫秒级完成。


3. 硬件 & 软件前置条件

硬件

  • GPU: 推荐 NVIDIA RTX 3060(12GB)或更高

    • 最低: 8GB VRAM(如 RTX 2070),可能需要降低 batch size
    • 理想: RTX 4090(24GB),可支持 batch 32+ 与 4K 缓存
  • 存储: NVMe SSD(关键)

    • 从机械硬盘读取大 .pt latent 缓存会造成卡顿
  • 内存: 32GB 系统内存

软件

  • OS: Linux(Ubuntu 20.04/22.04)或 Windows 10/11(PowerShell/CMD)

  • Python: 3.10.x(严格要求)

  • CUDA: 11.8 或 12.1

  • FFmpeg: 关键依赖,必须安装并加入 PATH

    • 通过运行 ffmpeg -version 验证

4. 安装与环境配置

Step 1:克隆与创建环境

# Clone the repository
git clone https://github.com/TMElyralab/MuseTalk.git
cd MuseTalk/MuseTalk

# create and activate environment
python3 venv venv
source venv/bin/activate

# Install PyTorch (Ensure CUDA compatibility)
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118

Step 2:安装核心依赖

使用提供的 requirements.txt

pip install -r requirements.txt
pip install -U openmim
mim install mmengine
mim install "mmcv>=2.0.1"
mim install "mmdet>=3.1.0"
mim install "mmpose>=1.1.0"

注:增强模块还需要 gfpgan

pip install gfpgan basicsr

Step 3:模型权重获取(Model Weight Acquisition)

你必须把 models/ 目录组织成如下结构。Linux 可用 download_weights.sh,Windows 需要手动放置文件。(由于依赖经常变化,建议“问 AI/LLM 获取最新正确方法”)

目录结构:

./models/
├── musetalkV15/       # The V1.5 UNet Checkpoint
│   ├── musetalk.json
│   └── unet.pth
├── sd-vae/            # Stable Diffusion VAE (ft-mse)
│   ├── config.json
│   └── diffusion_pytorch_model.bin
├── whisper/           # Whisper Tiny (Audio Encoder)
│   ├── config.json
│   └── pytorch_model.bin
├── dwpose/            # Body pose estimation models
├── face-parse-bisent/ # Face Parsing models
├── GFPGANv1.4.pth     # For enhance.py
└── resnet18-5c106cde.pth

MuseTalk/MuseTalk/scripts: (比较重要的文件夹)

1. inference.py

这是用于 离线推理 (Offline Inference) 的主脚本,用于根据源视频和音频文件生成口型同步的视频。

  • 功能: 它处理输入视频,提取帧,检测人脸并对其进行编码。然后,它使用 MuseTalk 模型(UNet 和 VAE)生成与输入音频同步的唇部动作。
  • 主要特点:
  • 配置: 从 YAML 配置文件(例如 test_img.yaml)中读取任务(音频和视频对)。
  • 面部解析 (Face Parsing): 使用面部解析模型将生成的嘴部区域无缝融合回原始面部。
  • 优化: 支持 float16 精度以加速推理。
  • 输出: 使用 ffmpeg 将生成的视觉帧与输入音频结合,生成最终的 MP4 文件。

2. realtime_inference.py

此脚本模拟 实时推理 (Real-time Inference) 场景,专为低延迟应用(如数字头像或聊天机器人)设计。

  • 功能: 与从头开始处理所有内容的 inference.py 不同,此脚本依赖于 预先计算的“头像”数据。它缓存背景视频的 VAE 潜在表示(Latents),因此在推理过程中,模型只需要运行生成步骤(UNet)和解码,从而跳过了耗时的编码步骤。
  • 主要特点:
  • Avatar 类: 管理特定视频头像的状态,加载缓存的潜在表示 (latents.pt) 和坐标 (coords.pkl)。
  • 准备模式: 如果缓存文件不存在,可以自动生成它们。
  • 并发: 使用 Python 的 threading 和队列并行处理帧的大小调整和混合,与模型的生成循环同时进行,以最大化吞吐量。

3. force_prep.py

这是一个专门用于 手动生成实时推理所需缓存 的实用脚本。

  • 功能: 它独立执行 realtime_inference.py 中的“准备阶段”。它提取帧,检测关键点,并为特定视频预先计算 VAE 潜在表示(这是繁重的计算部分)。
  • 目的: 用于预先设置头像(例如 bank_avatar_1),以便实时服务器或推理脚本可以立即启动,而无需设置延迟。它将缓存文件(latents.pt, masks.pt, mask_coords.pt)保存到结果目录中。

4. preprocess.py

此脚本用于 训练数据准备,而不是用于生成视频。它处理原始视频数据集以训练 MuseTalk 模型。

  • 功能: 将原始视频标准化为适合训练的格式。
  • 主要步骤:
  1. 转换: 将输入视频转换为固定的帧率(25 FPS)。
  2. 分割: 将长视频分割成较短的片段(例如 30 秒)。
  3. 元数据提取: 使用 mmposeFaceAlignment 检测每一帧的人脸边界框和关键点,并将数据保存到 JSON 文件中。
  4. 音频提取: 将音轨分离为 .wav 文件。
  5. 列表生成: 根据配置自动将数据拆分为训练列表和验证列表。

5. 阶段 1:Force-Prep 协议(缓存)

Force-Prep 是 Warm Server 的“秘密武器”:它把运行时的重计算转移到“部署准备阶段”。

工作原理(Theory of Operation)

scripts/force_prep.py 的原子操作流程:

  1. 抽帧: 将源 MP4 转为 PNG 帧序列(内存或临时磁盘)

  2. 人脸对齐:face_alignment 检测 68 个关键点并计算人脸 bbox

  3. 坐标缓存: 将 bbox 坐标 (x1, y1, x2, y2) 保存到 coords.pkl

    • 这样坐标固定,推理时不会“摄像机抖动”
  4. VAE 编码: 裁剪人脸并归一化到 -1...1,送入 VAE Encoder

    • 得到 (4, 32, 32) 的 latent 张量
  5. Latent 缓存: 将所有帧 latent 拼接并保存为 latents.pt

执行指南(Execution Guide)

  1. 准备源视频: 把高质量 avatar 视频(如 avatar_1.mp4)放到 data/video/
  2. 编辑配置: 打开 scripts/force_prep.py
AVATAR_ID = "my_avatar_v1"
VIDEO_PATH = "data/video/avatar_1.mp4"
BBOX_SHIFT = -5  # Adjusts the chin/mouth crop region
  1. 运行脚本:
python -m scripts.force_prep

结果: 会生成 results/avatars/my_avatar_v1/,其中包含 latents.ptcoords.pkl 等文件。


常见错误:缺少 dwpose 权重

Force-prep 之后你可能会遇到:

no file path: ./models/dwpose/dw-ll_ucoco_384.pth

原因:你没下载开源模型的权重。修复方式:把下面命令粘贴到终端运行:

python -c '
import os
from huggingface_hub import hf_hub_download

# Define the weights to download
downloads = [
    ("TMElyralab/MuseTalk", "musetalk/musetalk.json", "models/musetalk"),
    ("TMElyralab/MuseTalk", "musetalk/pytorch_model.bin", "models/musetalk"),
    ("TMElyralab/MuseTalk", "musetalkV15/musetalk.json", "models/musetalkV15"),
    ("TMElyralab/MuseTalk", "musetalkV15/unet.pth", "models/musetalkV15"),
    ("stabilityai/sd-vae-ft-mse", "config.json", "models/sd-vae"),
    ("stabilityai/sd-vae-ft-mse", "diffusion_pytorch_model.bin", "models/sd-vae"),
    ("openai/whisper-tiny", "config.json", "models/whisper"),
    ("openai/whisper-tiny", "pytorch_model.bin", "models/whisper"),
    ("openai/whisper-tiny", "preprocessor_config.json", "models/whisper"),
    ("yzd-v/DWPose", "dw-ll_ucoco_384.pth", "models/dwpose"),
    ("ByteDance/LatentSync", "latentsync_syncnet.pt", "models/syncnet")
]

print("Starting download...")
for repo, filename, local_dir in downloads:
    print(f"Downloading {filename}...")
    os.makedirs(local_dir, exist_ok=True)
    hf_hub_download(repo_id=repo, filename=filename, local_dir=local_dir)


# Download ResNet manually (not on HF Hub)
import urllib.request
print("Downloading resnet18...")
os.makedirs("models/face-parse-bisent", exist_ok=True)
urllib.request.urlretrieve(
    "https://download.pytorch.org/models/resnet18-5c106cde.pth", 
    "models/face-parse-bisent/resnet18-5c106cde.pth"
)
print("All downloads complete.")
'

如果你又遇到嵌套路径错误(例如截图所示)

如果仍然报出类似错误(你的截图场景),请用下面脚本修复:

python -c '
import os
import shutil
from huggingface_hub import hf_hub_download

def fix_or_download(repo, filename, expected_path, correct_local_dir_arg):
    # 1. Check if file exists in the "wrong" nested location
    # Previous script likely made: models/musetalkV15/musetalkV15/unet.pth
    nested_path = os.path.join(os.path.dirname(expected_path), filename)
    
    if os.path.exists(nested_path):
        print(f"Found nested file at {nested_path}. Moving to {expected_path}...")
        os.rename(nested_path, expected_path)
        # Try to remove empty nested dir
        try:
            os.rmdir(os.path.dirname(nested_path))
        except:
            pass
    elif os.path.exists(expected_path):
        print(f"✅ File already exists at {expected_path}")
    else:
        print(f"File missing. Downloading {filename} to {expected_path}...")
        # To get models/musetalkV15/unet.pth, we set local_dir="models" 
        # because filename already contains "musetalkV15/"
        hf_hub_download(repo_id=repo, filename=filename, local_dir=correct_local_dir_arg)

# Fix V1.5 UNet
fix_or_download(
    "TMElyralab/MuseTalk", 
    "musetalkV15/unet.pth", 
    "models/musetalkV15/unet.pth",
    "models" 
)

# Fix V1.5 Config
fix_or_download(
    "TMElyralab/MuseTalk", 
    "musetalkV15/musetalk.json", 
    "models/musetalkV15/musetalk.json",
    "models"
)

print("Fix complete. Running prep...")
'

6. 阶段 2:高保真增强(High-Fidelity Enhancement)

如果你的源视频是 1080p 或 4K,标准 MuseTalk 输出(256x256 人脸裁剪)可能会偏软。可以用 enhance.py 改善清晰度。

GFPGAN 集成

enhance.py 封装了 GFPGAN(Generative Facial Prior GAN)作为修复滤镜,它会“脑补”高频细节(毛孔、睫毛、更清晰的牙齿等),让脸更锐利。

增强工作流(Enhancement Workflow)

两种策略:

  1. 预增强(推荐): 在跑 force_prep.py 之前先对源视频跑 enhance.py

    • 这样背景与静态区域先变 HD,后续口型合成整体更稳
  2. 后增强: 等对话结束后,对最终 full_session.mp4 再跑增强

用法: 修改 enhance.py 配置:

INPUT_VIDEO = "results/full_session.mp4"
OUTPUT_VIDEO = "results/full_session_hd.mp4"

运行:

python enhance.py

脚本会抽帧到 temp_frames,对每帧执行 restorer.enhance(),再用 FFmpeg 拼回视频以保持音画同步。

(你提供的 Before/After 图片保持不变,这里不翻译图片内容。)


7. 阶段 3:Warm Server(运行时)—— 主文件(非常重要)

这是项目核心。server_fast.py 会初始化 warm_api.py 引擎。

warm_api.py 内部机制(Under the Hood)

warm_api.py 中的 RealTimeInference 类专为稳定性而设计。

1)One-Euro 滤波

原始人脸检测有噪声。上一帧 bbox 可能 x=100,下一帧 x=101,导致画面轻微抖动。 One-Euro Filter(在 OneEuro 类里)是一种动态低通滤波:

  • 脸不动: 强滤波(去抖动)
  • 头移动: 弱滤波(减延迟,跟踪更灵敏)

2)尺度锁定(Scale Locking)

warm_api.py 里有 scale_lock_enabled

  • 热身阶段: 前 50 帧统计宽高
  • 锁定: 取中位数并锁住缩放比例,避免“头一会儿大一会儿小”的脉动效果(嘴张合时尤其明显)

3)仿射跟踪 & 亚像素位移(Affine + Sub-pixel)

为了处理头部旋转(roll),API 使用 cv2.calcOpticalFlowPyrLK(Lucas-Kanade 光流)跟踪稳定特征点(鼻梁、眼睛)。 它计算一个 仿射矩阵 M_affine 表示当前帧相对参考帧的旋转差异,并在贴 mouth patch 之前先把生成结果做仿射变换,使嘴部能与头部旋转同步。

你还可以继续加入更多特征与算法来提升渲染稳定性。


API 端点(server_fast.py

POST /speak

  • Payload: {"text": "Hello world"}

  • 流程:

    1. 触发 VoiceEngine(TTS)生成 output_raw.wav
    2. 用 FFmpeg 转 16kHz 音频
    3. warm_model.run() 生成视频片段
    4. append_to_session() 拼接进 full_session.mp4
  • 返回: JSON(包含更新后的会话视频路径)

GET /reset

  • 作用: 删除 full_session.mp4 以开始新会话

8. 阶段 4:全栈 Web 应用(仍需完善,尤其前端)

位于 webapp/,提供可用的聊天 UI。

后端 Job 系统(webapp/backend/main.py

因为视频生成仍需要时间(即使很快),不能阻塞 HTTP 请求:

  1. 用户请求 /chat

  2. 后端生成 UUID job_id 并启动一个 Thread

  3. 线程执行:

    • 调用 LLM(模拟或真实)
    • 调用 TTS(_generate_tts_wav
    • 调用 render_avatar_video
  4. 主线程立即返回 job_id

前端轮询架构(webapp/frontend/app.js

前端不等待长请求返回:

  1. 每 500ms 轮询 /status/{job_id}
  2. 当状态为 done,更新 <video>src
  3. 处理 video.play() Promise,避免浏览器 “User Activation” 报错

9. 配置参数参考

Warm Server 的行为由 configs/inference/realtime_live.yaml 控制:

参数 类型 描述 推荐
avatar_id String results/avatars/ 下的文件夹名,必须匹配 force_prep bank_avatar_1
video_path Path 源视频路径 data/video/bank.mp4
bbox_shift Int mouth mask 的垂直偏移 -5(更多下巴)
batch_size Int GPU 并行处理帧数 RTX 3060 用 4,RTX 4090 用 16
preparation Bool 是否运行时做人脸检测 False(使用缓存)
fps Int 输出目标 FPS 25

Warm API 内部开关

scripts/warm_api.pyRealTimeInference.__init__

  • self.blend_mode = 'feather''poisson'

    • Poisson 质量更高但更慢(CPU 重)
  • self.subpixel_enabled = True

    • 让脸更稳定
  • self.scale_lock_enabled = True

    • 防止 pulsing(头部轻微缩放)

10. 排错与优化(Troubleshooting & Optimization)

常见错误

1)enhance.py 报 “Error: Input video not found or empty”

  • 原因: enhance.py 里的 INPUT_VIDEO 路径不正确或视频 0 帧
  • 修复: 检查路径与文件是否存在;注意 enhance.py 默认 INPUT_VIDEO = "results/...",确认指向正确文件

2)“FFmpeg command not found”

  • 原因: Python 的 subprocessos.system 找不到 ffmpeg
  • 修复: 把 FFmpeg 加入系统 PATH,重启终端/IDE

3)server_fast.py 报 “Model returned None”

  • 原因: 推理流水线没生成帧,常见是音频为空或 avatar 缓存缺失
  • 修复: 重新跑 force_prep.py,并检查 logs/ 中是否有 CUDA 具体错误

4)人脸抖动 / 震动(Face Jitter / Vibration)

  • 优化:warm_api.py 中把 One-Euro 的 beta 调小(如 0.05

    • 平滑更强,但会有轻微拖尾(增加一点点延迟感)

5)显存不足(OOM)

  • 优化: 降低 force_prep.pyrealtime_live.yaml 中的 BATCH_SIZE

About

MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.9%
  • Shell 1.2%
  • Other 0.9%