Lance: Unified Multimodal Modeling by Multi-Task Synergy

Fengyi Fu; Mengqi Huang; Shaojin Wu; Yunsheng Jiang; Yufei Huo; Hao Li; Yinghang Song; Fei Ding; Jianzhu Guo; Qian He; Zheren Fu; Zhendong Mao; Yongdong Zhang

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Fengyi Fu^*, Mengqi Huang^*†, Shaojin Wu^*, Yunsheng Jiang^*, Yufei Huo, Jianzhu Guo^†§

Hao Li, Yinghang Song, Fei Ding, Qian He, Zheren Fu, Zhendong Mao, Yongdong Zhang

^* Equal contribution ^† Corresponding authors ^§ Project lead

Intelligent Creation Lab, ByteDance

Lance is a 3B native unified multimodal model for image and video understanding, generation, and editing, trained from scratch within a training budget of no more than 128 GPUs using a staged multi-task recipe.

Paper Code Model Demo

Note: Lance is a research project rather than a polished product model.

The released checkpoint was trained with up to 128 A100 GPUs, with training conducted up to 768x768 image generation and 480p, 12 FPS video generation. Our goal is to share a research artifact for studying unified image/video understanding, generation, and editing under a relatively small model and limited compute budget. Output quality may vary across prompts, resolutions, duration, motion complexity, and editing scenarios, and we see further opportunities to improve the post-training recipe. We appreciate constructive feedback from the community as we continue improving the project.

Text-to-Video

Video Editing

Video Reasoning

Text-to-Video

Nine text-conditioned cases focused on character motion, fantasy animals, two-person interaction, and cinematic dreamlike scenes.

Surfing animal character

Robot and dog duet

Wizard cat with magic

Tropical sunset shoreline

Motorcycle through canyon

Pottery character motion

Piano performance

Robot boxing match

Two-person embrace

Displayed with 2x super-resolution and 2x frame interpolation.

Image-to-Video

Starting from a single image, Lance preserves subject identity and scene composition while synthesizing natural motion, camera movement, and temporal detail.

Snow leopard standing on a glacier before leaping

Input image

Snow leopard leap

A snow leopard stands calmly near the edge of a glacier, observing the icy gap ahead. The snow leopard remains the dominant main subject throughout the entire sequence, shown in a medium-close to medium framing and occupying a strong, consistent portion of the frame. Its anatomy, body proportions, fur pattern, facial features, tail, and limbs remain visually consistent and anatomically correct at all times, with no distortion, shrinking, or unnatural deformation. The environment is a vast frozen glacier canyon with layered ice cliffs, deep blue crevasses, snow-covered ledges, drifting snow particles, and distant snowy ridges under a pale winter sky. Soft golden side light illuminates the snow leopard's fur and creates subtle highlights across the ice surfaces. After a brief pause, the snow leopard performs a smooth, controlled slow-motion leap to the glacier on the opposite side. The motion is gentle and physically natural, with a soft push-off, subtle tail balance, and clear readable body movement. The camera smoothly follows the snow leopard and does not pull back to a wide distant view. The landing area on the opposite glacier remains relatively close in screen space, so the snow leopard stays large and prominent. After landing, the snow leopard remains a large foreground or mid-foreground subject, occupying about one-half to two-thirds of the frame, while the camera stays close and stable or slightly tracks forward. It lands steadily and returns to a calm stable stance. The glacier walls, snow surfaces, and distant ice formations remain visually stable without cracking or collapsing. The environment maintains soft natural motion with realistic snow atmosphere and physically consistent animal movement.

Generated video

Emperor penguin standing beside calm icy water

Input image

Penguin dive

Generated video

Input image

Firefly forest turn

Generated video

Full-body portrait of a woman in a simple modern room

Input image

Hair-smoothing portrait

Generated video

Ginger cat sleeping on a sunlit wooden windowsill

Input image

Sleeping cat ambience

Generated video

Input image

Rainbow waterfall motion

Generated video

Video is displayed with 2x frame interpolation.

Video Editing

Nine prompt-driven single-step and compositional editing cases spanning background transformation, object addition and removal, subject replacement, appearance restyling, stylization, and action edits.

Replace the background with a campfire.

Add a row of colorful balloons.

Change the boy to a girl with black shirt.

Change the dog to a cat.

Change the style to watercolor painting, soft colors, natural and dreamy.

Make the car a shiny red color and add a snowy street background.

Have the woman raise her right hand to gently brush her hair, slightly turn her body to the right, soften her expression, and shift her gaze to the right.

Add a scarf around her neck and replace the background with a snowy park.

Remove face stickers.

Multi-turn Consistency Editing

Source video followed by four linked edits on the same subject: replacement, accessory addition, background rewrite, and motion update.

Source video

Replace short straight hair with French curly hair.

Add a floral headband with red and white flowers to her hair.

Change the background to a fairytale castle by a lake.

Make her raise one hand to wave slowly.

Intelligent Video Generation

Structured planning and physics-oriented examples that probe control over multi-step spatial behavior.

4x4 grid navigation, route 1

5x5 grid navigation, route 1

6x6 grid navigation, route 1

4x4 grid navigation, route 2

5x5 grid navigation, route 2

6x6 grid navigation, route 2

Video Understanding

Selected video question answering and captioning cases that evaluate temporal reasoning, motion recognition, and concise-to-detailed description.

VQA question

How many times did the person launch objects on the table?
Options:
(A) 3
(B) 2
(C) 4

Response

(A) 3

VQA question

The person makes sets of repeated actions. How many distinct repeated actions did the person do?
Options:
(A) 2
(B) 3
(C) 4

Response

(A) 2

VQA question

In which direction does the purple sphere move in the video?
Options:
(A) Down and to the right.
(B) Up and to the left.
(C) Up and to the right.
(D) The object is stationary.

Response

(A) Down and to the right.

VQA question

What is the unrealistic phenomenon displayed in the video?
Options:
(A) The man can manipulate time via phone.
(B) Man grabs an object through a phone screen.
(C) Chocolate transforms into different objects.
(D) Visible means of propulsion enables flight.

Response

(B) Man grabs an object through a phone screen.

Short caption

Offer a succinct account of the culinary process shown in this video.

Response

Add tomato puree and mix it well with chicken pieces.

Long caption

Provide a detailed description of the given video, capturing its key moments.

Response

In a sunlit meadow, a small tortoiseshell butterfly rests on a purple flower. A bee, with black and yellow stripes, lands on the same flower. The butterfly flaps its wings gently, while the bee busies itself, collecting nectar. The flower sways slightly in the breeze. The butterfly then takes off, and the bee follows, both heading to the next flower. The scene is a vivid display of insect interaction in a natural setting, with the colors of the butterfly and the bee contrasting against the green background of the meadow. The video captures this peaceful moment in a short 6-second duration.

Text-to-Image

Representative text-to-image outputs spanning photorealistic, stylized, compositional, and typography-heavy prompts.

Image Editing

Instruction-guided image editing cases showing local replacement, style transfer, object-aware modifications, and layout-preserving transformations.

Image Understanding

Six selected visual question answering cases spanning charts, trade data, OCR, documents, landmarks, and natural phenomena.

Pie chart used for segment comparison reasoning

Is the largest segment greater than sum of all the other segments?

Response Yes, the largest segment in the pie chart is the blue one, which is the most populous segment.

Survey chart about border security priorities

What percentage of respondents want better border security?

Response 29%

Car image used for license plate recognition

What is the license plate number of the car?

Response The license plate number of the car is BX62 BFY.

Market research chart for promotional meeting and event spending

According to the proprietary market research data, how much was spent on promotional meetings and events during 1998?

Response According to the data from the proprietary market research, the total amount spent on the promotional meetings and events during 1998 was approximately $1.3 billion.

Colosseum image for appearance description

What is the appearance of the Colosseum in Rome, Italy?

Response The Colosseum in Rome, Italy is a magnificent ancient amphitheater made of stone and concrete. It is characterized by its large, arched openings and its semi-circular shape. The Colosseum has been partially restored, and its original color has faded, but the architectural details, like the columns and the semi-circular arches, are still visible. The lighting in the image highlights the grandeur and historical significance of the Colosseum, making it stand out against the backdrop of the evening sky.

Total solar eclipse image for visual knowledge answering

How does a total solar eclipse look like from Earth?

Response A total solar eclipse looks like a dark circle with a bright, white outer edge. This is a result of the Earth's rotation and the Earth's shadow, which partially obscures the sun's light. The bright white edge of the eclipse is caused by the sun's high energy and its interaction with the Earth's atmosphere, while the dark part of the eclipse is due to the Earth's shadow and the surrounding air currents. The solar eclipse's shape, with its bright white edge and dark center, is similar to the shape of a full moon or a dark disk. It is a natural phenomenon that occurs in the atmosphere of the Earth and is an important part of the solar system.

Framework

Lance keeps a shared interleaved sequence for text, image, and video context, then separates semantic understanding and visual generation through dedicated experts.

Comparison on multimodal benchmarks

Radar charts compare Lance with representative unified and task-specialized baselines. Detailed tables: GenEVAL, DPG-Bench, GEdit-Bench, VBench, and MVBench.

Radar chart comparing Lance with image generation, editing, and video generation baselines

Image generation on GenEVAL

GenEVAL measures object count, color, position, and attribute binding. Lance ties the best overall score among listed unified models while remaining a compact 3B model.