A medium-close shot shows a red panda wearing a gold-trimmed cap and travel satchel on a bright seaside wave with a painted surfboard, foam spray, and a glowing summer sky. Subject fills frame; premium detail, clear focus, lively eyes, readable motion. tracking shot. It rides the wave, lifts one paw in balance, and laughs as spray catches the light.
Lance: Unified Multimodal Modeling by Multi-Task Synergy
Lance is a 3B native unified multimodal model for image and video understanding, generation, and editing, trained from scratch within a training budget of no more than 128 GPUs using a staged multi-task recipe.
Note: Lance is a research project rather than a polished product model.
The released checkpoint was trained with up to 128 A100 GPUs, with training conducted up to 768x768 image generation and 480p, 12 FPS video generation. Our goal is to share a research artifact for studying unified image/video understanding, generation, and editing under a relatively small model and limited compute budget. Output quality may vary across prompts, resolutions, duration, motion complexity, and editing scenarios, and we see further opportunities to improve the post-training recipe. We appreciate constructive feedback from the community as we continue improving the project.
Text-to-Video
Nine text-conditioned cases focused on character motion, fantasy animals, two-person interaction, and cinematic dreamlike scenes.
A premium animated-film shot shows a brass robot playing violin in a lantern-lit city square with one puppy seated nearby under warm evening light. The main subject occupies at least two-thirds of the frame and remains the clear visual focus. The scene is whimsical, beautiful, and richly detailed, with strong character focus and elegant atmosphere. fixed shot. The robot draws the bow in smooth arcs while the puppy listens quietly.
A medium-close shot shows a Persian cat wearing ornate spectacles and a velvet academic robe inside a candlelit salon with carved shelves, chandeliers, and mosaic floors. The cat fills the frame with crisp fur detail and lively eyes. fixed shot. It lifts a slender magic wand and traces a soft glowing arc through the air.
A cinematic landscape shot shows a tropical coastline at sunset with pink sky, moving waves, black rocks, and palms swaying in warm wind. The scene is majestic, highly aesthetic, and rich in layered natural detail, with refined atmosphere and premium scenic clarity. wide shot. The sun sinks toward the horizon while wave foam advances and retreats along the shore.
A close-to-medium cinematic shot shows a handsome motorcyclist riding a classic black motorcycle along a coastal road with cliffs, sea spray, and dramatic sky. The background stays bright, layered, and aesthetically refined, with luminous depth and elegant environmental variation while remaining secondary to the main subject. The eyes are lively and expressive, with subtle blinking, natural gaze shifts, and gentle movement in the brows and mouth that keep the face vivid on camera. The subject is beautiful, highly detailed, and photographed with a premium cinematic aesthetic. The subject occupies at least two-thirds of the frame, with beautiful styling, refined facial detail, convincing skin texture, and anatomically correct hands. The rider's body posture matches the bike's motion and the hands grip the handlebars naturally. the camera follows from the side as the motorcycle leans through a curve.
A detailed cinematic portrait begins from a medium view and gradually moves into a close facial framing of a beautiful young woman shaping clay on a pottery wheel in a bright ceramic workshop with sunlit shelves, bowls, and hanging tools. The person is the dominant subject in the frame, styled with a tied-back apron, delicate earrings, rolled sleeves, and a simple pendant, and shown with premium skin detail, expressive eyes, subtle brow and cheek motion, anatomically convincing hands, and rich costume texture. Her hands guide the spinning clay in one smooth controlled motion as her expression moves from serene focus into a soft smile. Her gaze starts on the camera, follows the clay, briefly rises toward the window light, and returns to the lens while her head inclines naturally with the wheel.
A detailed cinematic portrait begins from a medium view and gradually moves into a close facial framing of a beautiful young woman playing a grand piano in a luminous marble music hall with tall windows, gold sconces, flowing curtains, polished floors, and refined floral arrangements. Styled with pearl earrings, a delicate crystal hairpin, and a layered silver necklace above an elegant satin gown. Subject dominates; sharp face, open eyes, subtle micro-expressions, correct visible hands. Both hands stay clearly visible on the piano keys, and every finger movement is elegant, natural, and easy to read as she plays a calm melodic phrase; her head gives a subtle natural sway in time with the music while the smile slowly grows warmer.
An elegant medium-close shot centers a shiba inu and a chrome boxing robot inside a palace-inspired championship ring with carved ivory columns, bright gold trim, glossy stone steps, and sweeping crystal chandeliers. The shiba inu wears an embroidered brocade boxing robe, a jeweled waist sash, and refined round goggles, and both fighters wear premium boxing gloves; robot has exposed polished mechanical body. Bright luxury arena; fighters dominate frame; slow readable boxing. steady camera. Controlled footwork and visible punches, with brief pauses after each exchange.
A cinematic shot shows two young adults meeting again on a quiet train platform in warm sunset light with drifting steam and long shadows. Subject fills frame; premium face/detail, correct hands and posture. medium shot. They pause in disbelief, step closer, and embrace tightly; the camera then pushes into a close-up of their tearful relieved faces.
Displayed with 2x super-resolution and 2x frame interpolation.
Image-to-Video
Starting from a single image, Lance preserves subject identity and scene composition while synthesizing natural motion, camera movement, and temporal detail.
A snow leopard stands calmly near the edge of a glacier, observing the icy gap ahead. The snow leopard remains the dominant main subject throughout the entire sequence, shown in a medium-close to medium framing and occupying a strong, consistent portion of the frame. Its anatomy, body proportions, fur pattern, facial features, tail, and limbs remain visually consistent and anatomically correct at all times, with no distortion, shrinking, or unnatural deformation. The environment is a vast frozen glacier canyon with layered ice cliffs, deep blue crevasses, snow-covered ledges, drifting snow particles, and distant snowy ridges under a pale winter sky. Soft golden side light illuminates the snow leopard's fur and creates subtle highlights across the ice surfaces. After a brief pause, the snow leopard performs a smooth, controlled slow-motion leap to the glacier on the opposite side. The motion is gentle and physically natural, with a soft push-off, subtle tail balance, and clear readable body movement. The camera smoothly follows the snow leopard and does not pull back to a wide distant view. The landing area on the opposite glacier remains relatively close in screen space, so the snow leopard stays large and prominent. After landing, the snow leopard remains a large foreground or mid-foreground subject, occupying about one-half to two-thirds of the frame, while the camera stays close and stable or slightly tracks forward. It lands steadily and returns to a calm stable stance. The glacier walls, snow surfaces, and distant ice formations remain visually stable without cracking or collapsing. The environment maintains soft natural motion with realistic snow atmosphere and physically consistent animal movement.
Generated video
A cinematic polar wildlife shot shows an emperor penguin standing near the edge of a flat ice shelf beside calm icy water, with its reflection clearly visible on the smooth blue-gray surface below. The penguin is the clear main subject, featuring a black head, white chest, soft yellow neck markings, a compact upright body, small dark feet, and smooth feather texture. The surrounding environment is cold, quiet, and minimal, with pale snow-covered ice, a slightly uneven frozen edge, distant flat ice fields, and soft overcast light creating a calm Antarctic atmosphere. The water remains still and reflective, with faint ripples near the ice edge and subtle texture across the surface. The scene feels clean, realistic, and highly detailed, keeping the penguin large and readable against the simple polar landscape. steady camera. The penguin slowly shifts its body toward the water, takes a small careful step at the edge, then gently hops into the water in a clear and natural motion. As it enters, a modest splash rises around its body and spreads outward into soft ripples, briefly breaking the reflection. The surrounding ice remains stable, the water movement stays physically natural, and the overall scene remains calm and believable.
Generated video
A premium fantasy-anime shot shows a young forest elf girl standing in a glowing enchanted woodland, surrounded by tall dark trees, twisting branches, soft moss, lush plants, and countless floating fireflies. She is the clear main subject, positioned slightly to the right of center, with long flowing green hair, pointed elf ears, delicate facial features, large expressive eyes, and a small flower ornament near her hair. She wears an elegant green-and-white fantasy dress with gold trim, translucent sleeves, a jeweled collar detail, and soft layered fabric that catches the turquoise forest light. The background is deep and magical, with misty blue-green atmosphere, layered tree trunks fading into the distance, glowing particles, and warm firefly lights scattered throughout the scene. The lighting is dreamy and luminous, with a bright teal glow behind her creating a soft halo effect and subtle highlights on her hair, face, and dress. The scene feels peaceful, magical, and highly detailed, with refined color contrast, rich forest textures, and a strong main-subject presence. gentle tracking shot. She steps forward, then slowly turns her head to follow a cluster of fireflies crossing in front of her. Her expression stays quiet and curious while the teal light outlines her face and dress.
Generated video
A detailed cinematic full-body portrait shows a young woman standing indoors in a simple modern room, facing the camera with a calm and pleasant expression. She has long brown hair falling naturally over her shoulders, warm brown eyes, natural skin texture, and a relaxed upright posture. She wears a fitted black sleeveless top, light blue high-waisted jeans, and clean white sneakers. One hand is gently raised near the side of her head, while the other arm rests naturally by her side. The background is minimal and softly lit, with a plain light-colored wall, warm wooden floor, white baseboard, soft daylight entering from the left, and a green potted plant near the window adding a natural detail. The subject remains the clear main focus, centered in the frame with stable body proportions, clear facial detail, natural hands, and realistic clothing texture. slow push-in shot. She slowly and gently lifts her hand to lightly smooth her hair beside her head in a soft, graceful motion. She looks toward the camera and gradually shows a warm subtle smile, with slight natural blinking and a relaxed presence. As the action continues, the camera gradually moves closer from the full-body view toward her face, bringing more attention to her expression while the room remains calm and stable.
Generated video
A warm cinematic interior shot shows a ginger cat sleeping peacefully on a sunlit wooden windowsill beside indoor plants and a wooden cabinet. The cat is the clear main subject, curled comfortably with soft orange fur, a striped tail, relaxed paws, closed eyes, and gentle breathing. Sunlight enters from the window, casting warm golden highlights across the cat's fur, the polished wood, and nearby green leaves. The room feels quiet and cozy, with potted plants, soft shadows, natural window light, and warm wooden textures creating a peaceful domestic atmosphere. The background remains simple and stable, keeping attention on the sleeping cat while adding natural detail and depth. steady cozy shot. The curled cat stays relaxed, eyes opened, with a slow rise and fall of its body. The nearby plants move faintly, and warm wood textures remain stable.
Generated video
A cinematic landscape shot shows a powerful waterfall plunging from a high moss-covered cliff into a misty river basin below, surrounded by lush green canyon walls, dark wet rock textures, and soft overcast daylight. A bright rainbow arches clearly across the lower mist in front of the waterfall, with vivid bands of red, orange, yellow, green, blue, and violet standing out against the white spray. The waterfall remains the dominant visual focus, with thick streams of white water pouring straight down in a continuous curtain, while mist rises and spreads across the base. The foreground shows a shallow rocky riverbank with dark pebbles and scattered stones, while the background is filled with layered green slopes, damp cliff faces, and soft atmospheric haze. The scene is majestic, fresh, and highly detailed, with rich natural textures, luminous mist, and a bright, clearly visible rainbow adding a vivid magical accent to the realistic landscape. calm cinematic shot. The waterfall, mist, and river move continuously while the surrounding moss-covered rock walls stay still. The rainbow gently flickers within the spray under soft daylight.
Generated videoVideo is displayed with 2x frame interpolation.
Video Editing
Nine prompt-driven single-step and compositional editing cases spanning background transformation, object addition and removal, subject replacement, appearance restyling, stylization, and action edits.
Replace the background with a campfire.
Add a row of colorful balloons.
Change the boy to a girl with black shirt.
Change the dog to a cat.
Change the style to watercolor painting, soft colors, natural and dreamy.
Make the car a shiny red color and add a snowy street background.
Have the woman raise her right hand to gently brush her hair, slightly turn her body to the right, soften her expression, and shift her gaze to the right.
Add a scarf around her neck and replace the background with a snowy park.
Remove face stickers.
Multi-turn Consistency Editing
Source video followed by four linked edits on the same subject: replacement, accessory addition, background rewrite, and motion update.
Replace short straight hair with French curly hair.
Add a floral headband with red and white flowers to her hair.
Change the background to a fairytale castle by a lake.
Make her raise one hand to wave slowly.
Intelligent Video Generation
Structured planning and physics-oriented examples that probe control over multi-step spatial behavior.
Describe the key elements of the input maze image (layout, white path, black walls, blue star, red flag, and overall background), then generate a 2D animation. The blue star should slide smoothly along the white path, stop exactly on the red flag, and then acquire a trophy. Ensure the blue star never crosses or enters the black maze walls. Keep the camera as a static top-down view showing the entire maze.
Describe the key elements of the input maze image (layout, white path, black walls, blue star, red flag, and overall background), then generate a 2D animation. The blue star should slide smoothly along the white path, stop exactly on the red flag, and then acquire a trophy. Ensure the blue star never crosses or enters the black maze walls. Keep the camera as a static top-down view showing the entire maze.
Describe the key elements of the input maze image (layout, white path, black walls, blue star, red flag, and overall background), then generate a 2D animation. The blue star should slide smoothly along the white path, stop exactly on the red flag, and then acquire a trophy. Ensure the blue star never crosses or enters the black maze walls. Keep the camera as a static top-down view showing the entire maze.
Describe the key elements of the input maze image (layout, white path, black walls, blue star, red flag, and overall background), then generate a 2D animation. The blue star should slide smoothly along the white path, stop exactly on the red flag, and then acquire a trophy. Ensure the blue star never crosses or enters the black maze walls. Keep the camera as a static top-down view showing the entire maze.
Describe the key elements of the input maze image (layout, white path, black walls, blue star, red flag, and overall background), then generate a 2D animation. The blue star should slide smoothly along the white path, stop exactly on the red flag, and then acquire a trophy. Ensure the blue star never crosses or enters the black maze walls. Keep the camera as a static top-down view showing the entire maze.
Describe the key elements of the input maze image (layout, white path, black walls, blue star, red flag, and overall background), then generate a 2D animation. The blue star should slide smoothly along the white path, stop exactly on the red flag, and then acquire a trophy. Ensure the blue star never crosses or enters the black maze walls. Keep the camera as a static top-down view showing the entire maze.
Video Understanding
Selected video question answering and captioning cases that evaluate temporal reasoning, motion recognition, and concise-to-detailed description.
How many times did the person launch objects on the
table?
Options:
(A) 3
(B) 2
(C) 4
(A) 3
The person makes sets of repeated actions. How many distinct
repeated actions did the person do?
Options:
(A) 2
(B) 3
(C) 4
(A) 2
In which direction does the purple sphere move in the
video?
Options:
(A) Down and to the right.
(B) Up and to the left.
(C) Up and to the right.
(D) The object is stationary.
(A) Down and to the right.
What is the unrealistic phenomenon displayed in the video?
Options:
(A) The man can manipulate time via phone.
(B) Man grabs an object through a phone screen.
(C) Chocolate transforms into different objects.
(D) Visible means of propulsion enables flight.
(B) Man grabs an object through a phone screen.
Offer a succinct account of the culinary process shown in this video.
ResponseAdd tomato puree and mix it well with chicken pieces.
Provide a detailed description of the given video, capturing its key moments.
ResponseIn a sunlit meadow, a small tortoiseshell butterfly rests on a purple flower. A bee, with black and yellow stripes, lands on the same flower. The butterfly flaps its wings gently, while the bee busies itself, collecting nectar. The flower sways slightly in the breeze. The butterfly then takes off, and the bee follows, both heading to the next flower. The scene is a vivid display of insect interaction in a natural setting, with the colors of the butterfly and the bee contrasting against the green background of the meadow. The video captures this peaceful moment in a short 6-second duration.
Text-to-Image
Representative text-to-image outputs spanning photorealistic, stylized, compositional, and typography-heavy prompts.
Image Editing
Instruction-guided image editing cases showing local replacement, style transfer, object-aware modifications, and layout-preserving transformations.
Image Understanding
Six selected visual question answering cases spanning charts, trade data, OCR, documents, landmarks, and natural phenomena.
Is the largest segment greater than sum of all the other segments?
Response Yes, the largest segment in the pie chart is the blue one, which is the most populous segment.
What percentage of respondents want better border security?
Response 29%
What is the license plate number of the car?
Response The license plate number of the car is BX62 BFY.
According to the proprietary market research data, how much was spent on promotional meetings and events during 1998?
Response According to the data from the proprietary market research, the total amount spent on the promotional meetings and events during 1998 was approximately $1.3 billion.
What is the appearance of the Colosseum in Rome, Italy?
Response The Colosseum in Rome, Italy is a magnificent ancient amphitheater made of stone and concrete. It is characterized by its large, arched openings and its semi-circular shape. The Colosseum has been partially restored, and its original color has faded, but the architectural details, like the columns and the semi-circular arches, are still visible. The lighting in the image highlights the grandeur and historical significance of the Colosseum, making it stand out against the backdrop of the evening sky.
How does a total solar eclipse look like from Earth?
Response A total solar eclipse looks like a dark circle with a bright, white outer edge. This is a result of the Earth's rotation and the Earth's shadow, which partially obscures the sun's light. The bright white edge of the eclipse is caused by the sun's high energy and its interaction with the Earth's atmosphere, while the dark part of the eclipse is due to the Earth's shadow and the surrounding air currents. The solar eclipse's shape, with its bright white edge and dark center, is similar to the shape of a full moon or a dark disk. It is a natural phenomenon that occurs in the atmosphere of the Earth and is an important part of the solar system.
Framework
Lance keeps a shared interleaved sequence for text, image, and video context, then separates semantic understanding and visual generation through dedicated experts.
Comparison on multimodal benchmarks
Radar charts compare Lance with representative unified and task-specialized baselines. Detailed tables: GenEVAL, DPG-Bench, GEdit-Bench, VBench, and MVBench.
Image generation on GenEVAL
GenEVAL measures object count, color, position, and attribute binding. Lance ties the best overall score among listed unified models while remaining a compact 3B model.
Scroll horizontally to inspect all metrics.
| Method | # Params. | Overall↑ | Single Obj. | Two Obj. | Counting | Colors | Position | Color Attri. |
|---|---|---|---|---|---|---|---|---|
| Generation-only models | ||||||||
| FLUX.1-dev | 12B | 0.82 | 0.98 | 0.93 | 0.75 | 0.93 | 0.68 | 0.65 |
| GPT Image 1 | - | 0.84 | 0.99 | 0.92 | 0.85 | 0.92 | 0.75 | 0.61 |
| Qwen-Image | 20B | 0.87 | 0.99 | 0.92 | 0.89 | 0.88 | 0.76 | 0.77 |
| Unified models | ||||||||
| MetaQuery-XL† | 7B | 0.80 | - | - | - | - | - | - |
| OmniGen2 | 4B | 0.80 | 1.00 | 0.95 | 0.64 | 0.88 | 0.55 | 0.76 |
| Show-o2 | 7B | 0.76 | 1.00 | 0.87 | 0.58 | 0.92 | 0.52 | 0.62 |
| UniWorld-V1 | 13B | 0.80 | 0.99 | 0.93 | 0.79 | 0.89 | 0.49 | 0.70 |
| BAGEL† | 7B | 0.88 | 0.98 | 0.95 | 0.84 | 0.95 | 0.78 | 0.77 |
| Mogao | 7B | 0.89 | 1.00 | 0.97 | 0.83 | 0.93 | 0.84 | 0.80 |
| TUNA | 7B | 0.90 | 1.00 | 0.97 | 0.81 | 0.91 | 0.88 | 0.83 |
| Lance | 3B | 0.90 | 1.00 | 0.94 | 0.84 | 0.97 | 0.87 | 0.81 |
† indicates methods that use LLM rewriters for prompt rewriting before generation.
Image generation on DPG-Bench
DPG-Bench stresses complex prompt following across global, entity, attribute, relation, and other compositional dimensions; Lance is especially strong on relation grounding.
Scroll horizontally to inspect all metrics.
| Method | # Params. | Overall↑ | Global | Entity | Attribute | Relation | Other |
|---|---|---|---|---|---|---|---|
| Generation-only models | |||||||
| PixArt-a | 0.6B | 71.11 | 74.97 | 79.32 | 78.60 | 82.57 | 76.96 |
| SDXL | 3.5B | 74.65 | 83.27 | 82.43 | 80.91 | 86.76 | 80.41 |
| Hunyuan-DiT | 1.5B | 78.87 | 84.59 | 80.59 | 88.01 | 74.36 | 86.41 |
| Playground v2.5 | - | 75.47 | 83.06 | 82.59 | 81.20 | 84.08 | 83.50 |
| DALL-E 3 | - | 83.50 | 90.97 | 89.61 | 88.39 | 90.58 | 89.83 |
| SD3-Medium | 2B | 84.08 | 87.90 | 91.01 | 88.83 | 80.70 | 88.68 |
| Emu3-Gen | 8B | 80.60 | 85.21 | 86.68 | 86.84 | 90.22 | 83.15 |
| FLUX.1-dev | 12B | 83.84 | 74.35 | 90.00 | 88.96 | 90.87 | 88.33 |
| Qwen-Image | 20B | 88.32 | 91.32 | 91.56 | 92.02 | 94.31 | 92.73 |
| Unified models | |||||||
| Emu3-DPO | 8B | 81.60 | - | - | - | - | - |
| Janus | - | 79.68 | 82.33 | 87.38 | 87.70 | 85.46 | 86.41 |
| Janus-Pro-7B | 7B | 84.19 | 86.90 | 88.90 | 89.40 | 89.32 | 89.48 |
| Ovis-U1 | 1.2B | 83.72 | 82.37 | 90.08 | 88.68 | 93.35 | 85.20 |
| OmniGen2 | 4B | 83.57 | 88.81 | 88.83 | 90.18 | 89.37 | 90.27 |
| Show-o2 | 7B | 86.14 | 89.00 | 91.78 | 89.96 | 91.81 | 91.64 |
| UniWorld-V1 | 13B | 81.38 | 83.64 | 88.39 | 88.44 | 89.27 | 87.22 |
| BAGEL† | 7B | 85.07 | 88.94 | 90.37 | 91.29 | 90.82 | 88.67 |
| Mogao | 7B | 84.33 | 82.37 | 90.03 | 88.26 | 93.18 | 85.40 |
| InternVL-U | 1.7B | 85.18 | 90.39 | 90.78 | 90.68 | 90.29 | 88.77 |
| TUNA | 7B | 86.76 | 90.42 | 91.68 | 90.94 | 91.87 | 90.73 |
| Lance | 3B | 84.67 | 83.89 | 91.07 | 89.36 | 93.38 | 80.80 |
† indicates methods that use LLM rewriters for prompt rewriting before generation.
Image editing on GEdit-Bench
GEdit-Bench evaluates instruction-guided edits such as background, color, material, subject, style, and tone changes. Lance reports the best average score among listed unified models.
Scroll horizontally to inspect all metrics.
| Method | # Params. | Avg/G-O↑ | BC | CA | MM | MC | PB | ST | SA | SR | SRp | TM | TT |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Generation-only models | |||||||||||||
| Gemini 2.0 | - | 6.32 | - | - | - | - | - | - | - | - | - | - | - |
| GPT Image 1 | - | 7.49 | 6.96 | 6.85 | 7.10 | 5.41 | 6.74 | 7.44 | 7.51 | 8.73 | 8.55 | 8.45 | 8.69 |
| Qwen-Image-Edit | 20B | 8.01 | 8.23 | 8.30 | 7.33 | 8.05 | 7.49 | 6.74 | 8.57 | 8.09 | 8.29 | 8.48 | 8.50 |
| Unified models | |||||||||||||
| Lumina-DiMOO | 8B | 3.91 | 3.43 | 4.27 | 3.08 | 2.77 | 4.74 | 5.19 | 4.44 | 3.80 | 4.38 | 2.68 | 4.20 |
| Ovis-U1 | 1.2B | 6.42 | 7.49 | 6.88 | 6.21 | 4.79 | 5.98 | 6.46 | 7.49 | 7.25 | 7.27 | 4.48 | 6.31 |
| BAGEL | 7B | 6.52 | 7.32 | 6.91 | 6.38 | 4.75 | 4.57 | 6.15 | 7.90 | 7.16 | 7.02 | 7.32 | 6.22 |
| InternVL-U | 1.7B | 6.66 | 7.08 | 7.05 | 6.38 | 7.02 | 6.03 | 6.27 | 7.13 | 6.55 | 6.33 | 6.59 | 6.85 |
| InternVL-U (w/ CoT) | 1.7B | 6.88 | 7.05 | 7.87 | 6.50 | 6.99 | 5.77 | 6.10 | 7.33 | 7.16 | 7.12 | 7.36 | 6.46 |
| Lance | 3B | 7.30 | 7.73 | 7.74 | 7.28 | 7.83 | 7.50 | 7.03 | 7.64 | 7.85 | 7.71 | 4.46 | 7.57 |
Video generation on VBench
VBench covers video quality, semantic alignment, object attributes, spatial relations, and motion-related dimensions. Lance obtains the top total score in the unified model group.
Scroll horizontally to inspect all metrics.
| Model | # Params. | Total Score↑ | Quality Score | Semantic Score | Subj. Consist. | Bkg. Consist. | Temp. Flicker | Motion Smooth. | Dynamic Degree | Aesthetic Quality | Imaging Quality | Object Class | Multi. Objects | Human Action | Color | Spatial Relation | Scene | Appear. Style | Temp. Style | Overall Consist. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Generation-only models | ||||||||||||||||||||
| ModelScope | 1.7B | 75.75 | 78.05 | 66.54 | 89.87 | 95.29 | 98.28 | 95.79 | 66.39 | 52.06 | 58.57 | 82.25 | 38.98 | 92.40 | 81.72 | 33.68 | 39.26 | 23.39 | 25.37 | 25.67 |
| LaVie | 3B | 77.08 | 78.78 | 70.31 | 91.41 | 97.47 | 98.30 | 96.38 | 49.72 | 54.94 | 61.90 | 91.82 | 33.32 | 96.80 | 86.39 | 34.09 | 52.69 | 23.56 | 25.93 | 26.41 |
| Show-1 | 6B | 78.93 | 80.42 | 72.98 | 95.53 | 98.02 | 99.12 | 98.24 | 44.44 | 57.35 | 58.66 | 93.07 | 45.47 | 95.60 | 86.35 | 53.50 | 47.03 | 23.06 | 25.28 | 27.46 |
| AnimateDiff-V2 | - | 80.27 | 82.90 | 69.75 | 95.30 | 97.68 | 98.75 | 97.76 | 40.83 | 67.16 | 70.10 | 90.90 | 36.88 | 92.60 | 87.47 | 34.60 | 50.19 | 22.42 | 26.03 | 27.04 |
| VideoCrafter-2.0 | - | 80.44 | 82.20 | 73.42 | 96.85 | 98.22 | 98.41 | 97.73 | 42.50 | 63.13 | 67.22 | 92.55 | 40.66 | 95.00 | 92.92 | 35.86 | 55.29 | 25.13 | 25.84 | 28.23 |
| CogVideoX | 5B | 81.61 | 82.75 | 77.04 | 96.23 | 96.52 | 98.66 | 96.92 | 70.97 | 61.98 | 62.90 | 85.23 | 62.11 | 99.40 | 82.81 | 66.35 | 53.20 | 24.91 | 25.38 | 27.59 |
| Kling | - | 81.85 | 83.39 | 75.68 | 98.33 | 97.60 | 99.30 | 99.40 | 46.94 | 61.21 | 65.62 | 87.24 | 68.05 | 93.40 | 89.90 | 73.03 | 50.86 | 19.62 | 24.17 | 26.42 |
| Open-Sora-2.0 | - | 81.71 | 82.10 | 80.14 | 98.75 | 98.00 | 99.40 | 99.49 | 20.74 | 64.33 | 65.62 | 94.50 | 77.72 | 95.40 | 85.98 | 76.18 | 52.71 | 22.98 | 25.91 | 27.57 |
| Gen-3 | - | 82.32 | 84.11 | 75.17 | 97.10 | 96.62 | 98.61 | 99.23 | 60.14 | 63.34 | 66.82 | 87.81 | 53.64 | 96.40 | 80.90 | 65.09 | 54.57 | 24.31 | 24.71 | 26.69 |
| Step-Video-T2V | 30B | 81.83 | 84.46 | 71.28 | 98.05 | 97.67 | 99.40 | 99.08 | 53.06 | 61.23 | 70.63 | 80.56 | 50.55 | 94.00 | 88.25 | 71.47 | 24.38 | 23.17 | 26.01 | 27.12 |
| Hunyuan Video | - | 83.43 | 85.07 | 76.88 | 97.22 | 97.60 | 99.39 | 99.05 | 71.94 | 60.28 | 67.24 | 83.48 | 66.71 | 94.40 | 89.79 | 72.13 | 54.46 | 22.21 | 24.52 | 26.95 |
| Wan2.1-T2V | 14B | 83.69 | 85.59 | 76.11 | 97.52 | 98.09 | 99.46 | 98.30 | 65.46 | 66.07 | 69.43 | 86.28 | 69.58 | 95.40 | 88.59 | 75.39 | 45.75 | 22.64 | 23.19 | 25.91 |
| Unified models | ||||||||||||||||||||
| HaploOmni | 7B | 78.10 | - | - | 96.40 | 97.60 | - | 96.80 | 65.30 | - | - | - | - | - | - | - | 34.60 | - | - | - |
| Emu3 | 8B | 80.96 | - | - | 95.32 | 97.69 | - | 98.93 | 79.27 | 59.64 | - | 86.17 | 44.64 | 77.71 | - | 68.73 | 37.11 | 20.92 | - | - |
| VILA-U | 7B | 74.01 | 76.26 | 65.04 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| Show-o2 | 2B | 81.34 | 82.10 | 78.31 | 97.28 | 96.78 | 97.68 | 98.25 | 40.83 | 65.15 | 67.06 | 94.81 | 76.01 | 95.20 | 80.89 | 62.61 | 57.67 | 23.29 | 25.27 | 27.00 |
| TUNA | 1.5B | 84.06 | 84.32 | 83.04 | 95.99 | 96.72 | 98.02 | 98.33 | 69.39 | 65.88 | 66.83 | 95.41 | 92.31 | 97.50 | 87.67 | 78.12 | 58.59 | 23.18 | 24.68 | 27.71 |
| Lance | 3B | 85.11 | 85.14 | 84.96 | 94.52 | 94.28 | 99.66 | 95.93 | 75.83 | 64.33 | 66.78 | 96.58 | 93.86 | 97.80 | 92.61 | 93.61 | 64.75 | 23.14 | 25.53 | 27.04 |
Video understanding on MVBench
MVBench evaluates video understanding across action, object, spatial, temporal, and reasoning categories. Lance achieves the best average score among listed unified models.
Scroll horizontally to inspect all metrics.
| Model | # Params. | Avg.↑ | AS | AP | AA | FA | UA | OE | OI | OS | MD | AL | ST | AC | MC | MA | SC | CO | EN | ER | CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Understanding-only models | |||||||||||||||||||||
| Video-LLaMA | 7B | 34.1 | 27.5 | 25.5 | 51.0 | 29.0 | 39.0 | 48.0 | 40.5 | 38.0 | 22.5 | 22.5 | 43.0 | 34.0 | 22.5 | 32.5 | 45.5 | 40.0 | 30.0 | 21.0 | 37.0 |
| LLaMA-Adapter | 7B | 31.7 | 23.0 | 28.0 | 51.0 | 30.0 | 33.0 | 53.5 | 32.5 | 33.5 | 25.5 | 21.5 | 30.5 | 29.0 | 22.5 | 41.5 | 39.5 | 31.5 | 22.5 | 28.0 | 32.0 |
| Video-ChatGPT | 7B | 32.7 | 23.5 | 26.0 | 62.0 | 22.5 | 26.5 | 54.0 | 28.0 | 40.0 | 23.0 | 20.0 | 31.0 | 30.5 | 25.5 | 39.5 | 48.5 | 33.0 | 29.5 | 26.0 | 35.5 |
| VideoChat | 7B | 35.5 | 33.5 | 26.5 | 56.0 | 33.5 | 40.5 | 53.0 | 40.5 | 30.0 | 25.5 | 27.0 | 48.5 | 35.0 | 20.5 | 42.5 | 46.0 | 41.0 | 23.5 | 23.5 | 36.0 |
| VideoChat2 | 7B | 51.1 | 66.0 | 47.5 | 83.5 | 49.5 | 60.0 | 58.0 | 71.5 | 42.5 | 23.0 | 23.0 | 88.5 | 39.0 | 42.0 | 58.5 | 44.0 | 36.5 | 35.0 | 40.5 | 65.5 |
| ST-LLM | 7B | 54.9 | 66.0 | 53.5 | 84.0 | 44.0 | 58.5 | 80.5 | 73.5 | 38.5 | 42.5 | 31.0 | 86.5 | 36.5 | 56.5 | 78.5 | 43.0 | 46.5 | 34.5 | 41.5 | 58.5 |
| GPT-4V | - | 43.5 | 55.5 | 63.5 | 72.0 | 46.5 | 73.5 | 18.5 | 59.0 | 29.5 | 12.0 | 40.5 | 83.5 | 39.0 | 12.0 | 22.5 | 45.0 | 52.0 | 31.0 | 59.0 | 11.0 |
| PLLaVA | 34B | 58.1 | 67.5 | 53.0 | 82.0 | 47.0 | 79.0 | 68.5 | 67.5 | 36.5 | 37.5 | 49.5 | 91.0 | 40.5 | 43.0 | 70.0 | 51.5 | 66.5 | 39.5 | 63.5 | 59.0 |
| Video-CCAM | 9B | 64.6 | 83.0 | 67.0 | 89.5 | 49.0 | 72.0 | 86.5 | 81.0 | 45.0 | 28.0 | 29.0 | 90.0 | 59.0 | 67.0 | 85.0 | 63.5 | 77.0 | 34.0 | 73.5 | 59.0 |
| Qwen2.5-VL | 3B | 67.0 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| TimeMarker | 8B | 67.4 | 79.0 | 74.5 | 89.0 | 53.5 | 77.0 | 94.0 | 76.0 | 41.5 | 52.5 | 47.0 | 91.5 | 53.0 | 76.5 | 92.5 | 57.0 | 70.5 | 23.5 | 53.5 | 82.5 |
| InternVideo2 | 7B | 67.3 | 86.0 | 70.0 | 87.0 | 56.0 | 75.0 | 91.0 | 86.0 | 40.0 | 48.0 | 53.0 | 90.0 | 41.0 | 73.0 | 92.0 | 52.0 | 56.0 | 33.0 | 57.0 | 74.0 |
| Unified models | |||||||||||||||||||||
| Show-o2 | 1.5B | 50.6 | 63.8 | 59.5 | 63.5 | 40.0 | 70.5 | 54.5 | 66.0 | 36.5 | 36.0 | 27.0 | 88.0 | 43.5 | 43.0 | 58.0 | 44.5 | 54.0 | 28.5 | 39.5 | 45.0 |
| Show-o2 | 7B | 55.7 | 60.1 | 67.0 | 68.0 | 45.5 | 78.0 | 51.0 | 73.5 | 44.5 | 36.0 | 39.0 | 92.5 | 51.5 | 36.0 | 59.5 | 52.0 | 64.0 | 38.0 | 60.0 | 43.0 |
| TUNA | 1.5B | 54.4 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| UniVideo | 7B | 46.3 | 54.3 | 41.5 | 77.5 | 50.0 | 62.5 | 68.2 | 50.5 | 37.5 | 36.0 | 29.5 | 35.5 | 28.5 | 52.5 | 70.5 | 33.5 | 40.5 | 37.5 | 36.5 | 38.0 |
| Lance | 3B | 62.0 | 73.9 | 76.5 | 71.5 | 49.0 | 63.5 | 96.0 | 72.5 | 33.0 | 63.5 | 33.0 | 86.0 | 41.0 | 82.0 | 97.5 | 43.0 | 47.5 | 31.5 | 40.0 | 77.0 |
Citation
@misc{fu2026lanceunifiedmultimodalmodeling,
title = {Lance: Unified Multimodal Modeling by Multi-Task Synergy},
author = {Fengyi Fu and Mengqi Huang and Shaojin Wu and Yunsheng Jiang and Yufei Huo and Hao Li and Yinghang Song and Fei Ding and Jianzhu Guo and Qian He and Zheren Fu and Zhendong Mao and Yongdong Zhang},
year = {2026},
eprint = {2605.18678},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2605.18678},
}