MultiAct

Text-to-Motion Generation from Composite Text via Tailored Attention Guidance

SIGGRAPH 2026 Conference

Abstract

Text-to-motion generation has progressed rapidly in recent years, offering an expressive interface for animation and human–computer interaction. However, current models remain brittle when handling prompts that describe multiple actions occurring at the same time. Rather than realizing all components of a composite description, models frequently prioritize a single dominant action and neglect the rest, leading to incomplete and ambiguous motion. We present MultiAct, an unpaired, inference-time framework for compositional text-to-motion synthesis that operates directly on pretrained motion generators without retraining or architectural modification. Our method counteracts semantic collapse by adaptively amplifying cross-attention scores associated with underrepresented prompt components. We note that effective modulation depends on prompt-specific choices, such as which tokens and layers to target, and introduce a lightweight auxiliary decision scheme that determines the most effective attention-strengthening parametrization. Evaluations demonstrate that MultiAct consistently outperforms existing baselines on composite prompts, achieving improved semantic coverage while preserving motion realism. Our code and data are available at https://github.com/natsala13/multiact.

Composite Text-to-Motion Generation

MultiAct is designed for prompts that describe multiple actions or modifiers simultaneously. Instead of allowing a single dominant action to override the rest, it amplifies underrepresented prompt components through attention guidance.

By selectively strengthening cross-attention signals for weakly represented tokens, MultiAct synthesizes motions that capture all specified semantic elements while maintaining motion realism and temporal coherence.

This approach mitigates semantic collapse in composite text prompts and enables more reliable generation across diverse motion descriptions.

Baseline

MultiAct

In the example above, the left video shows the baseline failing to execute both actions together, while the right video shows MultiAct successfully generating a person dribbling a ball while walking backwards.

Pipeline

MultiAct pipeline

MultiAct operates on a frozen text-to-motion backbone and steers generation by modulating cross-attention during inference. Given a composite text prompt, the framework selects the prompt tokens and layers where attention should be strengthened, and applies a tailored guided generation schedule. This process amplifies weak semantic components while preserving the pretrained motion generator’s structure and temporal dynamics. As a result, the generated motion better reflects all prompt details without requiring additional training or architectural changes.

Diving into Cross-Attention Features

MultiAct exploits the structure of cross-attention in motion diffusion models to identify which prompt components are underrepresented. By analyzing attention patterns across tokens and layers, the framework adaptively amplifies weak semantic cues while preserving temporal coherence. This attention-guided strategy enables robust generation from composite prompts.

Cross Attention Visualization

Attention score visualization in MultiAct

Attention score visualization in MultiAct.

Cross attention score comparison video.

Attention visualization. The colored heatmaps illustrate attention scores for the words “forward” (yellow) and “arms” (green). Our backbone assigns low attention to arm-related tokens, resulting in motions in which the arms are not raised. In contrast, our method assigns high attention scores to both tokens, producing a synchronized motion that faithfully reflects the prompt.

Composite Prompt Challenges

MultiAct is built to handle prompts that combine multiple simultaneous actions and modifiers. The examples below show three representative prompts, each comparing the baseline, MultiAct, and an additional comparison output.

Dribbling while moving backward

Prompt: a person dribbles a ball while moving backward. Baseline misses motion coherence, while MultiAct captures both dribbling and backward locomotion.

Baseline

MultiAct

STMC

Hopping forward while raising arms

Prompt: a person hops forward while raising arms. This example requires coordinated leg and arm motion; MultiAct preserves both simultaneously.

Baseline

MultiAct

All

Running while waving arms

Prompt: a person is running while waving his arms. MultiAct better captures both the running rhythm and the arm motion than the baseline variant.

Baseline

MultiAct

STMC

These three prompt examples demonstrate MultiAct’s ability to preserve multiple simultaneous actions across challenging composite descriptions.

BibTeX


@inproceedings{sala2026multiact,
title={MultiAct: Text-to-Motion Generation from Composite Text via Tailored Attention Guidance},
author={Sala, Nathan and Abramovich, Ofir and Shamir, Ariel and Cohen-Or, Daniel and Aristidou, Andreas and Raab, Sigal},
booktitle={SIGGRAPH 2026 Conference Papers},
year={2026}
}