This paper introduces VimoRAG, a novel video-based retrieval-augmented motion generation framework for motion large language models (LLMs). As motion LLMs face severe out-of-domain/out-of-vocabulary issues due to limited annotated data, VimoRAG leverages large-scale in-the-wild video databases to enhance 3D motion generation by retrieving relevant 2D human motion signals. While video-based motion RAG is nontrivial, we address two key bottlenecks: (1) developing an effective motion-centered video retrieval model that distinguishes human poses and actions, and (2) mitigating the issue of error propagation caused by suboptimal retrieval results. We design the Gemini Motion Video Retriever mechanism and the Motion-centric Dual-alignment DPO Trainer, enabling effective retrieval and generation processes. Experimental results show that VimoRAG significantly boosts the performance of motion LLMs constrained to text-only input.
VimoRAG first retrieves a relevant video from an unlabeled video database based on the input text. Both the text and retrieved video are then fed into a LLM to generate motion tokens, which are finally decoded into a motion sequence via VQ-VAE. To enhance this pipeline, we propose two key components: Gemini-MVR for effective cross-modal human video retrieval, and McDPO, a novel training strategy to mitigate error propagation in this process.
Text: The person is bending over to put food on the floor for a pet, then straightening up and stepping back to standing position.
Text: The person appears to be mimicking the action of riding a bicycle while standing up; alternating raising knees as if pedaling, and swinging arms as though holding handlebars.
Text: The person is standing upright with a rapid sequence of raising both fists from waist level to above the head and then lowering them back down in a cheering motion.
Text: The person is performing a punching motion while standing stationary. He is transitioning from a relaxed stance to a boxing stance, throwing a series of punches, and then returning to the relaxed stance.
Text: The person is performing a stationary basketball shooting motion. Starting from a standing position, they bend their knees to generate power, raise the ball with both hands in front of them, extend their arms upwards while jumping slightly, and then follow through with one hand to release the ball, mimicking a basketball shot.
Text: The person is walking back and forth in a room, turning slightly at each end, and appears to be fanning themselves continuously with one hand as they go.
Text: The person is squatting down and lifting a potted plant while then sitting on the floor with the plant.
Text: The person is standing and making a phone call gesture. They lift their right hand to their ear as if holding a phone. Their body remains relatively static while performing the gesture.
Text: The person is preparing to throw a frisbee. Starting with a stance where the weight is on the back foot, they shift the weight forward, bringing the arm with the frisbee back for momentum. Then, they step forward with the opposite leg, rotating the torso and extending the arm to release the frisbee.
The Gemini-MVR retrieval model consists of two independent retrievers—object-level and actionlevel—whose outputs are fused by a lightweight router to produce the final similarity score. Each retriever encodes both text and video with a specific focus. The object-level retriever captures visual entities (objects) and their textual arguments, while the action-level retriever targets motion and predicate-level semantics.
Given a text t and a retrieved video v, we first perform visual demonstration-enhanced instruction tuning to establish a base reference model π_ref. Using π_ref and the proposed motion-centric dual-alignment reward model, we construct a preference training set, and then apply DPO training on this dataset.