juooo1117

MoST: Motion Style Transformer between Diverse Action Contents 본문

Artificial Intelligence/Research Paper

MoST: Motion Style Transformer between Diverse Action Contents

Hyo__ni 2025. 3. 19. 06:48

Paper Review

MoST: Motion Style Transformer between Diverse Action Contents

(Accepted by CVPR 2024)

https://doi.org/10.48550/arXiv.2403.06225


Abstract

Limitations of existing methods:

Existing motion style transfer methods are effective between two motions with identical content, their performance significantly diminishes when transferring style between motions with different contents.

(motion 에서는 content, style이 명확하게 분리되지 않기 때문에, 기존 기법은 동일한 content의 모션에서는 효과적이지만, 다른 content의 변환에서는 성능이 줄어듦.)

 

Proposed Approach:

We propose a novel motion style transformer that effectively disentangles style from content and generates a plausible motion with transferred style from a source motion.

(content와 style을 효과적으로 분리하고, source motion 에서 변환된 스타일을 적용해서 motion 을 만듦)

 

to achieving the goal of this,

(1) a new architectuer for motion style tranformer with 'part-attentive style modulator across body parts' and 'Siamese encoders that encode style and content features separately'

(2) style disentanglement loss

 

Result of Motion Style Transformer


Introduction

Our goal is to transfer stylistic characteristics from a source motion sequence(style motion) to a target motion sequence(content motion) without manually providing style labels.

 

The most significant challenge is transfer failure, where the generated motion loses content of the content motion or does not reflect style of the style motion. ( → 기존 방법들의 문제점)

 

To address the primary concern of transfer failure, 

we design a new framework called MoST and new loss functions for effectively transferring style between different contents.

 

MoST comprises transformer-based Siamese motion encoders, a part-attentive style modulator (PSM; 신체부위별 스타일 변조기), and a motion generator.

 

 

The main contributions of this study:

We disign MoST, incorporating Siamese encoders and PSM, to effectively disentangle style from the source motion and align it with the target content.

 

We introduce 'Siamese motion encoders' capable of simultaneously extracting both features( → content, style) from an input motion.

 

PSM modulates the raw style feature extracted from the style motion to align with the content of the content motion before being inserted into the generator. This modulation enables effective disentanglement of style from the style motion and its expression in the content motion.

(원래의 style 특징(style motion)을 content(content motion)와 정렬시킨 뒤에 generator에 넣음으로써, content에 style이 자연스럽게 적용될 수 있도록 함!)

 

We propose novel loss functions to improve the model's ability to disentangle style from content within the motion and generate plausible motion.

 

The style disentanglement loss aims to distinctly separate style and content. This seperation enhances the model's robustness in transferring style, irrespective of the content of the style motion.

 

 

즉, 기존의 방법들보다 다른 content 간의 style 변환에서 뛰어난 성능을 보이고, post-processing 없이도 좋은 결과를 냄.


Method

(1) Overall Framework

The framework aims to transfer the style from a given style motion to given content motion to generated 'Mg'

 

 

Siamese moiton encoder is designed to encode both a content feature and a style feature simultaneously from a single motion.

 

 

(2) Motion Representation

We generate motion embeddings that are the input tensors of encoder. We grouped the joints(관절) into P body parts(신체 부위들), accounting for the body structure.  ( → 관절을 p 개의 신체 부위로 그룹화)

 

Moreover, to consider gloal translational motion, which existing methods overlook, we acquire a seperate embedding of global translation in addition to body part embeddings. i-th body part embedding at t-th frame. ( → 각 신체부위마다 개별 임베딩 만드니까 이렇게 표현!)

 

신체 부위의 관절 정보들을 하나의 벡터로 연결한 후, 완전 연결층을(Fully Connected Layer) 적용하여 최종적인 body part embedding 을 생성함.

그리고 global translation motion 정보를 'root joint position', 'global velocity' 로 나눠서 각각 임베딩한 후,

최종적으로 모든 프레임에서 body part embedding과 global translational motion embedding 을 결합하여 최종 입력 텐서를 생성함.

(즉, 각 프레임(t)에서 P개의 body part embedding + global transition embedding(1개) 를 포함하여 최종 입력 텐서를 구성하는 것)

 

 

(3) Siamese Motion Encoders

We introduce a Siamese architecture for both encoders to eliminate redundancy(불필요한 중복을 줄이기 위해서).

 - We design the encoder ot extract both style and content features from each motion, which are utilized in the next step.

 - Our model extracts a global style feature across the entire motion sequence.

 

Motion encoder comprises N stacked transformer blocks, and each block contains body part attention and temporal attention modules. (we employ part attention instead of joint attention)

Additionally, we introduce a style token to aggregate the style features across an intire motion sequence.

(인코더는 N개의 Transformer Blocks 으로 구성되며, 각 블록에는 Body Part Attention, Temporal Attention이 포함됨. 그리고 Style Token을 추가해서 전체 모션 시퀀스에서 스타일 특징을 효과적으로 통합함!)

 

 

To obtain a content feature, in the last block, instance normalization(IN) is applied to remove style characteristics from the motion features,

(style features를 포함하는 모션 특징에서 스타일 정보를 제거하기 위해 인스턴스 정규화(IN)를 적용함으로써, content 특징만 남음)

 

 

(4) Part-Attentive Style Modulator (PSM)

PSM modulates the style feature, which orginates from 'Ms', to be more effectively expressed in 'Mc'.

Cross-attention identifies how style should be transmitted from a specific body part of 'Cs' to a corresponding body part of 'Cc'. Consequently, PSM prevents the transmission of motion from an undesired body part of 'Ms'.

 

style motion 과 content motion 의 body part mapping이 다를 수 있는데, cross-attention 을 이용해서 잘못된 신체 부위에서 스타일이 전달되는 것을 방지하고, 적절한 부위에만 스타일을 적용할 수 있도록 함

 

 

(5) Motion Generator

The motion generator is composed of N stacked transformer blocks, similar to encoder.

'Yc' is introduced as an input tensor for the first transformer block.

 

즉, motion generator의 input tensor는 content feature; 'Yc' 에서 시작하며, 각 body part 의 프레임 단위 동적 정보(Dynamics of Each Body Part)를 포함하고 있음. 따라서, content 정보가 효과적으로 유지되며 generated motion 에 전달됨.

 

To incorporate 'Ss', we employ Adaptive Instance Normalization (AdaIN), in each transformer block, it is applied.

이후 트랜스포머 블록에서 Part Attention, Temporal Attention 을 거쳐 스타일이 반영된 새로운 content feature를 생성.

 

The local motion, global translation, and global velocity information of the generated motion are reconstructed by the output feature tensor of the final transformer block.  ( 최종적인 generated motion)

 

 

(6) Loss

We introduce the style disentanglement loss, aiming to effectively separate style from content. It increases the robustness of our model in generating a well-stylized motion regardless of content of 'Ms'.

 

Style Disentanglement Loss, 'LD

LD minimizes the discrepancy between the generated motions stylized by 'MaS' and 'MbS', where 'MaS' and 'MbS' denote two style motions that have identical style labels but different content labels. LD induces the model to clearly remove content from 'Ms', independent of the specific content present in 'Ms', thereby avoiding the blending of content into 'MG'.

('LD' 는 스타일이 콘텐츠에 영향을 받지 않고 독립적으로 학습되도록 함으로써, 콘텐츠와 스타일의 명확한 분리를 보장함. 결국, 동일한 content motion에 대해서 서로 다른 style motion을 적용했을 때, 생성된 결과가 최대한 비슷해야 한다는 것을 학습하도록 유도함. )

 

Physics-Based Loss, 'Lphy'

we introduce the physics-based loss to mitigate pose jittering(떨림) and improve foot-contact stability. It comprises regularization terms of velocity, acceleration, and foot contact.

(생성된 모션이 물리적으로 부자연스러울 수 있으므로(떨림, 접촉 불안정성 등..), 속도(velocity), 가속도(acceleration), 발 접촉(foot contact)과 관련된 정규화 항(Regularization Terms)을 추가!)


Conclusions

MoST is designed to effectively disentangle style and content of input motions and transfer styles between them. The proposed loss functions successfully train MoST to generate well-stylized motion without compromising content. Our method outperforms existing methods significantly, especially in motion pairs with different contents.

 

Limitations and Future works,

food contact problem → the physics-based loss does not forcibly eliminate it, but testing time optimization could be applied for the perfect removal.

 

transformer architecture → it requires specifying the maximum length of the model, so we plan to expand the model for few-shot learning, to handle smal datasets due to the high cost of motion data acquisition.