Model Architecture Design
model architecture deep-learning
The core of the recognition system is a Spatial-Temporal Transformer (ST-Transformer), a dual-attention neural network designed for processing streaming skeletal data by modeling both spatial relationships between body parts and temporal dynamics over time.
Architecture Overview
The model takes a sequence of spatial keypoints and outputs a probability distribution over the sign classes. It utilizes a hierarchical approach, first embedding anatomical groups and then processing them through alternating spatial and temporal attention mechanisms.
Diagram
graph TD Input["Input Sequence (Batch, T, F)"] --> GTE[Group Token Embedding] GTE --> PE[Positional Encoding] PE --> Drop[Dropout] Drop --> B1[ST-Transformer Block 1] B1 --> B2[ST-Transformer Block N] B2 --> MP[Mean Pooling] MP --> AP[Attention Pooling] AP --> FC[Classifier Head] FC --> Logits[Class Logits]
1. Group Token Embedding
The model independently projects anatomical regions (Pose, Face, Left Hand, Right Hand) into a shared latent space.
- Separate Projections: 4 independent Linear layers for each body part.
- Normalization: Batch Normalization on raw input features for stability.
- Tokenization: Body parts are stacked as “tokens” for spatial attention.
- Part Embeddings: Learnable parameter added to distinguish anatomical regions.
2. Positional Encoding
Since Transformers are permutation-invariant, a sinusoidal positional encoding is added to the temporal dimension to provide the model with information about the order of frames.
3. ST-Transformer Blocks
A stack of consecutive blocks that perform dual-stream attention:
- Spatial Attention: Multi-head self-attention operating across the body part tokens (Pose, Face, Hands) within each time step.
- Temporal Attention: Multi-head self-attention operating across the time steps for each body part.
- MLP Head: A position-wise feed-forward network with GELU activation.
- Residual Connections & LayerNorm: Each sub-layer (Spatial, Temporal, MLP) uses residual additions and normalization.
4. Self-Attention Pooling
Instead of simple averaging, the model uses a trainable Attention Pooling mechanism to aggregate the temporal dimension into a single context vector.
- Trainable Weights: Learns which frames are most informative for the sign.
- Softmax Weighting: Produces a normalized weighted sum of the hidden states.
5. Classification Head
- Linear Layer: Maps the pooled representation to the final class logits (502 signs).