LinkedIn
X
    post-image

    Self Supervised Speech Representation Learning - I

    04 June, 2025

    Abstract

    Self-supervised learning has revolutionized how we approach speech processing, enabling models to learn meaningful representations from raw audio without requiring expensive human annotations. In this article, we dive deep and explore two foundational models that have shaped the landscape of speech representation learning: HuBERT and its enhanced successor, MS-HuBERT.


    Challenges of Speech Representation Learning

    Before diving into the models, it's crucial to understand why speech presents unique challenges compared to other domains like computer vision or natural language processing.

    Three fundamental problems distinguish speech from text and images:

    • Multiple sound units per utterance: Unlike images where we typically classify a single object, speech contains multiple phonemes, words, and acoustic events within each input sequence.

    • No predefined lexicon: During Pre-training, we don't have access to a dictionary of discrete sound units like we do with words in NLP applications

    • Variable-length units with unclear boundaries: Sound units have different durations and lack explicit segmentation markers, making it difficult to apply standard masked prediction techniques

    These challenges imply that traditional self-supervised approaches from computer vision (instance classification) and NLP (masked language modeling) cannot be directly applied to speech.


    HuBERT: Hidden Unit BERT for Speech

    Core Innovation

    HuBERT tackles the speech representation challenge through a cleverly designed two-stage approach that separates acoustic unit discovery from representation learning. The key insight driving HuBERT is that consistency of targets matters more than their correctness. This implies that having stable, even if imperfect, pseudo-labels enables the model to focus on learning sequential structure rather than getting confused by constantly changing targets.

    Architecture and Methodology

    Stage 1: Acoustic Unit Discovery
    HuBERT begins with simple clustering algorithms (like K-means) applied to traditional audio features (Mel-Frequency Cepstral Coefficients) to generate pseudo-phonetic labels. While these initial clusters are noisy, they provide the consistency needed for effective learning.

    Stage 2: Masked Prediction Learning
    The model architecture follows a familiar pattern:

    • CNN Encoder: Processes raw waveform input, down sampling by 320x to create 20ms frame representations

    • BERT Encoder: Transformer-based architecture that processes masked sequences

    • Prediction Head: Projects transformer outputs to cluster vocabulary space

    The Training Process:

    1. Randomly mask ~8% of input frames with spans of ~10 frames

    2. Apply prediction loss only over masked regions (λ = 1), forcing the model to infer hidden units from context

    3. Use iterative refinement: after initial training, extract features from the learned model to generate better clusters for the next iteration

    Iterative Refinement Strategy

    HuBERT employs a sophisticated iterative approach:

    • Iteration 1: Use k-means on MFCC features (100 clusters)

    • Iteration 2+: Extract features from intermediate transformer layers of the previous model, apply k-means with more clusters (500+)

      This process gradually improves target quality while maintaining the consistency that enables effective learning.


    MS-HuBERT: Bridging Pre-training and Inference Gaps

    Identified Limitations

    MS-HuBERT addresses two critical limitations in the original HuBERT framework:

    1. Pre-training/Inference Mismatch: During training, models see masked inputs, but during inference, they process complete sequences

    2. Underutilized Model Capacity: Using only final layer outputs for loss calculation doesn't fully leverage the model's representational power.

    The Swap Method

    Motivation: Inspired by computer vision techniques that use multiple views of the same input, the Swap method ensures the model learns consistent representations regardless of masking.

    Implementation:

    • Create two views of each input: masked and unmasked

    • Process both views through the transformer simultaneously

    • After each transformer layer, swap the embeddings at masked positions between the two views

    • This forces the model to generate identical representations for both views

    Multicluster Masked Prediction Loss

    Concept: Instead of computing loss only at the final layer, MS-HuBERT calculates masked prediction losses across multiple transformer layers using different cluster granularities.

    Implementation:

    • Generate multiple cluster sets with varying resolutions (e.g., 1000 → 500 → 250 → 125 → 50 → 25 clusters)

    • Apply different cluster sets to different transformer layers

    • Compute combined loss:

      $$\sum \text{MPL}(\text{layer}_i, \text{clusters}_j)$$

      over selected layer-cluster pairs

    • Randomly drop some cluster sets during training to manage memory constraints

    This approach enables the model to learn hierarchical representations, from fine-grained acoustic details in early layers to higher-level linguistic patterns in deeper layers.

    Performance Analysis and Results

    HuBERT Performance

    HuBERT demonstrated remarkable effectiveness across multiple model sizes and training data amounts :

    Model SizeParametersLibriSpeech 960hLibri-Light 60k hours
    BASE95MCompetitiveStrong performance
    LARGE317MSOTASignificant gains
    X-LARGE964M19% relative WER reduction13% relative WER reduction

    Key Insights:

    • Performance scales consistently with model size and unlabeled data

    • Particularly strong in low-resource scenarios (10 minutes to 100 hours of labeled data)

    • Matches or exceeds wav2vec 2.0 performance across all evaluation settings

    MS-HuBERT Improvements

    MS-HuBERT showed substantial improvements over vanilla HuBERT :

    Fine-tuning DataHuBERTMS-HuBERTImprovement
    1 hour10.910.9Matched
    10 hours9.08.55.6% relative
    100 hours7.87.19.0% relative
    960 hours5.95.113.6% relative

    Notable Achievements:

    • Better utilization of model capacity evidenced by improved CCA similarity with word labels

    • Competitive with data2vec in high-resource settings while using discrete rather than continuous targets

    • Superior performance on SUPERB benchmark tasks, particularly phoneme recognition

    Why These Approaches Work

    The Power of Consistency in HuBERT

    The clustering-based approach succeeds because it provides stable targets that allow the model to focus on learning temporal dependencies. Even though k-means clusters don't perfectly align with true phonemes, their consistency enables the model to discover meaningful acoustic-linguistic patterns.

    Addressing the Masked Language Model Dilemma

    The Swap method in MS-HuBERT elegantly solves a fundamental problem: during pre-training, models learn to handle masked inputs, but during inference, they never see masks. By training on both masked and unmasked views simultaneously and enforcing consistency between them, the model learns representations that generalize better to inference conditions.

    Hierarchical Learning Through Multicluster MPL

    The multi-resolution clustering approach mimics how humans process speech - from low-level acoustic features to high-level semantic content. By applying different cluster granularities at different layers, the model learns a natural hierarchy of representations.

    Future Directions and Potential Improvements

    Architectural Enhancements

    Adaptive Masking Strategies: Current masking is random and uniform. Future work could explore:

    • Content-aware masking that targets linguistically meaningful units

    • Dynamic masking ratios based on input complexity

    • Hierarchical masking that respects prosodic boundaries

    Cross-Modal Integration: Incorporating visual information (lip reading) or text during pre-training could provide richer supervision signals.

    Training Methodology Improvements

    Advanced Clustering Techniques: Moving beyond k-means to:

    • Neural clustering methods that learn better acoustic unit boundaries

    • Hierarchical clustering that naturally provides multi-resolution targets

    • Contrastive clustering that explicitly separates confusable sounds

    Curriculum Learning: Gradually increasing task difficulty:

    • Start with longer masking spans, progressively reduce to phone-level granularity

    • Begin with clean speech, gradually introduce noise and different speaking styles

    Scaling and Efficiency

    Model Compression: Developing techniques to maintain performance while reducing computational requirements:

    • Knowledge distillation from large models to efficient student networks

    • Pruning strategies that preserve critical acoustic-linguistic representations

    Multilingual and Cross-Domain Adaptation:

    • Joint training across multiple languages to learn universal speech representations

    • Domain adaptation techniques for noisy, spontaneous, or accented speech

    Evaluation and Analysis

    Better Probing Tasks: Developing evaluation metrics that better capture:

    • Phonetic knowledge vs. linguistic understanding

    • Robustness to acoustic variations

    • Transfer learning capabilities

    Conclusion

    The evolution from HuBERT to MS-HuBERT represents a significant advancement in self-supervised speech representation learning. HuBERT's key insight about target consistency over correctness opened new possibilities for learning from unlabeled speech data. MS-HuBERT's innovations in addressing pre-training/inference mismatch and better capacity utilization demonstrate how careful analysis of model limitations can lead to meaningful improvements.

    These models have established a strong foundation for future research in speech representation learning. Their success stems from thoughtful adaptation of ideas from other domains (BERT's masked language modeling, computer vision's multi-view learning) to the unique challenges of speech processing.

    As the field continues to evolve, we can expect to see further innovations in clustering techniques, training strategies, and architectural designs that push the boundaries of what's possible with self-supervised speech learning. The ultimate goal remains clear: developing models that can learn rich, generalizable speech representations from minimal supervision, enabling better speech technologies for the diverse range of human languages and speaking conditions.

    The journey from HuBERT to MS-HuBERT illustrates how incremental but well-motivated improvements can yield substantial gains in model performance and our understanding of speech representation learning. This progression provides valuable lessons for researchers working on the next generation of self-supervised speech models.

    Get notified when I post!