>94%

Accuracy

200

Millisecond To Prediction

30+

Languages Supported

As artificial intelligence continues to evolve at a rapid pace, so too does the sophistication of deepfake technology. Deepfakes, particularly in the form of highly realistic synthetic audio, pose an increasing threat to trust online as they become more difficult to distinguish from authentic content. To stay ahead of this challenge and provide our customers with the most advanced deepfake detection capabilities, Resemble AI is excited to introduce DETECT-2B, the latest generation of our industry-leading deepfake detection solution.

Building upon the strong foundation of our original Detect model, DETECT-2B represents a major leap forward in terms of model architecture, training data, and overall performance. The result is an extremely robust and accurate deepfake detection model that achieves a remarkable level of performance when evaluated against a massive dataset of real and fake audio clips. Let’s dive into the high-level details of how DETECT-2B works and what makes it so effective.

Model Architecture

At its core, DETECT-2B is an ensemble of multiple sub-models that leverage several key architectural components:

  1. Pre-trained self-supervised audio representation models
  2. Efficient fine-tuning techniques to adapt the pre-trained models for the deepfake detection task

The sub-architectures consist of a frozen audio representation model with an adaptation module inserted into its key layers. This allows the adaptor to learn to shift the model’s attention towards the subtle artifacts that distinguish real audio from fake, without needing to retrain the entire model from scratch.

Additional sub-architectures build on this foundation by incorporating powerful sequence modeling techniques as the final classification layer. By combining self-supervised audio representation learning, efficient fine-tuning, and advanced sequence modeling, these sub-architectures are able to effectively capture both low-level acoustic features as well as higher-level sequential patterns that are indicative of audio deepfakes.

These sub-models are trained on a large-scale dataset for a sufficient number of iterations using an optimized learning rate schedule. Once the individual sub-models have converged, they are combined into a single ensemble model using a sophisticated fusion approach. At inference time, this ensemble predicts a fakeness score for short time slices across the duration of an input audio clip. These scores are then aggregated and compared to a carefully tuned threshold to make the final real vs. fake classification for the full audio clip.

One of the key advantages of DETECT-2B’s architecture is its parameter efficiency. By leveraging pre-trained components and efficient fine-tuning techniques, DETECT-2B is able to achieve state-of-the-art performance while still being relatively fast to train and lightweight to deploy.

DETECT-2B Result Analysis
DETECT-2B Result Analysis

DETECT-2B’s output is a granular frame-by-frame analysis of the audio stream with predictions being made for each frame of audio to determine whether it is a spoof.

How DETECT-2B uses Mamba-SMM

Mamba-SSM, or State Space Models, is an emerging architecture designed to enhance the capabilities of deepfake detection models by improving their sequence modeling capabilities. State Space Models aim to perform sequence modeling tasks like transformers but with greater efficiency, although more research is needed to determine if they can outperform transformers in terms of detection accuracy. Mamba-SSM is characterized by its unique approach to modeling temporal sequences and capturing intricate patterns within audio data.

At its core, Mamba-SSM leverages stochastic processes to model the state transitions within audio sequences. Traditional classifiers might look at static snapshots of data or rely on recurrent structures that process sequences in a somewhat deterministic manner. In contrast, Mamba-SSM introduces an element of stochasticity, allowing it to probabilistically transition between states based on observed data patterns. This stochastic approach provides several advantages:

  1. Enhanced Temporal Dynamics: By treating audio sequences as stochastic processes, Mamba-SSM can better capture the temporal dynamics of audio signals. This is crucial in deepfake detection, where subtle temporal inconsistencies often differentiate real from fake audio.
  2. Adaptive State Transitions: The stochastic nature of Mamba-SSM allows it to adaptively transition between states based on the observed audio features. This adaptability enables the classifier to more accurately track the evolution of audio signals over time, improving its ability to detect anomalies.
  3. Robustness to Variability: Mamba-SSM’s probabilistic framework makes it inherently robust to variations and noise in the audio data. This robustness is particularly beneficial in real-world scenarios where audio quality can vary significantly.

The introduction of Mamba-SSM represents a significant advancement in the field of deepfake detection for several reasons:

1. Superior Detection of Subtle Artifacts

Deepfake audio often contains artifacts that are too subtle for traditional classifiers to detect reliably. These artifacts might manifest as slight variations in pitch, timing, or spectral properties that are not easily noticeable. Mamba-SSM’s stochastic state transitions allow it to capture these subtle differences with greater accuracy. By modeling the probabilistic relationships between audio frames, Mamba-SSM can identify inconsistencies that would otherwise go undetected.

2. Improved Generalization Across Languages and Accents

The ability of DETECT-2B to perform well across diverse languages is largely attributed to our extensive multi-lingual training data and the use of pre-trained models like Wav2Vec2. These components enable the system to learn language-agnostic features that are indicative of audio manipulation. While Mamba-SSM’s probabilistic modeling of temporal dynamics may contribute to its robustness, the primary factor in its cross-lingual performance is likely the diversity of the training data.

3. Integration with Self-Supervised Learning

Mamba-SSM is designed to integrate seamlessly with self-supervised pre-trained models like Wav2Vec2. Self-supervised learning models have already demonstrated exceptional performance in various audio tasks by learning rich representations from large amounts of unlabeled data. When combined with Mamba-SSM, these models gain an additional layer of refinement, focusing more precisely on the artifacts that indicate deepfake audio. This integration results in a synergistic effect, enhancing the overall performance of the deepfake detection system.

4. Scalability and Efficiency

Despite its advanced capabilities, Mamba-SSM is designed to be computationally efficient. Its probabilistic framework can be scaled to handle large volumes of audio data with improved speed compared to alternative approaches. While more research is needed to fully characterize the trade-offs between speed and accuracy, the efficiency of the underlying models enables us to create powerful ensembles that deliver strong performance. This efficiency is essential for deploying deepfake detection systems in real-time applications, where quick decisions are critical.

Training and Evaluation Data

Of course, a model is only as good as the data it is trained and evaluated on. For DETECT-2B, we have curated an extensive and diverse dataset that includes a substantial amount of real and fake audio data generated using a variety of methods. The dataset covers a wide range of speakers across multiple languages to ensure robustness and generalization.

Crucially, we make sure to maintain a strict separation between the speakers used in the training and evaluation sets, to ensure that the model is truly learning to detect the artifacts of fake audio rather than overfitting to specific voices.

For model evaluation, we have put together a very large test set containing unseen speakers, deepfake generation methods, and languages. This includes data sourced from various academic datasets as well as internal data collected from diverse real-world sources. By testing on such a comprehensive dataset, we can be confident that DETECT-2B’s performance metrics reflect its ability to generalize to the types of deepfake audio it would encounter in the wild.

    DETECT-2B Language Analysis

    Performance Evaluation

    So how does DETECT-2B actually perform on this challenging test set? It achieves an impressive equal error rate (EER), which represents the point at which the false positive rate (real audio clips incorrectly classified as fake) and the false negative rate (fake audio clips incorrectly classified as real) are equal. DETECT-2B is able to correctly identify the vast majority of deepfake audio clips while maintaining a very low false positive rate. This is a substantial improvement over our original Detect model.

    Looking at the aggregate results, DETECT-2B exhibits consistently high accuracy across a variety of languages, including those not seen during training. This indicates that the model is learning to pick up on language-agnostic cues of audio manipulation.

    Similarly, when broken down by deepfake generation method, DETECT-2B performs well across the board, including on the latest synthetic audio approaches. Impressively, it achieves strong performance even on methods that were not represented in the training data, suggesting that the model is learning some fundamental features of synthetic audio rather than simply memorizing known fake audio patterns.

    DETECT 2B support for different models

    Integrating DETECT-2B

    For customers looking to integrate DETECT-2B into their audio processing pipelines, we offer a simple yet flexible API. Audio clips can be submitted for analysis either individually or in batches. The API will first preprocess the audio to ensure consistent format and quality. It then analyzes the audio using the DETECT-2B ensemble model, employing various optimizations for maximum efficiency.

    The raw fakeness scores for segments of the audio can be returned directly, or the API can aggregate them and apply the real/fake classification threshold to produce a single overall prediction for the clip. The threshold can be adjusted to trade off between false positives and false negatives depending on the use case.

    Additionally, we offer a web-based dashboard interface for customers who prefer a more visual, user-friendly way to interact with DETECT-2B. The dashboard allows users to easily upload audio files, view analysis results, and adjust settings without needing to work with the API directly.

    Future Work

    With the release of DETECT-2B, Resemble AI continues to push the boundaries of what is possible in deepfake detection. By leveraging state-of-the-art machine learning techniques and training on a vast, diverse dataset, DETECT-2B achieves unparalleled accuracy in identifying audio deepfakes.

    But our work is far from over. As generative AI capabilities continue to advance, so must our detection capabilities. We have several exciting research directions planned to further improve DETECT-2B, focusing on areas such as representation learning, advanced model architectures, and data expansion.

    We are committed to staying at the forefront of deepfake detection technology in order to help our customers navigate the increasingly complex landscape of AI-generated media. With tools like DETECT-2B in combination with our other innovative solutions, Resemble AI is providing the most comprehensive and effective suite for ensuring the integrity of audio content.

    If you’re interested in learning more about DETECT-2B or integrating it into your own application, please reach out to us for further discussion. Together, we can work towards a future where AI is used responsibly and transparently for the benefit of all.

    More FAQs

    What is DETECT-2B and how does it differ from previous deepfake detection models?

    DETECT-2B is the latest generation of Resemble AI’s deepfake detection solution. It represents a significant advancement over previous models, featuring:

    • An ensemble of multiple sub-models
    • Pre-trained self-supervised audio representation models
    • Efficient fine-tuning techniques
    • Advanced sequence modeling, including Mamba-SSM (State Space Models)
    • Greater parameter efficiency
    • Improved accuracy and performance across various languages and deepfake generation methods
    How does DETECT-2B work to identify deepfake audio?

    DETECT-2B works through several key steps:

    • It uses an ensemble of sub-models that analyze different aspects of the audio.
    • The model processes short time slices across the duration of an input audio clip.
    • It predicts a fakeness score for each slice.
    • These scores are aggregated and compared to a tuned threshold.
    • Based on this comparison, it makes a final real vs. fake classification for the full audio clip.
    • The model leverages pre-trained components and efficient fine-tuning to achieve high performance while remaining relatively fast and lightweight.
    What is Mamba-SSM and why is it important for deepfake detection?

    Mamba-SSM (State Space Models) is an emerging architecture in Detect 2B that enhances sequence modeling capabilities. It’s important because:

    • It uses stochastic processes to model state transitions within audio sequences.
    • This approach allows for better capture of temporal dynamics in audio signals.
    • It enables adaptive state transitions based on observed audio features.
    • The probabilistic framework makes it robust to variations and noise in audio data.
    • It can detect subtle artifacts that traditional classifiers might miss.
    • It integrates well with self-supervised learning models like Wav2Vec2.
    How effective is DETECT-2B across different languages and accents?

    DETECT-2B demonstrates high effectiveness across various languages and accents:

    • It performs consistently well on a diverse range of languages, including those not seen during training.
    • This cross-lingual performance is primarily attributed to the extensive multi-lingual training data and the use of pre-trained models like Wav2Vec2.
    • The model learns language-agnostic features indicative of audio manipulation.
    • Its architecture, including Mamba-SSM, contributes to its robustness across different linguistic contexts.
    What kind of data was used to train and evaluate DETECT-2B?

    DETECT-2B was trained and evaluated on a comprehensive dataset:

    • The training data includes a large amount of real and fake audio generated using various methods.
    • It covers a wide range of speakers across multiple languages.
    • There’s a strict separation between speakers in the training and evaluation sets to prevent overfitting.
    • The evaluation dataset is very large and includes unseen speakers, deepfake generation methods, and languages.
    • It incorporates data from academic datasets and diverse real-world sources.
    • This extensive and diverse dataset ensures the model’s robustness and ability to generalize to real-world scenarios.
    How can customers integrate DETECT-2B into their own systems?

    Resemble AI offers two main ways for customers to integrate DETECT-2B:

    1. API Integration:
      • A flexible API for submitting audio clips individually or in batches
      • Options to receive raw fakeness scores or aggregated predictions
      • Adjustable classification thresholds to balance false positives and negatives
    2. Web-based Dashboard:
      • A user-friendly interface for those who prefer visual interaction
      • Allows easy upload of audio files, viewing of analysis results, and adjustment of settings
      • No direct API work required
    What are the future plans for improving DETECT-2B?

    Resemble AI has several research directions planned to further enhance DETECT-2B:

    • Focusing on advanced representation learning techniques
    • Exploring new model architectures
    • Expanding and diversifying the training data
    • Continuing to adapt to emerging deepfake generation methods
    • Improving efficiency and scalability for real-time applications
    • Enhancing cross-lingual and cross-accent performance
    • Researching ways to make the model more robust against adversarial attacks

    The goal is to stay ahead of advancements in generative AI and provide increasingly effective tools for ensuring audio content integrity.