Derivative works — US Patent 12417756

Defensive Disclosure and Prior Art Generation for Real-Time Accent Mimicking

Publication Date: April 30, 2026
Subject Matter: Derivative works and extensions related to the technology disclosed in U.S. Patent 12,417,756. This document is intended to enter the public domain to serve as prior art for future inventions in the field of speech processing and voice modification.

Axis 1: Algorithmic and Architectural Substitution

Derivative 1.1: Adversarial Accent-Style Transfer Network

Enabling Description: This derivative replaces the distinct analysis and synthesis modules with a unified Generative Adversarial Network (GAN) architecture. The system comprises one generator and two discriminator networks. The generator (G) receives the second user's speech waveform (S2) and a target accent embedding vector (E1) extracted from the first user's speech. It outputs a modified waveform (S_mod). The first discriminator (D_accent) is trained to distinguish between S_mod and authentic speech from the first user (S1), forcing G to learn the accent features. The second discriminator (D_identity) is trained to distinguish the speaker identity of S_mod from the original speaker S2, ensuring that G preserves the natural voice characteristics. The loss function for G is a weighted sum of the adversarial losses from both discriminators, ensuring a balance between accent accuracy and speaker preservation.

Mermaid Diagram:

graph TD
    subgraph User 1
        S1[Speech Waveform] --> AE[Accent Encoder]
        AE --> E1[Accent Embedding Vector]
    end

    subgraph User 2
        S2[Speech Waveform] --> G[Generator]
        S2 --> DI[Speaker Identity Encoder]
        DI --> ID2[Identity Vector]
    end

    E1 --> G
    G --> S_mod[Modified Waveform]

    subgraph Training / Discrimination
        S_mod --> D_accent[Accent Discriminator]
        S1_samples[Real S1 Samples] --> D_accent
        D_accent --> L_accent[Accent Loss]

        S_mod --> D_identity[Identity Discriminator]
        S2_samples[Real S2 Samples] --> D_identity
        D_identity --> L_identity[Identity Loss]
    end

    L_accent --> G
    L_identity --> G

Derivative 1.2: End-to-End Flow-Based Waveform Generation

Enabling Description: This variation utilizes a non-autoregressive, flow-based deep learning model, analogous to VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech), for direct waveform conversion. The system first extracts linguistic features (phonemes) from the second user's speech using an acoustic model. Simultaneously, a speaker encoder generates a speaker embedding vector. The target accent is represented by a separate accent embedding vector. These three inputs (phonemes, speaker embedding, accent embedding) are fed into a conditional variational autoencoder (VAE) with normalizing flows. The model learns to map the distribution of the second user's speech to the distribution of the first user's accent, conditioned on the linguistic content and speaker identity. The output is a modified waveform generated in a single pass, enabling faster-than-real-time synthesis.

Mermaid Diagram:

sequenceDiagram
    participant S2 as Second User Speech
    participant ASR as ASR/Phoneme Extractor
    participant SE as Speaker Encoder
    participant AE as Accent Encoder (from User 1)
    participant VAE as Flow-Based VAE
    participant Vocoder

    S2->>ASR: Raw Audio
    ASR->>VAE: Phoneme Sequence (p)
    S2->>SE: Raw Audio
    SE->>VAE: Speaker Embedding (e_spk)
    AE->>VAE: Accent Embedding (e_acc)

    VAE->>Vocoder: Latent Representation (z_mod)
    Note over VAE: P(z|p, e_spk, e_acc)
    Vocoder->>S2: Modified Waveform

Axis 2: Operational Parameter Expansion

Derivative 2.1: Ultra-Low Latency Mimicking via Predictive Phoneme Framing

Enabling Description: To achieve glass-to-glass latency under 10ms for applications like simultaneous interpretation, this system employs a predictive model. The feature extraction pipeline operates on 20ms audio frames. A lightweight LSTM (Long Short-Term Memory) network, running in parallel, analyzes the linguistic content of the incoming speech and predicts the most likely subsequent phoneme sequence for the next 40-60ms. While the current frame is being converted, the synthesis module pre-computes the acoustic features for the predicted phonemes based on the target accent. When the actual audio frames arrive, the system combines the pre-computed features with the real-time prosodic information (pitch, energy) from the user, drastically reducing the synthesis computation time per frame. This predictive buffering minimizes the perceived delay.

Mermaid Diagram:

graph TD
    A[Audio Input Stream] --> B{Frame Buffer (20ms)}
    B --> C[Feature Extraction];
    B --> D[Linguistic Analysis];
    D --> E[Predictive LSTM];
    E --> F[Predicted Phoneme Buffer];
    F --> G{Pre-computation Module};
    C --> H{Accent Translation};
    H --> I[Feature Combination];
    G --> I;
    I --> J[Waveform Synthesis];
    J --> K[Audio Output Stream];

Derivative 2.2: Accent Mimicking for Hypersonic and Subsonic Frequencies

Enabling Description: This system is designed for scientific and industrial analysis by applying the concept of "accent" to non-human audio signals. For hypersonic analysis, the system analyzes the acoustic signature of airflow over a vehicle traveling at Mach 5+ to establish a "nominal flight accent." It then analyzes real-time acoustic data from sensors on the vehicle, converting it to mimic the nominal accent. Deviations in the required transformation indicate changes in atmospheric conditions or structural integrity. For subsonic applications, it analyzes ultrasonic vocalizations from rodents in a lab. It establishes a "calm accent" (baseline) and converts real-time vocalizations to this baseline. The acoustic distance of the conversion quantifies the animal's stress level in response to stimuli.

Mermaid Diagram:

stateDiagram-v2
    state "Hypersonic Application" as H {
        [*] --> Baseline: Capture nominal flight acoustic signature
        Baseline --> Monitoring: Real-time sensor data input
        Monitoring --> Monitoring: Analyze & transform signature to baseline
        state "Transformation Delta > Threshold" as Alert {
            note right of Alert
                Indicates structural flutter
                or unexpected turbulence
            end note
        }
        Monitoring --> Alert
        Alert --> [*]
    }
    state "Subsonic (Ultrasonic) Application" as S {
        [*] --> S_Baseline: Record baseline rodent vocalizations (calm state)
        S_Baseline --> S_Monitoring: Monitor vocalizations after stimulus
        S_Monitoring --> S_Monitoring: Convert active vocalizations to calm baseline
        state "Acoustic Distance High" as Stress {
             note right of Stress
                Quantifies stress level based on
                the degree of required conversion
             end note
        }
        S_Monitoring --> Stress
        Stress --> [*]
    }

Axis 3: Cross-Domain Application

Derivative 3.1: Aerospace - ATC Accent Simulation for Pilot Training

Enabling Description: In a flight simulator, a text-to-speech engine generates standard Air Traffic Control (ATC) commands. These serve as the "second user speech" (in its base, accent-neutral form). The system stores a library of accent embeddings from real-world recordings of ATCs in challenging airspaces (e.g., Guangzhou, Mexico City, Lagos). The training scenario selects a target accent ("first user accent"). The accent mimicking system modifies the standard TTS output to realistically replicate the chosen regional accent, including its unique cadence, phonology, and intonation, while preserving the clarity of the base TTS voice ("natural voice"). This exposes student pilots to realistic communication challenges in a safe environment.

Mermaid Diagram:

flowchart LR
    subgraph Simulator Core
        A[Training Scenario] --> B{Select Target Airspace};
        B --> C[Load ATC Accent Embedding];
        A --> D[Generate ATC Command Text];
    end
    subgraph Accent Mimicking System
        D --> E[Standard TTS Engine];
        E --> F[Base Speech Output];
        C --> G[Accent/Prosody Modifier];
        F --> G;
        G --> H[Accented Speech Output];
    end
    H --> I[Cockpit Audio System];

Derivative 3.2: AgTech - Pathogenic Beehive Acoustics

Enabling Description: The system is used to detect diseases like Varroa mite infestation in beehives. A high-fidelity microphone records the collective buzzing frequency and pattern of a healthy hive, which is used to create an acoustic embedding for the "healthy hive accent." The system then monitors other hives. The buzzing from a monitored hive ("second user speech") is analyzed. The system modifies this buzzing to mimic the "healthy accent." The parameters of the transformation (e.g., required frequency shift, amplitude modulation) correlate with specific pathogenic stressors. A large transformation magnitude indicates a high probability of infestation, triggering an alert for the beekeeper. The "natural voice" preservation corresponds to maintaining the hive's unique baseline hum, distinguishing it from background noise.

Mermaid Diagram:

sequenceDiagram
    participant Sensor as Hive Acoustic Sensor
    participant Analyzer as Accent Analyzer
    participant Transformer as Accent Transformer
    participant Dashboard as Beekeeper Dashboard

    Sensor->>Analyzer: Continuous Buzzing Audio (Hive B)
    note right of Analyzer: Pre-loaded with "Healthy Hive Accent" embedding (from Hive A)
    Analyzer->>Transformer: Buzzing Audio + Target Healthy Accent
    Transformer->>Transformer: Calculate Transformation Parameters
    Transformer->>Dashboard: Send Health Score (based on transform magnitude)
    alt Health Score < Threshold
        Dashboard->>Dashboard: Display "Hive B is Unhealthy"
    else
        Dashboard->>Dashboard: Display "Hive B is Healthy"
    end

Axis 4: Integration with Emerging Tech

Derivative 4.1: IoT and AI for Dynamic Acoustic Ambiance Matching

Enabling Description: In a vehicle or smart home, an array of IoT microphones constantly monitors the ambient conversation. An AI model determines the dominant accent and language of the occupants. When the user interacts with the voice assistant, this system modifies the assistant's standard response voice ("second user") to match the detected ambient accent ("first user"). This integration allows the AI assistant to seamlessly blend into the social environment. If the conversation switches accents (e.g., a new passenger joins the car), the IoT sensors trigger the AI to update the target accent embedding in real-time, ensuring the assistant's voice adapts dynamically.

Mermaid Diagram:

graph TD
    A[IoT Mic Array] --> B(Real-time Audio Stream);
    B --> C{AI Ambient Accent Detection};
    C --> D[Target Accent Profile];
    E[User Query] --> F{Voice Assistant};
    F --> G[Standard TTS Response];
    D --> H(Accent Mimicking Module);
    G --> H;
    H --> I[Adapted TTS Response];
    I --> J[Speakers];

Derivative 4.2: Blockchain-Verified "Voice Skins" for the Metaverse

Enabling Description: A voice actor creates a unique vocal identity, including a specific accent, and registers it as a "Voice NFT" on a public blockchain (e.g., Ethereum). The NFT's metadata contains the trained accent embedding vector. A user in the metaverse who purchases or licenses this NFT can apply it to their own voice. When the user speaks ("second user"), the system pulls the accent embedding from the blockchain via a smart contract call. It then modifies the user's voice to mimic the NFT's accent ("first user") while preserving the user's own intonation and emotion ("natural voice"). The blockchain transaction ledger provides an immutable, auditable trail of who is authorized to use the voice skin, preventing digital voice impersonation.

Mermaid Diagram:

classDiagram
class User {
    +walletAddress
    +speak()
}
class AccentMimickingSystem {
    +applyVoiceSkin(audio, nftContractAddress)
}
class Blockchain {
    +getAccentEmbedding(nftContractAddress)
}
class VoiceNFT {
    <<SmartContract>>
    +ownerAddress
    +accentEmbeddingVector
}
User "1" -- "1" AccentMimickingSystem : Interacts with
AccentMimickingSystem "1" -- "1" Blockchain : Queries
Blockchain "1" -- "*" VoiceNFT : Manages

Axis 5: The "Inverse" or Failure Mode

Derivative 5.1: Accent Anonymization Filter

Enabling Description: This system operates in an inverse "anonymization" mode. It is designed for applications where accent may introduce bias (e.g., automated job screening, anonymous witness testimony). The system analyzes the user's speech, extracts the accent-specific features (phoneme pronunciation, prosody), and also extracts the core vocal identity features (pitch, timbre, formant structure). It then synthesizes a new speech signal using the user's vocal identity features but replaces the accent-specific features with those from a pre-defined, standardized "neutral" accent model (e.g., a generic newscaster accent). The result is speech that is clearly in the user's voice but stripped of any regional or socio-economic accent markers.

Mermaid Diagram:

flowchart TD
    A[User Speech Input] --> B{Feature Splitter};
    B --> C[Accent Features];
    B --> D[Vocal Identity Features];
    E[Neutral Accent Model] --> F[Neutral Accent Features];
    C --> G{Feature Discard};
    D --> H{Speech Synthesizer};
    F --> H;
    H --> I[Anonymized Speech Output];

Derivative 5.2: Graceful Degradation to Phonetic Subtitling

Enabling Description: This is a safe-fail mode for high-noise environments where accent conversion could produce unintelligible artifacts. The system continuously calculates a Signal-to-Noise Ratio (SNR) and a confidence score for its accent analysis. If the SNR drops below a pre-set threshold (e.g., 5dB) or the confidence score is low, the system disables audio synthesis entirely. Instead, it performs a real-time speech-to-text conversion of the user's speech. Crucially, it then uses its accent analysis module not to convert the audio, but to generate a phonetic or dialect-aware subtitle. For example, if it detects a Scottish accent saying "I cannae do it," the subtitle might read: I cannae [can't] do it, providing the original dialect word and its standard equivalent for maximum clarity.

Mermaid Diagram:

stateDiagram-v2
    [*] --> Monitoring
    Monitoring: SNR > 5dB and Confidence > 0.8
    Monitoring --> Accent_Conversion: Process Audio
    Accent_Conversion --> Monitoring: Output modified audio

    Monitoring --> Phonetic_Subtitling: SNR <= 5dB or Confidence <= 0.8
    note right of Phonetic_Subtitling
      1. Disable audio synthesis
      2. Perform STT
      3. Annotate text with phonetic/dialect hints
    end note
    Phonetic_Subtitling --> Monitoring: Output enhanced subtitles

Combination Prior Art with Open-Source Standards

Combination with WebRTC and Insertable Streams: A system where the accent mimicking algorithm is compiled to WebAssembly (WASM) and deployed as a JavaScript library. In a peer-to-peer WebRTC video conference, the library uses the Insertable Streams for Media API to intercept the raw audio frames from a user's MediaStreamTrack. The WASM module performs the accent conversion in-browser, modifying the audio frames before they are passed to the RTCRtpSender for encryption and transmission to the remote peer. This enables client-side, real-time accent mimicking in any modern web application without server-side processing.
Combination with the Kaldi Speech Recognition Toolkit: A method for improving the accuracy of accent mimicking by leveraging the detailed acoustic models and forced alignment capabilities of the open-source Kaldi toolkit. The second user's speech is first processed by a Kaldi model to generate a precise, time-aligned phoneme transcription. The accent translation module then uses this alignment to perform a more accurate phoneme-to-phoneme mapping and prosody transfer from the target accent, as it knows the exact start and end time of every sound in the source speech.
Combination with Open-Source Voice Assistant Mycroft: An accent mimicking "skill" for the Mycroft open-source voice assistant. The skill allows a user to configure the assistant's voice personality. The user can have a short conversation with Mycroft ("first user speech") in their own accent. Mycroft's skill extracts the accent features and applies them to its own default TTS voice ("second user speech"). Thereafter, all of Mycroft's responses are delivered in its own voice but mimicking the user's regional accent, creating a personalized and localized user experience.