Patent 12417756

Derivative works

Defensive disclosure: derivative variations of each claim designed to render future incremental improvements obvious or non-novel.

Active provider: Google · gemini-2.5-pro

Derivative works

Defensive disclosure: derivative variations of each claim designed to render future incremental improvements obvious or non-novel.

✓ Generated

Defensive Disclosure and Prior Art Generation for Real-Time Accent Mimicking

Publication Date: April 30, 2026
Subject Matter: Derivative works and extensions related to the technology disclosed in U.S. Patent 12,417,756. This document is intended to enter the public domain to serve as prior art for future inventions in the field of speech processing and voice modification.


Axis 1: Algorithmic and Architectural Substitution

Derivative 1.1: Adversarial Accent-Style Transfer Network

  • Enabling Description: This derivative replaces the distinct analysis and synthesis modules with a unified Generative Adversarial Network (GAN) architecture. The system comprises one generator and two discriminator networks. The generator (G) receives the second user's speech waveform (S2) and a target accent embedding vector (E1) extracted from the first user's speech. It outputs a modified waveform (S_mod). The first discriminator (D_accent) is trained to distinguish between S_mod and authentic speech from the first user (S1), forcing G to learn the accent features. The second discriminator (D_identity) is trained to distinguish the speaker identity of S_mod from the original speaker S2, ensuring that G preserves the natural voice characteristics. The loss function for G is a weighted sum of the adversarial losses from both discriminators, ensuring a balance between accent accuracy and speaker preservation.

  • Mermaid Diagram:

    graph TD
        subgraph User 1
            S1[Speech Waveform] --> AE[Accent Encoder]
            AE --> E1[Accent Embedding Vector]
        end
    
        subgraph User 2
            S2[Speech Waveform] --> G[Generator]
            S2 --> DI[Speaker Identity Encoder]
            DI --> ID2[Identity Vector]
        end
    
        E1 --> G
        G --> S_mod[Modified Waveform]
    
        subgraph Training / Discrimination
            S_mod --> D_accent[Accent Discriminator]
            S1_samples[Real S1 Samples] --> D_accent
            D_accent --> L_accent[Accent Loss]
    
            S_mod --> D_identity[Identity Discriminator]
            S2_samples[Real S2 Samples] --> D_identity
            D_identity --> L_identity[Identity Loss]
        end
    
        L_accent --> G
        L_identity --> G
    

Derivative 1.2: End-to-End Flow-Based Waveform Generation

  • Enabling Description: This variation utilizes a non-autoregressive, flow-based deep learning model, analogous to VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech), for direct waveform conversion. The system first extracts linguistic features (phonemes) from the second user's speech using an acoustic model. Simultaneously, a speaker encoder generates a speaker embedding vector. The target accent is represented by a separate accent embedding vector. These three inputs (phonemes, speaker embedding, accent embedding) are fed into a conditional variational autoencoder (VAE) with normalizing flows. The model learns to map the distribution of the second user's speech to the distribution of the first user's accent, conditioned on the linguistic content and speaker identity. The output is a modified waveform generated in a single pass, enabling faster-than-real-time synthesis.

  • Mermaid Diagram:

    sequenceDiagram
        participant S2 as Second User Speech
        participant ASR as ASR/Phoneme Extractor
        participant SE as Speaker Encoder
        participant AE as Accent Encoder (from User 1)
        participant VAE as Flow-Based VAE
        participant Vocoder
    
        S2->>ASR: Raw Audio
        ASR->>VAE: Phoneme Sequence (p)
        S2->>SE: Raw Audio
        SE->>VAE: Speaker Embedding (e_spk)
        AE->>VAE: Accent Embedding (e_acc)
    
        VAE->>Vocoder: Latent Representation (z_mod)
        Note over VAE: P(z|p, e_spk, e_acc)
        Vocoder->>S2: Modified Waveform
    

Axis 2: Operational Parameter Expansion

Derivative 2.1: Ultra-Low Latency Mimicking via Predictive Phoneme Framing

  • Enabling Description: To achieve glass-to-glass latency under 10ms for applications like simultaneous interpretation, this system employs a predictive model. The feature extraction pipeline operates on 20ms audio frames. A lightweight LSTM (Long Short-Term Memory) network, running in parallel, analyzes the linguistic content of the incoming speech and predicts the most likely subsequent phoneme sequence for the next 40-60ms. While the current frame is being converted, the synthesis module pre-computes the acoustic features for the predicted phonemes based on the target accent. When the actual audio frames arrive, the system combines the pre-computed features with the real-time prosodic information (pitch, energy) from the user, drastically reducing the synthesis computation time per frame. This predictive buffering minimizes the perceived delay.

  • Mermaid Diagram:

    graph TD
        A[Audio Input Stream] --> B{Frame Buffer (20ms)}
        B --> C[Feature Extraction];
        B --> D[Linguistic Analysis];
        D --> E[Predictive LSTM];
        E --> F[Predicted Phoneme Buffer];
        F --> G{Pre-computation Module};
        C --> H{Accent Translation};
        H --> I[Feature Combination];
        G --> I;
        I --> J[Waveform Synthesis];
        J --> K[Audio Output Stream];
    

Derivative 2.2: Accent Mimicking for Hypersonic and Subsonic Frequencies

  • Enabling Description: This system is designed for scientific and industrial analysis by applying the concept of "accent" to non-human audio signals. For hypersonic analysis, the system analyzes the acoustic signature of airflow over a vehicle traveling at Mach 5+ to establish a "nominal flight accent." It then analyzes real-time acoustic data from sensors on the vehicle, converting it to mimic the nominal accent. Deviations in the required transformation indicate changes in atmospheric conditions or structural integrity. For subsonic applications, it analyzes ultrasonic vocalizations from rodents in a lab. It establishes a "calm accent" (baseline) and converts real-time vocalizations to this baseline. The acoustic distance of the conversion quantifies the animal's stress level in response to stimuli.

  • Mermaid Diagram:

    stateDiagram-v2
        state "Hypersonic Application" as H {
            [*] --> Baseline: Capture nominal flight acoustic signature
            Baseline --> Monitoring: Real-time sensor data input
            Monitoring --> Monitoring: Analyze & transform signature to baseline
            state "Transformation Delta > Threshold" as Alert {
                note right of Alert
                    Indicates structural flutter
                    or unexpected turbulence
                end note
            }
            Monitoring --> Alert
            Alert --> [*]
        }
        state "Subsonic (Ultrasonic) Application" as S {
            [*] --> S_Baseline: Record baseline rodent vocalizations (calm state)
            S_Baseline --> S_Monitoring: Monitor vocalizations after stimulus
            S_Monitoring --> S_Monitoring: Convert active vocalizations to calm baseline
            state "Acoustic Distance High" as Stress {
                 note right of Stress
                    Quantifies stress level based on
                    the degree of required conversion
                 end note
            }
            S_Monitoring --> Stress
            Stress --> [*]
        }
    

Axis 3: Cross-Domain Application

Derivative 3.1: Aerospace - ATC Accent Simulation for Pilot Training

  • Enabling Description: In a flight simulator, a text-to-speech engine generates standard Air Traffic Control (ATC) commands. These serve as the "second user speech" (in its base, accent-neutral form). The system stores a library of accent embeddings from real-world recordings of ATCs in challenging airspaces (e.g., Guangzhou, Mexico City, Lagos). The training scenario selects a target accent ("first user accent"). The accent mimicking system modifies the standard TTS output to realistically replicate the chosen regional accent, including its unique cadence, phonology, and intonation, while preserving the clarity of the base TTS voice ("natural voice"). This exposes student pilots to realistic communication challenges in a safe environment.

  • Mermaid Diagram:

    flowchart LR
        subgraph Simulator Core
            A[Training Scenario] --> B{Select Target Airspace};
            B --> C[Load ATC Accent Embedding];
            A --> D[Generate ATC Command Text];
        end
        subgraph Accent Mimicking System
            D --> E[Standard TTS Engine];
            E --> F[Base Speech Output];
            C --> G[Accent/Prosody Modifier];
            F --> G;
            G --> H[Accented Speech Output];
        end
        H --> I[Cockpit Audio System];
    

Derivative 3.2: AgTech - Pathogenic Beehive Acoustics

  • Enabling Description: The system is used to detect diseases like Varroa mite infestation in beehives. A high-fidelity microphone records the collective buzzing frequency and pattern of a healthy hive, which is used to create an acoustic embedding for the "healthy hive accent." The system then monitors other hives. The buzzing from a monitored hive ("second user speech") is analyzed. The system modifies this buzzing to mimic the "healthy accent." The parameters of the transformation (e.g., required frequency shift, amplitude modulation) correlate with specific pathogenic stressors. A large transformation magnitude indicates a high probability of infestation, triggering an alert for the beekeeper. The "natural voice" preservation corresponds to maintaining the hive's unique baseline hum, distinguishing it from background noise.

  • Mermaid Diagram:

    sequenceDiagram
        participant Sensor as Hive Acoustic Sensor
        participant Analyzer as Accent Analyzer
        participant Transformer as Accent Transformer
        participant Dashboard as Beekeeper Dashboard
    
        Sensor->>Analyzer: Continuous Buzzing Audio (Hive B)
        note right of Analyzer: Pre-loaded with "Healthy Hive Accent" embedding (from Hive A)
        Analyzer->>Transformer: Buzzing Audio + Target Healthy Accent
        Transformer->>Transformer: Calculate Transformation Parameters
        Transformer->>Dashboard: Send Health Score (based on transform magnitude)
        alt Health Score < Threshold
            Dashboard->>Dashboard: Display "Hive B is Unhealthy"
        else
            Dashboard->>Dashboard: Display "Hive B is Healthy"
        end
    

Axis 4: Integration with Emerging Tech

Derivative 4.1: IoT and AI for Dynamic Acoustic Ambiance Matching

  • Enabling Description: In a vehicle or smart home, an array of IoT microphones constantly monitors the ambient conversation. An AI model determines the dominant accent and language of the occupants. When the user interacts with the voice assistant, this system modifies the assistant's standard response voice ("second user") to match the detected ambient accent ("first user"). This integration allows the AI assistant to seamlessly blend into the social environment. If the conversation switches accents (e.g., a new passenger joins the car), the IoT sensors trigger the AI to update the target accent embedding in real-time, ensuring the assistant's voice adapts dynamically.

  • Mermaid Diagram:

    graph TD
        A[IoT Mic Array] --> B(Real-time Audio Stream);
        B --> C{AI Ambient Accent Detection};
        C --> D[Target Accent Profile];
        E[User Query] --> F{Voice Assistant};
        F --> G[Standard TTS Response];
        D --> H(Accent Mimicking Module);
        G --> H;
        H --> I[Adapted TTS Response];
        I --> J[Speakers];
    

Derivative 4.2: Blockchain-Verified "Voice Skins" for the Metaverse

  • Enabling Description: A voice actor creates a unique vocal identity, including a specific accent, and registers it as a "Voice NFT" on a public blockchain (e.g., Ethereum). The NFT's metadata contains the trained accent embedding vector. A user in the metaverse who purchases or licenses this NFT can apply it to their own voice. When the user speaks ("second user"), the system pulls the accent embedding from the blockchain via a smart contract call. It then modifies the user's voice to mimic the NFT's accent ("first user") while preserving the user's own intonation and emotion ("natural voice"). The blockchain transaction ledger provides an immutable, auditable trail of who is authorized to use the voice skin, preventing digital voice impersonation.

  • Mermaid Diagram:

    classDiagram
    class User {
        +walletAddress
        +speak()
    }
    class AccentMimickingSystem {
        +applyVoiceSkin(audio, nftContractAddress)
    }
    class Blockchain {
        +getAccentEmbedding(nftContractAddress)
    }
    class VoiceNFT {
        <<SmartContract>>
        +ownerAddress
        +accentEmbeddingVector
    }
    User "1" -- "1" AccentMimickingSystem : Interacts with
    AccentMimickingSystem "1" -- "1" Blockchain : Queries
    Blockchain "1" -- "*" VoiceNFT : Manages
    

Axis 5: The "Inverse" or Failure Mode

Derivative 5.1: Accent Anonymization Filter

  • Enabling Description: This system operates in an inverse "anonymization" mode. It is designed for applications where accent may introduce bias (e.g., automated job screening, anonymous witness testimony). The system analyzes the user's speech, extracts the accent-specific features (phoneme pronunciation, prosody), and also extracts the core vocal identity features (pitch, timbre, formant structure). It then synthesizes a new speech signal using the user's vocal identity features but replaces the accent-specific features with those from a pre-defined, standardized "neutral" accent model (e.g., a generic newscaster accent). The result is speech that is clearly in the user's voice but stripped of any regional or socio-economic accent markers.

  • Mermaid Diagram:

    flowchart TD
        A[User Speech Input] --> B{Feature Splitter};
        B --> C[Accent Features];
        B --> D[Vocal Identity Features];
        E[Neutral Accent Model] --> F[Neutral Accent Features];
        C --> G{Feature Discard};
        D --> H{Speech Synthesizer};
        F --> H;
        H --> I[Anonymized Speech Output];
    

Derivative 5.2: Graceful Degradation to Phonetic Subtitling

  • Enabling Description: This is a safe-fail mode for high-noise environments where accent conversion could produce unintelligible artifacts. The system continuously calculates a Signal-to-Noise Ratio (SNR) and a confidence score for its accent analysis. If the SNR drops below a pre-set threshold (e.g., 5dB) or the confidence score is low, the system disables audio synthesis entirely. Instead, it performs a real-time speech-to-text conversion of the user's speech. Crucially, it then uses its accent analysis module not to convert the audio, but to generate a phonetic or dialect-aware subtitle. For example, if it detects a Scottish accent saying "I cannae do it," the subtitle might read: I cannae [can't] do it, providing the original dialect word and its standard equivalent for maximum clarity.

  • Mermaid Diagram:

    stateDiagram-v2
        [*] --> Monitoring
        Monitoring: SNR > 5dB and Confidence > 0.8
        Monitoring --> Accent_Conversion: Process Audio
        Accent_Conversion --> Monitoring: Output modified audio
    
        Monitoring --> Phonetic_Subtitling: SNR <= 5dB or Confidence <= 0.8
        note right of Phonetic_Subtitling
          1. Disable audio synthesis
          2. Perform STT
          3. Annotate text with phonetic/dialect hints
        end note
        Phonetic_Subtitling --> Monitoring: Output enhanced subtitles
    

Combination Prior Art with Open-Source Standards

  1. Combination with WebRTC and Insertable Streams: A system where the accent mimicking algorithm is compiled to WebAssembly (WASM) and deployed as a JavaScript library. In a peer-to-peer WebRTC video conference, the library uses the Insertable Streams for Media API to intercept the raw audio frames from a user's MediaStreamTrack. The WASM module performs the accent conversion in-browser, modifying the audio frames before they are passed to the RTCRtpSender for encryption and transmission to the remote peer. This enables client-side, real-time accent mimicking in any modern web application without server-side processing.

  2. Combination with the Kaldi Speech Recognition Toolkit: A method for improving the accuracy of accent mimicking by leveraging the detailed acoustic models and forced alignment capabilities of the open-source Kaldi toolkit. The second user's speech is first processed by a Kaldi model to generate a precise, time-aligned phoneme transcription. The accent translation module then uses this alignment to perform a more accurate phoneme-to-phoneme mapping and prosody transfer from the target accent, as it knows the exact start and end time of every sound in the source speech.

  3. Combination with Open-Source Voice Assistant Mycroft: An accent mimicking "skill" for the Mycroft open-source voice assistant. The skill allows a user to configure the assistant's voice personality. The user can have a short conversation with Mycroft ("first user speech") in their own accent. Mycroft's skill extracts the accent features and applies them to its own default TTS voice ("second user speech"). Thereafter, all of Mycroft's responses are delivered in its own voice but mimicking the user's regional accent, creating a personalized and localized user experience.

Generated 4/30/2026, 4:33:57 PM