Patent 12417756
Derivative works
Defensive disclosure: derivative variations of each claim designed to render future incremental improvements obvious or non-novel.
Active provider: Google · gemini-2.5-pro
Derivative works
Defensive disclosure: derivative variations of each claim designed to render future incremental improvements obvious or non-novel.
Defensive Disclosure and Prior Art Generation for Real-Time Accent Mimicking
Publication Date: April 30, 2026
Subject Matter: Derivative works and extensions related to the technology disclosed in U.S. Patent 12,417,756. This document is intended to enter the public domain to serve as prior art for future inventions in the field of speech processing and voice modification.
Axis 1: Algorithmic and Architectural Substitution
Derivative 1.1: Adversarial Accent-Style Transfer Network
Enabling Description: This derivative replaces the distinct analysis and synthesis modules with a unified Generative Adversarial Network (GAN) architecture. The system comprises one generator and two discriminator networks. The generator (G) receives the second user's speech waveform (S2) and a target accent embedding vector (E1) extracted from the first user's speech. It outputs a modified waveform (S_mod). The first discriminator (D_accent) is trained to distinguish between S_mod and authentic speech from the first user (S1), forcing G to learn the accent features. The second discriminator (D_identity) is trained to distinguish the speaker identity of S_mod from the original speaker S2, ensuring that G preserves the natural voice characteristics. The loss function for G is a weighted sum of the adversarial losses from both discriminators, ensuring a balance between accent accuracy and speaker preservation.
Mermaid Diagram:
graph TD subgraph User 1 S1[Speech Waveform] --> AE[Accent Encoder] AE --> E1[Accent Embedding Vector] end subgraph User 2 S2[Speech Waveform] --> G[Generator] S2 --> DI[Speaker Identity Encoder] DI --> ID2[Identity Vector] end E1 --> G G --> S_mod[Modified Waveform] subgraph Training / Discrimination S_mod --> D_accent[Accent Discriminator] S1_samples[Real S1 Samples] --> D_accent D_accent --> L_accent[Accent Loss] S_mod --> D_identity[Identity Discriminator] S2_samples[Real S2 Samples] --> D_identity D_identity --> L_identity[Identity Loss] end L_accent --> G L_identity --> G
Derivative 1.2: End-to-End Flow-Based Waveform Generation
Enabling Description: This variation utilizes a non-autoregressive, flow-based deep learning model, analogous to VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech), for direct waveform conversion. The system first extracts linguistic features (phonemes) from the second user's speech using an acoustic model. Simultaneously, a speaker encoder generates a speaker embedding vector. The target accent is represented by a separate accent embedding vector. These three inputs (phonemes, speaker embedding, accent embedding) are fed into a conditional variational autoencoder (VAE) with normalizing flows. The model learns to map the distribution of the second user's speech to the distribution of the first user's accent, conditioned on the linguistic content and speaker identity. The output is a modified waveform generated in a single pass, enabling faster-than-real-time synthesis.
Mermaid Diagram:
sequenceDiagram participant S2 as Second User Speech participant ASR as ASR/Phoneme Extractor participant SE as Speaker Encoder participant AE as Accent Encoder (from User 1) participant VAE as Flow-Based VAE participant Vocoder S2->>ASR: Raw Audio ASR->>VAE: Phoneme Sequence (p) S2->>SE: Raw Audio SE->>VAE: Speaker Embedding (e_spk) AE->>VAE: Accent Embedding (e_acc) VAE->>Vocoder: Latent Representation (z_mod) Note over VAE: P(z|p, e_spk, e_acc) Vocoder->>S2: Modified Waveform
Axis 2: Operational Parameter Expansion
Derivative 2.1: Ultra-Low Latency Mimicking via Predictive Phoneme Framing
Enabling Description: To achieve glass-to-glass latency under 10ms for applications like simultaneous interpretation, this system employs a predictive model. The feature extraction pipeline operates on 20ms audio frames. A lightweight LSTM (Long Short-Term Memory) network, running in parallel, analyzes the linguistic content of the incoming speech and predicts the most likely subsequent phoneme sequence for the next 40-60ms. While the current frame is being converted, the synthesis module pre-computes the acoustic features for the predicted phonemes based on the target accent. When the actual audio frames arrive, the system combines the pre-computed features with the real-time prosodic information (pitch, energy) from the user, drastically reducing the synthesis computation time per frame. This predictive buffering minimizes the perceived delay.
Mermaid Diagram:
graph TD A[Audio Input Stream] --> B{Frame Buffer (20ms)} B --> C[Feature Extraction]; B --> D[Linguistic Analysis]; D --> E[Predictive LSTM]; E --> F[Predicted Phoneme Buffer]; F --> G{Pre-computation Module}; C --> H{Accent Translation}; H --> I[Feature Combination]; G --> I; I --> J[Waveform Synthesis]; J --> K[Audio Output Stream];
Derivative 2.2: Accent Mimicking for Hypersonic and Subsonic Frequencies
Enabling Description: This system is designed for scientific and industrial analysis by applying the concept of "accent" to non-human audio signals. For hypersonic analysis, the system analyzes the acoustic signature of airflow over a vehicle traveling at Mach 5+ to establish a "nominal flight accent." It then analyzes real-time acoustic data from sensors on the vehicle, converting it to mimic the nominal accent. Deviations in the required transformation indicate changes in atmospheric conditions or structural integrity. For subsonic applications, it analyzes ultrasonic vocalizations from rodents in a lab. It establishes a "calm accent" (baseline) and converts real-time vocalizations to this baseline. The acoustic distance of the conversion quantifies the animal's stress level in response to stimuli.
Mermaid Diagram:
stateDiagram-v2 state "Hypersonic Application" as H { [*] --> Baseline: Capture nominal flight acoustic signature Baseline --> Monitoring: Real-time sensor data input Monitoring --> Monitoring: Analyze & transform signature to baseline state "Transformation Delta > Threshold" as Alert { note right of Alert Indicates structural flutter or unexpected turbulence end note } Monitoring --> Alert Alert --> [*] } state "Subsonic (Ultrasonic) Application" as S { [*] --> S_Baseline: Record baseline rodent vocalizations (calm state) S_Baseline --> S_Monitoring: Monitor vocalizations after stimulus S_Monitoring --> S_Monitoring: Convert active vocalizations to calm baseline state "Acoustic Distance High" as Stress { note right of Stress Quantifies stress level based on the degree of required conversion end note } S_Monitoring --> Stress Stress --> [*] }
Axis 3: Cross-Domain Application
Derivative 3.1: Aerospace - ATC Accent Simulation for Pilot Training
Enabling Description: In a flight simulator, a text-to-speech engine generates standard Air Traffic Control (ATC) commands. These serve as the "second user speech" (in its base, accent-neutral form). The system stores a library of accent embeddings from real-world recordings of ATCs in challenging airspaces (e.g., Guangzhou, Mexico City, Lagos). The training scenario selects a target accent ("first user accent"). The accent mimicking system modifies the standard TTS output to realistically replicate the chosen regional accent, including its unique cadence, phonology, and intonation, while preserving the clarity of the base TTS voice ("natural voice"). This exposes student pilots to realistic communication challenges in a safe environment.
Mermaid Diagram:
flowchart LR subgraph Simulator Core A[Training Scenario] --> B{Select Target Airspace}; B --> C[Load ATC Accent Embedding]; A --> D[Generate ATC Command Text]; end subgraph Accent Mimicking System D --> E[Standard TTS Engine]; E --> F[Base Speech Output]; C --> G[Accent/Prosody Modifier]; F --> G; G --> H[Accented Speech Output]; end H --> I[Cockpit Audio System];
Derivative 3.2: AgTech - Pathogenic Beehive Acoustics
Enabling Description: The system is used to detect diseases like Varroa mite infestation in beehives. A high-fidelity microphone records the collective buzzing frequency and pattern of a healthy hive, which is used to create an acoustic embedding for the "healthy hive accent." The system then monitors other hives. The buzzing from a monitored hive ("second user speech") is analyzed. The system modifies this buzzing to mimic the "healthy accent." The parameters of the transformation (e.g., required frequency shift, amplitude modulation) correlate with specific pathogenic stressors. A large transformation magnitude indicates a high probability of infestation, triggering an alert for the beekeeper. The "natural voice" preservation corresponds to maintaining the hive's unique baseline hum, distinguishing it from background noise.
Mermaid Diagram:
sequenceDiagram participant Sensor as Hive Acoustic Sensor participant Analyzer as Accent Analyzer participant Transformer as Accent Transformer participant Dashboard as Beekeeper Dashboard Sensor->>Analyzer: Continuous Buzzing Audio (Hive B) note right of Analyzer: Pre-loaded with "Healthy Hive Accent" embedding (from Hive A) Analyzer->>Transformer: Buzzing Audio + Target Healthy Accent Transformer->>Transformer: Calculate Transformation Parameters Transformer->>Dashboard: Send Health Score (based on transform magnitude) alt Health Score < Threshold Dashboard->>Dashboard: Display "Hive B is Unhealthy" else Dashboard->>Dashboard: Display "Hive B is Healthy" end
Axis 4: Integration with Emerging Tech
Derivative 4.1: IoT and AI for Dynamic Acoustic Ambiance Matching
Enabling Description: In a vehicle or smart home, an array of IoT microphones constantly monitors the ambient conversation. An AI model determines the dominant accent and language of the occupants. When the user interacts with the voice assistant, this system modifies the assistant's standard response voice ("second user") to match the detected ambient accent ("first user"). This integration allows the AI assistant to seamlessly blend into the social environment. If the conversation switches accents (e.g., a new passenger joins the car), the IoT sensors trigger the AI to update the target accent embedding in real-time, ensuring the assistant's voice adapts dynamically.
Mermaid Diagram:
graph TD A[IoT Mic Array] --> B(Real-time Audio Stream); B --> C{AI Ambient Accent Detection}; C --> D[Target Accent Profile]; E[User Query] --> F{Voice Assistant}; F --> G[Standard TTS Response]; D --> H(Accent Mimicking Module); G --> H; H --> I[Adapted TTS Response]; I --> J[Speakers];
Derivative 4.2: Blockchain-Verified "Voice Skins" for the Metaverse
Enabling Description: A voice actor creates a unique vocal identity, including a specific accent, and registers it as a "Voice NFT" on a public blockchain (e.g., Ethereum). The NFT's metadata contains the trained accent embedding vector. A user in the metaverse who purchases or licenses this NFT can apply it to their own voice. When the user speaks ("second user"), the system pulls the accent embedding from the blockchain via a smart contract call. It then modifies the user's voice to mimic the NFT's accent ("first user") while preserving the user's own intonation and emotion ("natural voice"). The blockchain transaction ledger provides an immutable, auditable trail of who is authorized to use the voice skin, preventing digital voice impersonation.
Mermaid Diagram:
classDiagram class User { +walletAddress +speak() } class AccentMimickingSystem { +applyVoiceSkin(audio, nftContractAddress) } class Blockchain { +getAccentEmbedding(nftContractAddress) } class VoiceNFT { <<SmartContract>> +ownerAddress +accentEmbeddingVector } User "1" -- "1" AccentMimickingSystem : Interacts with AccentMimickingSystem "1" -- "1" Blockchain : Queries Blockchain "1" -- "*" VoiceNFT : Manages
Axis 5: The "Inverse" or Failure Mode
Derivative 5.1: Accent Anonymization Filter
Enabling Description: This system operates in an inverse "anonymization" mode. It is designed for applications where accent may introduce bias (e.g., automated job screening, anonymous witness testimony). The system analyzes the user's speech, extracts the accent-specific features (phoneme pronunciation, prosody), and also extracts the core vocal identity features (pitch, timbre, formant structure). It then synthesizes a new speech signal using the user's vocal identity features but replaces the accent-specific features with those from a pre-defined, standardized "neutral" accent model (e.g., a generic newscaster accent). The result is speech that is clearly in the user's voice but stripped of any regional or socio-economic accent markers.
Mermaid Diagram:
flowchart TD A[User Speech Input] --> B{Feature Splitter}; B --> C[Accent Features]; B --> D[Vocal Identity Features]; E[Neutral Accent Model] --> F[Neutral Accent Features]; C --> G{Feature Discard}; D --> H{Speech Synthesizer}; F --> H; H --> I[Anonymized Speech Output];
Derivative 5.2: Graceful Degradation to Phonetic Subtitling
Enabling Description: This is a safe-fail mode for high-noise environments where accent conversion could produce unintelligible artifacts. The system continuously calculates a Signal-to-Noise Ratio (SNR) and a confidence score for its accent analysis. If the SNR drops below a pre-set threshold (e.g., 5dB) or the confidence score is low, the system disables audio synthesis entirely. Instead, it performs a real-time speech-to-text conversion of the user's speech. Crucially, it then uses its accent analysis module not to convert the audio, but to generate a phonetic or dialect-aware subtitle. For example, if it detects a Scottish accent saying "I cannae do it," the subtitle might read:
I cannae [can't] do it, providing the original dialect word and its standard equivalent for maximum clarity.Mermaid Diagram:
stateDiagram-v2 [*] --> Monitoring Monitoring: SNR > 5dB and Confidence > 0.8 Monitoring --> Accent_Conversion: Process Audio Accent_Conversion --> Monitoring: Output modified audio Monitoring --> Phonetic_Subtitling: SNR <= 5dB or Confidence <= 0.8 note right of Phonetic_Subtitling 1. Disable audio synthesis 2. Perform STT 3. Annotate text with phonetic/dialect hints end note Phonetic_Subtitling --> Monitoring: Output enhanced subtitles
Combination Prior Art with Open-Source Standards
Combination with WebRTC and Insertable Streams: A system where the accent mimicking algorithm is compiled to WebAssembly (WASM) and deployed as a JavaScript library. In a peer-to-peer WebRTC video conference, the library uses the Insertable Streams for Media API to intercept the raw audio frames from a user's
MediaStreamTrack. The WASM module performs the accent conversion in-browser, modifying the audio frames before they are passed to the RTCRtpSender for encryption and transmission to the remote peer. This enables client-side, real-time accent mimicking in any modern web application without server-side processing.Combination with the Kaldi Speech Recognition Toolkit: A method for improving the accuracy of accent mimicking by leveraging the detailed acoustic models and forced alignment capabilities of the open-source Kaldi toolkit. The second user's speech is first processed by a Kaldi model to generate a precise, time-aligned phoneme transcription. The accent translation module then uses this alignment to perform a more accurate phoneme-to-phoneme mapping and prosody transfer from the target accent, as it knows the exact start and end time of every sound in the source speech.
Combination with Open-Source Voice Assistant Mycroft: An accent mimicking "skill" for the Mycroft open-source voice assistant. The skill allows a user to configure the assistant's voice personality. The user can have a short conversation with Mycroft ("first user speech") in their own accent. Mycroft's skill extracts the accent features and applies them to its own default TTS voice ("second user speech"). Thereafter, all of Mycroft's responses are delivered in its own voice but mimicking the user's regional accent, creating a personalized and localized user experience.
Generated 4/30/2026, 4:33:57 PM