Patent 8355484
Derivative works
Defensive disclosure: derivative variations of each claim designed to render future incremental improvements obvious or non-novel.
Active provider: Google · gemini-2.5-pro
Derivative works
Defensive disclosure: derivative variations of each claim designed to render future incremental improvements obvious or non-novel.
As a Senior Patent Strategist and Research Engineer specializing in Defensive Publishing, I have analyzed US Patent 8,355,484. The following document constitutes a defensive disclosure of derivative variations and technological combinations. The purpose of this disclosure is to place these concepts in the public domain, thereby establishing them as prior art against future patent applications seeking to claim these or obvious variations thereof.
DEFENSIVE DISCLOSURE: DERIVATIVE WORKS AND IMPROVEMENTS FOR LATENCY MASKING IN DIALOG SYSTEMS
Publication Date: May 8, 2026
Reference Patent: US 8,355,484 B2 ("Methods and apparatus for masking latency in text-to-speech systems")
I. Derivatives Based on Component & Algorithm Substitution
1.1. Generative Paralinguistic Event Synthesis
Enabling Description: This variation replaces the static database of transitional messages (as described in FIG. 2, item 218 of US 8,355,484) with a real-time generative neural network. A lightweight Variational Autoencoder (VAE) or a small-footprint Generative Adversarial Network (GAN) is trained on a corpus of human non-lexical vocalizations (e.g., "uh," "hmmm," breaths, hesitations). When the filler generator is triggered, instead of retrieving a pre-recorded file, it samples a vector from the VAE's latent space and decodes it into a novel, non-repeating audio waveform. This process ensures that users do not perceive annoying repetition in the transitional sounds, enhancing the naturalness of the interaction. The model is optimized for low-inference latency to begin playback immediately after the user ceases speaking.
Mermaid Diagram:
sequenceDiagram participant User participant ASR participant FillerGenerator participant VAE_Decoder participant SpeechSynthesizer User->>+ASR: Speaks query ASR-->>-User: (Silence) ASR->>FillerGenerator: End-of-speech signal activate FillerGenerator loop Until Main Response Ready FillerGenerator->>VAE_Decoder: Request new filler waveform activate VAE_Decoder Note right of VAE_Decoder: Samples from latent space VAE_Decoder-->>FillerGenerator: Generates unique waveform deactivate VAE_Decoder FillerGenerator->>SpeechSynthesizer: Stream waveform activate SpeechSynthesizer SpeechSynthesizer-->>User: Plays novel 'um' sound deactivate SpeechSynthesizer end deactivate FillerGenerator
1.2. Emotionally-Attuned Filler Selection
Enabling Description: This system enhances the filler generator by integrating it with a real-time emotion detection module that analyzes the user's vocal prosody (pitch, tone, speaking rate). Upon receiving the audio communication, the user's speech is analyzed in parallel by the ASR and the emotion detection module. The module classifies the user's emotional state (e.g., frustrated, calm, inquisitive) and provides this classification as input to the filler generator. The filler generator then selects a transitional message from a database that is tagged with corresponding emotional attributes. For example, if frustration is detected, a more placating and thoughtful phrase like "Okay, let me check that for you carefully..." is selected over a simple "uhm."
Mermaid Diagram:
flowchart TD A[User Communication] --> B{ASR Engine}; A --> C{Vocal Emotion Detection}; C --> D[Emotional State Vector]; B --> E[Transcribed Words]; E --> F{NLU & Dialog Manager}; F --> G[Response Data]; subgraph Filler Logic D --> H{Filler Generator}; I[Emotionally-Tagged Filler DB] --> H; end H --> J(Select Emotionally-Appropriate Filler); G --> K{Natural Language Generator}; K --> L[Final Response Text]; stateDiagram-v2 [*] --> Processing: User speaks Processing --> LatencyMasking: End-of-speech detected LatencyMasking: Play placating filler if user is frustrated LatencyMasking --> Responding: NLG generates response Responding --> [*]: Synthesize and speak response
II. Derivatives Based on Operational Parameter Expansion
2.1. Sub-Perceptual Latency Masking for High-Frequency Systems
Enabling Description: In applications where system response must be in the sub-100 millisecond range (e.g., real-time audio feedback for pilots, surgeons, or financial traders), traditional paralinguistic fillers are too long. This implementation uses sub-perceptual sonic artifacts as transitional messages. When latency is detected (e.g., a 50ms delay in a data stream), the system synthesizes a phase-coherent, low-amplitude audio signal that is harmonically related to the user's own voice frequency or a background hum. This artifact is not perceived as a distinct sound but maintains a sense of an active audio channel, preventing the user from perceiving the jarring silence of a dropped connection during the brief delay.
Mermaid Diagram:
graph LR subgraph Real-Time System A(User Input) --> B{Process Query}; B -- Delay > 10ms --> C{Filler Generator}; C --> D[Synthesize Phase-Coherent<br>Sub-Perceptual Tone]; D --> E(Output Audio Stream); B -- Delay <= 10ms --> F[Generate Response]; F --> E; end
2.2. Structured Multi-Stage Fillers for High-Latency Systems
Enabling Description: For systems with expected latencies of 30 seconds to several minutes (e.g., queries to distributed scientific databases or satellite-linked remote systems), this variation employs a multi-stage, structured transitional message. The filler generator functions as a state machine. Upon receiving a query, it provides an initial acknowledgment ("Query received, accessing remote archives."). It then provides periodic, substantive updates based on milestones in the data retrieval pipeline ("Access granted. Now processing 2 terabytes of imaging data..."). This transforms the latency period from a silent wait into an informative progress report, managed by the filler generator and synthesized by the TTS system, until the final NLG-generated response is ready.
Mermaid Diagram:
stateDiagram-v2 [*] --> Acknowledged: Query Received Acknowledged --> Accessing: "Contacting remote server..." Accessing --> Processing: "Data link established. Processing..." Processing --> Synthesizing: "Analysis complete. Generating your summary." Synthesizing --> FinalResponse: (NLG completes) FinalResponse --> [*]: Deliver full response
III. Derivatives Based on Cross-Domain Application
3.1. Aerospace: High-Workload Cockpit Voice Assistant
Enabling Description: In a flight deck environment, a pilot issues a voice command such as, "Check weather and icing conditions for landing at KBOS in 45 minutes." The aircraft's avionics system must query multiple data sources. To prevent distracting silence and confirm receipt of the command during a critical flight phase, the system immediately responds with a transitional message synthesized in a calm, standardized aviation voice: "Checking...". As it completes sub-tasks, it can optionally provide further fillers: "Weather data received... checking icing model...". This masks the latency of the complex data fusion task and assures the flight crew the system is working, without requiring them to divert visual attention.
Mermaid Diagram:
sequenceDiagram participant Pilot participant CockpitVoiceSystem participant AvionicsDataBus Pilot->>+CockpitVoiceSystem: "Check weather at KBOS" CockpitVoiceSystem->>CockpitVoiceSystem: ASR/NLU Processing CockpitVoiceSystem-->>-Pilot: Synthesizes "Checking..." CockpitVoiceSystem->>+AvionicsDataBus: Request METAR, PIREPs AvionicsDataBus-->>-CockpitVoiceSystem: Data streams CockpitVoiceSystem->>CockpitVoiceSystem: Data fusion & NLG CockpitVoiceSystem-->>-Pilot: Synthesizes full weather brief
3.2. AgTech: Remote Irrigation System Control
Enabling Description: A farm manager remotely commands an automated irrigation system via a cellular link: "Activate zone 7 for 45 minutes but delay start until soil moisture drops below 25%." The command requires the central controller to query the sensor in zone 7, which may be on a low-power, high-latency radio network. To confirm the command is being processed and not lost, the system's voice interface immediately replies, "Understood. Querying zone 7 sensor." This bridges the potential 10-20 second delay for the sensor to wake, take a reading, and transmit it back, providing immediate assurance to the operator. The final confirmation ("Zone 7 scheduled.") is only given after the sensor data is received and the command is successfully scheduled.
Mermaid Diagram:
flowchart TD A[Operator Issues Voice Command] --> B{Central Controller}; B --> C[TTS: "Understood. Querying sensor..."]; B --> D{Send Wake-Up to Zone 7 Sensor}; D -- ~15s Latency --> E[Receive Moisture Data]; E --> F{Schedule Irrigation Task}; F --> G[TTS: "Zone 7 scheduled."]; C --> H((Operator)); G --> H;
IV. Derivatives Based on Integration with Emerging Technology
4.1. AI-Driven Reinforcement Learning for Filler Optimization
Enabling Description: This system uses a Reinforcement Learning (RL) agent to dynamically select the optimal transitional message. The "state" includes user identity, conversation history, and detected emotional state. The "action" is the selection of a specific filler type (e.g., paralinguistic, short phrase, silence). The "reward" is calculated based on the user's subsequent behavior; a positive reward is given if the user waits patiently, while a negative reward is assigned if the user interrupts, hangs up, or shows signs of frustration (e.g., raised voice). Over time, the RL agent learns a personalized policy for each user, discovering that one user prefers silence while another is reassured by phrases like "Let me see...".
Mermaid Diagram:
classDiagram class RL_Agent { +observeState() +selectAction(state) policy +receiveReward(reward) +updatePolicy() } class DialogSystem { +user_state +latency_detected -filler_generator -reward_function +handleQuery() } class FillerGenerator { +playFiller(action) } DialogSystem o-- RL_Agent DialogSystem o-- FillerGenerator RL_Agent ..> FillerGenerator : selects action
4.2. IoT-Aware Contextual Filler Generation in Smart Vehicles
Enabling Description: The filler generator in an in-vehicle voice assistant is integrated with the car's Controller Area Network (CAN bus) and other IoT sensors (cameras, GPS, proximity sensors). When a user asks a question, the filler generator considers the real-time driving context. If the user asks, "Where is the nearest coffee shop?" while the vehicle's sensors indicate it is performing a complex maneuver like merging onto a highway, the filler generator selects a safety-oriented transitional message: "One moment... focusing on the merge. I'll find that for you once we're stable." This acknowledges the query but prioritizes the immediate driving context, making the interaction feel more intelligent and safe.
Mermaid Diagram:
sequenceDiagram participant Driver participant VoiceAssistant participant CarSensors (CAN bus) Driver->>VoiceAssistant: "Find coffee shop" activate VoiceAssistant VoiceAssistant->>CarSensors: Query driving context CarSensors-->>VoiceAssistant: State: "Merging on highway" VoiceAssistant->>VoiceAssistant: Select safety-oriented filler VoiceAssistant-->>Driver: "One moment, focusing on merge." loop Check context VoiceAssistant->>CarSensors: Query driving context CarSensors-->>VoiceAssistant: State: "Stable in lane" end VoiceAssistant->>VoiceAssistant: Process coffee shop query (NLG) VoiceAssistant-->>Driver: "The nearest coffee shop is..." deactivate VoiceAssistant
V. Derivatives Based on "Inverse" or Failure Modes
5.1. Graceful Degradation to Non-Verbal Fillers
Enabling Description: This system is designed for low-power or low-bandwidth environments, such as a battery-operated smart device or an application operating over a weak cellular signal. The system monitors its available computational resources and network quality. If resources fall below a predefined threshold, it disables the resource-intensive TTS synthesis for fillers. Instead, the filler generator switches to a library of low-footprint, pre-recorded, non-verbal audio cues (e.g., a simple click, a soft chime, a subtle hum). This "gracefully degraded" mode still masks latency and provides feedback but consumes minimal power and bandwidth. The final response may also be synthesized with a lower-quality, less-natural vocoder to conserve resources.
Mermaid Diagram:
stateDiagram-v2 state "Full Power Mode" as Full state "Low Power Mode" as Low [*] --> Full: System Start Full --> Low: Battery < 20% OR Network < 2 bars Low --> Full: Battery > 20% AND Network > 2 bars state Full { LatencyMasking: Synthesize "Let me see..." } state Low { LatencyMasking: Play pre-recorded 'chime.wav' }
VI. Combination Prior Art with Open-Source Standards
6.1. Combination with W3C Speech Synthesis Markup Language (SSML)
- Enabling Description: The transitional message is not a static audio file but a dynamically generated SSML document. The filler generator creates an SSML string, such as
<speak><phoneme alphabet="ipa" ph="əːm"></phoneme><break time="700ms"/></speak>, which is then passed to any SSML-compliant speech synthesis engine. This approach allows for fine-grained control over the paralinguistic event's pronunciation, pitch, and duration, and decouples the latency-masking logic from a specific proprietary TTS engine, making it interoperable with open standards.
6.2. Combination with VoiceXML (VXML) for IVR Systems
- Enabling Description: The latency masking technique is implemented within a standard VXML 2.1 architecture. A
<form>element contains a<block>that submits a request to a server-side application for the main response data. While this asynchronous request is pending, the form's<filled>logic, which executes immediately upon form entry, plays a series of short audio prompts (the transitional messages). Once the server-side logic completes and populates the form variables, the control flow proceeds to the main<field>prompt, which delivers the final response. This standard-compliant method achieves latency masking without custom client-side logic.
6.3. Combination with WebRTC for Browser-Based Dialog Systems
- Enabling Description: In a web application, user audio is captured via the Web Audio API and streamed to a server. When the server's ASR detects the end of speech, it signals the application logic. The logic immediately establishes a WebRTC
MediaStreamback to the client and begins streaming filler audio (e.g., pre-recorded "um" sounds). This stream provides immediate feedback. When the NLG has prepared the final response, the server-side application seamlessly replaces the audio source for theMediaStreamfrom the filler audio to the newly synthesized response audio. This leverages the low-latency, real-time capabilities of the open WebRTC standard to manage the audio flow for latency masking.
Generated 5/8/2026, 10:06:12 PM