Derivative works — US Patent 8355484

As a Senior Patent Strategist and Research Engineer specializing in Defensive Publishing, I have analyzed US Patent 8,355,484. The following document constitutes a defensive disclosure of derivative variations and technological combinations. The purpose of this disclosure is to place these concepts in the public domain, thereby establishing them as prior art against future patent applications seeking to claim these or obvious variations thereof.

DEFENSIVE DISCLOSURE: DERIVATIVE WORKS AND IMPROVEMENTS FOR LATENCY MASKING IN DIALOG SYSTEMS

Publication Date: May 8, 2026
Reference Patent: US 8,355,484 B2 ("Methods and apparatus for masking latency in text-to-speech systems")

I. Derivatives Based on Component & Algorithm Substitution

1.1. Generative Paralinguistic Event Synthesis

Enabling Description: This variation replaces the static database of transitional messages (as described in FIG. 2, item 218 of US 8,355,484) with a real-time generative neural network. A lightweight Variational Autoencoder (VAE) or a small-footprint Generative Adversarial Network (GAN) is trained on a corpus of human non-lexical vocalizations (e.g., "uh," "hmmm," breaths, hesitations). When the filler generator is triggered, instead of retrieving a pre-recorded file, it samples a vector from the VAE's latent space and decodes it into a novel, non-repeating audio waveform. This process ensures that users do not perceive annoying repetition in the transitional sounds, enhancing the naturalness of the interaction. The model is optimized for low-inference latency to begin playback immediately after the user ceases speaking.

Mermaid Diagram:

sequenceDiagram
    participant User
    participant ASR
    participant FillerGenerator
    participant VAE_Decoder
    participant SpeechSynthesizer

    User->>+ASR: Speaks query
    ASR-->>-User: (Silence)
    ASR->>FillerGenerator: End-of-speech signal
    activate FillerGenerator
    loop Until Main Response Ready
        FillerGenerator->>VAE_Decoder: Request new filler waveform
        activate VAE_Decoder
        Note right of VAE_Decoder: Samples from latent space
        VAE_Decoder-->>FillerGenerator: Generates unique waveform
        deactivate VAE_Decoder
        FillerGenerator->>SpeechSynthesizer: Stream waveform
        activate SpeechSynthesizer
        SpeechSynthesizer-->>User: Plays novel 'um' sound
        deactivate SpeechSynthesizer
    end
    deactivate FillerGenerator

1.2. Emotionally-Attuned Filler Selection

Enabling Description: This system enhances the filler generator by integrating it with a real-time emotion detection module that analyzes the user's vocal prosody (pitch, tone, speaking rate). Upon receiving the audio communication, the user's speech is analyzed in parallel by the ASR and the emotion detection module. The module classifies the user's emotional state (e.g., frustrated, calm, inquisitive) and provides this classification as input to the filler generator. The filler generator then selects a transitional message from a database that is tagged with corresponding emotional attributes. For example, if frustration is detected, a more placating and thoughtful phrase like "Okay, let me check that for you carefully..." is selected over a simple "uhm."

Mermaid Diagram:

flowchart TD
    A[User Communication] --> B{ASR Engine};
    A --> C{Vocal Emotion Detection};
    C --> D[Emotional State Vector];
    B --> E[Transcribed Words];
    E --> F{NLU & Dialog Manager};
    F --> G[Response Data];
    subgraph Filler Logic
        D --> H{Filler Generator};
        I[Emotionally-Tagged Filler DB] --> H;
    end
    H --> J(Select Emotionally-Appropriate Filler);
    G --> K{Natural Language Generator};
    K --> L[Final Response Text];

    stateDiagram-v2
        [*] --> Processing: User speaks
        Processing --> LatencyMasking: End-of-speech detected
        LatencyMasking: Play placating filler if user is frustrated
        LatencyMasking --> Responding: NLG generates response
        Responding --> [*]: Synthesize and speak response

II. Derivatives Based on Operational Parameter Expansion

2.1. Sub-Perceptual Latency Masking for High-Frequency Systems

Enabling Description: In applications where system response must be in the sub-100 millisecond range (e.g., real-time audio feedback for pilots, surgeons, or financial traders), traditional paralinguistic fillers are too long. This implementation uses sub-perceptual sonic artifacts as transitional messages. When latency is detected (e.g., a 50ms delay in a data stream), the system synthesizes a phase-coherent, low-amplitude audio signal that is harmonically related to the user's own voice frequency or a background hum. This artifact is not perceived as a distinct sound but maintains a sense of an active audio channel, preventing the user from perceiving the jarring silence of a dropped connection during the brief delay.

Mermaid Diagram:

graph LR
    subgraph Real-Time System
        A(User Input) --> B{Process Query};
        B -- Delay > 10ms --> C{Filler Generator};
        C --> D[Synthesize Phase-Coherent<br>Sub-Perceptual Tone];
        D --> E(Output Audio Stream);
        B -- Delay <= 10ms --> F[Generate Response];
        F --> E;
    end

2.2. Structured Multi-Stage Fillers for High-Latency Systems

Enabling Description: For systems with expected latencies of 30 seconds to several minutes (e.g., queries to distributed scientific databases or satellite-linked remote systems), this variation employs a multi-stage, structured transitional message. The filler generator functions as a state machine. Upon receiving a query, it provides an initial acknowledgment ("Query received, accessing remote archives."). It then provides periodic, substantive updates based on milestones in the data retrieval pipeline ("Access granted. Now processing 2 terabytes of imaging data..."). This transforms the latency period from a silent wait into an informative progress report, managed by the filler generator and synthesized by the TTS system, until the final NLG-generated response is ready.

Mermaid Diagram:

stateDiagram-v2
    [*] --> Acknowledged: Query Received
    Acknowledged --> Accessing: "Contacting remote server..."
    Accessing --> Processing: "Data link established. Processing..."
    Processing --> Synthesizing: "Analysis complete. Generating your summary."
    Synthesizing --> FinalResponse: (NLG completes)
    FinalResponse --> [*]: Deliver full response

III. Derivatives Based on Cross-Domain Application

3.1. Aerospace: High-Workload Cockpit Voice Assistant

Enabling Description: In a flight deck environment, a pilot issues a voice command such as, "Check weather and icing conditions for landing at KBOS in 45 minutes." The aircraft's avionics system must query multiple data sources. To prevent distracting silence and confirm receipt of the command during a critical flight phase, the system immediately responds with a transitional message synthesized in a calm, standardized aviation voice: "Checking...". As it completes sub-tasks, it can optionally provide further fillers: "Weather data received... checking icing model...". This masks the latency of the complex data fusion task and assures the flight crew the system is working, without requiring them to divert visual attention.

Mermaid Diagram:

sequenceDiagram
    participant Pilot
    participant CockpitVoiceSystem
    participant AvionicsDataBus

    Pilot->>+CockpitVoiceSystem: "Check weather at KBOS"
    CockpitVoiceSystem->>CockpitVoiceSystem: ASR/NLU Processing
    CockpitVoiceSystem-->>-Pilot: Synthesizes "Checking..."
    CockpitVoiceSystem->>+AvionicsDataBus: Request METAR, PIREPs
    AvionicsDataBus-->>-CockpitVoiceSystem: Data streams
    CockpitVoiceSystem->>CockpitVoiceSystem: Data fusion & NLG
    CockpitVoiceSystem-->>-Pilot: Synthesizes full weather brief

3.2. AgTech: Remote Irrigation System Control

Enabling Description: A farm manager remotely commands an automated irrigation system via a cellular link: "Activate zone 7 for 45 minutes but delay start until soil moisture drops below 25%." The command requires the central controller to query the sensor in zone 7, which may be on a low-power, high-latency radio network. To confirm the command is being processed and not lost, the system's voice interface immediately replies, "Understood. Querying zone 7 sensor." This bridges the potential 10-20 second delay for the sensor to wake, take a reading, and transmit it back, providing immediate assurance to the operator. The final confirmation ("Zone 7 scheduled.") is only given after the sensor data is received and the command is successfully scheduled.

Mermaid Diagram:

flowchart TD
    A[Operator Issues Voice Command] --> B{Central Controller};
    B --> C[TTS: "Understood. Querying sensor..."];
    B --> D{Send Wake-Up to Zone 7 Sensor};
    D -- ~15s Latency --> E[Receive Moisture Data];
    E --> F{Schedule Irrigation Task};
    F --> G[TTS: "Zone 7 scheduled."];
    C --> H((Operator));
    G --> H;

IV. Derivatives Based on Integration with Emerging Technology

4.1. AI-Driven Reinforcement Learning for Filler Optimization

Enabling Description: This system uses a Reinforcement Learning (RL) agent to dynamically select the optimal transitional message. The "state" includes user identity, conversation history, and detected emotional state. The "action" is the selection of a specific filler type (e.g., paralinguistic, short phrase, silence). The "reward" is calculated based on the user's subsequent behavior; a positive reward is given if the user waits patiently, while a negative reward is assigned if the user interrupts, hangs up, or shows signs of frustration (e.g., raised voice). Over time, the RL agent learns a personalized policy for each user, discovering that one user prefers silence while another is reassured by phrases like "Let me see...".

Mermaid Diagram:

classDiagram
    class RL_Agent {
        +observeState()
        +selectAction(state) policy
        +receiveReward(reward)
        +updatePolicy()
    }
    class DialogSystem {
        +user_state
        +latency_detected
        -filler_generator
        -reward_function
        +handleQuery()
    }
    class FillerGenerator {
        +playFiller(action)
    }
    DialogSystem o-- RL_Agent
    DialogSystem o-- FillerGenerator
    RL_Agent ..> FillerGenerator : selects action

4.2. IoT-Aware Contextual Filler Generation in Smart Vehicles

Enabling Description: The filler generator in an in-vehicle voice assistant is integrated with the car's Controller Area Network (CAN bus) and other IoT sensors (cameras, GPS, proximity sensors). When a user asks a question, the filler generator considers the real-time driving context. If the user asks, "Where is the nearest coffee shop?" while the vehicle's sensors indicate it is performing a complex maneuver like merging onto a highway, the filler generator selects a safety-oriented transitional message: "One moment... focusing on the merge. I'll find that for you once we're stable." This acknowledges the query but prioritizes the immediate driving context, making the interaction feel more intelligent and safe.

Mermaid Diagram:

sequenceDiagram
    participant Driver
    participant VoiceAssistant
    participant CarSensors (CAN bus)

    Driver->>VoiceAssistant: "Find coffee shop"
    activate VoiceAssistant
    VoiceAssistant->>CarSensors: Query driving context
    CarSensors-->>VoiceAssistant: State: "Merging on highway"
    VoiceAssistant->>VoiceAssistant: Select safety-oriented filler
    VoiceAssistant-->>Driver: "One moment, focusing on merge."
    loop Check context
        VoiceAssistant->>CarSensors: Query driving context
        CarSensors-->>VoiceAssistant: State: "Stable in lane"
    end
    VoiceAssistant->>VoiceAssistant: Process coffee shop query (NLG)
    VoiceAssistant-->>Driver: "The nearest coffee shop is..."
    deactivate VoiceAssistant

V. Derivatives Based on "Inverse" or Failure Modes

5.1. Graceful Degradation to Non-Verbal Fillers

Enabling Description: This system is designed for low-power or low-bandwidth environments, such as a battery-operated smart device or an application operating over a weak cellular signal. The system monitors its available computational resources and network quality. If resources fall below a predefined threshold, it disables the resource-intensive TTS synthesis for fillers. Instead, the filler generator switches to a library of low-footprint, pre-recorded, non-verbal audio cues (e.g., a simple click, a soft chime, a subtle hum). This "gracefully degraded" mode still masks latency and provides feedback but consumes minimal power and bandwidth. The final response may also be synthesized with a lower-quality, less-natural vocoder to conserve resources.

Mermaid Diagram:

stateDiagram-v2
    state "Full Power Mode" as Full
    state "Low Power Mode" as Low

    [*] --> Full: System Start
    Full --> Low: Battery < 20% OR Network < 2 bars
    Low --> Full: Battery > 20% AND Network > 2 bars

    state Full {
        LatencyMasking: Synthesize "Let me see..."
    }
    state Low {
        LatencyMasking: Play pre-recorded 'chime.wav'
    }

VI. Combination Prior Art with Open-Source Standards

6.1. Combination with W3C Speech Synthesis Markup Language (SSML)

Enabling Description: The transitional message is not a static audio file but a dynamically generated SSML document. The filler generator creates an SSML string, such as <speak><phoneme alphabet="ipa" ph="əːm"></phoneme><break time="700ms"/></speak>, which is then passed to any SSML-compliant speech synthesis engine. This approach allows for fine-grained control over the paralinguistic event's pronunciation, pitch, and duration, and decouples the latency-masking logic from a specific proprietary TTS engine, making it interoperable with open standards.

6.2. Combination with VoiceXML (VXML) for IVR Systems

Enabling Description: The latency masking technique is implemented within a standard VXML 2.1 architecture. A <form> element contains a <block> that submits a request to a server-side application for the main response data. While this asynchronous request is pending, the form's <filled> logic, which executes immediately upon form entry, plays a series of short audio prompts (the transitional messages). Once the server-side logic completes and populates the form variables, the control flow proceeds to the main <field> prompt, which delivers the final response. This standard-compliant method achieves latency masking without custom client-side logic.

6.3. Combination with WebRTC for Browser-Based Dialog Systems

Enabling Description: In a web application, user audio is captured via the Web Audio API and streamed to a server. When the server's ASR detects the end of speech, it signals the application logic. The logic immediately establishes a WebRTC MediaStream back to the client and begins streaming filler audio (e.g., pre-recorded "um" sounds). This stream provides immediate feedback. When the NLG has prepared the final response, the server-side application seamlessly replaces the audio source for the MediaStream from the filler audio to the newly synthesized response audio. This leverages the low-latency, real-time capabilities of the open WebRTC standard to manage the audio flow for latency masking.