Patent 12236947

Derivative works

Defensive disclosure: derivative variations of each claim designed to render future incremental improvements obvious or non-novel.

Active provider: Google · gemini-2.5-pro

Derivative works

Defensive disclosure: derivative variations of each claim designed to render future incremental improvements obvious or non-novel.

✓ Generated

Defensive Disclosure and Prior Art Generation

Publication Date: May 8, 2026
Reference ID: DPD-T7-12236947
Title: System and Method for Multimodal, Context-Aware, and Failsafe Determination of Command Intent for Human-Machine Interfaces

This document discloses a series of technical implementations and derivative works related to the core concepts embodied in U.S. Patent 12,236,947. The purpose of this disclosure is to place these concepts into the public domain, thereby establishing prior art against future patent applications claiming these or similar incremental innovations.


Disclosures Pertaining to Claim 1: Multimodal Command Recognition

The core claim involves processing audio and video to identify a user's command intent. The following derivatives expand upon this concept.

Derivative 1.1: Component Substitution with Non-Visible Spectrum and Non-Acoustic Sensors

  • Enabling Description: The system is implemented using a long-wave infrared (LWIR) thermal camera instead of a standard CMOS/RGB camera. The "visual characteristic" is a thermal signature change in the user's perioral and nasal regions, which corresponds to the exhalation patterns of speech. This method is effective in zero-light conditions. The audio input is supplemented by a bone conduction transducer pressed against the user's mastoid process, capturing vocal vibrations directly, rendering the system highly immune to ambient acoustic noise. Data from the LWIR sensor (as a 32x32 pixel thermal array) and the bone conduction sensor are fed into a convolutional neural network (CNN) for intent fusion.
  • Mermaid Diagram:
    graph TD
        A[User Utterance] --> B{LWIR Thermal Camera};
        A --> C{Bone Conduction Transducer};
        B --> D[Thermal Feature Extraction <br>(e.g., Perioral heat map)];
        C --> E[Vibrational Audio Processing];
        D & E --> F[Fusion CNN];
        F --> G{Intent Classification <br>(Command / Non-Command)};
        G --> H[System Action];
    end
    

Derivative 1.2: Component Substitution with Gaze and Neurological Input

  • Enabling Description: The video input device is a dedicated eye-tracking module using infrared emitters and sensors to calculate the user's precise point of gaze (POG). The "visual characteristic" is the user's gaze dwelling on a system-controllable object for >500ms concurrently with an utterance. This is combined with audio input and a consumer-grade electroencephalography (EEG) headband. The system identifies a command if the audio input is co-occurrent with a P300 event-related potential (ERP) signal from the EEG, indicating a recognition/decision event in the user's brain.
  • Mermaid Diagram:
    sequenceDiagram
        participant User
        participant EyeTracker
        participant EEG_Headband
        participant AudioMic
        participant FusionEngine
    
        User->>+AudioMic: Speaks "Activate that"
        User->>+EyeTracker: Looks at target device
        User->>+EEG_Headband: Brain registers decision (P300 wave)
    
        AudioMic->>FusionEngine: Provides audio stream
        EyeTracker->>FusionEngine: Provides Point of Gaze (POG) data
        EEG_Headband->>FusionEngine: Provides EEG data stream
    
        FusionEngine->>FusionEngine: Fuses POG, Audio, and P300 signal
        FusionEngine-->>User: Executes command on target device
        deactivate AudioMic
        deactivate EyeTracker
        deactivate EEG_Headband
    

Derivative 1.3: Cross-Domain Application in Sterile Surgical Environments

  • Enabling Description: In a surgical robotics suite, a surgeon wears augmented reality (AR) glasses with an integrated microphone and an inward-facing camera for eye-tracking. When the surgeon issues a command like "increase power," the system determines intent. If the surgeon's gaze is directed at the robotic arm's control panel (displayed in the AR view), the command is routed to the robot. If their gaze is directed at a human nurse, the command is ignored by the robotic system and outputted via a speaker for the nurse. The "visual characteristic" is the POG relative to virtual objects in the AR overlay.
  • Mermaid Diagram:
    stateDiagram-v2
        [*] --> Idle
        Idle --> Listening: Surgeon speaks
        Listening --> Intent_Analysis: Gaze Data Received
    
        state Intent_Analysis {
            [*] --> Gaze_On_Robot_UI: Eye tracking POG on AR robot controls
            Gaze_On_Robot_UI --> Route_To_Robot
            [*] --> Gaze_On_Human: Eye tracking POG on human colleague
            Gaze_On_Human --> Ignore_Command
        }
    
        Route_To_Robot --> Executed: Command sent to surgical robot
        Ignore_Command --> Idle: Command is for human staff
        Executed --> Idle
    

Derivative 1.4: Cross-Domain Application in Livestock Monitoring (AgTech)

  • Enabling Description: An array of pan-tilt-zoom (PTZ) cameras and long-range microphones are installed in a cattle feedlot. The system continuously analyzes audio for bovine vocalizations indicative of distress (e.g., specific pitch and duration). Upon detecting such a vocalization, it directs the nearest camera to the source. The video stream is then analyzed to identify "visual characteristics" of distress, such as limping, isolation from the herd, or postural abnormalities. A "command" is determined if both audio and visual distress indicators are present, triggering an alert to the rancher's mobile device with the animal's tag number and location.
  • Mermaid Diagram:
    graph TD
        subgraph Monitoring_System
            A(Audio Analysis) -- Detects Distress Vocalization --> B(Cue Camera);
            B -- PTZ Camera focuses on source --> C(Video Analysis);
            C -- Identifies Visual Distress Signs --> D(Confirm Intent);
        end
        D -- Both Modalities Positive --> E{Send Alert};
        E --> F[Rancher's Device];
        C -- No Visual Distress --> A;
    

Derivative 1.5: Integration with AI for Predictive Intent Modeling

  • Enabling Description: The system integrates a transformer-based neural network that is pre-trained on a massive dataset of human interactions. It receives the real-time audio and video streams (as sequences of feature vectors). Instead of just classifying the current utterance, the model predicts a probability distribution over a set of potential future commands the user might issue in the next 1-3 seconds. If a spoken command matches a high-probability prediction from the model, the confidence threshold for executing the command is lowered, allowing for faster, more responsive interaction, especially in high-noise environments where the audio signal may be degraded.
  • Mermaid Diagram:
    classDiagram
        class User {
            +Utterance
            +FacialData
        }
        class PredictiveIntentModel {
            -TransformerNetwork
            +process(audio_features, video_features)
            +predictNextCommands(top_k) : list
        }
        class CommandInterpreter {
            +transcribe(audio) : string
            +execute(command)
        }
        User --|> PredictiveIntentModel : provides features
        PredictiveIntentModel --|> CommandInterpreter : provides predictions
        User --|> CommandInterpreter : provides audio
    

Disclosures Pertaining to Claim 17: Dialog State Context

This claim adds the use of conversational context. The following derivatives expand this by redefining "dialog state" and its application.

Derivative 17.1: Expansion of "Dialog State" to Environmental and Biosensor State

  • Enabling Description: The "state of a dialog" is expanded to include data from a network of IoT sensors. In a vehicle, this includes the current navigation route, weather conditions (from an external API), and cabin occupancy (from weight sensors in seats). The user's biometric state is monitored via a smartwatch, providing heart rate and galvanic skin response. An utterance like "it's getting dark" is interpreted as a command to turn on the headlights only if the IoT state confirms ambient light is below a set lumen threshold and the dialog state indicates no ongoing conversation about philosophy. An utterance like "I'm stressed" combined with high heart rate data will cause the system to suggest a calming playlist.
  • Mermaid Diagram:
    erDiagram
        USER {
            string utterance
            string biometric_state
        }
        SYSTEM {
            string dialog_history
            string environmental_state
            string vehicle_state
        }
        INTENT_PROCESSOR {
            string fused_context
        }
        USER ||--o{ INTENT_PROCESSOR : provides
        SYSTEM ||--o{ INTENT_PROCESSOR : provides
    

Derivative 17.2: Cross-Domain Application in Automated Educational Tutors

  • Enabling Description: An AI-powered language tutor uses a comprehensive dialog state that tracks the student's learning history, including common grammatical errors and vocabulary weaknesses. When the student speaks a phrase, the system processes the audio. The video input is analyzed for facial cues of confusion (e.g., furrowed brow). If the student's spoken phrase contains a grammatical error previously flagged in the dialog state, and the visual cues indicate confusion, the system interrupts to provide a targeted correction. If no confusion is detected, it allows the conversation to flow, assuming a minor slip of the tongue. The "command" is an implicit request for help.
  • Mermaid Diagram:
    flowchart LR
        subgraph Student
            A[Speaks Phrase]
            B[Facial Expression]
        end
        subgraph TutorSystem
            C[Audio Processing]
            D[Video Processing]
            E[Access Dialog State <br> (Past Errors)]
        end
        A --> C
        B --> D
        F{Fuse Inputs & State}
        C & D & E --> F
        F -- Error matches past & Confusion detected --> G[Provide Correction]
        F -- Else --> H[Continue Conversation]
    

Derivative 17.3: The "Inverse" or Failure Mode: Stateless Privacy Mode

  • Enabling Description: The system is designed with a user-selectable "stateless" mode. When activated, the system intentionally purges all dialog history after each interaction. It does not store logs of conversations or commands. In this mode, the determination of intent relies exclusively on the immediate audio and video input from the current utterance. This provides a higher degree of user privacy at the cost of contextual awareness. For example, the system cannot resolve anaphora (e.g., "turn it off") and will prompt for clarification, as it has no memory of what "it" refers to. This is a failsafe for privacy-sensitive applications.
  • Mermaid Diagram:
    stateDiagram-v2
        state "Standard Mode" as S1
        state "Stateless Mode" as S2
    
        [*] --> S1
        S1 --> S2: User Toggles Privacy
        S2 --> S1: User Toggles Privacy
    
        S1: Utterance -> Process(audio, video, history) -> Action
        S2: Utterance -> PurgeHistory -> Process(audio, video) -> Action/Prompt
    

Disclosures Pertaining to Claim 18: The Physical System

This claim covers the physical hardware. The following derivatives propose alternative and advanced hardware architectures.

Derivative 18.1: Distributed System Architecture using Edge Computing

  • Enabling Description: The system is not a single computing device but a distributed network. The "audio input device" and "video input device" (e.g., cameras and mics in a smart home) are edge nodes. These nodes perform initial feature extraction locally using low-power processors (e.g., ARM Cortex-M series). The camera extracts facial landmark vectors, and the microphone extracts MFCCs (Mel-frequency cepstral coefficients). Only these low-bandwidth feature vectors are transmitted over the network to a central hub or cloud service for the final, computationally expensive intent fusion and command processing. This architecture preserves privacy (raw video/audio doesn't leave the room) and saves network bandwidth.
  • Mermaid Diagram:
    graph TD
        subgraph Edge_Device_1
            A[Camera] --> B(Local Feature Extractor <br> Facial Landmarks);
        end
        subgraph Edge_Device_2
            C[Microphone] --> D(Local Feature Extractor <br> MFCCs);
        end
        B -- Landmark Vector --> F{Central Fusion Hub};
        D -- MFCC Vector --> F;
        F --> G[Intent Determination];
        G --> H[Command Execution];
    

Derivative 18.2: Component Substitution with Neuromorphic Hardware

  • Enabling Description: The "computing device" is a specialized neuromorphic processor (e.g., based on Loihi or SpiNNaker architecture). Both the audio and video sensor data are converted into asynchronous spike trains. The intent determination model is implemented as a Spiking Neural Network (SNN). This hardware architecture provides extreme low-power operation, making it suitable for always-on applications in battery-powered devices like wearables or AR glasses. The processing is event-driven, consuming power only when new visual or auditory information is detected.
  • Mermaid Diagram:
    sequenceDiagram
        participant Sensor_Video
        participant Sensor_Audio
        participant SpikingEncoder
        participant Neuromorphic_SNN
        participant Actuator
    
        Sensor_Video->>SpikingEncoder: Raw pixel data
        Sensor_Audio->>SpikingEncoder: Raw audio waveform
        SpikingEncoder->>Neuromorphic_SNN: Asynchronous Spike Trains
        Neuromorphic_SNN->>Neuromorphic_SNN: Processes spikes, determines intent
        Neuromorphic_SNN->>Actuator: Triggers command action
    

Combination Prior Art Scenarios with Open-Source Standards

  1. Combination with OpenCV and WebRTC: An implementation is disclosed wherein the system operates entirely within a web browser. The video and audio input devices are a standard webcam and microphone accessed via the WebRTC getUserMedia() API. The received video frames are processed client-side in a WebAssembly module that uses the OpenCV.js library to perform real-time facial landmark detection and head pose estimation. These "visual characteristics" are combined with the audio stream (which may be transcribed locally or sent to a server) to determine command intent, enabling any website to become a multimodal conversational agent without requiring plugins or dedicated hardware.

  2. Combination with Kaldi and MQTT: A system is disclosed for an industrial control environment (IIoT). Microphones distributed throughout a factory floor are the audio input devices. They run a lightweight version of the Kaldi speech recognition toolkit for keyword spotting. Cameras act as video input devices. When a worker speaks a potential command, the Kaldi spotter and the video feed (analyzed for gestures or gaze direction) provide inputs. The fused intent is then published as a message on a lightweight MQTT (Message Queuing Telemetry Transport) broker. Subscribed robotic arms or machinery act on the command, creating a robust, low-latency, and standards-based factory control system.

  3. Combination with Android Open Source Project (AOSP) and TensorFlow Lite: A system is disclosed as a modification to the core AOSP accessibility services. The audio input is from the device microphone and the video input is from the front-facing camera. A TensorFlow Lite model, optimized for mobile NPUs, is integrated into the OS. This model continuously processes the audio/video streams to determine if a user with motor impairments is attempting to issue a command versus speaking to someone else in the room. This multimodal intent signal is then used to activate the standard AOSP Voice Access service, reducing false activations and making the device more usable.

Generated 5/8/2026, 10:08:20 PM