Derivative works — US Patent 12236947

Defensive Disclosure and Prior Art Generation

Publication Date: May 8, 2026
Reference ID: DPD-T7-12236947
Title: System and Method for Multimodal, Context-Aware, and Failsafe Determination of Command Intent for Human-Machine Interfaces

This document discloses a series of technical implementations and derivative works related to the core concepts embodied in U.S. Patent 12,236,947. The purpose of this disclosure is to place these concepts into the public domain, thereby establishing prior art against future patent applications claiming these or similar incremental innovations.

Disclosures Pertaining to Claim 1: Multimodal Command Recognition

The core claim involves processing audio and video to identify a user's command intent. The following derivatives expand upon this concept.

Derivative 1.1: Component Substitution with Non-Visible Spectrum and Non-Acoustic Sensors

Enabling Description: The system is implemented using a long-wave infrared (LWIR) thermal camera instead of a standard CMOS/RGB camera. The "visual characteristic" is a thermal signature change in the user's perioral and nasal regions, which corresponds to the exhalation patterns of speech. This method is effective in zero-light conditions. The audio input is supplemented by a bone conduction transducer pressed against the user's mastoid process, capturing vocal vibrations directly, rendering the system highly immune to ambient acoustic noise. Data from the LWIR sensor (as a 32x32 pixel thermal array) and the bone conduction sensor are fed into a convolutional neural network (CNN) for intent fusion.

Mermaid Diagram:

graph TD
    A[User Utterance] --> B{LWIR Thermal Camera};
    A --> C{Bone Conduction Transducer};
    B --> D[Thermal Feature Extraction <br>(e.g., Perioral heat map)];
    C --> E[Vibrational Audio Processing];
    D & E --> F[Fusion CNN];
    F --> G{Intent Classification <br>(Command / Non-Command)};
    G --> H[System Action];
end

Derivative 1.2: Component Substitution with Gaze and Neurological Input

Enabling Description: The video input device is a dedicated eye-tracking module using infrared emitters and sensors to calculate the user's precise point of gaze (POG). The "visual characteristic" is the user's gaze dwelling on a system-controllable object for >500ms concurrently with an utterance. This is combined with audio input and a consumer-grade electroencephalography (EEG) headband. The system identifies a command if the audio input is co-occurrent with a P300 event-related potential (ERP) signal from the EEG, indicating a recognition/decision event in the user's brain.

Mermaid Diagram:

sequenceDiagram
    participant User
    participant EyeTracker
    participant EEG_Headband
    participant AudioMic
    participant FusionEngine

    User->>+AudioMic: Speaks "Activate that"
    User->>+EyeTracker: Looks at target device
    User->>+EEG_Headband: Brain registers decision (P300 wave)

    AudioMic->>FusionEngine: Provides audio stream
    EyeTracker->>FusionEngine: Provides Point of Gaze (POG) data
    EEG_Headband->>FusionEngine: Provides EEG data stream

    FusionEngine->>FusionEngine: Fuses POG, Audio, and P300 signal
    FusionEngine-->>User: Executes command on target device
    deactivate AudioMic
    deactivate EyeTracker
    deactivate EEG_Headband

Derivative 1.3: Cross-Domain Application in Sterile Surgical Environments

Enabling Description: In a surgical robotics suite, a surgeon wears augmented reality (AR) glasses with an integrated microphone and an inward-facing camera for eye-tracking. When the surgeon issues a command like "increase power," the system determines intent. If the surgeon's gaze is directed at the robotic arm's control panel (displayed in the AR view), the command is routed to the robot. If their gaze is directed at a human nurse, the command is ignored by the robotic system and outputted via a speaker for the nurse. The "visual characteristic" is the POG relative to virtual objects in the AR overlay.

Mermaid Diagram:

stateDiagram-v2
    [*] --> Idle
    Idle --> Listening: Surgeon speaks
    Listening --> Intent_Analysis: Gaze Data Received

    state Intent_Analysis {
        [*] --> Gaze_On_Robot_UI: Eye tracking POG on AR robot controls
        Gaze_On_Robot_UI --> Route_To_Robot
        [*] --> Gaze_On_Human: Eye tracking POG on human colleague
        Gaze_On_Human --> Ignore_Command
    }

    Route_To_Robot --> Executed: Command sent to surgical robot
    Ignore_Command --> Idle: Command is for human staff
    Executed --> Idle

Derivative 1.4: Cross-Domain Application in Livestock Monitoring (AgTech)

Enabling Description: An array of pan-tilt-zoom (PTZ) cameras and long-range microphones are installed in a cattle feedlot. The system continuously analyzes audio for bovine vocalizations indicative of distress (e.g., specific pitch and duration). Upon detecting such a vocalization, it directs the nearest camera to the source. The video stream is then analyzed to identify "visual characteristics" of distress, such as limping, isolation from the herd, or postural abnormalities. A "command" is determined if both audio and visual distress indicators are present, triggering an alert to the rancher's mobile device with the animal's tag number and location.

Mermaid Diagram:

graph TD
    subgraph Monitoring_System
        A(Audio Analysis) -- Detects Distress Vocalization --> B(Cue Camera);
        B -- PTZ Camera focuses on source --> C(Video Analysis);
        C -- Identifies Visual Distress Signs --> D(Confirm Intent);
    end
    D -- Both Modalities Positive --> E{Send Alert};
    E --> F[Rancher's Device];
    C -- No Visual Distress --> A;

Derivative 1.5: Integration with AI for Predictive Intent Modeling

Enabling Description: The system integrates a transformer-based neural network that is pre-trained on a massive dataset of human interactions. It receives the real-time audio and video streams (as sequences of feature vectors). Instead of just classifying the current utterance, the model predicts a probability distribution over a set of potential future commands the user might issue in the next 1-3 seconds. If a spoken command matches a high-probability prediction from the model, the confidence threshold for executing the command is lowered, allowing for faster, more responsive interaction, especially in high-noise environments where the audio signal may be degraded.

Mermaid Diagram:

classDiagram
    class User {
        +Utterance
        +FacialData
    }
    class PredictiveIntentModel {
        -TransformerNetwork
        +process(audio_features, video_features)
        +predictNextCommands(top_k) : list
    }
    class CommandInterpreter {
        +transcribe(audio) : string
        +execute(command)
    }
    User --|> PredictiveIntentModel : provides features
    PredictiveIntentModel --|> CommandInterpreter : provides predictions
    User --|> CommandInterpreter : provides audio

Disclosures Pertaining to Claim 17: Dialog State Context

This claim adds the use of conversational context. The following derivatives expand this by redefining "dialog state" and its application.

Derivative 17.1: Expansion of "Dialog State" to Environmental and Biosensor State

Enabling Description: The "state of a dialog" is expanded to include data from a network of IoT sensors. In a vehicle, this includes the current navigation route, weather conditions (from an external API), and cabin occupancy (from weight sensors in seats). The user's biometric state is monitored via a smartwatch, providing heart rate and galvanic skin response. An utterance like "it's getting dark" is interpreted as a command to turn on the headlights only if the IoT state confirms ambient light is below a set lumen threshold and the dialog state indicates no ongoing conversation about philosophy. An utterance like "I'm stressed" combined with high heart rate data will cause the system to suggest a calming playlist.

Mermaid Diagram:

erDiagram
    USER {
        string utterance
        string biometric_state
    }
    SYSTEM {
        string dialog_history
        string environmental_state
        string vehicle_state
    }
    INTENT_PROCESSOR {
        string fused_context
    }
    USER ||--o{ INTENT_PROCESSOR : provides
    SYSTEM ||--o{ INTENT_PROCESSOR : provides

Derivative 17.2: Cross-Domain Application in Automated Educational Tutors

Enabling Description: An AI-powered language tutor uses a comprehensive dialog state that tracks the student's learning history, including common grammatical errors and vocabulary weaknesses. When the student speaks a phrase, the system processes the audio. The video input is analyzed for facial cues of confusion (e.g., furrowed brow). If the student's spoken phrase contains a grammatical error previously flagged in the dialog state, and the visual cues indicate confusion, the system interrupts to provide a targeted correction. If no confusion is detected, it allows the conversation to flow, assuming a minor slip of the tongue. The "command" is an implicit request for help.

Mermaid Diagram:

flowchart LR
    subgraph Student
        A[Speaks Phrase]
        B[Facial Expression]
    end
    subgraph TutorSystem
        C[Audio Processing]
        D[Video Processing]
        E[Access Dialog State <br> (Past Errors)]
    end
    A --> C
    B --> D
    F{Fuse Inputs & State}
    C & D & E --> F
    F -- Error matches past & Confusion detected --> G[Provide Correction]
    F -- Else --> H[Continue Conversation]

Derivative 17.3: The "Inverse" or Failure Mode: Stateless Privacy Mode

Enabling Description: The system is designed with a user-selectable "stateless" mode. When activated, the system intentionally purges all dialog history after each interaction. It does not store logs of conversations or commands. In this mode, the determination of intent relies exclusively on the immediate audio and video input from the current utterance. This provides a higher degree of user privacy at the cost of contextual awareness. For example, the system cannot resolve anaphora (e.g., "turn it off") and will prompt for clarification, as it has no memory of what "it" refers to. This is a failsafe for privacy-sensitive applications.

Mermaid Diagram:

stateDiagram-v2
    state "Standard Mode" as S1
    state "Stateless Mode" as S2

    [*] --> S1
    S1 --> S2: User Toggles Privacy
    S2 --> S1: User Toggles Privacy

    S1: Utterance -> Process(audio, video, history) -> Action
    S2: Utterance -> PurgeHistory -> Process(audio, video) -> Action/Prompt

Disclosures Pertaining to Claim 18: The Physical System

This claim covers the physical hardware. The following derivatives propose alternative and advanced hardware architectures.

Derivative 18.1: Distributed System Architecture using Edge Computing

Enabling Description: The system is not a single computing device but a distributed network. The "audio input device" and "video input device" (e.g., cameras and mics in a smart home) are edge nodes. These nodes perform initial feature extraction locally using low-power processors (e.g., ARM Cortex-M series). The camera extracts facial landmark vectors, and the microphone extracts MFCCs (Mel-frequency cepstral coefficients). Only these low-bandwidth feature vectors are transmitted over the network to a central hub or cloud service for the final, computationally expensive intent fusion and command processing. This architecture preserves privacy (raw video/audio doesn't leave the room) and saves network bandwidth.

Mermaid Diagram:

graph TD
    subgraph Edge_Device_1
        A[Camera] --> B(Local Feature Extractor <br> Facial Landmarks);
    end
    subgraph Edge_Device_2
        C[Microphone] --> D(Local Feature Extractor <br> MFCCs);
    end
    B -- Landmark Vector --> F{Central Fusion Hub};
    D -- MFCC Vector --> F;
    F --> G[Intent Determination];
    G --> H[Command Execution];

Derivative 18.2: Component Substitution with Neuromorphic Hardware

Enabling Description: The "computing device" is a specialized neuromorphic processor (e.g., based on Loihi or SpiNNaker architecture). Both the audio and video sensor data are converted into asynchronous spike trains. The intent determination model is implemented as a Spiking Neural Network (SNN). This hardware architecture provides extreme low-power operation, making it suitable for always-on applications in battery-powered devices like wearables or AR glasses. The processing is event-driven, consuming power only when new visual or auditory information is detected.

Mermaid Diagram:

sequenceDiagram
    participant Sensor_Video
    participant Sensor_Audio
    participant SpikingEncoder
    participant Neuromorphic_SNN
    participant Actuator

    Sensor_Video->>SpikingEncoder: Raw pixel data
    Sensor_Audio->>SpikingEncoder: Raw audio waveform
    SpikingEncoder->>Neuromorphic_SNN: Asynchronous Spike Trains
    Neuromorphic_SNN->>Neuromorphic_SNN: Processes spikes, determines intent
    Neuromorphic_SNN->>Actuator: Triggers command action

Combination Prior Art Scenarios with Open-Source Standards

Combination with OpenCV and WebRTC: An implementation is disclosed wherein the system operates entirely within a web browser. The video and audio input devices are a standard webcam and microphone accessed via the WebRTC getUserMedia() API. The received video frames are processed client-side in a WebAssembly module that uses the OpenCV.js library to perform real-time facial landmark detection and head pose estimation. These "visual characteristics" are combined with the audio stream (which may be transcribed locally or sent to a server) to determine command intent, enabling any website to become a multimodal conversational agent without requiring plugins or dedicated hardware.
Combination with Kaldi and MQTT: A system is disclosed for an industrial control environment (IIoT). Microphones distributed throughout a factory floor are the audio input devices. They run a lightweight version of the Kaldi speech recognition toolkit for keyword spotting. Cameras act as video input devices. When a worker speaks a potential command, the Kaldi spotter and the video feed (analyzed for gestures or gaze direction) provide inputs. The fused intent is then published as a message on a lightweight MQTT (Message Queuing Telemetry Transport) broker. Subscribed robotic arms or machinery act on the command, creating a robust, low-latency, and standards-based factory control system.
Combination with Android Open Source Project (AOSP) and TensorFlow Lite: A system is disclosed as a modification to the core AOSP accessibility services. The audio input is from the device microphone and the video input is from the front-facing camera. A TensorFlow Lite model, optimized for mobile NPUs, is integrated into the OS. This model continuously processes the audio/video streams to determine if a user with motor impairments is attempting to issue a command versus speaking to someone else in the room. This multimodal intent signal is then used to activate the standard AOSP Voice Access service, reducing false activations and making the device more usable.