Obviousness — US Patent 12236947

Obviousness Analysis of US Patent 12,236,947

Date of Analysis: May 8, 2026

Patent under Review: US 12,236,947 ("the '947 patent")

Relevant Legal Standard: Under 35 U.S.C. § 103, a patent claim is invalid as obvious if the differences between the claimed invention and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art (POSITA).

This analysis examines the independent claims of the '947 patent in light of the cited prior art. The core of the invention, as outlined in the independent claims, is a multimodal system for recognizing voice commands by processing both audio and video inputs to determine user intent.

Claim 1 Analysis: Multimodal Command Recognition

Independent claim 1 claims a method of processing voice commands by:

Receiving a first audio input.
Receiving a first video input of the user.
Determining the utterance is a system-directed command based on processing both the audio and video, including identifying a "visual characteristic" of the user.
Causing the system to act on the command.

A potential obviousness argument could be constructed by combining the teachings of US 20130297319 A1 (Kim) and US 20200310842 A1 (Electronic Arts).

US 20130297319 A1 ("Kim") discloses a mobile device that uses a microphone and camera to recognize user commands. Kim's abstract and detailed description focus on activating voice recognition based on detecting a user's face and voice. This reference clearly establishes the use of both audio and video inputs for a voice-activated system.
US 20200310842 A1 ("Electronic Arts") describes a system for tracking user sentiment, which involves analyzing audio cues (like tone) and visual cues (like facial expressions) to gauge a user's emotional state. This system is designed to understand user intent and sentiment through multimodal analysis.

Motivation to Combine: A person of ordinary skill in the art developing voice-command systems would have been motivated to combine the teachings of Kim and Electronic Arts to improve the accuracy of command recognition. At the time of the invention, a well-known problem in voice-command systems was the "false trigger," where the system would incorrectly interpret ambient conversation as a command. A POSITA would recognize that simply detecting a face and a voice, as taught by Kim, is insufficient to confirm user intent. The sentiment and intent analysis taught by Electronic Arts, which explicitly uses facial expressions ("visual characteristics") and vocal tone, provides a direct solution to this problem. By integrating the intent-analysis methods of Electronic Arts with the basic multimodal command structure of Kim, a POSITA could create a more robust system that better distinguishes between a direct command and extraneous speech. This combination would lead to the system described in claim 1, as it would use both audio and visual characteristics to determine that an utterance is a system-directed command, thus rendering the claim obvious.

Claim 17 Analysis: Dialog State Context

Independent claim 17 builds upon claim 1 by adding the limitation of "using a state of a dialog between the system and the user in the determining." This means the system's understanding of the ongoing conversation influences whether an utterance is treated as a command.

This claim could be rendered obvious by combining Kim and Electronic Arts (as above) with the "Non-Patent Citations" referenced in the '947 patent's file history, specifically the 2013 IEEE paper by Wang et al., "Understanding computer-directed utterances in multi-user dialog systems."

Wang et al. directly addresses the challenge of distinguishing system-directed speech from inter-user conversation in a multi-user environment. A key aspect of their research is using contextual cues, including the dialog history, to make this determination. The paper explicitly discusses how the system's state (e.g., whether it has just asked a question) is a critical factor in classifying a user's utterance.

Motivation to Combine: A POSITA, having already combined Kim and Electronic Arts to improve intent recognition, would naturally look to dialog context to further refine the system's accuracy. The problem of distinguishing commands from conversation is particularly acute in a continuous dialog. The Wang et al. paper provides a clear, well-documented method for using dialog state to solve this very problem. Therefore, a POSITA would have been motivated to incorporate the dialog-state analysis from Wang et al. into the multimodal framework of Kim and Electronic Arts. This would be a predictable improvement, allowing the system to understand, for example, that a short user utterance is likely a response to a system prompt rather than a new, unrelated command. This combination directly teaches all the elements of claim 17.

Claim 18 Analysis: The Physical System

Independent claim 18 claims the physical embodiment of the method: a system comprising an audio input device, a video input device, and a computing device configured to perform the multimodal command recognition.

This claim is rendered obvious by the same prior art that makes the method claims obvious.

Kim discloses a system with the necessary hardware: a microphone (audio input device), a camera (video input device), and the mobile device's processor (a computing device).
The combination of Kim, Electronic Arts, and Wang et al. teaches the functionality that the computing device would be configured to perform.

Since the underlying method is obvious, and the hardware components (microphone, camera, processor) are conventional and disclosed in the prior art for this exact purpose, the claim for the system itself would also have been obvious to a POSITA. There is no inventive concept in merely implementing an obvious method on a standard set of hardware components.

Conclusion

Based on the cited prior art, the independent claims of US patent 12,236,947 appear to be vulnerable to an obviousness challenge under 35 U.S.C. § 103. The combination of prior art references teaches the use of multimodal (audio and video) inputs to detect user commands, including the analysis of visual characteristics to determine intent, and the use of dialog context to improve accuracy. A person of ordinary skill in the art would have been motivated to combine these teachings to solve the well-known problem of false triggers and improve the overall user experience in voice-command systems.