Obviousness — US Patent 12125496

Correction Regarding Patent 12125496

It must be noted that the previous analysis incorrectly stated that US patent 12125496 could not be located. The full patent text provided for US12125496B1 is authoritative for this analysis, and all subsequent analysis will proceed using this document as the ground truth.

Obviousness Analysis of US12125496B1 Under 35 U.S.C. § 103

Patent Identification: US12125496B1, titled "Methods for neural network-based voice enhancement and systems thereof."
Priority Date: 2023-05-05.
Current Date: April 26, 2026.

Level of Ordinary Skill in the Art (POSITA)

A person having ordinary skill in the art (POSITA) relevant to US12125496B1 would be an individual with a strong background in digital signal processing, machine learning (especially deep learning), and audio/speech processing. This would typically entail a Master's degree or equivalent professional experience in fields such as Electrical Engineering, Computer Science, or Artificial Intelligence, with expertise in neural network architectures, audio feature extraction, and speech enhancement techniques.

Overview of Independent Claims

The independent claims of US12125496B1 describe a two-stage neural network approach for real-time voice enhancement:

Claim 1 (System): A voice enhancement system that fragments input audio data (containing foreground speech content, non-content elements, and speech characteristics) into speech frames. A first neural network converts these frames into low-dimensional representations, explicitly omitting one or more non-content elements. A second neural network then applies to these low-dimensional representations to generate target speech frames, which are combined to form output audio data retaining foreground speech content and speech characteristics.
Claim 11 (Method): A method involving training both the first and second neural networks. The trained first neural network converts input speech frames to low-dimensional representations, omitting non-content elements. The trained second neural network applies to these low-dimensional representations to generate target speech frames, which are combined to produce output audio data comprising foreground speech content and speech characteristics.
Claim 16 (Non-transitory computer-readable medium): Instructions for digitizing input audio, fragmenting it, converting frames to low-dimensional representations (omitting non-content elements) via a first neural network, applying a second neural network to generate target speech frames, combining them into output audio data, and converting to analog output.

A core innovative aspect highlighted is the generation of a "low-dimensional representation" that omits non-content elements while preserving speech characteristics, acting as an intermediate step between two neural networks.

Cited Prior Art References

The following prior art references, with priority dates preceding US12125496B1's priority date (2023-05-05), were identified:

US11410684B1 (Amazon Technologies, Inc.): Priority Date: 2019-06-04. Relates to Text-to-Speech (TTS) processing with transfer of vocal characteristics.
US11482235B2 (Qnap Systems, Inc.): Priority Date: 2019-04-01. Discloses a speech enhancement method and system using a deep neural network to predict a target spectrum diagram from an audio signal's spectrum.
US11705147B2 (Qualcomm Incorporated): Priority Date: 2020-04-29. Describes methods, systems, and devices for speech enhancement using mixed adaptive and fixed coefficient neural networks.
US11868883B1 (Michael Lamport Commons): Priority Date: 2010-10-26. Concerns an intelligent control system with hierarchical stacked neural networks.

Obviousness Analysis (35 U.S.C. § 103)

The primary problem addressed by US12125496B1, as stated in its background, is that existing voice enhancement techniques, while capable of reducing noise, often distort essential speech features, leading to inaccurate automatic speech recognition (ASR) results or failing to improve naturally unclear speech. This identifies a long-felt need and a known problem in the art.

The independent claims of US12125496B1 introduce a two-neural-network architecture where a first network creates a low-dimensional representation of input speech frames that specifically omits non-content elements (e.g., background noise, microphone pops, low-fidelity audio) but retains foreground speech content and characteristics, before a second network reconstructs enhanced target speech.

While no single prior art reference explicitly discloses this exact two-stage, noise-omitting low-dimensional representation, a POSITA would have been motivated to combine existing knowledge and techniques from the cited prior art and the general state of the art to arrive at the claimed invention.

Combination of Prior Art and Motivation:

A POSITA, in 2023, would be aware of the following:

Neural Network-Based Speech Enhancement: Both US11482235B2 and US11705147B2 teach the use of neural networks for speech enhancement. US11482235B2 uses a deep neural network to predict a target spectrum diagram for generating an enhanced audio signal, and US11705147B2 uses mixed adaptive and fixed coefficient neural networks for general speech enhancement. These patents would provide the basic teaching of employing neural networks for enhancing speech signals.
Hierarchical/Stacked Neural Networks: US11868883B1 demonstrates the concept of hierarchical stacked neural networks for complex tasks like intelligent control. While not directly in speech enhancement, it teaches that multi-stage or cascaded neural network architectures are a known and effective approach for breaking down and solving complex problems by processing information through successive layers or networks. A POSITA would readily recognize the applicability of a multi-stage neural network architecture to the complex problem of robust speech enhancement.
Feature Extraction and Dimensionality Reduction: The specification of US12125496B1 itself describes that the low-dimensional representation can be achieved by "pre-processing the input audio data 402 to remove noise and other distortions" using "a noise reduction algorithm... or a filtering technique" and that "features may be extracted... such as by using Fourier Transform, Mel-Frequency Cepstral Coefficients (MFCC)... encoded... using techniques such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), or other dimensionality reduction techniques." These are well-known signal processing and machine learning techniques for isolating and compressing relevant information while reducing noise or irrelevant data.

Motivation for Combination:

Given the acknowledged problem that existing single-stage neural network enhancement methods (e.g., as taught by US11482235B2 or US11705147B2) often distort desired speech characteristics while attempting to remove noise, a POSITA would be strongly motivated to improve upon these systems. The motivation would be to achieve more precise noise suppression without compromising speech intelligibility and integrity.

To address this, a POSITA would consider:

Breaking down the problem: Instead of a single network trying to do everything, separating the task into distinct stages using a multi-network approach (informed by US11868883B1).
Explicitly isolating speech content: Leveraging known feature extraction (e.g., MFCC) and dimensionality reduction techniques (e.g., PCA, LDA) to create an intermediate representation that focuses only on the critical speech characteristics and filters out or minimizes the non-content elements (noise, pops, etc.). This step directly addresses the problem of distortion by ensuring that the signal presented for enhancement is already stripped of undesirable components.
Reconstruction: Using a second network to reconstruct the full speech signal from this cleaner, low-dimensional representation, ensuring that the output retains the desired speech characteristics but is free from the previously omitted non-content elements.

The combination of:

Neural network-based speech enhancement (e.g., US11482235B2 or US11705147B2)
The architectural concept of multi-stage/hierarchical neural networks (e.g., US11868883B1)
Well-known techniques for feature extraction, noise reduction, and dimensionality reduction (as described in the background of US12125496B1)

would lead a POSITA to develop the claimed two-network system with an intermediate low-dimensional, noise-free representation. The motivation is to overcome the known trade-off between noise reduction and speech distortion in prior art systems by explicitly separating and processing speech content from non-content elements. This approach represents an expected engineering choice to improve performance in a known problem area.

Therefore, the methods and systems described in independent claims 1, 11, and 16 of US12125496B1 would be obvious to a POSITA by combining the teachings of the cited prior art and general knowledge in the field to address a recognized technical challenge.