Obviousness — US Patent 12131745

Obviousness Analysis of US Patent 12131745 under 35 U.S.C. § 103

This analysis identifies combinations of prior art references that would render the claims of US Patent 12131745 obvious to a person having ordinary skill in the art (POSITA) at the time of the invention (priority date of June 27, 2023). The primary inventive step claimed by US12131745 is the use of a "differentiable alignment by jointly maximizing a cosine distance" between source and transformed phonetic embedding vectors for real-time accent conversion, specifically to overcome the limitations of Dynamic Time Warping (DTW) [Description].

Claims Under Consideration

The independent claims (Claim 1 for a system, Claim 8 for a method, and Claim 15 for a non-transitory computer-readable medium) share the following core elements:

Obtaining phonetic embedding vectors representing a source accent from input audio data.
Applying a trained machine learning model (neural network) to generate transformed phonetic embedding vectors representing a target accent.
Determining an alignment by maximizing a cosine distance between the source and transformed phonetic embedding vectors.
Aligning the speech data to the phonetic content based on this determined alignment to generate output audio data representing the target accent.

Claim 4 further details the cosine distance determination, including normalizing vectors to a unit norm and computing a dot product, which is a standard method for calculating cosine similarity. [Claim 4]

Identified Prior Art Combinations and Motivation

A person having ordinary skill in the art (POSITA) in the field of real-time accent conversion and deep learning, by 2023, would have been acutely aware of the limitations of conventional alignment techniques like Dynamic Time Warping (DTW). The patent itself explicitly highlights these issues: DTW is "non-differentiable and not providing gradient information," requires "two separate steps," "makes it difficult to train an accent conversion model effectively," and suffers from "non-monotonicity and instability" leading to "alignment errors" and "poor-quality audio signals." [Description] The patent also states that these limitations make it "challenging to optimize current accent conversion systems using gradient-based methods, which are widely used in deep learning models." [Description]

The overarching motivation for a POSITA would therefore be to develop an accent conversion system that is fully differentiable, allows for end-to-end training using gradient-based optimization, and provides a stable, monotonic, and efficient alignment, thereby improving the accuracy, naturalness, and real-time performance of accent conversion.

Combination: US20220358903A1 in view of US20220122579A1 and general knowledge of a POSITA

1. Primary Reference: US20220358903A1 (Sanas.ai Inc.)
This patent, titled "Real-Time Accent Conversion Model," is assigned to the same entity (Sanas.ai Inc.) as US12131745 and represents highly relevant prior art. It would have taught a POSITA the fundamental components of an accent conversion system, including:

Obtaining input audio data and extracting speech characteristics, which would implicitly involve or lead to the generation of phonetic representations or embeddings.
Utilizing a model (likely a neural network, given the context of "Real-Time Accent Conversion Model") to process these characteristics to convert a source accent to a target accent.
Generating output audio data in the target accent.

As a "current accent conversion system" at the time of the instant patent's filing, it would likely have suffered from the DTW limitations that US12131745 explicitly aims to overcome, serving as the baseline system a POSITA would seek to improve. [Description]

2. Secondary Reference: US20220122579A1 (Google Llc.)
This patent, titled "End-to-end speech conversion," teaches systems and methods for determining a mapping between source and target speech signals using a deep neural network, specifically highlighting its "end-to-end" nature. In the field of deep learning, "end-to-end" processing explicitly refers to systems where all components are differentiable and can be trained jointly using gradient-based optimization, directly addressing the limitations of non-differentiable components like DTW. This reference would motivate a POSITA to seek differentiable solutions for alignment within speech conversion systems.

3. General Knowledge of a POSITA:
By 2023, a POSITA would have possessed the following common knowledge:

Limitations of DTW: The non-differentiability of DTW and its drawbacks for end-to-end training in deep learning architectures were widely recognized problems in sequence-to-sequence tasks, including speech processing. [Description]
Utility of Phonetic Embedding Vectors: Phonetic embedding vectors are a well-established means to numerically represent the phonetic characteristics of speech, making them suitable for machine learning processing. The patent itself describes them as capturing "important features related to pronunciation, intonation, and other phonetic aspects." [Description]
Cosine Distance/Similarity as a Differentiable Metric: Cosine distance (or similarity) is a standard, mathematically differentiable metric for quantifying the similarity or dissimilarity between two vectors, particularly effective for high-dimensional embeddings. Calculating cosine distance typically involves normalizing vectors to a unit norm and then computing their dot product, both of which are differentiable operations. [Description] In sequence models, cosine similarity is frequently used within attention mechanisms to determine alignment or relevance between different parts of input and output sequences in a differentiable manner.

Motivation for Combination:
A POSITA, tasked with improving the "Real-Time Accent Conversion Model" taught by US20220358903A1, would be strongly motivated by the widely known problems associated with DTW (non-differentiability, instability, multi-step training) as articulated in US12131745. [Description] Recognizing the benefits of "end-to-end speech conversion" as taught by US20220122579A1, the POSITA would seek to replace the non-differentiable DTW alignment with a differentiable alternative to enable efficient gradient-based optimization of the entire accent conversion pipeline.

Given that phonetic embedding vectors are already being processed (as implied by US20220358903A1) and that cosine distance is a standard and differentiable measure of similarity between vectors, it would have been an obvious engineering choice for a POSITA to adapt cosine distance maximization to create a differentiable alignment for these phonetic embedding vectors. Maximizing cosine distance between unit-normed vectors (equivalent to maximizing their dot product) is a differentiable operation that could be integrated directly into the neural network's loss function, thereby achieving a "differentiable alignment" and allowing for end-to-end training of the accent conversion neural network, precisely addressing the deficiencies of DTW. This approach would lead to the improved performance, stability, and speed (e.g., "about twenty times faster than alignment achieved using dynamic time warping (DTW)") described as advantages of the claimed invention. [Description]

Conclusion on Obviousness

The combination of US20220358903A1 (teaching a real-time accent conversion model), US20220122579A1 (teaching the desirability and means for end-to-end differentiable speech conversion systems), and the general knowledge of a POSITA regarding DTW's limitations and the differentiable properties and applications of cosine distance for vector alignment, would render the claims of US12131745 obvious. A POSITA would have been motivated to combine these elements to overcome the known technical problems of non-differentiable alignment in accent conversion, thereby enabling end-to-end optimization and improving the performance and efficiency of such systems.