Obviousness — US Patent 11948550

Obviousness Analysis under 35 U.S.C. § 103 for US Patent 11948550

This analysis identifies combinations of prior art references that would render the claims of US patent 11948550 obvious to a person having ordinary skill in the art (POSITA) at the time of the invention (priority date May 6, 2021). The motivation for combining these references is also explained.

The independent claims of US11948550 (Claims 1, 11, and 19) describe a system, non-transitory computer-readable medium, and method, respectively, for real-time accent conversion. Key features include:

Deriving a non-text linguistic representation from input speech (first accent).
Synthesizing output audio (second accent) by mapping a first non-text linguistic representation of a first phoneme to a second non-text linguistic representation of a second, different phoneme.
Utilizing two machine-learning algorithms (one for deriving the linguistic representation, one for synthesis).
Real-time operation with low latency (e.g., 50-700 ms).
Preservation of prosodic features.

Combination of References: Zhao et al. (2019) in view of US10163451B2 (Amazon) and Sajjan & Vijaya (2016)

Primary References:

Zhao, G., Ding, S., & Gutierrez-Osuna, R. (2019). Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams. (Hereinafter "Zhao (2019)")
US10163451B2: Accent translation (Amazon Technologies, Inc.) (Hereinafter "Amazon '451")
Sajjan, S. C., & Vijaya, C. (Mar. 2016). Continuous Speech Recognition of Kannada language using triphone modeling. (Hereinafter "Sajjan (2016)")

Detailed Obviousness Rationale:

A person having ordinary skill in the art (POSITA) in speech processing or machine learning, seeking to improve real-time accent conversion, would have been motivated to combine the teachings of Zhao (2019) with Amazon '451, possibly incorporating well-known ASR techniques described in Sajjan (2016).

Problem Addressed in the Art:
The background of US11948550 explicitly identifies shortcomings in existing accent conversion solutions:

Voice conversion methods that only adjust audio characteristics (e.g., pitch, intonation) fail to account for pronunciation differences (e.g., "th-stopping" in Indian English to Standard American English).
Speech-to-text (STT) followed by text-to-speech (TTS) approaches introduce significant latency (up to several seconds) and may lose nuances like prosody and emotion.

How the Combination Renders Claims Obvious:

Preamble (System, processor, non-transitory computer-readable medium): Both Zhao (2019) and Amazon '451 describe computer-implemented systems and methods, implicitly requiring processors and non-transitory computer-readable media for their operation. This element is standard for any modern speech processing technology.
Claim Element 1(a) (Training a first ML algorithm... with multi-speaker first accent data, aligning/classifying frames):
- Zhao (2019) describes "Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams". The generation of Phonetic Posteriorgrams (PPGs) from speech, as a non-text linguistic representation, fundamentally relies on an underlying Automatic Speech Recognition (ASR) model. Training such an ASR model requires "speech content captured from a plurality of different speakers" to ensure robustness and generalization.
- Sajjan (2016) explicitly teaches "Continuous Speech Recognition... using triphone modeling". Aligning and classifying speech frames according to monophone and triphone sounds is a standard and well-known technique in ASR training for developing robust phonetic representations. A POSITA would readily apply these established ASR training methods to train the first machine-learning algorithm used to generate PPGs or other non-text linguistic representations.
Claim Element 1(b) (Applying the first ML algorithm to received speech to derive a non-text linguistic representation):
- Zhao (2019) directly teaches deriving "Phonetic Posteriorgrams" (PPGs) from input speech for accent conversion. PPGs are a clear example of a "non-text linguistic representation" as they represent phonetic probabilities over time without full text transcription, directly addressing the patent's stated advantage over STT-TTS. Input speech would be "received via at least one microphone," a common component of any speech processing system.
Claim Element 1(c) (Synthesizing using a second ML algorithm... by mapping first phoneme to a second, different phoneme):
- Zhao (2019) generally teaches "Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams," which involves a second machine-learning algorithm to convert the non-text linguistic representation (PPGs) into synthesized audio of a target accent. This algorithm would be trained with audio data of both the first and second accents.
- Amazon '451 directly teaches the crucial step of modifying pronunciations for accent conversion. Claim 1 of Amazon '451 states a method comprising "modifying at least some of the set of phonemes to a target set of phonemes based at least in part on the target accent". This explicitly covers "mapping at least a first non-text linguistic representation of a first phoneme... to a second non-text linguistic representation of a second phoneme... that is different from the first pronunciation... wherein the first and second phonemes are different phonemes." For instance, changing the phoneme for "th" to the phoneme for "d" or "t" for an Indian English to SAE conversion, as highlighted in US11948550's background. A POSITA would understand that the "set of phonemes" and "acoustic characteristics" in Amazon '451 constitute a form of linguistic representation, which could be implemented as a non-textual representation like PPGs from Zhao (2019).
Claim Element 1(d) (Converting synthesized audio data into a synthesized version... comprising the updated set of phonemes):
- Both Zhao (2019) and Amazon '451 ultimately produce synthesized speech in the target accent. Zhao's method explicitly involves "Synthesizing Speech" from PPGs. This final step of converting the synthesized audio data (e.g., mel spectrograms as mentioned in US11948550) into an audible waveform using a vocoder or similar component (as described in US11948550) is a well-known process in speech synthesis and is implicitly or explicitly taught by both references as the end goal of accent conversion. The output speech would naturally embody the "updated set of phonemes" resulting from the conversion process.

Motivation for Combination:

A POSITA would have been motivated to combine Zhao (2019) and Amazon '451 to address the identified problems in the art:

To overcome latency and prosody loss of STT-TTS: Zhao (2019)'s use of Phonetic Posteriorgrams (PPGs) provides a direct speech-to-speech conversion path using a "non-text linguistic representation". This approach is known to offer lower latency and better preservation of prosodic features (like pitch, emotion, and pauses) compared to STT-TTS, directly addressing the issues raised in US11948550's background.
To address pronunciation differences: Amazon '451 directly provides a solution for modifying phonemes based on the target accent. This explicitly remedies the deficiency of prior voice conversion methods that only adjust acoustic characteristics but fail to alter specific pronunciations, a problem central to US11948550.
Synergy and Predictability: It would be obvious for a POSITA to integrate the phoneme modification capability of Amazon '451 into the low-latency, non-textual framework of Zhao (2019). The "non-text linguistic representation" (PPGs) from Zhao (2019) provides an ideal intermediate format within which the phoneme modifications taught by Amazon '451 could be implemented, thereby creating a real-time accent conversion system that effectively handles both acoustic and phonetic variations.

Additional Obvious Features:

Real-time operation (Claim 9): Both PPG-based methods (Zhao 2019) and accent translation (Amazon '451) are developed in contexts where real-time performance is highly desirable for communication applications. Optimizing such systems for "low latency" (e.g., 50-700 ms, or 300 ms as mentioned in US11948550's abstract) is a common design goal in speech processing, and within the ordinary skill of the art for these types of systems.
Preservation of prosodic features (Claim 10): As noted, PPGs (Zhao 2019) inherently preserve more continuous speech characteristics than discrete text, allowing for better retention of prosody, pitch, and emotion, which are explicitly mentioned as advantages in US11948550.
Multi-speaker first accent, single speaker second accent (Claims 4, 5): The practice of training ASR or voice conversion models with diverse speakers for input robustness and a single, representative speaker for target voice identity is a well-established technique in speech synthesis and conversion, readily apparent to a POSITA.

Therefore, the combination of Zhao (2019) and Amazon '451, possibly supplemented by Sajjan (2016) for standard ASR training methodologies, renders the claimed invention obvious under 35 U.S.C. § 103.