Obviousness — US Patent 12412561

The authoritative patent text for US12412561 does not provide the full text or detailed disclosures of specific prior art references. While it mentions "Prior art keywords" (audio signal, input, output, features, chunk) and cross-references a parent application (U.S. application Ser. No. 18/083,727, which resulted in US11715457B1), the content of these specific prior art documents is not available within the provided text. Therefore, identifying specific combinations of distinct prior art references to demonstrate obviousness, along with their detailed disclosures, is not possible in this analysis.

However, based on the problem statement and technical descriptions within US12412561, an analysis of hypothetical obviousness can be conducted.

Problem Addressed by US12412561 in the Prior Art:
The patent explicitly states, "Existing solutions for correcting accents in audio signals are not very effective in real-time communications." This statement establishes that, as of the priority date (2022-01-10), systems for "correcting accents in audio signals" existed, and "audio communications" were popular. The core problem was the lack of real-time effectiveness in these existing accent correction solutions.

General Knowledge of a Person Having Ordinary Skill in the Art (POSITA):
A POSITA in speech processing and real-time audio systems, by the priority date, would have been generally aware of:

Speech Processing Pipelines/Computational Graphs: The modular decomposition of complex speech tasks into sequential or parallel processing blocks (e.g., feature extraction, analysis, synthesis).
Acoustic Feature Extraction: Techniques for extracting features like pitch (F0), energy, Voice Activity Detection (VAD), Linear Prediction Cepstral Coefficients (LPCC), Mel Frequency Cepstral Coefficients (MFCC), and Bark Frequency Cepstral Coefficients (BFCC) from audio signals.
Linguistic Feature Extraction: Methods for deriving linguistic information, such as phonemes, Phonetic PosteriorGrams (PPGs), or bottleneck features from Automatic Speech Recognition (ASR) neural networks.
Speech Synthesis and Vocoding: Technologies for generating speech from features, including various synthesis modules and vocoders (e.g., HiFi-GAN, LPCNet).
Speaker Embeddings: The use of speaker-specific features to control voice characteristics in speech synthesis.
Real-Time Processing Techniques: General methods for achieving low-latency processing in streaming data, such as dividing input into "chunks," employing "context caching" to maintain continuity, and utilizing parallel processing across different computational units (CPUs, GPUs, NPUs, FPGAs).
Digital Signal Processing Fundamentals: The necessity of handling different sample rates within a pipeline and performing resampling operations as needed.

Hypothetical Obviousness Argument:

A person having ordinary skill in the art (POSITA), faced with the acknowledged problem that "Existing solutions for correcting accents in audio signals are not very effective in real-time communications," would have been motivated to combine known speech processing components and real-time system design principles to create an effective real-time accent correction system.

The core method of US12412561 involves a computational graph with specific modules: an acoustic features extraction module, a linguistic features extraction module (designed to produce accent-reduced features), a synthesis module (using acoustic, linguistic features, and a speaker embedding to produce a spectrum representation), and a vocoder.

Motivation to combine these elements would stem from the desire to overcome the limitations of prior art real-time accent correction. Specifically:

Basic Architecture: The concept of a pipeline or computational graph for speech transformation (analysis-synthesis) involving feature extraction, modification, and regeneration of speech was well-known. A POSITA would naturally apply this established paradigm to the problem of accent correction.
Accent Reduction in Linguistic Features: Knowing that accent resides in both acoustic (e.g., prosody, timbre) and linguistic (e.g., phoneme realization) domains, a POSITA would be motivated to develop linguistic features that are "accent-agnostic" or "accent-reduced," for instance, by training a linguistic feature extractor with an accent reduction loss function as described in US12412561. This would be an engineering choice to isolate and modify the accent component.
Speaker Embedding for Voice Preservation: To maintain the original speaker's voice identity while correcting accent, the inclusion of a speaker embedding in the synthesis process would be an obvious design choice for a POSITA familiar with voice conversion and synthesis techniques.
Real-Time Optimization Techniques: To achieve the desired "real-time" performance and overcome the "not very effective in real-time" limitation of prior art:
- Chunking Input: Processing audio in "chunks" is a standard method for streaming applications to manage latency and memory.
- Context Caching: Storing and updating "context" for each module and submodule is a known technique to ensure continuity and quality in chunk-based processing, mitigating artifacts that could arise from processing isolated chunks.
- Parallel Processing: Utilizing "at least two processing units" (CPUs, GPUs, NPUs, FPGAs) to process "at least two modules from the computational graph in parallel" is a fundamental strategy for accelerating complex computations and meeting real-time deadlines in modern computing systems.
- Sample Rate Management: The need to handle "different sample rates" between modules and "resampling the output data" is a basic digital signal processing requirement when integrating components that operate at different sampling frequencies.
- Time-Shift Parameters: Explicitly managing "time-shift parameters" within a real-time pipeline to account for processing delays and ensure proper alignment of features (e.g., acoustic features might be available with lower latency than linguistic features due to simpler processing) would be a logical engineering approach to optimize overall system latency and synchronization. The choice to assign a "lower time-shift parameter" to acoustic features than linguistic features could be motivated by the typically less complex and faster extraction of raw acoustic parameters compared to higher-level linguistic analysis.

Therefore, a POSITA, motivated by the clear need for effective real-time accent correction and equipped with general knowledge of speech processing components and real-time system optimization techniques, would have found it obvious to combine these known elements in the manner claimed by US12412561 to achieve the stated objective. The particular combination of modules and the specific techniques for real-time operation (chunking, caching, parallelization, sample rate handling, and time-shift management) represent an assembly of known elements to address a known problem with predictable results, thus potentially rendering the claims obvious.