Obviousness — US Patent 11650968

The following analysis addresses the obviousness of US Patent 11650968 under 35 U.S.C. § 103, identifying combinations of prior art references that would render the claims obvious and explaining the motivation for a person having ordinary skill in the art (POSITA) to combine them. The analysis primarily focuses on Independent Claim 1, as Claims 10 and 16 are directed to a system and a computer program product for carrying out the method of Claim 1, respectively.

The patent's priority date is May 24, 2019. The provided patent text describes the general state of prior art for neural network training: "Prior training methods typically stop by periodically testing a NN using a holdout data set not included in training data. Training may be stopped for example when improvement stagnates." [cite: US11650968 Description] The patent identifies a problem: "It is hard or impossible using prior art methods to predict at any given point in training how much improvement can be achieved by training using further epochs." [cite: US11650968 Description]

The references from the Google Patents listing for US11650968B2 (found under the "References" section) are used for this analysis.

Combination of Prior Art References

A combination of WO2019053350A1 ("EFFICIENT MACHINE LEARNING MODEL TRAINING WITH EARLY STOPPING" by Li et al., published March 21, 2019) and general knowledge within the field of machine learning, possibly supplemented by Prechelt, Lutz, "Early stopping—But when?" (2012), would render Claim 1 of US11650968 obvious.

WO2019053350A1 (Li et al.)
This international patent application, published prior to the priority date of US11650968, discloses methods for optimizing machine learning model training. Its abstract states: "Methods may include collecting metrics associated with a machine learning model training, determining an expected performance curve for the machine learning model training based on at least one trained model and the collected metrics, and stopping the machine learning model training based on the expected performance curve." [cite: WO2019053350A1 Abstract] The description further clarifies that "The trained model may be a surrogate model trained to predict future performance of a machine learning model based on its past performance." [cite: WO2019053350A1 Description]

Prechelt, Lutz, "Early stopping—But when?" (2012)
This seminal work discusses various strategies for early stopping based on monitoring validation error to prevent overfitting during neural network training. It highlights the reactive nature of conventional early stopping methods, which cease training when observed improvement stagnates. [cite: Prechelt (2012)]

General Knowledge of a Person Having Ordinary Skill in the Art (POSITA)
A POSITA in machine learning as of 2019 would possess knowledge of:

Standard neural network training processes, including iterative epochs and loss computation.
The computational expense of training large neural networks and the desirability of early stopping.
Techniques for training predictive models (e.g., surrogate models) on diverse datasets to achieve robustness and generalization.
Basic statistical methods for deriving probabilities and expected values from predictive model outputs.
Using thresholds for decision-making in automated processes.

Obviousness Analysis of Claim 1

Claim 1: A computer-implemented method for predictively stopping training of a neural network (NN), the method comprising:

1a. training, by one or more processors, the NN over a series of training periods, wherein a loss for the NN is computed during each training period;

Disclosure in Prior Art: This element is widely known in the art and explicitly taught by both Prechelt (2012) and WO2019053350A1. Training neural networks iteratively and computing a loss function are fundamental aspects of machine learning. WO2019053350A1 describes "collecting metrics associated with a machine learning model training," which includes computing loss during training periods. [cite: WO2019053350A1 Abstract]

1b. determining, by the one or more processors and using one or more models that have been trained using training losses of a plurality of NNs other than the NN, a probability of improvement in the loss of the NN, wherein the one or more models have input thereto a set of model parameters for the NN and data describing training loss of the NN;

Disclosure in Prior Art: WO2019053350A1 discloses "determining an expected performance curve for the machine learning model training based on at least one trained model and the collected metrics." [cite: WO2019053350A1 Abstract] It further states that this "trained model may be a surrogate model trained to predict future performance of a machine learning model based on its past performance." [cite: WO2019053350A1 Description]
- "using one or more models that have been trained using training losses of a plurality of NNs other than the NN": While WO2019053350A1 doesn't explicitly use the phrase "plurality of NNs other than the NN," a POSITA would understand that a "surrogate model trained to predict future performance of a machine learning model" would inherently be trained on a diverse dataset of training histories from many different machine learning models or configurations. Training such a predictive model on data from only a single NN would limit its generalizability, which is contrary to the purpose of a surrogate model designed for "efficient machine learning model training." Therefore, it would be an obvious design choice for a POSITA to train the "at least one trained model" using data from a plurality of NNs to make it broadly applicable and robust.
- "a probability of improvement in the loss of the NN": Once an "expected performance curve" or "predicted future performance" is determined (as taught by WO2019053350A1), calculating a "probability of improvement" (e.g., the likelihood that the loss will fall below a certain target value or continue to decrease by a significant amount) is a routine statistical derivation. US11650968 itself describes using a cumulative distribution function (CDF) with a mean and variance from a model to determine such a probability. [cite: US11650968 Description] This is a standard application of predictive analytics.
- "input thereto a set of model parameters for the NN and data describing training loss of the NN": WO2019053350A1 describes using "collected metrics" to determine the expected performance curve, and specifically mentions the surrogate model predicting future performance "based on its past performance." [cite: WO2019053350A1 Abstract, WO2019053350A1 Description] This directly corresponds to inputting parameters and loss history.

1c. stopping, by the one or more processors, the training of the NN if the determined probability is less than a probability threshold, or if a wait value is greater than a wait threshold.

Disclosure in Prior Art: WO2019053350A1 teaches "stopping the machine learning model training based on the expected performance curve." [cite: WO2019053350A1 Abstract] The use of "thresholds" for decision-making based on a calculated metric (such as a probability) is a fundamental engineering principle. Given the derivation of a "probability of improvement," it would be obvious to a POSITA to define a "probability threshold" to trigger early stopping. The "wait value" (or patience) mechanism, which stops training after a period of non-improvement, is a well-established heuristic in early stopping, as taught by Prechelt (2012) and generally known in the art. [cite: Prechelt (2012)]

Motivation to Combine

A POSITA would be motivated to combine these references and general knowledge for the following reasons:

Addressing Computational Inefficiency: The core problem addressed by US11650968—the wasteful expenditure of computational resources on neural network training when improvement is no longer likely—was a well-known challenge in the art. Prior art like Prechelt (2012) highlighted the importance of early stopping but primarily offered reactive solutions based on observed stagnation. A strong motivation existed to develop more proactive and efficient stopping mechanisms. [cite: US11650968 Description]
Improving Stopping Accuracy and Efficiency: WO2019053350A1 offered a significant step forward by introducing a "trained model" to predict an "expected performance curve." A POSITA would be motivated to integrate this predictive capability into existing early stopping strategies to achieve more precise and timely cessation of training.
Generalization for Robustness: To make a predictive early stopping model broadly useful across different neural network architectures and hyperparameters, a POSITA would naturally train the "surrogate model" (from WO2019053350A1) on a diverse dataset of training runs from "a plurality of NNs." This is a standard approach in machine learning to ensure robustness and generalizability, thereby making the predictive model more valuable and widely applicable.
Deriving Actionable Metrics: Once a predicted performance curve is available, it would be a straightforward and obvious step for a POSITA to derive actionable metrics such as the "probability of improvement." Such probabilities provide a clear, quantifiable basis for making stopping decisions, enhancing the sophistication of the early stopping mechanism.
Integrating Established Heuristics: Combining a predictive probability threshold with a traditional "wait value" (patience) mechanism from existing early stopping methods (like those in Prechelt (2012)) would be an obvious way to create a robust and comprehensive stopping policy, ensuring that training is halted both when improvement is predicted to be unlikely and when it simply fails to materialize over time.

In conclusion, the combination of the teachings of WO2019053350A1, the well-known early stopping techniques described in works like Prechelt (2012), and the general understanding of machine learning principles by a POSITA would have made the method of Claim 1 of US11650968 obvious at the time of invention. The advancements claimed in US11650968 represent predictable developments and applications of known techniques to address a recognized problem in the field.