Obviousness — US Patent 11275900

An analysis of the obviousness of US patent 11,275,900 under 35 U.S.C. § 103 suggests that the claims may be rendered obvious by combinations of prior art references. The core invention lies in applying machine learning to classify dark web forum content, using a pre-defined tag hierarchy to refine the results by adding parent tags based on prediction probabilities. Several prior art documents teach the essential elements of this process, and a person having ordinary skill in the art (POSITA) would have been motivated to combine them.

Analysis of Independent Claim 1

Claim 1 details a system with the following key elements:

Accessing a ground truth dataset from deep web forums with a predetermined tag hierarchy.
Extracting features from new data using word or paragraph vectors.
Applying a machine classifier to generate a prediction list of tags with prediction probability values.
Adding all parent tags to the prediction list based on a comparison between the prediction probability value and a first predetermined threshold.

The combination of Forman (US 2004/0064464 A1) and IBM (US 2013/0097103 A1) appears to render the key features of claim 1 obvious.

Forman (US 2004/0064464 A1): This reference explicitly teaches a "hierarchical categorization method and system." Forman discloses using a hierarchy of categories (analogous to the patent's "tag hierarchy") to classify documents. The system uses multiple classifiers, and its core purpose is to organize information into a structured, hierarchical form. This directly addresses the concept of a "predetermined tag hierarchy" (element 1) and applying classifiers to categorize data (element 3). While Forman does not specify the "deep web," the application of hierarchical classification to text is a well-established principle taught by this reference.
IBM (US 2013/0097103 A1): This reference addresses the problem of training classifiers with imbalanced or unlabeled datasets. It discloses techniques for "Generating Balanced and Class-Independent Training Data From Unlabeled Data Set." This is highly relevant to the problem domain of US 11,275,900, which notes the difficulty of creating large, hand-labeled "ground truth" datasets for dark web content. IBM teaches the generation and use of a "ground truth dataset" to train a classifier (element 1). The reference also implicitly involves generating prediction probabilities to assess the classifier's output.

Motivation to Combine: A POSITA, skilled in machine learning and text classification, would have been motivated to combine Forman and IBM. Forman provides a robust framework for hierarchical classification. However, a known challenge in applying such a system to a new domain like the dark web is the scarcity of labeled training data—a problem the '900 patent explicitly aims to solve. The POSITA would naturally look to solutions like those presented in IBM to generate a more effective ground truth dataset for the hierarchical classifier described by Forman. The combination is a straightforward application of a known technique (IBM's data generation) to improve a known system (Forman's hierarchical classification). The use of word/paragraph vectors (element 2) was a standard and well-known method for feature extraction in natural language processing at the time of the invention. Adding parent tags based on a child tag's prediction probability (element 4) would be an obvious way to enforce the hierarchy taught by Forman; if the classifier is confident about a specific sub-category, it should also be confident about its parent categories.

Analysis of Independent Claim 12

Claim 12 mirrors the system of Claim 1 but is framed as a method. It includes:

Accessing data from a deep web forum.
Extracting features for a machine classifier.
Applying the classifier to generate a prediction list with probability values.
Adding parent tags based on a probability threshold.

The same combination of Forman (US 2004/0064464 A1) and IBM (US 2013/0097103 A1) also renders this claim obvious for the same reasons. The steps outlined in the method are functionally identical to the components of the system in Claim 1.

Additionally, the combination of Forman (US 2004/0064464 A1) with AGT International (US 2017/0032276 A1) provides another strong argument for obviousness.

AGT International (US 2017/0032276 A1): This reference teaches "Data fusion and classification with imbalanced datasets." It describes using machine learning classifiers and explicitly deals with the problem of class imbalance, a core issue addressed by the '900 patent. AGT's methods are designed to improve classifier accuracy when some categories have far fewer training examples than others, which is characteristic of the dark web data described in the '900 patent.

Motivation to Combine: A POSITA would be motivated to apply the techniques for handling imbalanced data from AGT to the hierarchical classification system of Forman. When classifying text into a deep hierarchy, it is almost certain that lower-level, more specific categories will have fewer examples than higher-level, broader categories, leading to an imbalanced dataset. To make Forman's hierarchical system work effectively on real-world data, a POSITA would have found it obvious to incorporate methods like those in AGT to ensure the classifier wasn't biased towards the more general, high-frequency parent categories. Adding parent tags based on a child's prediction confidence is a logical step to maintain the integrity of the hierarchy taught by Forman.

In conclusion, the fundamental concepts of hierarchical classification, using ground truth data for training, and applying machine learning classifiers to text were all well-established in the prior art. The specific application to dark web forums represents an application of known techniques to a new, but analogous, domain. The refinement of enforcing the tag hierarchy by adding parent tags based on a probability threshold is an obvious implementation detail that a POSITA would have considered to ensure logical consistency in the classifier's output.