Obviousness — US Patent 8190610

Under 35 U.S.C. § 103, an invention is considered obvious if the differences between the claimed invention and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art (POSITA). This analysis considers known prior art references, a motivation to combine them, and a reasonable expectation of success from such a combination.

The independent claims of US Patent 8190610 (Claims 1, 17, 33, and 40) primarily focus on enhancing the MapReduce programming methodology to efficiently process data from a plurality of grouped sets of key/value pairs, specifically enabling operations like joins on related, but possibly heterogeneous, datasets (i.e., data with different schemas sharing a common key).

Prior Art References

The authoritative prior art for this analysis, as indicated by the patent text, includes:

Dean and Ghemawat: "MapReduce: Simplified Data Processing on Large Clusters," by Jeffrey Dean and Sanjay Ghemawat, appearing in OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, Calif., December, 2004. This document describes the conventional MapReduce programming methodology. [cite: Jeffrey Dean and Sanjay Ghemawat, " MapReduce: Simplified Data Processing on Large Clusters "]
General Knowledge of Relational Database Management Systems (RDBMS): The patent itself references the common understanding of joining relational tables with different schemas on a common key (e.g., FIG. 3). This represents well-established principles in the field of data processing and databases.
Cited Patents: While numerous patents are cited, none appear to directly teach the specific combination of heterogeneous schema handling within a MapReduce framework for relational operations. However, some, like US6341289B1 ("Object identity and partitioning for user defined extents") and US7620936B2 ("Schema-oriented content management system"), demonstrate prior art's familiarity with data partitioning, object identity, and schema management.

Obviousness Analysis

The core of US8190610's claims lies in adapting the MapReduce framework to overcome its described limitation in "efficiently process[ing] data from heterogeneous sources" and "perform[ing] joins over two relational tables that have different schemas".

A strong argument for obviousness can be built by combining Dean and Ghemawat with the general knowledge of relational database operations.

Combination: Dean and Ghemawat + General Knowledge of Relational Database Operations

Motivation to Combine:
A person having ordinary skill in the art (POSITA) in 2006, such as a software engineer or database architect specializing in distributed systems and large-scale data processing, would have been familiar with both the scalable and fault-tolerant benefits of the MapReduce paradigm (as taught by Dean and Ghemawat) and the established techniques for managing and querying diverse data in relational databases. The problem of efficiently performing common relational database operations, like joining heterogeneous tables (i.e., tables with different schemas) on a common key, was a known and important task for large datasets. Recognizing the power of MapReduce for large-scale processing, a POSITA would have been motivated to extend the MapReduce framework to handle these common database operations, thereby leveraging MapReduce's advantages for tasks beyond simple aggregation or filtering. The patent explicitly states that it was "impractical to perform joins over two relational tables that have different schemas" using conventional MapReduce implementations, which highlights a known problem a POSITA would seek to solve.

Application to Independent Claims (e.g., Claim 1, 33):

1. Treating data as "plurality of data groups" with "different schema" but a "key in common" (Claim 1, 33):

Dean and Ghemawat teaches processing large datasets in a distributed system by partitioning input data into key/value pairs. [cite: Jeffrey Dean and Sanjay Ghemawat, " MapReduce: Simplified Data Processing on Large Clusters "]
It would have been obvious to a POSITA to apply this framework to various input sources, which naturally often comprise "data groups" (e.g., separate files, tables, or log streams) that possess "different schemas" (e.g., an "Employee" table and a "Department" table, as illustrated in FIG. 3 of the patent).
The concept of combining such heterogeneous data using a "key in common" (e.g., DeptID in a join operation) is fundamental to relational database theory and a well-known prerequisite for performing joins.

2. "Mapping differently" for each group to "output different lists of values" and "intermediate data identifiable to that data group" (Claim 1, 33):

Dean and Ghemawat describes user-configurable map functions that process input key/value pairs to produce intermediate key/value pairs. [cite: Jeffrey Dean and Sanjay Ghemawat, " MapReduce: Simplified Data Processing on Large Clusters "]
Given that input data groups have different schemas, it would be an obvious design choice for a POSITA to configure the map function differently for each data group. For example, a map function processing employee records would extract (DeptID, LastName), while a map function processing department records would extract (DeptID, DeptName). This "different mapping" naturally leads to "different lists of values" in the intermediate data.
To enable the correct merging of data from different original sources in the subsequent reduce phase, it would be obvious for a POSITA to ensure the "intermediate data" is "identifiable to that data group." This could be achieved by including a group identifier in the intermediate key (e.g., (group_id, out_key)), in the intermediate value, or by storing intermediate data for different groups in distinct, identifiable locations. The patent's example emit_to_group(group, DeptID, val) directly illustrates this obvious technique.

3. "Reducing the intermediate data for the data groups" by "processing the intermediate data for each data group in a manner that is defined to correspond to that data group" to "merge" based on a "key in common" (Claim 1, 33):

Dean and Ghemawat teaches that the reduce function processes all intermediate values sharing the same key. [cite: Jeffrey Dean and Sanjay Ghemawat, " MapReduce: Simplified Data Processing on Large Clusters "]
When intermediate data from multiple heterogeneous groups, identified by their respective group IDs and sharing a "key in common," arrives at a reducer, a POSITA would be motivated to "merge" this data, mimicking a relational join operation.
To "process the intermediate data for each data group in a manner that is defined to correspond to that data group," it would be an obvious implementation choice to use separate "iterators" for the intermediate values belonging to each distinct group, as explicitly described in the patent ("applying a different iterator to intermediate values for each group"). This provides a structured and flexible way for the reducer to access and combine the distinct value lists associated with the common key from each original data group (e.g., emp_iter and dept_iter in the patent's pseudocode). This approach is analogous to how join algorithms in RDBMS might process streams of data from different relations.
The result, a "merging of the corresponding different intermediate data based on the key in common" (Claim 1) or an "output data set" with a "different schema than the first and second schema" (Claim 33), is the predictable outcome of performing a join operation on heterogeneous data.

4. "Mapping and reducing operations are performed by a distributed system" (Claim 1, 33):

This element is explicitly taught by Dean and Ghemawat, which describes a programming methodology for "parallel computations over distributed (typically, very large) data sets." [0002, Jeffrey Dean and Sanjay Ghemawat, " MapReduce: Simplified Data Processing on Large Clusters "]

Conclusion on Obviousness:
A POSITA, motivated to adapt the scalable MapReduce framework (Dean and Ghemawat) to efficiently perform well-known relational database operations like joins on heterogeneous datasets, would have found it obvious to apply known techniques of schema-specific data processing, group identification, and iterative data access to the MapReduce model. The proposed modifications represent predictable extensions to MapReduce for a known problem domain (relational database processing), yielding predictable results (enabling joins across disparate data). Thus, the claimed methods and systems would have been obvious to a POSITA at the time of the invention.