Obviousness — US Patent 6460050

Obviousness Analysis under 35 U.S.C. § 103

This analysis evaluates the obviousness of the claims of US patent 6460050 in light of prior art existing before the priority date of December 22, 1999. The core concept of the patent—using a digital identifier (hash) generated on a client machine, sending it to a central server, and using a collective database of these identifiers to classify content like spam or viruses—can be rendered obvious by combining existing technologies and known techniques from that era.

A person having ordinary skill in the art (PHOSITA) at the time would have been a computer scientist or software engineer with experience in network protocols, email systems, and cybersecurity, including antivirus and anti-spam techniques.

Primary Obviousness Combination

A strong argument for obviousness can be constructed by combining the teachings of distributed antivirus signature systems with known hashing algorithms for file identification.

Distributed Antivirus Technology (The "How"): By the late 1990s, antivirus software was a mature industry. Companies like Symantec and McAfee employed a client-server model where individual computers (clients) would run software that scanned files. These clients would periodically connect to a central server to download updated "signature" files. These signatures were essentially unique identifiers for known viruses. This established the model of a distributed network of agents all reporting to, and receiving updates from, a central authority to identify unwanted content. The fundamental architecture of a "plurality of agents" (the antivirus software on user machines) communicating with a central "server" having a "database" (the virus signature database) was well-established and ubiquitous.
Content Hashing for Identification (The "What"): Hashing algorithms like MD5 and SHA-1 were well-known and standardized before 1999. They were widely used for verifying file integrity and creating unique, compact "digital fingerprints" or "identifiers" for any piece of digital data. A PHOSITA would have been well aware that hashing was the standard method for creating a unique, fixed-size identifier from a variable-sized piece of content (like an email body or a file). The use of a hash as a "digital ID" was a common and fundamental computer science technique.

Motivation to Combine

The motivation to combine these two established concepts would have been driven by the clear and pressing need for a more efficient and scalable solution to the burgeoning problem of email spam.

By 1999, spam was a significant issue. Simple client-side keyword filtering was becoming ineffective as spammers adapted their methods. Sending entire emails to a third-party service for analysis was recognized as being slow, costly in terms of bandwidth, and raising privacy concerns—problems explicitly mentioned in the background of the '050 patent itself.

A PHOSITA, tasked with creating a better spam filter, would have naturally looked at the successful model used for combating viruses. The analogy is direct:

Viruses are unwanted files. Spam is an unwanted email.
The antivirus model uses a compact signature to identify a virus without needing to see the whole file every time.
Spam emails, particularly from a single campaign, are often identical or nearly identical.

The logical and obvious step would be to apply the proven, efficient client-server signature model from the antivirus world to the spam problem. Instead of a virus signature, the PHOSITA would use a hash of the email's content as the signature. This approach would be highly efficient, as only a small hash value would need to be transmitted over the network, solving the bandwidth and privacy issues of sending the entire email. The central server could then identify spam by observing that many different clients were submitting the exact same hash in a short period—a clear indicator of a mass-mailing campaign.

Analysis of Independent Claims

Claim 1: A file content classification system...

"a plurality of agents, each agent including a file content ID generator creating file content IDs using a mathematical algorithm": This is disclosed by the combination. The "agents" are analogous to the distributed antivirus clients. The "file content ID generator" is the known hashing algorithm (e.g., MD5) that the PHOSITA would obviously select to create a unique signature for an email.
"an ID appearance database, provided on a server, coupled to receive file content IDs from the agents": This describes the central antivirus signature server and its database, adapted to store hashes instead of virus signatures. The network connectivity is inherent to the client-server model.
"a characteristic comparison routine on the server, identifying a characteristic of the file content based on the appearance of the file content ID": This is the core logic of the central server. In the antivirus world, the "characteristic" is "is a virus," determined by checking if the file's signature is in the database. For spam, the obvious "characteristic" would be "is spam," and the "comparison routine" would be a simple algorithm to check if a hash appears with a high frequency from multiple agents, as motivated above.
"transmitting the characteristic to the client agents": This is the standard final step in the client-server model. The server informs the client of the result so the client can take action (e.g., quarantine a virus, or in this case, delete or flag spam).

Claim 9 & 16: A method for identifying characteristics... / A method of filtering an email message...

These method claims mirror the system of claim 1 and are rendered obvious by the same combination of prior art. The steps of "receiving... file content identifiers... from a plurality of... agents," "determining... whether each received content identifier matches a characteristic," and "outputting... an indication of the characteristic" are the direct and obvious implementation of the combined antivirus/hashing model applied to spam filtering.

Claim 21: A file content classification system... a computed value of at least two non-contiguous sections of data in a file...

This claim adds the limitation that the identifier is computed from "at least two non-contiguous sections of data." This was a known technique for making identifiers more robust against minor changes. Spammers in the late 1990s had already begun to add random text or extra whitespace to the end of messages to try and defeat simple hashing of the entire file. A PHOSITA would have considered it an obvious and routine design choice to create a more resilient hash by selecting stable parts of the message (e.g., the first 500 bytes and the last 500 bytes of the body, or the subject line and part of the body) and concatenating them before hashing. This would be a predictable adaptation to circumvent known evasion techniques and does not constitute an inventive step.

Claim 22: A method for providing a service on the Internet...

This claim describes the invention as an internet "service." The combination of distributed clients (antivirus software) communicating over the internet with a central server (the update server) to provide a service (virus protection) was the standard business and technical model for this type of software in 1999. Applying this same service model to spam filtering by substituting hashes for virus signatures would have been an obvious commercial and technical implementation. The steps of "collecting data," "characterizing the files... based on said digital content identifiers," and "transmitting a substance identifier" are all present in the proposed obvious combination.