Derivative works — US Patent 7069287

Defensive Disclosure and Prior Art Generation

Based on U.S. Patent 7,069,287: Method for efficient computation of odd characteristic extension fields

Publication Date: May 10, 2026
Reference Patent: U.S. Patent 7,069,287 B2
Purpose: This document is intended to enter the public domain as prior art. The following disclosures describe foreseeable and obvious extensions, combinations, and variations of the inventions claimed in U.S. Patent 7,069,287. The intent is to prevent the patenting of these incremental improvements by third parties.

Derivative Disclosures Based on Claim 1 (Method)

The core method of claim 1 involves computing temporary coefficients (c_k') as a sum of intermediate products (a_i * b_j) without an immediate modular reduction, followed by a single modular reduction on the accumulated sum. The following are derivative methods based on this core concept.

1.1. Material & Component Substitution: Systolic Array Implementation

Derivative Idea: The method is implemented on a systolic array or a dedicated block within a Field-Programmable Gate Array (FPGA) rather than a general-purpose microcontroller's CPU and ALU. This substitutes the software-based control flow with a dedicated hardware data flow architecture.
Enabling Description: A 2D systolic array of m x m processing elements (PEs) is configured. Field element coefficients a_i and b_j are fed into the array from the top and left edges, respectively. Each PE performs a single multiplication a_i * b_j and adds the result to a value passed from a neighboring PE. The intermediate products are accumulated as they flow diagonally through the array. The unreduced sums c_k' emerge from the bottom and right edges of the array into a wide accumulator buffer (e.g., 24-bit or 32-bit width, depending on m and p). A final, separate hardware module, the Reduction Unit (RU), reads from this buffer and performs the multi-word modular reduction as described in Algorithm 1.2 of the reference patent. This replaces sequential software loops with parallel hardware execution.

Mermaid Diagram:

graph TD
    subgraph Systolic Array Architecture
        direction TB
        A0[a_0] --> PE00((PE 0,0));
        A1[a_1] --> PE10((PE 1,0));
        B0[b_0] --> PE00;
        B1[b_1] --> PE01((PE 0,1));
        PE00 -->|a_0*b_1+...| PE11((PE 1,1));
        PE10 -->|a_1*b_0+...| PE11;
        PE01 -->|a_0*b_1+...| PE11;
        PE11 --> C_sum[c_k' Accumulator];
    end
    C_sum --> RU[Reduction Unit];
    RU --> C_final[Final c_k Coefficients];

    style PE00 fill:#f9f,stroke:#333,stroke-width:2px
    style PE10 fill:#f9f,stroke:#333,stroke-width:2px
    style PE01 fill:#f9f,stroke:#333,stroke-width:2px
    style PE11 fill:#f9f,stroke:#333,stroke-width:2px

1.2. Material & Component Substitution: GPU-based Parallelization

Derivative Idea: The method is adapted for massively parallel execution on a Graphics Processing Unit (GPU). The computation of each temporary coefficient c_k' is assigned to a separate thread or a block of threads.
Enabling Description: The coefficients of field elements A and B are loaded into the GPU's global memory. A CUDA or OpenCL kernel is launched with 2m-1 threads (or thread blocks). Each thread k is responsible for calculating a single temporary coefficient c_k'. It iterates through the necessary i and j indices where i+j=k, computes the products a_i * b_j, and accumulates them in its local register or shared memory. Because all threads run in parallel, all unreduced c_k' coefficients are generated simultaneously. A second kernel, or a synchronization step followed by a final reduction stage on the GPU, performs the modular reduction on all c_k' coefficients in parallel, writing the final reduced coefficients back to global memory.

Mermaid Diagram:

sequenceDiagram
    participant CPU
    participant GPU
    CPU->>GPU: Transfer coefficients a_i, b_j to Global Memory
    CPU->>GPU: Launch Kernel 1 (Parallel Accumulation)
    activate GPU
    par For k = 0 to 2m-2
        GPU->>GPU: Thread k computes c_k' = sum(a_i * b_j)
    end
    GPU-->>CPU: Kernel 1 complete
    deactivate GPU
    CPU->>GPU: Launch Kernel 2 (Parallel Reduction)
    activate GPU
    par For k = 0 to 2m-2
        GPU->>GPU: Thread k computes c_k = c_k' mod p
    end
    GPU-->>CPU: Kernel 2 complete
    deactivate GPU
    CPU->>GPU: Read final c_k coefficients from Global Memory

1.3. Operational Parameter Expansion: Post-Quantum Cryptography Scale

Derivative Idea: The method is applied to finite fields with parameters suitable for post-quantum cryptography, specifically isogeny-based cryptography (e.g., SIDH/SIKE), which may require fields of size p = 2^x * 3^y - 1 with bit lengths of 751 bits or more.
Enabling Description: In this context, the coefficients a_i and b_j are themselves multi-word integers (e.g., composed of multiple 64-bit words). The intermediate product a_i * b_j is a large integer requiring a significant number of registers. The accumulation step c_k' += a_i * b_j is performed using arbitrary-precision arithmetic libraries. The key inventive step is preserved: the computationally intensive modular reduction (mod p) is deferred until the entire sum for c_k' is computed. The accumulator for c_k' must be able to hold a value significantly larger than p^2, potentially spanning dozens of 64-bit words, before the final, specialized reduction algorithm is applied.

Mermaid Diagram:

flowchart TD
    A[Start: PQC-scale Multiplication A*B] --> B{Initialize Multi-Word c_k' Accumulator};
    B --> C{Loop for each c_k'};
    C --> D{Loop for each product a_i*b_j for k};
    D --> E[Compute multi-word product T = a_i * b_j];
    E --> F[Add T to c_k' using arbitrary-precision addition];
    F --> D;
    D -- done --> G[Perform final multi-word modular reduction: c_k = c_k' mod p];
    G --> C;
    C -- done --> H[End: Final c_k coefficients];

1.4. Cross-Domain Application: Number Theoretic Transforms in Signal Processing

Derivative Idea: The core method is used to optimize the butterfly operations within a Number Theoretic Transform (NTT), a variant of the FFT used for high-speed, error-free convolution of digital signals.
Enabling Description: An NTT operation involves repeated multiplications by powers of a root of unity (w) within a finite field GF(p). A standard decimation-in-time NTT algorithm requires numerous A = X + Y*w^k and B = X - Y*w^k calculations. The multiplication Y*w^k is performed using the "add-first, reduce-later" method. When the elements Y and w^k are represented as polynomials in an extension field GF(p^m), the coefficient products are accumulated without reduction. This is especially useful in hardware DSPs where multi-word accumulators are standard features. The expensive modular reduction is performed only once after all coefficient products for the Y*w^k term are summed, before the final additions/subtractions with X.

Mermaid Diagram:

graph TD
    subgraph NTT Butterfly Unit
        X[Input X] --> AddSub1;
        Y[Input Y] --> Mult;
        W[Twiddle Factor w^k] --> Mult;
        Mult(Multiply Y*w^k) -- uses 'add-first, reduce-later' --> TempProd[Unreduced Product];
        TempProd --> Reducer[Modular Reducer];
        Reducer --> ReducedProd[Reduced Product];
        ReducedProd --> AddSub1(Add/Subtract);
        ReducedProd --> AddSub2(Add/Subtract);
        X --> AddSub2;
        AddSub1 --> Output_A[Output A = X + Yw^k];
        AddSub2 --> Output_B[Output B = X - Yw^k];
    end
    style Mult fill:#cde,stroke:#333
    style Reducer fill:#f99,stroke:#333

1.5. Integration with Emerging Tech: AI-Driven Dynamic Reduction Scheduling

Derivative Idea: An AI model, such as a lightweight neural network or a reinforcement learning agent, dynamically determines the optimal number of intermediate products to accumulate before triggering a modular reduction.
Enabling Description: A small multilayer perceptron (MLP) is pre-trained or trained on-device. Its inputs are features of the system state: CPU load, cache miss rate, available accumulator memory, and statistical properties of the operand coefficients (e.g., average bit length). The MLP's output is an integer N, representing the number of a_i * b_j products to sum before a reduction is performed. Instead of accumulating all products for a given c_k', the multiplication loop runs N times, accumulates, and then calls a partial reduction routine. This adaptive approach allows the system to balance throughput against memory pressure in real-time, outperforming a fixed strategy when computational loads are variable.

Mermaid Diagram:

stateDiagram-v2
    [*] --> Idle
    Idle --> Accumulating: Start Calc c_k'
    state Accumulating {
        direction LR
        [*] --> Summing
        Summing --> CheckPolicy: product added
        CheckPolicy --> Summing: Policy says continue
        CheckPolicy --> Reducing: Policy says reduce
    }
    Accumulating --> Done: All products summed
    Reducing --> Accumulating: Partial reduction complete
    Done --> [*]

    note left of Accumulating
        AI Policy Engine:
        Input: CPU Load, Mem Usage
        Output: Reduce or Continue
    end note

1.6. The "Inverse" or Failure Mode: Graceful Degradation for Low-Power States

Derivative Idea: A version of the method designed for power-constrained devices that prioritizes stability over speed. When a low-power state is detected, the system switches to a less memory-intensive, albeit slower, multiplication algorithm.
Enabling Description: The system continuously monitors the device's battery level or power state. During normal operation (e.g., battery > 20%), it uses the high-performance "add-first, reduce-later" method, which requires a larger RAM footprint for the multi-word c_k' accumulators. When the battery level drops below the threshold, the system flags a state change. On the next call to the multiplication function, it dynamically dispatches to the conventional method: c = (c + (a_i * b_j) mod p) mod p. This avoids allocating large accumulators, reducing the risk of a low-memory crash, and ensures that critical cryptographic operations can complete successfully, even with degraded performance.

Mermaid Diagram:

flowchart TD
    A[Start Multiplication] --> B{Check Power State};
    B -- Normal (>20%) --> C[Use 'add-first, reduce-later' method];
    B -- Low (<20%) --> D[Use 'conventional' reduce-per-product method];
    C --> E{Allocate Multi-word Accumulator};
    E --> F[Compute full sum c_k'];
    F --> G[Perform single reduction];
    D --> H{Use Single-word Accumulator};
    H --> I[Loop: c = (c + a_i*b_j mod p) mod p];
    G --> Z[Return Result];
    I --> Z;

Combination Prior Art with Open-Source Standards

2.1. RISC-V Custom ISA Extension

Scenario: The method is implemented as a set of custom instructions within the open-source RISC-V Instruction Set Architecture (ISA). This creates a hardware-software co-design that optimally executes the claimed method.
Enabling Description: We propose a non-standard RISC-V extension, "Xocem" (Odd Characteristic Extension Multiplication). It defines the following instructions:
1. ocem.maccum rd, rs1, rs2: Takes coefficients from registers rs1 and rs2, multiplies them, and adds the full unreduced product to a special wide accumulator register rd (e.g., a 64-bit accumulator on a 32-bit core). This instruction does not perform any modular reduction.
2. ocem.reduce rd, rs1: Takes a multi-word value from register pair rs1, performs the efficient multi-step reduction (x_1*c + x_0) mod p, and writes the single-word result to rd.
3. ocem.setp rs1: Sets the modulus p for the reduction hardware from a value in rs1.
  An open-source RISC-V core (like Rocket or BOOM) is modified to include the hardware logic for these instructions. A modified GCC or LLVM compiler toolchain is then used to recognize a C intrinsic function, e.g., __riscv_ocem_multiply(), and automatically generate the optimal sequence of ocem.maccum and ocem.reduce instructions, thus combining the patented method with an open standard for processor design.

2.2. OpenSSL Engine for Embedded Systems

Scenario: The method is integrated into the widely used OpenSSL cryptographic library via its ENGINE API.
Enabling Description: An OpenSSL ENGINE is created, named "ocem_engine". This engine provides alternative implementations for high-level Elliptic Curve functions, specifically ECDSA_sign and ECDSA_verify. When the engine is loaded on a system with a processor known to benefit from the "add-first, reduce-later" strategy (e.g., an ARM Cortex-M4 or an Intel 8051 variant), the engine overrides OpenSSL's default bignum arithmetic functions. It replaces the standard field multiplication routines with a new function that implements the claimed method, using multi-precision integer additions to accumulate c_k' before a final reduction. The source code for this engine is made publicly available on a repository like GitHub, providing a direct implementation of the method within a major open-source cryptographic framework.

2.3. FIDO2/WebAuthn Open-Source Firmware

Scenario: The method is embedded into the firmware of an open-source FIDO2/WebAuthn hardware authenticator (e.g., SoloKeys or Nitrokey) to accelerate the cryptographic signature generation.
Enabling Description: The firmware for a FIDO2 authenticator, which is typically written in C or Rust for an embedded microcontroller (like a Nordic nRF52840 or STM32), is modified. The cryptographic library used for generating the ECDSA signature over the P-256 curve (which operates over a prime field GF(p)) is replaced or augmented. When the device must sign the client data hash during an authentication ceremony (authenticatorGetAssertion operation), the underlying scalar multiplication of the private key with the curve's base point relies on the claimed field multiplication method. By accumulating intermediate products and performing a single reduction, the overall latency of the signature generation is reduced, providing a faster and more responsive user experience for passwordless login, while being fully compliant with the open WebAuthn and FIDO2 standards.