Patent 7069287
Derivative works
Defensive disclosure: derivative variations of each claim designed to render future incremental improvements obvious or non-novel.
Active provider: Google · gemini-2.5-pro
Derivative works
Defensive disclosure: derivative variations of each claim designed to render future incremental improvements obvious or non-novel.
Defensive Disclosure and Prior Art Generation
Based on U.S. Patent 7,069,287: Method for efficient computation of odd characteristic extension fields
Publication Date: May 10, 2026
Reference Patent: U.S. Patent 7,069,287 B2
Purpose: This document is intended to enter the public domain as prior art. The following disclosures describe foreseeable and obvious extensions, combinations, and variations of the inventions claimed in U.S. Patent 7,069,287. The intent is to prevent the patenting of these incremental improvements by third parties.
Derivative Disclosures Based on Claim 1 (Method)
The core method of claim 1 involves computing temporary coefficients (c_k') as a sum of intermediate products (a_i * b_j) without an immediate modular reduction, followed by a single modular reduction on the accumulated sum. The following are derivative methods based on this core concept.
1.1. Material & Component Substitution: Systolic Array Implementation
- Derivative Idea: The method is implemented on a systolic array or a dedicated block within a Field-Programmable Gate Array (FPGA) rather than a general-purpose microcontroller's CPU and ALU. This substitutes the software-based control flow with a dedicated hardware data flow architecture.
- Enabling Description: A 2D systolic array of
m x mprocessing elements (PEs) is configured. Field element coefficientsa_iandb_jare fed into the array from the top and left edges, respectively. Each PE performs a single multiplicationa_i * b_jand adds the result to a value passed from a neighboring PE. The intermediate products are accumulated as they flow diagonally through the array. The unreduced sumsc_k'emerge from the bottom and right edges of the array into a wide accumulator buffer (e.g., 24-bit or 32-bit width, depending onmandp). A final, separate hardware module, the Reduction Unit (RU), reads from this buffer and performs the multi-word modular reduction as described in Algorithm 1.2 of the reference patent. This replaces sequential software loops with parallel hardware execution. - Mermaid Diagram:
graph TD subgraph Systolic Array Architecture direction TB A0[a_0] --> PE00((PE 0,0)); A1[a_1] --> PE10((PE 1,0)); B0[b_0] --> PE00; B1[b_1] --> PE01((PE 0,1)); PE00 -->|a_0*b_1+...| PE11((PE 1,1)); PE10 -->|a_1*b_0+...| PE11; PE01 -->|a_0*b_1+...| PE11; PE11 --> C_sum[c_k' Accumulator]; end C_sum --> RU[Reduction Unit]; RU --> C_final[Final c_k Coefficients]; style PE00 fill:#f9f,stroke:#333,stroke-width:2px style PE10 fill:#f9f,stroke:#333,stroke-width:2px style PE01 fill:#f9f,stroke:#333,stroke-width:2px style PE11 fill:#f9f,stroke:#333,stroke-width:2px
1.2. Material & Component Substitution: GPU-based Parallelization
- Derivative Idea: The method is adapted for massively parallel execution on a Graphics Processing Unit (GPU). The computation of each temporary coefficient
c_k'is assigned to a separate thread or a block of threads. - Enabling Description: The coefficients of field elements A and B are loaded into the GPU's global memory. A CUDA or OpenCL kernel is launched with
2m-1threads (or thread blocks). Each threadkis responsible for calculating a single temporary coefficientc_k'. It iterates through the necessaryiandjindices wherei+j=k, computes the productsa_i * b_j, and accumulates them in its local register or shared memory. Because all threads run in parallel, all unreducedc_k'coefficients are generated simultaneously. A second kernel, or a synchronization step followed by a final reduction stage on the GPU, performs the modular reduction on allc_k'coefficients in parallel, writing the final reduced coefficients back to global memory. - Mermaid Diagram:
sequenceDiagram participant CPU participant GPU CPU->>GPU: Transfer coefficients a_i, b_j to Global Memory CPU->>GPU: Launch Kernel 1 (Parallel Accumulation) activate GPU par For k = 0 to 2m-2 GPU->>GPU: Thread k computes c_k' = sum(a_i * b_j) end GPU-->>CPU: Kernel 1 complete deactivate GPU CPU->>GPU: Launch Kernel 2 (Parallel Reduction) activate GPU par For k = 0 to 2m-2 GPU->>GPU: Thread k computes c_k = c_k' mod p end GPU-->>CPU: Kernel 2 complete deactivate GPU CPU->>GPU: Read final c_k coefficients from Global Memory
1.3. Operational Parameter Expansion: Post-Quantum Cryptography Scale
- Derivative Idea: The method is applied to finite fields with parameters suitable for post-quantum cryptography, specifically isogeny-based cryptography (e.g., SIDH/SIKE), which may require fields of size
p = 2^x * 3^y - 1with bit lengths of 751 bits or more. - Enabling Description: In this context, the coefficients
a_iandb_jare themselves multi-word integers (e.g., composed of multiple 64-bit words). The intermediate producta_i * b_jis a large integer requiring a significant number of registers. The accumulation stepc_k' += a_i * b_jis performed using arbitrary-precision arithmetic libraries. The key inventive step is preserved: the computationally intensive modular reduction (modp) is deferred until the entire sum forc_k'is computed. The accumulator forc_k'must be able to hold a value significantly larger thanp^2, potentially spanning dozens of 64-bit words, before the final, specialized reduction algorithm is applied. - Mermaid Diagram:
flowchart TD A[Start: PQC-scale Multiplication A*B] --> B{Initialize Multi-Word c_k' Accumulator}; B --> C{Loop for each c_k'}; C --> D{Loop for each product a_i*b_j for k}; D --> E[Compute multi-word product T = a_i * b_j]; E --> F[Add T to c_k' using arbitrary-precision addition]; F --> D; D -- done --> G[Perform final multi-word modular reduction: c_k = c_k' mod p]; G --> C; C -- done --> H[End: Final c_k coefficients];
1.4. Cross-Domain Application: Number Theoretic Transforms in Signal Processing
- Derivative Idea: The core method is used to optimize the butterfly operations within a Number Theoretic Transform (NTT), a variant of the FFT used for high-speed, error-free convolution of digital signals.
- Enabling Description: An NTT operation involves repeated multiplications by powers of a root of unity (
w) within a finite fieldGF(p). A standard decimation-in-time NTT algorithm requires numerousA = X + Y*w^kandB = X - Y*w^kcalculations. The multiplicationY*w^kis performed using the "add-first, reduce-later" method. When the elementsYandw^kare represented as polynomials in an extension fieldGF(p^m), the coefficient products are accumulated without reduction. This is especially useful in hardware DSPs where multi-word accumulators are standard features. The expensive modular reduction is performed only once after all coefficient products for theY*w^kterm are summed, before the final additions/subtractions withX. - Mermaid Diagram:
graph TD subgraph NTT Butterfly Unit X[Input X] --> AddSub1; Y[Input Y] --> Mult; W[Twiddle Factor w^k] --> Mult; Mult(Multiply Y*w^k) -- uses 'add-first, reduce-later' --> TempProd[Unreduced Product]; TempProd --> Reducer[Modular Reducer]; Reducer --> ReducedProd[Reduced Product]; ReducedProd --> AddSub1(Add/Subtract); ReducedProd --> AddSub2(Add/Subtract); X --> AddSub2; AddSub1 --> Output_A[Output A = X + Yw^k]; AddSub2 --> Output_B[Output B = X - Yw^k]; end style Mult fill:#cde,stroke:#333 style Reducer fill:#f99,stroke:#333
1.5. Integration with Emerging Tech: AI-Driven Dynamic Reduction Scheduling
- Derivative Idea: An AI model, such as a lightweight neural network or a reinforcement learning agent, dynamically determines the optimal number of intermediate products to accumulate before triggering a modular reduction.
- Enabling Description: A small multilayer perceptron (MLP) is pre-trained or trained on-device. Its inputs are features of the system state: CPU load, cache miss rate, available accumulator memory, and statistical properties of the operand coefficients (e.g., average bit length). The MLP's output is an integer
N, representing the number ofa_i * b_jproducts to sum before a reduction is performed. Instead of accumulating all products for a givenc_k', the multiplication loop runsNtimes, accumulates, and then calls a partial reduction routine. This adaptive approach allows the system to balance throughput against memory pressure in real-time, outperforming a fixed strategy when computational loads are variable. - Mermaid Diagram:
stateDiagram-v2 [*] --> Idle Idle --> Accumulating: Start Calc c_k' state Accumulating { direction LR [*] --> Summing Summing --> CheckPolicy: product added CheckPolicy --> Summing: Policy says continue CheckPolicy --> Reducing: Policy says reduce } Accumulating --> Done: All products summed Reducing --> Accumulating: Partial reduction complete Done --> [*] note left of Accumulating AI Policy Engine: Input: CPU Load, Mem Usage Output: Reduce or Continue end note
1.6. The "Inverse" or Failure Mode: Graceful Degradation for Low-Power States
- Derivative Idea: A version of the method designed for power-constrained devices that prioritizes stability over speed. When a low-power state is detected, the system switches to a less memory-intensive, albeit slower, multiplication algorithm.
- Enabling Description: The system continuously monitors the device's battery level or power state. During normal operation (e.g., battery > 20%), it uses the high-performance "add-first, reduce-later" method, which requires a larger RAM footprint for the multi-word
c_k'accumulators. When the battery level drops below the threshold, the system flags a state change. On the next call to the multiplication function, it dynamically dispatches to the conventional method:c = (c + (a_i * b_j) mod p) mod p. This avoids allocating large accumulators, reducing the risk of a low-memory crash, and ensures that critical cryptographic operations can complete successfully, even with degraded performance. - Mermaid Diagram:
flowchart TD A[Start Multiplication] --> B{Check Power State}; B -- Normal (>20%) --> C[Use 'add-first, reduce-later' method]; B -- Low (<20%) --> D[Use 'conventional' reduce-per-product method]; C --> E{Allocate Multi-word Accumulator}; E --> F[Compute full sum c_k']; F --> G[Perform single reduction]; D --> H{Use Single-word Accumulator}; H --> I[Loop: c = (c + a_i*b_j mod p) mod p]; G --> Z[Return Result]; I --> Z;
Combination Prior Art with Open-Source Standards
2.1. RISC-V Custom ISA Extension
- Scenario: The method is implemented as a set of custom instructions within the open-source RISC-V Instruction Set Architecture (ISA). This creates a hardware-software co-design that optimally executes the claimed method.
- Enabling Description: We propose a non-standard RISC-V extension, "Xocem" (Odd Characteristic Extension Multiplication). It defines the following instructions:
ocem.maccum rd, rs1, rs2: Takes coefficients from registersrs1andrs2, multiplies them, and adds the full unreduced product to a special wide accumulator registerrd(e.g., a 64-bit accumulator on a 32-bit core). This instruction does not perform any modular reduction.ocem.reduce rd, rs1: Takes a multi-word value from register pairrs1, performs the efficient multi-step reduction(x_1*c + x_0) mod p, and writes the single-word result tord.ocem.setp rs1: Sets the moduluspfor the reduction hardware from a value inrs1.
An open-source RISC-V core (like Rocket or BOOM) is modified to include the hardware logic for these instructions. A modified GCC or LLVM compiler toolchain is then used to recognize a C intrinsic function, e.g.,__riscv_ocem_multiply(), and automatically generate the optimal sequence ofocem.maccumandocem.reduceinstructions, thus combining the patented method with an open standard for processor design.
2.2. OpenSSL Engine for Embedded Systems
- Scenario: The method is integrated into the widely used OpenSSL cryptographic library via its
ENGINEAPI. - Enabling Description: An OpenSSL
ENGINEis created, named "ocem_engine". This engine provides alternative implementations for high-level Elliptic Curve functions, specificallyECDSA_signandECDSA_verify. When the engine is loaded on a system with a processor known to benefit from the "add-first, reduce-later" strategy (e.g., an ARM Cortex-M4 or an Intel 8051 variant), the engine overrides OpenSSL's defaultbignumarithmetic functions. It replaces the standard field multiplication routines with a new function that implements the claimed method, using multi-precision integer additions to accumulatec_k'before a final reduction. The source code for this engine is made publicly available on a repository like GitHub, providing a direct implementation of the method within a major open-source cryptographic framework.
2.3. FIDO2/WebAuthn Open-Source Firmware
- Scenario: The method is embedded into the firmware of an open-source FIDO2/WebAuthn hardware authenticator (e.g., SoloKeys or Nitrokey) to accelerate the cryptographic signature generation.
- Enabling Description: The firmware for a FIDO2 authenticator, which is typically written in C or Rust for an embedded microcontroller (like a Nordic nRF52840 or STM32), is modified. The cryptographic library used for generating the ECDSA signature over the P-256 curve (which operates over a prime field
GF(p)) is replaced or augmented. When the device must sign the client data hash during an authentication ceremony (authenticatorGetAssertionoperation), the underlying scalar multiplication of the private key with the curve's base point relies on the claimed field multiplication method. By accumulating intermediate products and performing a single reduction, the overall latency of the signature generation is reduced, providing a faster and more responsive user experience for passwordless login, while being fully compliant with the open WebAuthn and FIDO2 standards.
Generated 5/10/2026, 12:46:21 AM