Unified FHE Accelerator Targets Logic and SIMD Schemes on One Array

Patrick LongaMicrosoft ResearchThursday, June 4, 202615 min read

Minxuan Zhou of the Illinois Institute of Technology argues that fully homomorphic encryption will not become practical through cryptographic schemes alone, because its costs are dominated by ciphertext expansion, polynomial arithmetic, and data movement. In a Microsoft Research talk hosted by Patrick Longa, Zhou presents UFC, a unified FHE accelerator designed to support both logic and SIMD schemes by reducing their workloads to shared low-level primitives rather than building separate scheme-specific pipelines. The case for UFC is that hybrid FHE applications need both styles of computation, and that a common hardware substrate, NTT-centered interconnect, near-memory support, and compiler scheduling can outperform or avoid the inefficiencies of split accelerators.

FHE’s practical problem is not cryptographic promise; it is data movement, representation, and arithmetic cost

Minxuan Zhou frames fully homomorphic encryption as a cryptography framework that lets computation run directly on encrypted data without decrypting it. The source description places FHE in the context of post-quantum privacy-preserving computing; Zhou’s talk itself focuses on the hardware cost of making that privacy-preserving computation practical. In the client-cloud model he uses, a client sends encrypted sensitive data to the cloud, the cloud computes on that encrypted input — possibly together with its own data or other encrypted data — and returns an encrypted output. The attraction is end-to-end privacy for applications such as machine learning, biomedical computing, and databases.

The cost begins at representation. Zhou’s first example is a learning-with-error, or LWE, ciphertext. Encrypting a single value becomes a vector with hundreds of coefficients. In the slide’s representative setting, an LWE ciphertext has $a \in Z_{q}^{n}$ and $b \in Z_{q}$ , with $lo g (q) = 32$ and $n = 500$ . Zhou describes this as more than a 500-times expansion in data size for a single value.

>500x

data-size expansion in Zhou’s LWE single-value example

Ring-LWE, or RLWE, is more efficient for vectors because it can encrypt a vector of values into one ciphertext rather than encrypting each value separately. But it does not remove the underlying hardware problem. In Zhou’s representative RLWE example, a ciphertext contains two polynomials in $Z_{q} [x] / (X^{N} + 1)$ . The degree can be 64K and coefficients can be around 1,000 bits. The design problem becomes one of operating on very large integer vectors, not scalar plaintext values.

Computation also changes form. A multiplication on encrypted data may become polynomial multiplication. The naive polynomial multiplication cost is $O (N^{2})$ . FHE implementations commonly use the number theoretic transform, or NTT, to convert the polynomial to a form where multiplication is element-wise: an $O (N lo g N)$ transform, followed by $O (N)$ multiplication. That reduces algorithmic complexity but imposes a hardware obligation: the accelerator must support NTT-style transforms, modular arithmetic, and the movement patterns those transforms require.

The coefficient sizes introduce another layer. Because coefficients may exceed 1,000 bits, practical systems use a residue number system, decomposing large-integer polynomial coefficients into multiple smaller-modulus polynomials. This changes the processing pattern: the accelerator must support RNS-based polynomial operations, not merely large integer arithmetic in the abstract.

The resulting slowdown is large enough that hardware acceleration becomes a practical necessity. Zhou shows recent reported FHE results for neural-network workloads: convolutional-network results reproduced from Orion and a BERT result extracted from MOAI. Even on comparatively small models, the FHE latencies are in the range of hundreds to tens of thousands of seconds, while plaintext CPU execution would be on the order of tens of milliseconds.

Challenge	Representative detail from Zhou’s examples	Hardware consequence
Ciphertext expansion	LWE encrypts one value into hundreds of coefficients; Zhou gives n=500	Large memory footprint and bandwidth pressure
RLWE polynomial size	Two high-degree polynomials; Zhou gives N=65,536 and ~1,000-bit coefficients	Wide modular arithmetic over very large vectors
Polynomial multiplication	Naive O(N²) multiplication replaced by NTT-based O(N log N) transform plus O(N) element-wise multiplication	NTT units and shuffle networks become central
RNS decomposition	Large coefficients decomposed into smaller-modulus polynomial representations	Additional algorithmic and data-layout overhead

Zhou’s hardware motivation for FHE acceleration is rooted in ciphertext size, polynomial arithmetic, and data movement.

Logic and SIMD FHE solve different parts of the application, which is why hybrid execution matters

Zhou divides the FHE world, at a high level, into two families: logic schemes and SIMD schemes. The division matters because each family is efficient for different computations and carries different hardware needs.

Logic FHE schemes such as TFHE and FHEW use LWE ciphertexts as the basic representation. They encrypt a single value into a single ciphertext. Their advantage is that they can evaluate arbitrary functions on that value, including non-linear functions through lookup tables. ReLU, sign, and compare are examples of operations well matched to this style. Logic schemes also have relatively small ciphertexts. Their limitations are throughput and precision: each ciphertext carries one value, and high-bit-precision data can become costly.

SIMD FHE schemes such as BGV, BFV, and CKKS use RLWE ciphertexts and pack many values into one encrypted vector. Their advantage is high-throughput vector arithmetic, including high-bit-precision data. Their limitation is functional expressiveness: Zhou characterizes SIMD schemes as supporting arithmetic operations such as addition and multiplication, along with vector rotation, but not arbitrary functions in the same direct way. Their ciphertexts are also large.

Scheme family	Examples named by Zhou	Basic ciphertext	Strength	Limitation
Logic FHE	TFHE, FHEW	LWE; single value	Arbitrary function on a single value; smaller ciphertext	Limited throughput; costly for high-bit precision
SIMD FHE	BGV, BFV, CKKS	RLWE; vector	High-throughput arithmetic; supports high-bit precision	Supports add/mul and rotation rather than arbitrary functions directly; large ciphertext

Zhou’s logic-versus-SIMD distinction sets up the case for hybrid-scheme FHE.

A privacy-preserving machine-learning workload may need both: arithmetic-intensive vector operations and non-linear operations. The SIMD side can handle vector multiplications, additions, rotations, and accumulations. The logic side can handle non-linear operations such as ReLU, sign, and comparison.

The obstacle is that the schemes do not share the same ciphertext granularity. SIMD uses packed RLWE vectors. Logic uses single-value LWE ciphertexts. Moving between them requires scheme switching.

From SIMD to logic, Zhou describes an extraction process: reduce the modulus of the RLWE ciphertext, perform a slot-to-coefficient conversion, reorganize coefficients into orders that produce different LWE ciphertexts, and then apply LWE key switching to bring those ciphertexts into the parameter setting used by the logic scheme.

The inverse process is repacking. Multiple LWE ciphertexts are packed back into one RLWE ciphertext through linear transformations with repacking keys, followed by coefficient-to-slot transformation and bootstrapping to recover levels in the modulus tower. These conversions are still operations over the same LWE and RLWE data structures, mostly polynomial and vector operations.

That is the hinge for hybrid FHE workloads. If extraction and repacking required a separate class of computation, a unified accelerator would be much harder to justify. Zhou’s argument is narrower and more concrete: because scheme switching can be decomposed into operations over the same polynomial and vector structures, it becomes plausible to support hybrid execution by identifying shared low-level primitives and scheduling them well. The hardware question is therefore not whether logic and SIMD schemes are identical. They are not. The question is whether their differences can be absorbed by a common substrate plus limited support for the cases that do not fit.

Is it possible to support both schemes in the same hardware?

Minxuan Zhou

Scheme-specific accelerators waste hardware when the application crosses scheme boundaries

The answer is not obvious because prior FHE accelerators have been designed scheme by scheme. Their pipelines are customized around the operation pattern of a target scheme.

For logic FHE accelerators, Zhou describes a pipeline with components such as a decomposition unit, NTT unit, element-wise unit, key-switch unit, and memory. The hardware is hard-wired so that the output of one unit flows into the next, maximizing utilization for the logic-scheme pipeline. For SIMD accelerators, the customized blocks differ: base conversion, NTT, element-wise operations, automorphism, and memory. The same pipeline strategy works when the workload matches the scheme, but it becomes a liability when asked to support another scheme. A unit may be underutilized, or a needed operation may not be present in that accelerator’s pipeline.

Even common units are not automatically reusable. NTT is central in both logic and SIMD schemes, but the polynomial-degree ranges differ. Zhou cites CKKS as commonly using polynomial degrees from 8K to 64K, while TFHE commonly uses 512 to 16K. Existing accelerators tune their NTT units for those ranges. A CKKS accelerator such as SHARP, in Zhou’s example, reaches maximum NTT utilization at 64K-degree polynomials; a TFHE accelerator such as Strix reaches maximum utilization at 16K-degree polynomials.

When those parameter ranges cross, utilization falls or functionality may be missing for the example under discussion. Zhou says using a CKKS accelerator for a 512-degree polynomial can drive utilization down to around 50%. In the other direction, the TFHE accelerator in his example is designed for a smaller parameter range and does not support the larger-degree polynomial case he uses to illustrate CKKS-style needs.

Prior accelerator type	Example baseline named by Zhou	Tuned for	Issue when reused
CKKS accelerator	SHARP [Kim ISCA ’23]	Large CKKS polynomial degrees; maximum NTT utilization at 64K	Underutilization on small TFHE-like polynomials in Zhou’s example
TFHE accelerator	Strix [Putra MICRO ’23]	TFHE polynomial range; maximum NTT utilization at 16K	Lacks support for the larger-degree polynomial case Zhou uses to illustrate CKKS-style parameters

Zhou’s critique of scheme-specific accelerators is about both pipeline specialization and parameter specialization.

This motivates UFC, the unified FHE accelerator presented as collaborative work among UCSD, Intel Labs, and Illinois Institute of Technology, published at MICRO 2024. The design goal is not to add every high-level operation from every scheme as a specialized block. It is to identify low-level primitives shared across schemes, build a high-throughput array around them, and use algorithm, hardware, and compiler choices to keep utilization high across parameter sizes.

UFC reduces the accelerator to common primitives, then spends the hardware budget on throughput

Minxuan Zhou describes UFC as a 2D array of general processing elements rather than a sequence of scheme-specific pipeline stages. In the configuration he presents, the array has 64 processing elements arranged as 8 by 8. Each PE has 256 lanes, producing 16K lanes per chip for primitive operations.

16K

processing lanes in Zhou’s UFC configuration: 64 PEs × 256 lanes

The PE lanes support primitives such as multiplication, addition, digit decomposition, and butterfly operations. The primitive set comes from breaking down CKKS and TFHE operations. CKKS requires operations including NTT and inverse NTT, element-wise modular addition, element-wise modular multiplication, and automorphism. TFHE requires NTT and inverse NTT, element-wise modular arithmetic, vector rotation, vector reduction, and digit decomposition.

Most of the computation reduces to element-wise modular arithmetic: multiplication, addition, subtraction, negation, and butterfly patterns for NTT. Reduction is the exception, and UFC offloads it to lightweight near-memory support rather than building more specialized compute into each PE.

The memory-side units are intentionally limited. UFC includes lightweight near-memory support near the HBM controllers for operations such as data packing, addressing, memory reads and writes with shuffle, polynomial fetch/store, and LWE fetch/store. These support operations that do not need the full 2D compute array.

The third design pillar is compiler-level scheduling. Since a unified accelerator has many resources and must serve programs with different parameter sizes, the compiler must schedule operations so that small polynomials do not leave most lanes idle. Zhou later returns to this problem in the TFHE case, where polynomial degrees may be much smaller than the hardware’s natural 16K-wide throughput.

The hardest part is not arithmetic; it is the shuffle network

Once the computation is reduced to common primitives, data movement becomes the central cost. Many FHE operations require not only modular arithmetic but also reordering, rotating, or reorganizing data. Zhou lists NTT, automorphism, rotate, and extract as operations with significant movement or shuffling patterns.

A naive unified design would need to connect processing elements through enough networks to support all those shuffle patterns. Zhou argues that this would be prohibitively expensive to scale with PE throughput. UFC’s main algorithm-hardware co-design move is therefore to reduce the number and complexity of interconnect patterns the hardware must support.

The first technique is constant-geometry NTT. Classical NTT has different data exchange patterns at different stages: the stride and pairing pattern changes as the transform progresses. Constant-geometry NTT is algorithmically equivalent, according to Zhou, but uses the same shuffle pattern across stages. That is attractive for a 2D PE array because the hardware can implement one repeated pattern rather than many stage-specific ones.

But even a constant-geometry NTT network is still expensive when routed naively across a 2D array. Zhou shows a 4-by-4 example in which wires interleave heavily across rows and columns, creating large area overhead in a real chip layout. His team therefore applies permutation decomposition, citing Miel’s 1993 work on constant-geometry fast Fourier transforms on array processors. The constant-geometry permutation is decomposed into three phases: x-axis shuffling, y-axis shuffling, and register shuffling. The first two can be replicated as simpler 1D patterns across rows and columns; the third is local movement inside each PE’s registers.

This does not merely optimize the NTT. UFC then uses the NTT network as the substrate for other data movements, avoiding specialized hardware for each operation where possible.

For automorphism in SIMD schemes, prior accelerators may use specialized automorphism hardware. UFC instead transforms automorphism into an NTT process by preparing, offline, special twiddle factors corresponding to a given rotation. During online execution, the accelerator uses those prepared twiddle factors through the NTT process to produce the required automorphism result. The hardware reuses the NTT network instead of adding an automorphism-specific interconnect.

For polynomial rotation in TFHE, Zhou makes a similar substitution. A rotation of polynomial coefficients can be reinterpreted as multiplication by a monomial such as $x^{a}$ . Since polynomial multiplication can be supported through NTT operations, the rotation no longer requires its own specialized movement network.

Extraction is handled differently. Zhou says coefficient extraction can rely on off-chip memory-controller support because it is difficult and does not need the same high throughput as the main compute array.

The net design principle is clear: keep the high-throughput PE array focused on common arithmetic and one carefully engineered movement pattern; bend other operations toward that pattern through algorithmic transformations or move lower-throughput support work near memory.

Small polynomials are packed by layout and scheduling, not by changing the hardware

UFC’s 16K lanes make high utilization straightforward when the polynomial degree is at least 16K. CKKS workloads often fit that scale. TFHE creates a harder case because Zhou gives its polynomial-degree range as roughly 512 to 16K. A unified accelerator with 16K lanes risks wasting most of the array on small polynomials unless multiple small operations can be packed together.

The first answer is data layout. In Zhou’s simplified example, the hardware supports a 16-degree polynomial but the workload has two 8-degree polynomials. With a continuous layout, the first polynomial occupies the first two PEs and the second occupies the next two. As the constant-geometry network performs successive stages, coefficients are shuffled. The shuffled layouts remain usable by subsequent stages without hardware modification. At the end, the layout may be interleaved, but inverse NTT can transform it back to the original continuous layout. In other words, the constant-geometry network can support multiple small polynomials intrinsically if the compiler uses the right layout.

The second answer is parallelism selection. In TFHE bootstrapping, Zhou identifies at least two kinds of parallelism: ciphertext parallelism and polynomial parallelism. Ciphertext parallelism schedules the same operation across different ciphertexts. Polynomial parallelism schedules parallel work within a single bootstrapping computation.

UFC prioritizes ciphertext parallelism. Zhou says it is more efficient because multiple ciphertexts can share the same bootstrapping key, lowering memory bandwidth requirements. Polynomial-level parallelism would reload the bootstrapping key multiple times. Therefore the compiler aggressively finds different TFHE ciphertexts and schedules the same operation from multiple ciphertexts together on the accelerator.

This is an important part of the argument because it shows the unified design is not just hardware. UFC depends on compiler knowledge of FHE program structure, ciphertext independence, data layout, and memory reuse.

The reported evaluation shows modest CKKS gains, larger TFHE gains, and hybrid gains from avoiding a split system

Zhou presents UFC’s hardware evaluation as synthesized and optimized for a commercial technology, then projected to 1 GHz and scaled to 7 nm using publicly available technology. The reported values are projections, not representations of any commercial technology. Large-scale application evaluation uses a trace-based cycle-accurate simulator, with traces generated from an FHE library.

The workloads include CKKS logistic regression, ResNet-20, sorting, and bootstrapping; TFHE bootstrapping and Zama neural-network workloads; and a hybrid k-nearest-neighbor search workload using CKKS for distance calculation and TFHE for comparison.

The presented UFC configuration has an area of about 300 mm² and power of about 170.8 W. The slide’s area breakdown is: processing elements 20%, Int Network 5%, Local Interconnect 6%, Scratchpad 30%, and Global Interconnect 25%. Zhou characterizes the design as balanced across functional units, interconnect, and on-chip memory; the slide itself reports those categories separately rather than collapsing them into a single interconnect figure.

Evaluation item	Value or description from Zhou
Projection	1 GHz, scaled to 7 nm
Design status	Synthesized and optimized for a commercial technology; projected values not representing a commercial technology
Simulator	Trace-based cycle-accurate simulator
Security setting	FHE parameters at 128-bit security
UFC area	~300 mm²
UFC power	~170.8 W
Area breakdown	Processing Elements 20%; Int Network 5%; Local Interconnect 6%; Scratchpad 30%; Global Interconnect 25%
Scratchpad configuration discussed	64 MB reported in the Q&A; design space explored from 64 MB to 256 MB

Zhou’s reported experimental setup and hardware configuration for UFC.

Against the CKKS-specific SHARP baseline, UFC is reported as 1.1 times faster and 1.4 times lower energy, with a 1.5 times improvement in energy-delay-area product. Zhou attributes the gain to high computation throughput and better utilization from focusing hardware on critical primitives. The throughput is broadly comparable to SHARP in lane count, but UFC’s 2D array provides higher bisection bandwidth, contributing to better performance.

Against the TFHE-specific Strix baseline, the reported gain is larger: 6 times speedup, 1.2 times less energy, and 1.5 times lower EDAP. Zhou says one reason is that UFC has 4.6 times more compute throughput than Strix. He also argues the unified design still achieves high utilization despite not being TFHE-specific.

For the hybrid workload, Zhou compares UFC to a heterogeneous system with both Strix and SHARP connected by PCIe 5.0. UFC reports 1.02 times to 3.7 times speedup, with larger gains in large TFHE parameter settings, plus 3.1 times average energy-delay product improvement and 3.7 times average EDAP improvement.

Comparison	Baseline	Reported UFC result
CKKS workloads	SHARP [Kim ISCA ’23]	1.1x faster, 1.4x lower energy, 1.5x EDAP improvement
TFHE workloads	Strix [Putra MICRO ’23]	6x speedup, 1.2x less energy, 1.5x lower EDAP
Hybrid workload	Strix + SHARP over PCIe 5.0	1.02x–3.7x speedup, 3.1x EDP and 3.7x EDAP improvement on average

The performance and efficiency comparisons Zhou reports for UFC.

The headline performance numbers are not uniform across workloads: CKKS gains are modest relative to a CKKS-specific accelerator, TFHE gains are larger relative to Strix, and hybrid gains come from avoiding a split accelerator system and its scheme-crossing inefficiencies.

The memory system is software-managed and distributed across processing elements

In the recorded Q&A, an audience member asks about UFC’s memory hierarchy, specifically whether the scratchpad is like an L1 cache. Zhou answers that it is not a cache. It is a software-managed distributed scratchpad: each processing element has its own scratchpad, managed by software or by the compiler.

For the reported 64 MB scratchpad configuration, Zhou says the 64 PEs each have 1 MB. He also says the team explored scratchpad sizes ranging from 64 MB to 256 MB.

Asked about coherency between processing elements, Zhou does not spell out cache-coherence semantics. He answers by describing HBM connectivity and the intended data movement pattern. Each HBM pseudo-channel connects to a scratchpad. With two HBM stacks and 64 pseudo-channels, data can be moved simultaneously to all PEs. Zhou’s explanation is tied to FHE’s data shape: the data is usually very large, and the accelerator manipulates whole polynomials, so it moves data across channels into all scratchpads at the same time.

That answer reinforces a recurring theme in the talk. UFC is not built around general-purpose cache behavior. It assumes structured, compiler-visible FHE data movement across large polynomial objects.

Zhou’s current work extends the same problem into compilers and FHE systems

Patrick Longa asks what Zhou is working on now, since UFC was presented earlier. Zhou says his focus has shifted more toward compiler-side optimization on different hardware targets.

One ongoing direction is work with Google on the HEIR compiler. Zhou describes the goal as a unified optimization framework that can target different hardware configurations, including the UFC accelerator, TPU, and GPU hardware. The compiler problem is hardware-aware offloading: how to map different FHE applications to different hardware targets while optimizing performance. UFC remains one target of that work, but the optimization effort is expanding across hardware platforms.

Zhou also mentions work on making FHE more efficient for other applications. He says his group recently submitted a SIGMOD paper on a new FHE-plus-MPC algorithm for database applications, and they plan to support database applications in the compiler infrastructure, especially for large-scale data. He characterizes that direction as more systems-oriented optimization.

Longa also asks whether the UFC design shares similarities with, or serves as the basis of, Intel’s HERACLES. Zhou declines to discuss non-public details, saying he was told not to talk about HERACLES after leaving Intel Labs. He limits the comparison to what he presents as public information: the HERACLES ISCA paper also has a 2D array, while UFC and HERACLES have different design targets. Zhou says HERACLES was not designed as a unified accelerator for different schemes. He also notes a public interconnect distinction as he understands it: HERACLES claims a mesh-based network, while UFC’s network is designed specifically to support the constant-geometry NTT pattern. He characterizes HERACLES as an industry accelerator with more extensive evaluation and chip/tapeout considerations, whereas UFC is a research paper evaluated post-synthesis.

AI Research Methods AI Security AI Infrastructure and Compute