Snap Cut Experimentation Job Costs 76% With GPU-Accelerated Spark

Noah Kravitz Prudhvi VatalaNVIDIAWednesday, May 13, 20269 min read

Prudhvi Vatala, Snap’s head of engineering platforms, argues that the company’s 10-plus-petabyte daily experimentation pipeline became a cost and scale problem that could not be solved by adding more CPUs. In an NVIDIA AI Podcast interview, he says Snap cut job costs by 76% by moving Spark workloads to NVIDIA GPU-accelerated infrastructure on Google Cloud, reusing idle inference GPUs overnight, and doing so without application code changes.

Snap’s experimentation pipeline became a 10-petabyte-a-day constraint

Prudhvi Vatala puts the scale of Snap’s experimentation platform at more than 10 petabytes of data each day. That figure is not Snap’s entire big-data estate; it is only the data behind the company’s product experimentation system.

76%

reduction in Snap’s job costs after the migration

The constraint is not just volume. Experimentation results have a strict morning SLA because developers, product managers, and data scientists need them early enough to make product decisions. For his team, “accelerating data processing” meant finding a way to keep those pipelines fast and affordable without simply adding more CPUs as the user base and feature surface grew.

Snap, in Vatala’s description, operates at the intersection of augmented reality, AI, and visual communication, serving close to a billion monthly active users. His engineering platforms organization covers big-data infrastructure, developer productivity, enterprise AI, and related platform work. The experimentation platform sits inside that broader mandate because product changes at Snap are evaluated through statistical systems rather than shipped only on intuition.

That role matters because Snap’s product-development philosophy, as Vatala frames it, rests on experimentation, safety, and privacy. When the company ships features to what he describes as half a billion daily active users globally, it needs to know whether users are receiving those features as intended, whether they are adding value, and whether they are harming performance or device experience.

A/B testing is the mechanism for that evaluation. Vatala describes randomized control trials as a long-standing idea, especially in clinical fields, that became a way to bring statistical rigor to digital decision-making. At Snap’s scale, that rigor includes holdouts, well-defined controls, and statistical methods that his team has added to the platform over time.

The methods are not decorative. Heterogeneous treatment effect detection is used because a feature can look successful for the global audience while performing poorly for a subset of users. Variance reduction matters because, even with randomization, power users can end up disproportionately represented on one side of an experiment; Snap works to even out distributions when results are read. Sample-size mismatch is another signal: if users dislike a feature enough to stop showing up, the experiment can become imbalanced in a way that has to be measured.

The acceleration project therefore began from a specific operational problem. The platform was not merely running large Spark jobs. It was supporting a product-development process where stale or incomplete experiment data could delay decisions, and where statistical safeguards were part of the product quality system.

The first appeal of GPU acceleration was that Spark jobs did not have to be rewritten

Snap’s interest in NVIDIA’s Spark acceleration began after the team saw a blog post about NVIDIA Spark RAPIDS. Vatala uses the Spark RAPIDS name in the discussion; the source description identifies the technology as the NVIDIA cuDF plugin for Apache Spark, formerly referred to as the NVIDIA RAPIDS plugin. The numbers looked strong enough on paper to justify testing, with performance claims around 3.6x and roughly 50% runtime reductions. Snap’s experimentation stack was already on Google Cloud, and Vatala describes Google Cloud Dataproc as a strong partner through the platform’s scaling.

The benchmark work was not treated as a single yes-or-no test. Snap’s jobs were complex enough that the team evaluated different workload shapes separately: jobs heavy on joins, repartitions, and shuffles; jobs that primarily union data from multiple places; and jobs that mostly aggregate values, such as sums.

The results varied by workload type. Join-heavy jobs saw more than 3x improvement. Union-heavy jobs saw close to 2x. Aggregation jobs saw a little over 1.5x, which Vatala attributes partly to CPUs already being good at aggregation. GPUs were a stronger fit where parallelism and high-bandwidth memory could be brought to bear on data movement and join-heavy processing.

Workload type	Reported improvement	Vatala’s explanation
Join-heavy jobs	More than 3x	Benefited from GPU parallelism and high-bandwidth memory, especially where data movement and repartitioning were significant.
Union-heavy jobs	Close to 2x	Showed material acceleration, though less than join-heavy workloads.
Aggregation-heavy jobs	A little over 1.5x	Improved less because CPUs were already effective for aggregations.

Snap’s benchmarking results differed by Spark workload shape.

The decisive platform feature was not only the speedup. It was that the Spark RAPIDS technology discussed by Vatala did not require application code changes.

We didn't have to change a single thing about how we ran the jobs. That was the beauty of it. No code changes.

Prudhvi Vatala · Source

For an organization responsible for developer productivity and enablement, that mattered. Dataproc abstracted the Spark runtime, while the NVIDIA Spark acceleration layer could accelerate the jobs without forcing teams to rewrite their Spark applications. Vatala describes that combination as “phenomenal” because it lowered the adoption burden for production workloads.

Snap first migrated one shard of the experimentation architecture to production on Google Dataproc using 300 GPUs. The results were strong enough that the team planned a next phase: moving 10 shards out of a total architecture of more than 50 shards. That required about 3,000 GPUs, which was still feasible with on-demand Dataproc GPUs.

But that approach did not scale indefinitely. Snap eventually hit a roadblock with on-demand GPU capacity. The project then shifted from a straightforward migration to a capacity-reuse problem: where inside Snap did GPU capacity already exist that could be borrowed?

The capacity breakthrough came from using idle inference GPUs overnight

The insight was that Snap’s audience is global but usage still follows daily cycles. Snapchatters wake up, use Snapchat, and go to bed; when major markets went to sleep, some of Snap’s online inference GPU capacity sat idle. Vatala identifies roughly 1 a.m. to 5 a.m. Pacific time as the window when that opportunity emerged.

That spare capacity, however, belonged to a different kind of system. Online serving GPUs were tied to Kubernetes and Google Kubernetes Engine. Batch data processing was not what that serving stack had been built to do. To use that capacity, Snap had to migrate workloads to a Kubernetes-based Spark runtime hosted on GKE.

The goal also expanded. It would have been one thing for Vatala’s team to opportunistically use idle GPUs for experimentation. Snap instead wanted a platform that could let any team use available GPU capacity for its own needs, while recognizing that user-facing serving had priority. If a user needed fresh Spotlight content, that would supersede experimentation GPU demand.

That meant preemption had to be designed into the platform. If traffic spiked, batch jobs needed to give capacity back to online serving. Vatala describes this as one of the core complexities of building the platform “ground up”: idle capacity was useful only if the system could respect the hierarchy between user-facing inference and back-office data work.

The first production migration on the GKE plus NVIDIA-accelerated Spark platform targeted Snap’s hourly pipeline. The experimentation workloads are split into two broad cadences. Hourly pipelines are used mainly for guardrailing: they help Snap avoid breaking user experience and provide rapid feedback when product changes are live. Daily pipelines serve as the statistical authority for decision-making.

That ordering reflected the value of speed. Hourly guardrails are where faster feedback matters most, because they can detect regressions while a rollout is still in motion. But making the hourly pipeline reliable on shared GPU infrastructure required fallbacks. Snap had to handle cases where GPU capacity was unavailable, such as during peak daytime activity. In that case, the system needed to fall back gracefully from GPUs to CPUs. If shared GKE resources themselves were constrained, it needed another fallback from CPUs to Dataproc clusters.

In Vatala’s account, the operational design had to span those failure modes: use borrowed GPUs when available, fall back to CPUs when GPU capacity is not available, and fall back again to Dataproc clusters when shared GKE resources are constrained.

The production impact was cost, cores, memory, and spill

Prudhvi Vatala gives the migration’s business and engineering impact in four main numbers: 76% lower job costs, 62% fewer cores, 80% lower memory footprint, and roughly 120 terabytes of disk and memory spill eliminated from the pipelines.

He emphasizes the spill reduction for Spark practitioners because spill is one of the recurring pain points in large data pipelines. In his phrasing, the spill “just vanished” once the migrated architecture was in place.

Metric	Change
Job costs	Down about 76%
Required cores	Down about 62%
Memory footprint	Down about 80%
Disk and memory spill	About 120 TB eliminated

Vatala’s production impact figures from moving the experimentation pipelines to GPU-accelerated Spark infrastructure.

The timeline also matters. Snap moved a 10-plus-petabyte production pipeline from prototyping and exploration to full production in about eight to nine months. Vatala attributes that speed to the three-way partnership among Snap, NVIDIA, and Google Cloud, describing the back-and-forth knowledge sharing among the companies as essential to completing the migration on that timescale.

The technical enablers were not only the GPU runtime. Vatala also mentions a tool the transcript renders as “NVIDIA Ether,” describing it as a solution that provides Spark tuning out of the box. In his account, that became important because Snap’s fallback system crossed multiple environments: GPUs, CPUs, and Dataproc. Different environments required different Spark parameters, and the tool gave the team a starting point that helped keep tuning consistent across those versions.

Zero application code changes still left substantial platform engineering work. Snap had to solve image building, environment differences, testing cycles, and production rollout. Any production workload, Vatala notes, still needs a rigorous rollout. The claim is narrower and more useful: the NVIDIA Spark acceleration lowered the application-migration burden, while the platform work made the accelerated runtime operationally viable inside Snap.

The platform changed how Snap thinks about internal GPU capacity

The GPU project began with one large, urgent workload: experimentation data processing. Prudhvi Vatala says it has since affected Snap’s roadmap because the team built a bottom-up data platform rather than a one-off migration path. The platform is meant to let teams across Snap use available GPU capacity and NVIDIA libraries where they make sense.

That has already changed behavior inside Vatala’s own organization. His team has begun migrating and experimenting with additional workloads that were not part of the original scope. The limiting factor is not that idle capacity can cover all jobs all the time; it cannot. The opportunity, as he describes it, is in scheduling work creatively and moving workloads around to maximize capacity use.

The broader lesson Vatala offers for teams attempting similar migrations is that unused inference capacity can be valuable, but only when the production realities are handled. Snap’s online workloads remain primary. Batch jobs must be prepared to lose access to GPUs, run elsewhere, and maintain reliability across those transitions.

Vatala ties that platform burden to Snap’s broader product expansion. During his nearly eight years at the company, he has watched Snap grow from camera and picture messaging into augmented reality, Stories, AI, and visual communication. Running internal platforms means enabling that product evolution as new features and user behaviors add dimensions to the data infrastructure.

Whether the next workload comes from AR, AI, Spotlight, or another product area, his platform organization has to make the underlying systems scale. In that sense, the experimentation migration was both an immediate cost-and-performance project and, in Vatala’s telling, part of a broader effort to let teams at Snap make use of available GPU capacity.

Inference and Deployment AI Infrastructure and Compute

Snap’s experimentation pipeline became a 10-petabyte-a-day constraint

The first appeal of GPU acceleration was that Spark jobs did not have to be rewritten

The capacity breakthrough came from using idle inference GPUs overnight

The production impact was cost, cores, memory, and spill

The platform changed how Snap thinks about internal GPU capacity

The frontier, in your inbox tomorrow at 08:00.