RunPod’s Serverless LLM Endpoint Trades Cold Starts for Lower Idle Cost

Audry HsuAI EngineerSunday, June 7, 20266 min read

Audry Hsu presents RunPod as a cloud AI infrastructure company trying to move GPU provisioning and operations behind a deployable model endpoint. In the walkthrough, she shows a Qwen model deployed from RunPod’s Hub as an OpenAI-compatible vLLM serverless endpoint on H100s in under five minutes, with billing tied to workers while they handle requests. Her case is narrower than eliminating infrastructure tradeoffs: the first request waited 41.6 seconds on cold start, while subsequent execution took about 1.5 seconds, leaving teams to choose between lower idle cost and keeping workers warm for lower latency.

RunPod’s pitch is that GPU infrastructure should disappear behind the model endpoint

Audry Hsu described RunPod as a cloud AI infrastructure company built around a simple division of labor: developers bring their code or model, and RunPod supplies the GPU-backed environment needed to run it. The model can be private or open source, including models from Hugging Face. The platform’s promise is not that infrastructure is unimportant, but that managing it is not where most software teams create value.

Hsu framed the problem in three parts: infrastructure consumes developer time, GPU access is slow and opaque, and builders should spend their attention on building. She compared the current GPU market to the early-COVID toilet-paper shortage: demand surged, supply is constrained, and customers are still learning to estimate how much compute they actually need. Her expectation was that the market will recover as companies and users get better at compute planning, but RunPod’s product argument does not depend on waiting for that recovery. It is designed to abstract away GPU procurement and operations now.

The company’s founding story reinforced that builder-first positioning. Hsu said RunPod began in 2022 with two engineers, Zhen and Pardeep, who had failed crypto-mining GPU rigs in a basement. They prototyped what became RunPod, posted on Reddit offering free GPU access in exchange for feedback, and grew from there. Hsu emphasized the story less as a bootstrapped origin story than as evidence of how the company still wants to operate: in public, with feedback from developer communities such as Reddit and Discord.

The scale numbers she cited were already well beyond the basement origin: more than 500,000 developers on the platform, 30-plus data centers across the world, and a slide that described the footprint as 10 countries. Hsu also cited $120 million in annual recurring revenue.

$120M

annual recurring revenue cited by Hsu

A customer slide listed Anthropic, Databricks, Microsoft, Perplexity, Wix, Replit, Toyota, CivitAI, and Zillow. Hsu said even AI-native cloud companies use RunPod for the same reason as other customers: flexible and reliable GPU infrastructure.

The product splits the infrastructure problem into four paths

RunPod’s product surface is organized around Pods, Serverless, Clusters, and the Hub.

Pods are the core sandbox environment. RunPod spins up a container, allocates GPUs to it, and manages the rest. Developers bring Docker files and code. Serverless is the autoscaling product, meant for bursty or batch workloads where workers can spin down when idle. Hsu’s central economic point was that when a serverless worker is idle, the customer is not paying for it. Clusters are for heavier training workloads, including multi-node setups with high-speed networking. The Hub is the repository of preconfigured AI repos that can be forked, watched, starred, and deployed on RunPod.

Serverless is the relevant product for real-time inference, variable or spiky traffic, and user-facing AI products. Teams use it because they do not have to pre-provision exactly how much compute they will need. They can set maximum workers, configure spending caps, and optionally keep some workers “always on” so models remain downloaded and ready to respond immediately.

That is the practical center of the product: teams can optimize for lower idle cost by allowing cold starts, or pay for active workers to reduce first-request latency. Hsu showed the relevant controls: worker caps, active workers, GPU selection, pricing, and observability.

A Hub listing becomes an OpenAI-compatible vLLM endpoint with only a few configuration choices

The Serverless Hub is positioned as a starting point for developers who want to see what can run immediately. Hsu selected a vLLM listing whose underlying repository was shown as runpod-workers / worker-vllm, described as an “OpenAI-Compatible vLLM Serverless Endpoint Worker” for “OpenAI-Compatible Routing Fast LLM Endpoints powered by vLLM.”

The repo included a Dockerfile, defaults, setup instructions, and environment-variable configuration. Hsu clicked deploy from the Hub and selected a Qwen model, shown in the console as Qwen/Qwen1.5-4B. The model would be downloaded from Hugging Face. She increased the max model length to enlarge the context window, left the rest at defaults, and noted that options such as max LoRAs are passed through as flags to vLLM serve.

The endpoint configuration screen showed the default deployment targeting H100 PCIe GPUs with A100s as backup. The pricing shown was $0.00115 / s for an 80 GB H100 PCIe configuration, charged while the worker is actually running and handling a request. The same screen let her set max workers, with the configuration showing a cap of 15, and configure active workers that should stay on rather than letting the container spin down.

Setting shown	Value or behavior in the deployment
Model	Qwen/Qwen1.5-4B
Serving stack	OpenAI-compatible vLLM worker
Max model length	8192 shown in the configuration panel
Primary GPU	80 GB H100 PCIe
Displayed price	$0.00115 per second
Max workers	15

Key deployment settings visible during the RunPod serverless walkthrough

The endpoint itself was exposed as a normal HTTP API. The console showed a POST request to https://api.runpod.ai/v2/vllm/run with a JSON payload asking, “How did Big Ben get its name?” Hsu described this as the same kind of endpoint a developer’s customers could send requests to directly.

The first request waited on cold start; execution after startup was about 1.5 seconds

The most useful operational detail was the split between queue delay and execution time. After Hsu submitted requests, the worker view showed containers initializing. She described that initialization as the container being created and the model being downloaded. Some workers had already reached the running state and were expected to pick up the queued requests.

The completed request returned an answer beginning, “Big Ben is the nickname for the Great Bell of the Great Clock of Westminster...” The console reported a delay time of 41.6 seconds and an execution time of 1.48 seconds. Hsu attributed the longer delay on the first request to cold start work: downloading the model and initializing the first container. Subsequent requests, she said, should be shorter because that startup cost has already been paid.

41.6s / 1.48s

first request delay time and execution time shown in the console

It sat in the queue for about 41 seconds. That’s going to be a little bit longer than all of the subsequent requests because of some of the cold start time.

Audry Hsu

That creates the core tradeoff. If no worker is kept warm, the platform can avoid idle charges, but the first request can wait while infrastructure initializes. If a team needs immediate response, it can configure active workers and pay for them to stay ready. Hsu also pointed to endpoint telemetry for request count, execution time, and delay time, positioning observability as part of operating the API rather than an afterthought.

The full path from Hub listing to working endpoint took under five minutes. Hsu’s serverless claim was not that cold starts disappear. It was that the developer can get to a GPU-backed LLM endpoint quickly, choose how many workers can scale up, decide whether any should stay warm, and pay for workers only while they are handling requests.

AI Application Architecture Inference and Deployment AI Infrastructure and Compute

RunPod’s pitch is that GPU infrastructure should disappear behind the model endpoint

The product splits the infrastructure problem into four paths

A Hub listing becomes an OpenAI-compatible vLLM endpoint with only a few configuration choices

The first request waited on cold start; execution after startup was about 1.5 seconds

The frontier, in your inbox tomorrow at 08:00.