Orply.

A Python Decorator Replaces the GPU Deployment Container Loop

Audry HsuAI EngineerTuesday, June 9, 202610 min read

RunPod’s Audrey Hsu argues that GPU inference development should not require a commit, container build, registry push and server provisioning cycle for every model change. In a demo of Flash, RunPod’s Python SDK, she shows how adding a `@flash.endpoint` decorator to an async function can package that function as a GPU-backed cloud endpoint while the rest of the application stays in the developer’s IDE. Her broader case is that teams should experiment on Pods or low worker counts, then move to Serverless when they need autoscaling inference across many GPU workers.

Flash targets the container-build loop, not model code itself

Audry Hsu frames RunPod as an AI cloud infrastructure company built to keep developers from spending their time on infrastructure configuration rather than models. In her description, RunPod brings the GPUs and compute; developers bring code and models. The platform’s job is to make deployment fast enough that teams are not forced to think first about CUDA version alignment, which PyTorch versions work together, which GPU SKUs have been tested, or where scaling should happen.

The specific product Hsu demonstrates is Flash, a Python SDK meant to shorten the development loop for GPU-backed inference. Her baseline for that loop is familiar: change inference code, commit it, push to GitHub, build a Docker image, pull it from a container registry, load it onto a server, allocate a GPU, then finally test whether the model behaves as expected. If it does not, the same loop repeats.

Flash’s intervention is deliberately narrow: let a developer mark an async Python function as a remote GPU endpoint from local code. The documentation shown on screen describes Flash as “a Python SDK for developing cloud-native AI apps where you define everything—hardware, remote functions, and dependencies—using local code.” The code example uses a @flash.endpoint decorator on an async main() function, declares a GPU requirement, and lists dependencies such as torch.

Hsu’s summary is that “all you need to know about Flash” is this: take a regular async Python function, add the Flash endpoint decorator, and Flash packages and deploys everything inside the function onto GPU cloud. Everything around that decorated function, including main functions and helper functions, runs in the local development environment. Flash also provides hot file reload, so changes anywhere in the application are repackaged and pushed without leaving the development environment.

You have a regular async Python function, you add our Flash endpoint decorator, and it’s going to deploy and package everything inside your function onto a GPU cloud.

Audry Hsu

The demo code makes that boundary concrete. A function named generate_image() imports PyTorch and diffusers, loads a pretrained text-to-image model, generates an image, writes it into a buffer, base64-encodes it, and returns the encoded string. The endpoint decorator names the endpoint, requests one GPU from the Ada family, and sets worker limits:

@flash.endpoint(
    name="image_generator",
    gpu_dict={"Ada": 1},
    max_workers=5,
    min_workers=1
)

Hsu starts the app with flash run image_generation.py, which spins up a local development server. She describes it as a FastAPI server. A helper script then sends a POST request to the local endpoint and decodes the image response so the audience can see the generated output.

The local server receives the request, queues a job, and forwards the GPU work through the Flash endpoint. The editor, terminal, and generated outputs remain visible as the deployment surface: the developer is still working from the IDE, while the GPU-backed function is packaged and run through RunPod.

The first image shows what fast iteration is for

Audry Hsu first uses Stable Diffusion XL Turbo for image generation, describing it as “really great for fast” image generation. The audience supplies a deliberately simple prompt: cats flying in the sky, on a cloudy day, somewhere in London. The first visible result is a dragon, which Hsu attributes to not passing the prompt correctly. After correcting the prompt flag, the model produces distorted cats floating above rooftops. Hsu calls them “abstract cats” and says the result “looks terrible.”

The poor output matters because Hsu is not presenting Flash as a way to guarantee the right model choice or the right prompt. She is presenting it as a way to make the next experiment cheap in developer time.

She comments out the SDXL Turbo pipeline and swaps in DreamShaper:

pipe = AutoPipelineForText2Image.from_pretrained(
    "Lykon/dreamshaper-xl-1-0",
    torch_dtype=torch.float16,
    variant="fp16"
).to("cuda")

Hsu describes DreamShaper as a fine-tuned model based on Stable Diffusion 1.5 and says she expects better image quality, especially for more artistic or illustrative styles. She also changes generation settings, including increasing inference steps to 25 and using a 1024-by-1024 output.

What made this really fast is instead of making a code change, committing it, rebuilding my Docker, uploading it somewhere, and then allocating GPU infrastructure, all of this is happening right here from my IDE and I never have to leave.

Audry Hsu · Source

The endpoint decorator carries deployment configuration alongside the Python function. In the demo, Hsu points to parameters for endpoint name, GPU family, maximum workers, minimum workers, and timeout. The code shown requests the Ada GPU family. Hsu describes the selected GPU family, while the later RunPod console view shows H100 80GB workers.

In the code shown, max_workers=5 means the endpoint can have up to five workers running at once. min_workers=1 keeps one worker active. Hsu describes timeout as controlling how long a worker can remain idle.

The model swap is therefore not only a code edit; it is a test of the product’s promised iteration loop. The GPU-backed endpoint is treated as part of local development rather than as a separate infrastructure artifact that must be rebuilt before every meaningful experiment.

The orchestration demo is where local code matters most

Audry Hsu says the larger value of a developer tool like Flash appears when a developer is not making a single call to a single model, but writing orchestration code around multiple model calls. Her second demo uses a pre-prepared pipeline that chains three model operations.

Pipeline stepModel or endpointRole in the demo
Prompt expansionQwen 3, with terminal text showing Qwen/Qwen-110BGenerate three richer prompts from a simple user prompt
Image generationDreamShaper running on the Flash endpointRender images from the expanded prompts
CompositionNano Banana 2Compose the generated images with a reference photo into final images
The multi-model demo chains prompt generation, image rendering, and final composition.

The orchestration code shown on screen is plain Python: ask for a prompt, send it to Qwen to generate better prompts, loop through those prompts and send each to DreamShaper, then pass the resulting images into Nano Banana 2 for composition. Hsu identifies Nano Banana 2 as a premium Google model that is “really good at composing photos together.”

The audience prompt is “two men walking in London on a cloudy day,” refined with “AI engineers,” “glasses,” and “close up of their faces.” The terminal output shows the prompt-expansion step returning a more detailed image prompt: two men with glasses walking through a bustling London street, close-up faces with thoughtful expressions and weathered features, soft-focus cloudy background, muted grays and deep blues, overcast lighting, subtle lens droplets, and detailed skin and fabric textures.

That expanded prompt becomes part of Hsu’s argument about orchestration. The developer did not manually write elaborate prompts for each generated image. The local pipeline delegates that job to one model, sends the results to another model running through Flash, then sends the generated images to a third model for composition. The demo’s emphasis is not only remote GPU execution but control over a multi-step application from local code.

The final web page compares generated and composed results. Hsu shows the original prompt, the DreamShaper outputs based on Qwen’s prompt engineering, and the composed images created from a reference photo of RunPod’s founders. She reads out some of the Qwen-expanded cues because the screen is hard to see: “thoughtful expressions and weathered faces,” “soft focus on background clouds,” “muted urban palette with grays and deep blues,” and “overcast lighting.” She jokes that one composed result makes Pardeep look handsome and Zhen look unexpectedly old, then scrolls to show the reference photo used for composition.

The demo is deliberately informal, but its technical shape is clear: Flash is used for the GPU endpoint in a larger local pipeline, not as a replacement for the rest of the application. Hsu closes that thread by saying developers can start in a local development environment, use open-source models, or bring their own private models.

RunPod’s product split maps to the stage of work

Audry Hsu presents RunPod as four main ways to build: Pods, Serverless, Clusters, and Hub. The distinctions matter because Flash is not introduced as the whole platform. It is introduced as a developer workflow for GPU-backed serverless work.

Pods are for persistent VM-like environments where a developer rents GPU capacity on demand, pays by the second, and tears it down when finished. If a pod is running, Hsu says, the GPU is reserved: “the GPU is yours and no one can take it away from you.”

Serverless is for workloads that are ready to deploy and need scaling because request frequency and load are variable. RunPod scales workers up and down, and Hsu says users do not pay for idle time when no requests are happening. A slide describes Serverless as “autoscaling inference without infrastructure overhead,” best suited to real-time inference, variable or spiky traffic, and user-facing AI products. The listed reasons teams use it are no pre-provisioning, automatic scaling, and paying only for usage.

Clusters are positioned for training and multi-node use cases, including high-speed networking, Slurm, and PyTorch. Hub is for one-click deployment of pre-vetted open-source AI repositories and popular models such as ComfyUI, Stable Diffusion, and vLLM.

The product taxonomy reflects the split that appears in the demo. Hub reduces the first-click burden for people exploring open-source AI tools. Pods provide direct, reserved GPU environments. Clusters target training and multi-node workloads. Serverless targets inference services with variable or spiky demand. Flash is presented as a developer interface into that Serverless layer, designed to make the remote GPU accessible from the same place the developer is writing the Python application.

Serverless pricing is usage-based, with a premium for scaling

An audience question leads Audry Hsu to distinguish Serverless from Pods on pricing. In the RunPod console, she shows the image-generator serverless endpoint created from the terminal. The worker list shows multiple H100 80GB workers, with some running and some idle. Hsu points out that three workers are running because she asked for three photos. The console displays the H100 80GB cost as $0.00116 /s.

$0.00116/s
displayed H100 80GB serverless worker cost in the RunPod console

Hsu says every request is charged only for how long that request is running. The edit endpoint modal also shows GPU memory tiers and per-second prices: 16 GB at $0.00012 /s, 24 GB at $0.00020 /s, 32 GB at $0.00030 /s, 40 GB at $0.00045 /s, 48 GB at $0.00060 /s, and 80 GB at $0.00116 /s.

GPU memory tier shownDisplayed serverless price
16 GB$0.00012/s
24 GB$0.00020/s
32 GB$0.00030/s
40 GB$0.00045/s
48 GB$0.00060/s
80 GB$0.00116/s
Per-second GPU prices visible in the RunPod endpoint configuration modal.

Hsu says Serverless pricing is “a little bit different” from Pods because Pods do not include the same scaling behavior. Serverless therefore carries “a little bit of a premium.” Her recommendation is practical: while experimenting, start either with a very low worker count or with Pods, because an experiment may only require one or two GPUs at a time. Serverless is for cases where a team needs hundreds of workers on hundreds of GPUs, distributed across data centers for availability.

The guidance keeps the product positioning specific. Flash makes the iteration loop look local and lightweight, while RunPod’s deployment choices still depend on stage and load: Pods for experimentation or reserved GPU access; Serverless for autoscaling inference when demand justifies workers scaling across infrastructure.

The infrastructure pitch is developer adoption and GPU reach

Audry Hsu gives a short company history to explain why RunPod exists. Its founders, Zhen and Pardeep, started the company in 2022 after a failed crypto mining venture left them with spare GPUs in a basement. They built a prototype, posted on Reddit asking whether anyone wanted free GPUs in exchange for feedback, and built from there with the community. Hsu says RunPod has been revenue-generating from the beginning, which she calls rare.

The company metrics shown and stated are large for a GPU infrastructure provider that Hsu describes as “punching above our weight class”: around 500,000 developers, more than 30 data centers across 10 countries, and $120 million in annualized run rate. In Europe, she names France, Romania, and Iceland, with a brief aside on whether Iceland counts as Europe. A customer logo slide includes Anthropic, Databricks, Microsoft, Perplexity, Wix, Replit, Toyota, Civitai, and Zillow.

RunPod metric shownValue
Developers on platform500K
Data centers30+
Countries10
Annualized run rate$120M
RunPod scale metrics shown in Hsu’s presentation.

Hsu’s explanation of the customer set is that AI-native companies and large enterprises share a need for flexible, reliable GPU infrastructure. That is the infrastructure-level claim behind the Flash demo: developers may want local-feeling iteration, but the platform still has to provide GPU availability, worker scaling, geographic distribution, and configuration options for different stages of AI development.

The frontier, in your inbox tomorrow at 08:00.

Sign up free. Pick the industry Briefs you want. Tomorrow morning, they land. No credit card.

Sign up free