Audry Hsu

Software engineer at RunPod focused on AI infrastructure and developer workflows. She has presented RunPod demos on deploying serverless vLLM/LLM endpoints and describes herself as a full-stack software engineer with experience building scalable distributed applications.

A Python Decorator Replaces the GPU Deployment Container Loop

RunPod’s Audrey Hsu argues that GPU inference development should not require a commit, container build, registry push and server provisioning cycle for every model change. In a demo of Flash, RunPod’s Python SDK, she shows how adding a `@flash.endpoint` decorator to an async function can package that function as a GPU-backed cloud endpoint while the rest of the application stays in the developer’s IDE. Her broader case is that teams should experiment on Pods or low worker counts, then move to Serverless when they need autoscaling inference across many GPU workers.

AI EngineerJun 9, 202610 min read

RunPod’s Serverless LLM Endpoint Trades Cold Starts for Lower Idle Cost

Audry Hsu presents RunPod as a cloud AI infrastructure company trying to move GPU provisioning and operations behind a deployable model endpoint. In the walkthrough, she shows a Qwen model deployed from RunPod’s Hub as an OpenAI-compatible vLLM serverless endpoint on H100s in under five minutes, with billing tied to workers while they handle requests. Her case is narrower than eliminating infrastructure tradeoffs: the first request waited 41.6 seconds on cold start, while subsequent execution took about 1.5 seconds, leaving teams to choose between lower idle cost and keeping workers warm for lower latency.

AI EngineerJun 7, 20266 min read