Hackathon Caps Models at 32B Parameters to Reward Tinkerable AI Apps

Shashank Verma

Nikita Pavlichenko Hannah Blair Zhong ZhangHugging FaceFriday, June 5, 202620 min read

Build Small is a Hugging Face and Gradio hackathon organized around a hard constraint: every model used must be under 32 billion parameters. Yuvraj Sharma framed the rule as a way to move AI building away from dependence on giant hosted models and back toward systems that participants can inspect, fine-tune, run locally, and ship as working Gradio Spaces. Sponsor presentations from Black Forest Labs, OpenBMB, OpenAI, NVIDIA, Modal, JetBrains, and Cohere largely reinforced that premise, offering small models, credits, tools, and prize categories meant to turn the constraint into runnable projects rather than demos in name only.

The constraint is the point: build with models small enough to inspect, adapt, and ship

Yuvraj Sharma framed Build Small as a deliberate move away from the current center of gravity in AI: large API inference providers and ever-larger models. The hackathon’s rule is not simply an eligibility boundary. It is the design premise. Participants are asked to build with models under 32 billion parameters so the work can return to a mode where models are “tinkerable,” fine-tunable, and fun to experiment with.

The hackathon runs June 3–15 under the title “Small Models Big Adventures,” hosted by Hugging Face and Gradio, with sponsors including Hugging Face, OpenAI, NVIDIA, Modal, OpenBMB, JetBrains, Cohere, and Black Forest Labs. The total prize structure includes about $48,000 in cash, $20,000 in Modal credits, two RTX 5080s, one-year ChatGPT Pro subscriptions, and 29 prize categories.

The two project tracks define the intended shape of submissions. “Backyard AI” is for software that helps someone the builder actually knows. Sharma’s examples were modest by design: a storybook generator for a child, a practical assistant, something whose value can be judged by whether the intended person uses it. “Thousand Token Wood” is the whimsical track: weird, delightful, creative work where AI is load-bearing rather than decorative. An interactive game was given as the simple example, but the track is explicitly meant to leave room for stranger ideas.

The core rules are compact. Every model used must be under 32 billion parameters. Sharma clarified that this applies to the total parameter count for mixture-of-experts models, not only active parameters. A builder can use multiple models, and each model must remain below the limit. Submissions must be Gradio apps hosted as Hugging Face Spaces under the hackathon organization; the SDK can be Gradio or Docker, but the app must still be a Gradio Space. Each submission also needs a demo video and a social post, with links included in the Space README, so judges can evaluate the project even if the app cannot run at judging time because of GPU limits, exhausted API credits, or other operational issues.

32B

maximum total parameter count for each model used in a Build Small submission

The hackathon also uses “bonus quests” as a second layer of incentives. “Off the Grid” recognizes apps that avoid cloud APIs and run the model locally in the Space. “Well-Tuned” is for projects using a fine-tuned model published on Hugging Face. “Off-Brand” rewards custom frontends that push beyond the default Gradio look, with gr.Server mentioned as a hint. “Llama Champion” is for models running through the llama.cpp runtime. “Sharing is Caring” recognizes shared agent traces on the Hub. “Field Notes” is for a blog post or report about what was built and learned. Sharma described these as badges tied to extra points, recognition, and specific prize categories.

The resource package is built to make that constraint plausible rather than symbolic. Participants in the hackathon organization receive $20 in Hugging Face credits, ZeroGPU access, and access to inference providers. Sharma said the organization’s team status gives participants the ability to create ZeroGPU Spaces and use ZeroGPU for around 40 minutes per day, with a limit of up to 10 ZeroGPU apps. Modal is providing $250 in credits to registered participants. OpenAI is providing Codex credits for the first 1,000 registrations. Support is being routed through Discord, with sponsor AMA sessions planned around June 9, 10, and 11.

Gradio’s new workflow layer turns model composition into a visual artifact

Hannah Blair introduced gr.workflow() as a new Gradio feature being released the same day, and positioned it as a way to build AI pipelines without code. It sits on top of Gradio and lets a builder drag and drop nodes on a canvas to chain image generators, voice models, summarizers, and other components into a pipeline.

The examples shown were deliberately compositional: a webcam input connected to a video-avatar model and then to an output video. Blair said workflows can use any Hugging Face Space, model, or dataset. Builders can browse by category — image, audio, video, 3D, and text — filter for ZeroGPU Spaces, and search semantically through the Hub. For more complex workflows, custom Python functions can be mixed into the visual pipeline.

The output is not a separate product surface. Blair emphasized that a finished workflow can be shared or handed off as a Gradio app, saved to the Hub as a workflow file, and launched like any other Gradio app because it is “just another layer on top of Gradio.” Other users can then use the workflow inside their own Spaces.

The caveat was clear: this is a beta launch. Blair invited participants to use it creatively for the hackathon, but also to report bugs and suggestions. More documentation and tutorials were promised to help builders get started.

Black Forest Labs wants image generation to be fine-tuned, not just prompted

Stephen Batifol presented Black Forest Labs’ contribution through FLUX.2 [klein], the smallest member of the FLUX.2 family. The model line is open weights and Apache 2.0 licensed, with 4B and 9B variants. Batifol focused on the 4B model because it fits the hackathon’s “small” constraint particularly well: it can run on consumer hardware, on Hugging Face ZeroGPU, and in a starter Space built for the event.

Black Forest Labs described itself as a frontier visual AI research lab focused on image generation, editing, and products built on top of visual models. Its slides claimed more than 400 million open-model downloads, a team of more than 80 people, and a $3.25 billion Series B valuation. Batifol also connected the founding team to Stable Diffusion and latent diffusion.

For builders, the practical distinction was between base and distilled versions. The base model is the training target and the version to use for LoRAs. The distilled version runs in four steps and is meant for fast inference, testing adapters, and deploying the result. Batifol’s recommendation was therefore procedural: train on base, ship on distilled.

The capabilities demonstrated were text-to-image, image-to-image editing, and style variation. The same model can generate an image from a prompt, edit an existing image, change backgrounds or other image attributes, and handle a range of styles from anime-like images to more artistic renderings. The 4B version was described as requiring roughly 13 GB of VRAM.

Batifol’s main technical guidance concerned LoRAs. He wrote a guide for the hackathon, “Fine-tune FLUX.2 [klein] with a LoRA under 60 minutes,” covering the full loop from dataset construction to trainer configuration, training, loading in diffusers, and deploying a Gradio Space. The guide covers both style LoRAs and edit LoRAs. For a style LoRA, Batifol said a dataset can be small: roughly 15 to 20 images for teaching a style, and perhaps 15 to 40 depending on the use case. The important captioning rule is that captions should describe what is in the image, not the style; the style is what the model should infer.

For edit LoRAs, the dataset is paired: reference input and target output. The caption is an instruction rather than a description. Batifol showed an example where a photo of a cat becomes a line drawing, then emphasized that the LoRA learns the edit rather than the subject. A LoRA trained mostly on pets could still transform a city skyline into a sketch, which he used to illustrate generalization across subject matter. The more fluid part of edit-LoRA behavior, according to the slide, is the data rather than the training, and varying caption phrasing helps if looser prompts should work.

The starter Space collects the whole workflow: text-to-image, image-to-image, loading a LoRA, training a LoRA, and links. Builders can duplicate it and get a running FLUX.2 [klein] app in their own account with no token and no gating. It also renders the base model and a LoRA at the same seed, so the builder can see exactly what the adapter changed. Batifol showed the Space generating a “cozy ramen stall at night in the rain” and described using a webcam photo as an input for image editing, though the live webcam did not work during the presentation.

Sharma added that Black Forest Labs is sponsoring a $5,000 cash prize pool and that the guide will be linked from the hackathon landing page.

OpenBMB’s small-model pitch is breadth: text, vision, voice, and on-device deployment

Zhong Zhang introduced OpenBMB as an open-source community co-founded by THUNLP and ModelBest, focused on deploying and adopting LLMs on devices and hardware. Its MiniCPM family spans language, multimodal, and edge deployment models, supported by tools for compression, quantization, and inference optimization. Zhong said the goal is to make LLMs run efficiently on resource-constrained hardware such as smartphones, IoT devices, and robots.

The OpenBMB slides claimed 139,000 GitHub stars, more than 32 million total model downloads, and 104 open-source models. THUNLP was described as Tsinghua’s NLP lab and China’s earliest research group for NLP and large language models, with more than 200 papers at top conferences and 44,000 citations. ModelBest was described as a team working on efficient on-device models for law, automotive, and IoT customers.

The MiniCPM family presented for the hackathon includes text, reasoning, vision, multimodal, and voice models. MiniCPM-5 1B is the lightweight text model, recommended for local-first apps such as a personal assistant, writing helper, study tutor, lightweight chatbot, or email and note helper. Zhong said it ranked number one on the Artificial Analysis Intelligence Index for tiny models and mentioned a desktop-pet demo as an example of a small local application.

MiniCPM-4.1 8B is the stronger text-reasoning model, aimed at apps that need deeper problem solving: reasoning assistants, study solvers, planning assistants, research helpers, and structured decision-making tools. VoxCPM covers end-to-end voice generation, including TTS, voice cloning, and creative audio. Zhong suggested voice storytellers, character voices, audio postcards, voice companions, and language-learning voice practice as possible uses.

For vision and multimodal applications, MiniCPM-V 4.6 is the recommended model for image understanding, OCR, document assistants, and video understanding. Zhong suggested receipt and bill parsers, screenshot helpers, homework image tutors, repair-manual readers, shop-menu assistants, and visual puzzle games. MiniCPM-o 4.5 is the omni model. Zhong called it the world’s first full-duplex omni-modal model, able to continuously see, listen, and speak naturally. The practical meaning he emphasized is conversational interruption: while it is talking, a user can interrupt, and the model can respond to speech immediately.

OpenBMB is sponsoring $10,000 split evenly across the two hackathon tracks. For Backyard AI, the prize pool is $5,000: $2,500 for first place, $1,500 for second, and $1,000 for third. Thousand Token Wood has the same distribution. To be eligible for the OpenBMB category, Sharma clarified, the app must build with MiniCPM models.

OpenBMB is also offering free APIs for the event, including endpoints for MiniCPM-4.1-8B, MiniCPM-V-4.5, MiniCPM-V-4.6, and MiniCPM-V-4.6-Thinking, with an authorization token and sample OpenAI-compatible curl request shown on screen. For deployment, Zhong recommended vLLM for GPU serving and OpenAI-compatible high-throughput APIs, and llama.cpp for local-first quantized deployment on laptops and edge devices. He also showed Transformers quick-start snippets for MiniCPM-V 4.6 image inference and MiniCPM-5 1B text generation.

His recommended build path was intentionally minimal: pick a concrete use case, choose the matching MiniCPM model, build a minimal Gradio app, deploy it to a Hugging Face Space, write the README, record the demo video, and submit the Space link. “A great hackathon project doesn’t need to be complicated,” the slide said; it needs to be clear, runnable, useful, or delightful.

Codex is being judged as a development method, not an app dependency

Vaibhav Srivastav presented Codex as OpenAI’s software delivery agent, integrated across the Codex app, IDEs, and CLI, and built on OpenAI coding-focused models. His emphasis was less on using an OpenAI model inside the final hackathon app and more on using Codex to build the software: Gradio servers, model demos, tools, MCP servers, orchestration code, and supporting infrastructure.

Codex, as Srivastav described it, can connect to the tools a developer already uses: GitHub, Figma, Notion, Google Docs, Hugging Face, and others through plugins. Hugging Face has its own plugin that can be used from Codex. This means Codex can operate across project materials, inspect designs, edit code, and interact with repositories rather than functioning only as a chat interface.

Srivastav gave a short model-history account focused on Codex capability gains over the previous six months: GPT-5.1-Codex-Max, GPT-5.2-Codex, GPT-5.3-Codex, GPT-5.4, and GPT-5.5. The slide associated GPT-5.1-Codex-Max with long-running tasks, compaction, extra-high reasoning effort, Windows operation, and better performance with fewer thinking tokens. GPT-5.2-Codex was described as improving cybersecurity capability, large code changes, and vision. GPT-5.3-Codex was described as 25% faster, with new highs on SWE-Bench Pro and Terminal-Bench. GPT-5.4 added a 1 million context window and better efficiency on long-running tasks. GPT-5.5 was presented as the smartest and fastest model for real-world work, with computer use and token efficiency.

Srivastav said GPT-5.5 is more token-efficient than GPT-5.4 at the same reasoning levels while scoring higher on Terminal-Bench 2.0, which he described as a proxy for day-to-day and coding tasks. The slide claimed Codex delivers up to 1.5x faster token velocity with GPT-5.5 in /fast mode.

The newer Codex product features are relevant to hackathon work. An in-app browser lets Codex inspect and interact with dashboards, apps, servers, or other UI surfaces during development. Image generation is available directly inside Codex. Remote SSH connections allow the Codex app to operate on a remote host; Srivastav specifically mentioned Modal VMs as a place builders might connect Codex. Codex Mobile lets users start tasks from ChatGPT mobile, steer work on remote SSH hosts, review diffs, and push PRs, which Srivastav connected to the reality that hackathon participants may not be at a computer continuously.

The feature he highlighted most strongly was /goal, which is generally available. It lets a user define an objective and describe what “done” means, then allow Codex to work for hours or days. Srivastav suggested that a prompt such as training a 2B-parameter model under a given strategy could lead Codex to set up the environment, decide the approach, and eventually return a fine-tuned model.

Sharma then explained the specific OpenAI prize mechanics. OpenAI is sponsoring a $10,000 cash award with first, second, and third prizes. Evaluation in that category will be done by Codex. Submissions must still be apps in the hackathon organization, but the Space README must mention a public GitHub repository, and that repository must contain Codex-attributed commits. Projects showing more holistic use of Codex — for example using it to fine-tune a model or create agent traces — will be ranked higher in that category.

NVIDIA’s eligible models cover the whole app stack, but Nemotron Ultra is outside the contest

Shashank Verma mapped the NVIDIA Nemotron models that fit the Build Small rules. He described the family as efficient, open, and intelligent, with many models available on Hugging Face and some offered in quantized versions such as NVFP4 checkpoints for efficient use on newer NVIDIA GPUs.

For general-purpose language work, Verma identified Nemotron 3 Nano 30B-A3B and Nemotron 3 Nano 4B. The 30B model has 3B active parameters and uses a mixture-of-experts architecture. It is meant for reasoning, chat, tool use, coding helpers, retrieval-augmented generation, and long-running agent workflows. The smaller 4B edge model is intended for RTX and Jetson devices, local assistants, games, NPCs, reasoning, and tool-use agents. Both fit the hackathon’s parameter limit as described.

For multimodal understanding, Verma pointed to Nemotron 3 Nano Omni 30B-A3B. It can understand audio, image, text, and video, with text output, making it suitable for document intelligence and GUI agents. For math and code, he pointed to Nemotron Cascade 2 30B-A3B, a fine-tune of Nemotron Nano aimed at advanced math/code reasoning and tool-integrated reasoning. The slide claimed gold-medal performance in the 2025 International Mathematical Olympiad and International Olympiad in Informatics.

The speech models include Nemotron 3.5 ASR, a 0.6B multilingual continuous streaming speech-recognition model supporting 42 language locales, punctuation and capitalization, and chunk-size configuration between 400ms and 1200ms. Verma suggested voice agents, live captions, multilingual transcription, meeting and audio apps, and accessibility tools. He also referred to a broader Nemotron Speech collection on Hugging Face, including TTS, speaker diarization, streaming variants, and speech-to-speech or audio-to-audio agent components.

For document-heavy retrieval workflows, Verma highlighted Nemotron Parse v1.2, a model under 1B parameters purpose-built to extract structured text, tables, markdown, bounding boxes, and semantic classes from PDFs, PowerPoints, forms, reports, screenshots, charts, diagrams, and figures. He described it as an intermediate model in a RAG pipeline: parse the document first, then pass the extracted structure to a larger reasoning model such as Nano.

NVIDIA also has embedding models for vision and document intelligence. Nemotron ColEmbed VL comes in 4B and 8B sizes for high-accuracy visual document retrieval over dense pages, tables, charts, screenshots, layouts, and infographics. Llama Nemotron Embed VL 1B is the lighter option for multimodal RAG and PDF/image/text retrieval.

For safety, Verma introduced Nemotron 3.5 Content Safety 4B, a small language model for multimodal content moderation, input/output safety checks, and privacy-policy enforcement. He suggested placing such a model at the end of an agent to check what comes in and what goes out.

The only model he explicitly placed outside the hackathon was Nemotron 3 Ultra. NVIDIA had released it the day before, and Verma described it as an efficient large model for long-running agents, high-end reasoning, and orchestration, but “not for this hackathon.” The slide presented it as 85B and compared it with GLM 4.2, Qwen 2.5, and Qwen 3 on benchmarks including Agent Productivity Multibench, OSWorld, Terminal Bench 2.0, IFBench, GSM8K, Multimodal Agent, and long-context rule Q&A.

NVIDIA’s prize structure is tied to Nemotron use. Sharma said NVIDIA is sponsoring two RTX 5080s. One will be judged or evaluated by the NVIDIA team for projects built with the eligible Nemotron models presented. The second will also require Nemotron models but will be awarded based on community signals such as likes and interaction.

Modal is the compute layer for fine-tuning, inference, and agent sandboxes

Felicia Chang described Modal as an AI infrastructure platform for running inference, training models, batch processing, and sandboxes for coding agents. Its primitive is a Python function that can access CPU and GPU compute at scale.

Her examples were intentionally code-sized. Modal showed vLLM inference in 200 lines of code, with a link to a Modal examples repository. It showed supervised fine-tuning in 300 lines of code, using Modal volumes to keep data in one place and using open-source libraries with serverless infrastructure to make parallel hyperparameter sweeps straightforward. It also showed OpenCode running in a Modal Sandbox as an example of building coding agents at home.

Chang also pointed to Modal’s integration with OpenAI’s Agents SDK. The tutorial shown, “Building with Modal and the OpenAI Agents SDK,” described using Modal Sandboxes with the Agents SDK for parameter golf, a use case she connected to the theme of building small models.

For Build Small participants, Modal is supplying $250 in credits. Chang said that should be enough to build with a variety of open-source models, run training, and use sandboxes. She directed participants to reach her at @felicia_modal on the Hugging Face Discord and to join Modal’s Slack to speak directly with engineers for technical support.

Sharma added that Modal is also sponsoring a Modal category worth around $20,000 in Modal credits. To be eligible, builders must use Modal and specify that use in the Space README. Sharma connected this directly to the bonus badges: using Modal to fine-tune or host a model can help participants pursue categories where fine-tuning or stronger deployment work matters.

JetBrains is using Mellum 2 to test fast coding models under hackathon pressure

Nikita Pavlichenko presented Mellum 2, a JetBrains model released the Monday before the kickoff. It is a 12B-parameter mixture-of-experts model optimized for H100 and H200 GPUs, permissively licensed under Apache 2.0, and built for coding and language tasks.

Pavlichenko said coding is where the model is likely stronger, but it can be used for other tasks. The feature he emphasized most was throughput: Mellum 2 “really shines” when many requests run in parallel, making it suitable for use cases that benefit from low-latency, high-throughput inference. The slide suggested AI coding assistants, RAG applications, intelligent routing between models, code analysis and developer tools, real-time chat, and automation.

JetBrains has shared llama.cpp weights. Pavlichenko said MLX support had not yet been merged, but invited participants to tag the team on Discord if they need it, so JetBrains can try to speed that up. There are two versions of the model: a thinking version that outputs reasoning and an instruct version that does not produce reasoning traces and is “blazingly fast.”

His instructions to builders were open-ended: deploy it, fine-tune it, use LoRA, and “please break the model.” If something works, fine; if something does not, tell the JetBrains team so they can help and improve the model. Engineers from the Mellum team, including people who worked on pre-training and post-training, will be available on Discord. Pavlichenko even floated the possibility that the team might train another model during the hackathon, though he did not commit to it.

Sharma clarified that JetBrains is sponsoring a $5,000 cash prize and described Mellum 2 as a recently released code-completion model with team support available on Discord.

Cohere’s small models are aimed at speech and multilingual applications underserved by default LLM choices

Julian Mack presented two Cohere and Cohere Labs model offerings for the hackathon: Cohere Transcribe and Tiny Aya. He described Cohere as an enterprise-focused model lab and Cohere Labs as its research arm.

Cohere Transcribe is an audio-to-text ASR model. Mack described it as 2B parameters, fast, optimized for low latency, and among the best ASR models available. The slide said it supports 14 languages and performs strongly in English and multilingual settings, including noisy and far-field microphone conditions. It also noted that 90% of parameters are in the encoder for efficient inference, and that the model has support across Transformers, vLLM, transformers.js for browser use, mlx-audio for Apple Silicon, Whisper_Menos for iOS, Whisper.cpp through a provided pull request, and a Linux app. The slide recommended silero-vad voice activity detection for end-of-utterance detection and hallucination mitigation.

Mack suggested that the encoder may be interesting for ambitious fine-tuning because Cohere pushed much of the model’s parameter budget there, making it a strong transcription encoder that could potentially be extended to related tasks.

Tiny Aya is a family of multilingual LLMs. Mack described it as a 3.3B family, while a later slide and guide screenshot described Tiny Aya as 3.8B; both remain far below the hackathon limit. The family supports around 70 languages and includes five variants. base is the pretrained model. global has the broadest language coverage and is the default to try first. earth is best for West Asian and African languages. fire is best for South Asian languages. water is best for European and Asia Pacific languages.

The model family has GGUF quantizations available and is small enough to run locally on phones and in browsers. Mack framed it as a particularly good choice for languages that are not well served by commonly used LLMs. The core strengths he named were multilingual text generation, conversational AI, summarization, translation, and cross-lingual tasks — especially prompts that require understanding multiple languages at once.

Cohere has a guide for Build Small participants covering Tiny Aya and Cohere Transcribe. The screenshot described them as a good fit for local multilingual assistants, voice interfaces, accessibility tools, offline translation helpers, and small apps for real people. Mack said he would be in Discord for Cohere Transcribe, while Alejandro and Saurabh would support Tiny Aya.

Sharma added that Cohere’s $5,000 sponsorship goes into the general prize pool and again connected Cohere’s models to the badge system: fine-tuning, Modal use, and quantization through llama.cpp can stack into additional recognition and prize chances.

The final submission is a runnable Space with evidence that it works

Across the sponsors, the repeated pattern was not “use a small model” in the abstract. Each sponsor tried to reduce the path from model choice to working demo: a starter Space for FLUX.2 [klein], free APIs and deployment snippets for MiniCPM, Codex credits and commit attribution for OpenAI’s category, Nemotron model maps and GitHub cookbooks, Modal credits and examples, Mellum 2 weights and Discord access, Cohere guides and quantized multilingual models, and Gradio’s new workflow layer for assembling pipelines.

The judging mechanics reinforce that Build Small is not just a benchmark contest. A submission must be hosted under the hackathon organization, implemented as a Gradio Space, documented in the README, and accompanied by a demo video and social post. Sponsor-specific categories add their own evidence requirements: MiniCPM for the OpenBMB prizes, Nemotron for NVIDIA’s RTX 5080 awards, Modal usage disclosed in the README for the Modal category, and Codex-attributed commits in a public GitHub repository for OpenAI’s prize.

Sharma ended by directing participants to the landing page and Discord for questions about credits, joining deadlines, registration deadlines, and prize details. He emphasized that the hackathon organization is the event homepage and that participants need to be members of that organization to compete for the prizes. Registration had already closed by the kickoff’s end, and he warned that the organization itself might close soon as well.

AI Application Architecture Data and Training Inference and Deployment Voice and Audio AI Agents and Autonomy Multimodal AI Open Models Image and Video Generation Model Releases Coding Assistants