MiniCPM-V 2.6 Runs at 18 Tokens per Second on iPhone

Hugging FaceWednesday, June 10, 20266 min read

OpenBMB used its Build Small hackathon session to argue that small models are valuable when they can be deployed where applications and data already live: on phones, laptops, mobile apps and edge devices. Its main example was MiniCPM-V 2.6, a vision-language model shown running on an iPhone 15 Pro at 18 tokens per second with llama.cpp and 4-bit quantization. The broader claim was that compact, open models paired with existing runtimes can expand access, reduce cloud dependence, and improve privacy and latency for local AI use cases.

OpenBMB’s practical claim starts with an iPhone speed test

OpenBMB’s case for building small models is practical: compact models matter when they can run on the devices people already use, close to the data and the application. The most concrete example was MiniCPM-V 2.6, OpenBMB’s vision-language model, shown running on an iPhone 15 Pro at 18 tokens per second using llama.cpp and 4-bit quantization. The speaker described that rate as faster than most humans read, and therefore suitable for user interaction.

18 tokens/s

MiniCPM-V 2.6 generation speed shown on an iPhone 15 Pro using llama.cpp and 4-bit quantization

That result was part of a broader deployment argument. OpenBMB presented the MiniCPM family not only as a set of small-model releases, but as a group of models meant to be deployed across servers, laptops, phones, and edge systems. The emphasis was not simply model size. It was whether a model can be paired with existing runtimes and shipped into the environments where users, applications, robots, or mobile devices actually operate.

The speaker emphasized that users do not need to compile low-level C++ themselves to run these models on-device. The available deployment tools vary by target. For server-side deployment, the speaker pointed to vLLM and LMDeploy. For MacBook or PC use, the strongest recommendation was llama.cpp and Ollama, described as user-friendly and well optimized for CPU and Apple Silicon. For iOS and Android apps, OpenBMB pointed to mobile-specific frameworks such as Alibaba’s MNN and MLC-LLM.

A deployment slide organized the runtime choices by platform. It listed MLX for Apple Silicon, MLC-LLM and TVM for Android and web targets, TensorRT for PC, Alibaba’s MNN, and Tencent’s NCNN, alongside cross-platform options including llama.cpp, vLLM, Ollama, and LMDeploy. The point was not that every runtime serves the same job. It was that current small-model deployment has enough tooling to match different environments: server inference, laptop use, mobile apps, web targets, and device-local execution.

For OpenBMB, that tooling is part of why small models are worth building. A model that fits a phone, tablet, or laptop can be paired with runtimes that already support those devices. The practical question becomes which runtime fits the target platform and application.

MiniCPM-V 2.6 was shown running at interactive speeds on Apple devices

OpenBMB used MiniCPM-V 2.6 as its example of on-device viability. With llama.cpp and 4-bit quantization, the model was shown running across Apple devices, including iPads, an iPhone, a Mac Studio, and a MacBook Pro. OpenBMB presented these as its llama.cpp test results for the model under that configuration.

Device	Memory	MiniCPM-V 2.6 speed
iPad Pro M4	8GB	15 tokens/s
iPad Pro M2	8GB	11 tokens/s
iPhone 15 Pro	8GB	18 tokens/s
Mac Studio M2 Max	32GB	31 tokens/s
MacBook Pro M3 Max	128GB	32 tokens/s

OpenBMB’s llama.cpp test of MiniCPM-V 2.6 with 4-bit quantization across Apple devices

The iPhone 15 Pro result was the figure the speaker singled out: 18 tokens per second. With 4-bit quantization and Apple Silicon optimization through llama.cpp, MiniCPM-V 2.6 was presented as deployable on a phone-class device at a usable generation rate.

The rest of the device results give the range shown in the presentation. On the iPad Pro M2, the model was shown at 11 tokens per second; on an iPad Pro M4, 15 tokens per second; on a Mac Studio M2 Max, 31 tokens per second; and on a MacBook Pro M3 Max, 32 tokens per second. The spread supported a deployment point: the same vision-language model can move from phone to tablet to desktop-class Apple hardware while remaining interactive under the tested setup.

OpenBMB wants edge AI to be usable without writing code

Deployment was not limited to developer runtimes. OpenBMB also highlighted a collaboration with Alibaba’s MNN team on ChatMini, an application described on the slide as a way to experience “Edge AI Out of the Box.” The slide listed MiniCPM-V 2.6 and MiniCPM3-4B as supported models and showed the app as available for Android, with iOS “coming soon.”

Frameworks such as llama.cpp, Ollama, MNN, and MLC-LLM make small models accessible to developers. ChatMini is meant to remove even that setup burden for users. According to the speaker, a user can download the app, have the model downloaded inside it, and start using OpenBMB’s newest text and vision-language models without writing code.

That no-code layer reinforces the deployment claim. Compact models are more useful when they can be embedded in mobile apps, run on laptops, and be packaged in forms that let users experience local inference directly rather than setting up a custom deployment workflow.

The case for building small is learning, access, privacy, and latency

The answer to “why build small?” had four parts: small models are easier to learn from, they broaden access to AI, they support privacy and security through local execution, and they enable edge-device use cases where latency or connectivity constraints matter.

First, small models are presented as a learning tool. Because they are small, iteration is faster. Because they do not necessarily have very long context windows or complex architectures such as mixture-of-experts, the code and architecture are easier to understand. The speaker also emphasized that small models can be fine-tuned with cheaper GPUs. For people trying to understand model training and adaptation, small models lower both infrastructure cost and conceptual overhead.

Second, the access argument is explicitly about hardware. Not everyone has access to H100s or RTX 4090s; many people do have a MacBook or an iPhone. The speaker’s framing was that compact models let users run powerful, state-of-the-art language models on their own devices rather than depending on paid access to OpenAI or other large companies for intelligence.

Third, local execution changes the privacy and security tradeoff. If a model runs completely on-device, data does not need to leave the device. The speaker connected that to individuals using personal information and to companies processing data inside their own pipelines.

The model runs completely local. Data doesn't need to leave your device, so you can feed it with personal information.

Fourth, small models were tied to embodied and wearable computing. Robots cannot be assumed to have a stable, fast network. For safety and latency reasons, a robot may need to make local decisions. The same logic was applied to AI pins, wearable devices, and other edge products that need quick responses or simple reasoning. In those environments, a small model is not merely a cheaper substitute for a cloud model; it fits the operating constraints.

MiniCPM is presented as a compact family spanning text, vision, code, and math

OpenBMB summarized MiniCPM as a model family that has evolved through several generations, with work spanning text, code, math, mixture-of-experts, and vision-language understanding. The summary slide described the MiniCPM-V series as “best in class” for vision-language understanding and MiniCPM 3.0 as “best in class” for 4B text, code, and math capabilities. Earlier MiniCPM 1.0 and 2.0 series models were described as covering text, code, math, and MoE work.

The speaker said the MiniCPM family had reached a third generation, alongside version 2.6 of the vision-language model, and that the models discussed were open source and available on Hugging Face. OpenBMB pointed users to the OpenBMB Hugging Face organization at huggingface.co/openbmb and GitHub at github.com/OpenBMB.

The name MiniCPM carries older history. Asked what CPM stands for, the speaker explained that it originally meant “Chinese Pre-trained Model.” The first CPM 1.0 was launched around 2020, when the team focused on pre-training over Chinese corpora. The name remained even though the speaker said the models now have strong English and other language capabilities.

That origin explains the acronym, while the current presentation of MiniCPM is broader than the original Chinese pretraining focus. In this talk, MiniCPM was positioned as a compact, open-source model family intended for deployment across local devices, mobile apps, and edge systems.

Inference and Deployment Multimodal AI Open Models Model Releases