Understanding the present, shaping the future.

Search
08:46 AM UTC · TUESDAY, JUNE 2, 2026 XIANDAI · Xiandai
Jun 2, 2026 · Updated 08:46 AM UTC
AI

Developers Run Google Gemma 4 Locally on Mac Using New LM Studio CLI

Developer George Liu has successfully deployed Google Gemma 4 26B locally on a MacBook Pro using the headless CLI feature in LM Studio 0.4.0, achieving efficient on-device AI inference.

Alex Chen

2 min read

Developers Run Google Gemma 4 Locally on Mac Using New LM Studio CLI
Photo: amazon.com

Developer George Liu recently demonstrated a complete workflow for running the Google Gemma 4 26B model on local hardware using the new headless CLI tool introduced in LM Studio 0.4.0. This setup not only supports local inference but also enables integration with Claude Code via the command line, offering developers a robust alternative that eliminates reliance on cloud-based APIs.

As AI use cases continue to expand, cloud APIs often come with drawbacks such as network latency, recurring costs, and potential privacy risks. By deploying models locally, developers can bypass these issues entirely, ensuring that all data remains strictly on their own devices.

Performance Advantages of the Mixture-of-Experts Architecture

The 26B-A4B model in the Google Gemma 4 series utilizes a Mixture-of-Experts (MoE) architecture. In his testing, Liu noted that while the model features 128 expert models and one shared expert, it only activates eight experts per inference—totaling approximately 3.8 billion active parameters. This design allows the model to maintain high performance while significantly lowering hardware requirements.

On a 14-inch M4 Pro MacBook Pro equipped with 48GB of unified memory, the model runs smoothly, achieving generation speeds of up to 51 tokens per second. Liu believes this architecture offers exceptional value for local inference, noting that its performance can rival models with parameter counts hundreds of times larger.

Comparative data shows that the Gemma 4 26B-A4B achieves a score of 82.6% on the MMLU Pro benchmark, trailing only slightly behind massive 31B dense models, which score 85.2%. Liu remarked, "You don't need an expensive GPU cluster to run an AI that can compete with models boasting tens of billions of parameters."

By leveraging the API interface provided by LM Studio, developers can connect local models to development tools like Claude Code. While there is a slight performance trade-off when running within the Claude Code environment, this setup provides a private, cost-free environment for high-frequency development tasks such as code review and drafting.

Comments