Developer George Liu recently demonstrated a complete workflow for running the Google Gemma 4 26B model on local hardware using the new headless CLI tool introduced in LM Studio 0.4.0. This setup not only supports local inference but also enables integration with Claude Code via the command line, offering developers a robust alternative that eliminates reliance on cloud-based APIs.
As AI use cases continue to expand, cloud APIs often come with drawbacks such as network latency, recurring costs, and potential privacy risks. By deploying models locally, developers can bypass these issues entirely, ensuring that all data remains strictly on their own devices.
Performance Advantages of the Mixture-of-Experts Architecture
The 26B-A4B model in the Google Gemma 4 series utilizes a Mixture-of-Experts (MoE) architecture. In his testing, Liu noted that while the model features 128 expert models and one shared expert, it only activates eight experts per inference—totaling approximately 3.8 billion active parameters. This design allows the model to maintain high performance while significantly lowering hardware requirements.
On a 14-inch M4 Pro MacBook Pro equipped with 48GB of unified memory, the model runs smoothly, achieving generation speeds of up to 51 tokens per second. Liu believes this architecture offers exceptional value for local inference, noting that its performance can rival models with parameter counts hundreds of times larger.
Comparative data shows that the Gemma 4 26B-A4B achieves a score of 82.6% on the MMLU Pro benchmark, trailing only slightly behind massive 31B dense models, which score 85.2%. Liu remarked, "You don't need an expensive GPU cluster to run an AI that can compete with models boasting tens of billions of parameters."
By leveraging the API interface provided by LM Studio, developers can connect local models to development tools like Claude Code. While there is a slight performance trade-off when running within the Claude Code environment, this setup provides a private, cost-free environment for high-frequency development tasks such as code review and drafting.