Google’s latest iteration of its open-model architecture, Gemma 4, is proving capable of handling complex agentic coding tasks locally. Developer Daniel Vaughan recently tested the model in the Codex CLI, aiming to determine if it could serve as a viable, private alternative to cloud-based models like GPT-5.4.
For the test, Vaughan utilized two distinct hardware configurations. His first setup featured a 24 GB M4 Pro MacBook Pro running the 26B Mixture-of-Experts (MoE) variant via llama.cpp. His second setup used a Dell Pro Max GB10 equipped with 128 GB of unified memory and an NVIDIA Blackwell chip to run the 31B Dense variant via Ollama v0.20.5.
Vaughan’s primary motivation for the shift was to address the rising costs of API usage and to resolve privacy concerns regarding sensitive codebases. He noted that cloud-based models often present issues with throttling and price volatility, making local execution a more resilient choice for day-to-day work.
Overcoming tool-calling limitations
Previous versions of Gemma failed to provide a foundation for agentic coding due to poor tool-calling accuracy. According to benchmarks, earlier models scored only 6.6 percent on the tau2-bench function-calling test. However, the Gemma 4 31B model has improved this performance significantly, scoring 86.4 percent on the same benchmark.
"Gemma 4 31B scores 86.4 per cent on the same benchmark. That is what made this test worth running," Vaughan wrote. This capability allows the model to reliably read files, write code, and apply patches without needing to send requests to an external server.
The transition to local hardware was not without obstacles. Vaughan reported that initial attempts were hindered by software bugs, specifically within the Ollama streaming process. He found that v0.20.3 incorrectly routed tool-call responses to the reasoning output rather than the tool-call field. These challenges required a full day of debugging to resolve before the model could function effectively as a coding agent.
Vaughan’s findings suggest that while local inference requires more setup time, the model quality is sufficient to compete with cloud-based alternatives for professional coding tasks. By moving the workload to local hardware, developers can maintain control over their data while mitigating the ongoing costs associated with high-frequency API calls.