In a notable technological advancement, Apple has unveiled a new architecture for on-device AI models that overcomes the long-standing limitations of Dynamic Random Access Memory (DRAM). This breakthrough, showcased with the AFM 3 Core Advanced model at WWDC26, enables the storage of up to 20 billion parameters in flash memory instead of traditional RAM, potentially transforming how enterprises deploy AI agents.
Breaking the Memory Barrier
The parameter count for on-device AI models has historically been limited by the necessity to fit entirely within DRAM. This restriction has constrained the capabilities of local deployments, forcing enterprises to choose between powerful cloud-dependent models and less capable on-device alternatives. By shifting the weight set away from DRAM, Apple’s new architecture creates numerous opportunities for on-device AI applications, particularly in sectors that prioritize data privacy and compliance.
"Instead of forcing the entire model into DRAM, the full model is stored in flash memory," Apple’s research team explained, marking a significant shift in how AI models can be built and implemented. The AFM 3 Core Advanced model uses a method called Instruction-Following Pruning (IFP), where flash memory acts as the primary storage location for parameters, while DRAM serves as a temporary buffer for active tasks.
Innovative Routing Mechanism
The AFM 3 Core Advanced's architecture features a novel routing mechanism that functions on a prompt basis instead of processing token by token. This design choice addresses the limitations of NAND-to-DRAM bandwidth, which struggles to support the rapid weight swapping needed by standard Mixture of Experts (MoE) models. Instead, the model predicts which experts to load into RAM based on the incoming prompt, enabling it to generate all tokens from a fixed expert set selected during the prompt stage.
Awni Hannun, a researcher at Anthropic and former Apple scientist, remarked, "You can't put 20B parameters in RAM at any reasonable precision. To make it work they are using pretty exotic architecture by today's standards." This prediction-and-load mechanism boosts efficiency and performance, allowing the model to scale its active parameters from 1 billion to 4 billion based on task complexity.
Implications for Enterprises
Apple's new architecture has significant implications for enterprise architects considering the deployment of AI agents. The ability to run a 20-billion-parameter model locally enables businesses to maintain data integrity while accessing advanced AI capabilities without relying on cloud infrastructure. However, the shift to this new model brings challenges, especially concerning deployment constraints and operational transparency.
While the technical paper detailing the memory design and activation mechanisms sheds light on the architecture, many critical metrics related to production viability—such as energy consumption and thermal performance—remain undisclosed. Marco Abis, a developer focused on local AI profiling tools, voiced concerns about this lack of information, stating, "Energy, memory bandwidth, thermal? Not in the docs. A notable gap, given those decide most of on-device performance."
Future Considerations
As enterprises assess the AFM 3 Core Advanced model, they encounter a crucial decision regarding the architectural boundaries between on-device and cloud-based AI processing. Apple has not specified when an on-device request might offload to the cloud or if this routing is visible to developers, raising potential compliance issues for organizations that require detailed documentation of inference processes.
The AFM 3 Cloud Pro, which utilizes Nvidia GPUs within Google Cloud, provides a server-side option for more complex tasks, adding another layer of complexity to the architecture. Enterprises must now manage the balance between a capable local option and their ongoing dependence on cloud infrastructure for more demanding AI operations.
With Apple set to release a comprehensive technical report featuring benchmarks later this summer, the AI community is keenly watching how these advancements will impact the enterprise landscape, particularly in regulated industries. The transition from relying solely on DRAM to a hybrid model that incorporates flash memory could define the next era of on-device AI.
The stories that move AI & crypto markets — before the market reacts.
Free. 7am ET. Five stories. 62,400 readers.
