Close Menu
    What's Hot

    CISA gives US federal agencies three days to fix a VPN bug under attack by a ransomware gang

    Can tech companies learn to love cheaper AI models? 

    Serena Williams: 44-year-old Grand Slam great makes winning comeback in Queen’s doubles event after four-year absence | Tennis News

    Facebook X (Twitter) Instagram
    Trending
    • CISA gives US federal agencies three days to fix a VPN bug under attack by a ransomware gang
    • Can tech companies learn to love cheaper AI models? 
    • Serena Williams: 44-year-old Grand Slam great makes winning comeback in Queen’s doubles event after four-year absence | Tennis News
    • Mateus Fernandes: The reasons why Manchester United and Europe’s best are interested in a midfielder who has been relegated twice | Football News
    • U.S. Open 2026 complete field: Amateurs finish qualification on ‘Golf’s Longest Day’
    • The Top New Features in Apple’s iOS 27 and iPadOS 27
    • Meta to Use Off-Site Business Data for Feed and AI Personalization
    • A hidden summer threat could soon send twice as many Americans to the hospital
    interluknewsinterluknews
    • Home
    • Business
      • Corporate News
      • Industry Insights
      • Startups & Entrepreneurship
      • Technology & Innovation
    • Economy
      • Economic Policy
      • Financial Analysis
      • Inflation & Interest Rates
      • Trade & Markets
    • Global
      • Conflicts & Security
      • Diplomacy
      • Global Trends
      • International Affairs
    • Lifestyle
      • Fashion
      • Food & Dining
      • Personal Development
      • Travel
    • Opinion
      • Columns
      • Editorials
      • Expert Opinions
      • Reader Voices
    • More
      • Politics
        • Elections
        • Government & Policy
        • International Relations
        • Political Analysis
      • Sports
        • Cricket
        • Football / Soccer
        • International Sports
        • Local Sports
      • Technology
        • Artificial Intelligence
        • Cybersecurity
        • Gadgets & Reviews
        • Tech News
      • South Africa News
    Facebook X (Twitter) Instagram
    interluknewsinterluknews
    Startups & Entrepreneurship

    On-device AI agents hit a hard memory limit. Apple’s new architecture routes around it.

    adminBy adminJune 9, 2026No Comments5 Mins Read
    Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
    On-device AI agents hit a hard memory limit. Apple’s new architecture routes around it.
    Share
    Facebook Twitter LinkedIn Pinterest Email

    On-device AI models have stayed small because the entire weight set has to live in DRAM, capping practical parameter counts well below what server-side deployments use. Enterprise architects evaluating agentic workloads have had to choose between capable cloud-dependent models and limited on-device ones. Apple’s third-generation foundation models, announced at WWDC26, break that constraint by moving the weight set off DRAM entirely.

    The AFM 3 family was developed in collaboration with Google and spans five models: two on-device and three server-based, all running within Apple’s Private Cloud Compute boundary. The server-side models, including AFM 3 Cloud Pro for agentic tool use and complex reasoning, run on Nvidia GPUs in Google Cloud. The on-device architecture is Apple’s own. AFM 3 Core Advanced is a 20-billion-parameter model that stores weights in NAND flash rather than DRAM.

    “Instead of forcing the entire model into DRAM, the full model is stored in flash memory,” Apple’s research team wrote. “Because NAND-to-DRAM bandwidth is too slow to swap weights token by token, as standard MoE models require, AFM 3 Core Advanced makes routing decisions per prompt.”

    How the architecture actually works

    The memory wall Apple is working around is one every local AI developer runs into.

    “You can’t put 20B parameters in RAM at any reasonable precision,” Awni Hannun, a researcher at Anthropic and former Apple research scientist, posted on X. “To make it work they are using pretty exotic architecture by today’s standards. A small model predicts from the query (or prompt) which experts to load from NAND into RAM.”

    That prediction-and-load mechanism has three distinct components, each driven by the hardware constraints of consumer silicon.

    The full 20B weight set lives in flash, not DRAM. AFM 3 Core Advanced stores its entire parameter set in NAND flash rather than active memory. Standard on-device deployments require the full model to fit in DRAM, which is what caps their parameter counts. Apple’s approach, which it calls Instruction-Following Pruning (IFP) and developed with its own researchers, treats flash as the model’s permanent home and DRAM as a working buffer for whichever experts a given prompt requires.

    Expert routing happens once per prompt, not per token. In a conventional Mixture of Experts model, a router selects different experts for every token generated — which would require continuous weight movement between flash and DRAM at inference speed. NAND-to-DRAM bandwidth cannot support that. AFM 3 Core Advanced routes once at prompt time, selects a fixed expert set, loads it into DRAM alongside always-active shared experts, and generates all tokens from that same configuration.

    “The key distinction from a typical MoE is that you do this once per query and then generate all the tokens with the same experts,” Hannun wrote.

    The AFM 3 Core Advanced model architecture

    Source: Apple Machine Learning Research, June 8, 2026.

    Active parameter count scales from 1B to 4B depending on task complexity. Rather than running a fixed model size for every request, AFM 3 Core Advanced adjusts how many parameters it activates based on what the task requires — 1 billion for simpler operations, up to 4 billion for harder ones, all drawn from the 20-billion-parameter pool in flash.

    What Apple has and hasn’t disclosed

    The architecture paper is detailed on the memory design and sparse activation mechanism. It is less forthcoming on practical deployment constraints.

    Apple’s profiling tools expose timing but not the metrics that decide production viability. “Energy, memory bandwidth, thermal? Not in the docs,” Marco Abis, who is building Ziraph, a profiler for local AI on Apple silicon, posted on X. “A notable gap, given those decide most of on-device performance.” 

    Abis also did not find a statement in Apple’s documentation — across the Core AI docs, the Foundation Models docs or the Private Cloud Compute security post — of when an on-device request transparently offloads, or whether that routing is visible to the developer or the user. For enterprises that need to document where inference runs, that is a direct compliance problem.

    Not all the information is currently available. Apple has indicated a full technical report with benchmarks is coming later this summer.

    What this means for enterprise architects

    Regulated industries evaluating agentic AI deployments now have a concrete architectural decision to make.

    • The DRAM wall for on-device agents just moved. Enterprises evaluating agents that need to run without a cloud round-trip now have a 20-billion-parameter local option to evaluate. The constraint shifts from model capability to device hardware.

    • The private/cloud boundary is now an architectural decision, not a default. Simpler requests stay on-device; complex agentic tasks route to AFM 3 Cloud Pro on Private Cloud Compute. Apple has not publicly specified when a request offloads or whether that routing is visible to the developer — a gap that complicates policy decisions for organizations that need to document where inference runs.

    • The agentic server tier depends on Google Cloud. AFM 3 Cloud Pro runs on Nvidia GPUs in Google Cloud. The Private Cloud Compute guarantee covers data privacy. It does not eliminate the Google Cloud dependency for server-side inference.

    AFM 3 Core Advanced gives enterprises a 20-billion-parameter on-device option that did not exist before WWDC26. Whether it is deployable at scale depends on answers Apple has not yet published. Those details are due in the summer technical report.

    agents Apples architecture Hard hit limit Memory ondevice Routes
    Follow on Google News Follow on Flipboard
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
    Previous ArticleOil slides as US official says Hormuz transits are ‘meaningfully’ climbing
    Next Article NATO Needs Standardization on Cloud Services in AI Age
    admin
    • Website

    Related Posts

    CISA gives US federal agencies three days to fix a VPN bug under attack by a ransomware gang

    June 9, 2026

    The Top New Features in Apple’s iOS 27 and iPadOS 27

    June 9, 2026

    Seattle slips in ranking of best U.S. cities for foreign investment, fueling concerns about business climate – GeekWire

    June 9, 2026
    Leave A Reply Cancel Reply

    Demo
    Latest Posts

    CISA gives US federal agencies three days to fix a VPN bug under attack by a ransomware gang

    Can tech companies learn to love cheaper AI models? 

    Serena Williams: 44-year-old Grand Slam great makes winning comeback in Queen’s doubles event after four-year absence | Tennis News

    Mateus Fernandes: The reasons why Manchester United and Europe’s best are interested in a midfielder who has been relegated twice | Football News

    Latest Posts

    Subscribe to News

    Get the latest sports news from NewsSite about world, sports and politics.

    Advertisement
    Demo

    We are a digital news platform delivering timely, accurate, and insightful coverage of politics, global affairs, business, economy, sports, and more. Our mission is to keep readers informed with reliable news, clear analysis, and stories that truly matter.
    We're social. Connect with us:

    Facebook X (Twitter) Instagram Pinterest YouTube

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Type above and press Enter to search. Press Esc to cancel.

    Powered by
    ...
    ►
    Necessary cookies enable essential site features like secure log-ins and consent preference adjustments. They do not store personal data.
    None
    ►
    Functional cookies support features like content sharing on social media, collecting feedback, and enabling third-party tools.
    None
    ►
    Analytical cookies track visitor interactions, providing insights on metrics like visitor count, bounce rate, and traffic sources.
    None
    ►
    Advertisement cookies deliver personalized ads based on your previous visits and analyze the effectiveness of ad campaigns.
    None
    ►
    Unclassified cookies are cookies that we are in the process of classifying, together with the providers of individual cookies.
    None
    Powered by