Close Menu
    What's Hot

    Ranking Asia’s contenders: Which team will go the furthest at the World Cup?

    How the Spurs authored the biggest single-game choke in NBA Finals history

    Cybersecurity researchers aren’t happy about the guardrails on Anthropic’s Fable

    Facebook X (Twitter) Instagram
    Trending
    • Ranking Asia’s contenders: Which team will go the furthest at the World Cup?
    • How the Spurs authored the biggest single-game choke in NBA Finals history
    • Cybersecurity researchers aren’t happy about the guardrails on Anthropic’s Fable
    • GitHub to Disable npm Install Scripts by Default to Stop Supply Chain Attacks
    • Iran War Live Updates: U.S. and Iran Trade Fresh Attacks
    • Anthropic’s Dario Amodei has just one direct report
    • Realty Income Vs. EastGroup: A Case Study In Overcoming Cost Of Capital (NYSE:O)
    • Mikaela Mayer hopes to end UK hoodoo when she faces Chantelle Cameron in super-welterweight world title clash in Birmingham | Boxing News
    interluknewsinterluknews
    • Home
    • Business
      • Corporate News
      • Industry Insights
      • Startups & Entrepreneurship
      • Technology & Innovation
    • Economy
      • Economic Policy
      • Financial Analysis
      • Inflation & Interest Rates
      • Trade & Markets
    • Global
      • Conflicts & Security
      • Diplomacy
      • Global Trends
      • International Affairs
    • Lifestyle
      • Fashion
      • Food & Dining
      • Personal Development
      • Travel
    • Opinion
      • Columns
      • Editorials
      • Expert Opinions
      • Reader Voices
    • More
      • Politics
        • Elections
        • Government & Policy
        • International Relations
        • Political Analysis
      • Sports
        • Cricket
        • Football / Soccer
        • International Sports
        • Local Sports
      • Technology
        • Artificial Intelligence
        • Cybersecurity
        • Gadgets & Reviews
        • Tech News
      • South Africa News
    Facebook X (Twitter) Instagram
    interluknewsinterluknews
    Startups & Entrepreneurship

    Researchers say they trained a foundation model from scratch for about $1,500

    adminBy adminJune 11, 2026No Comments9 Mins Read
    Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
    Researchers say they trained a foundation model from scratch for about ,500
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Training a foundation LLM from scratch costs millions and requires internet-scale data — which is why most enterprises don’t bother. Sapient thinks it has a cheaper path.

    To overcome this brute-force scaling dogma, researchers at Sapient developed HRM-Text, which replaces standard Transformers with a highly sample-efficient Hierarchical Recurrent Model (HRM), an architecture they first introduced last year.

    HRM decouples computation into slow-evolving strategic and fast-evolving execution layers. Instead of brute-force autoregressive prediction on raw text, HRM-Text trains exclusively on instruction-response pairs. This is close to real-world enterprise settings, where users usually expect a targeted answer to a specific task.

    The researchers were able to train a 1B-parameter HRM-Text from scratch at a fraction of the cost and tokens of normal LLMs. Their model achieved performance competitive with much larger open models on key industry benchmarks.

    For real-world AI applications, this means foundational pretraining is no longer restricted to highly resourced institutions. With HRM-Text, organizations can affordably pretrain their own highly capable reasoning models from scratch and pair them with external knowledge stores.

    The training bottleneck

    When we train an LLM, we don’t actually care if it has memorized the exact sequence of words in a random 2014 Reddit thread. What we want is for the model to develop a deep, underlying understanding of human language, logic, facts, and reasoning.

    The current approach is brute force: scrape the internet, run next-token prediction trillions of times, and assume the model has developed a working internal model of the world.

    Basically, this means that we waste millions of dollars of computing power forcing models to memorize everything collected from the internet, just so they can indirectly learn how to think. For example, standard decoder-only models spend valuable compute assigning loss to reconstruct the prompt itself, even though the user’s prompt is already known and provided at inference time.

    Instead of simply viewing this as a computational hurdle, the industry must recognize it as a severe business limitation. In comments provided to VentureBeat, Guan Wang, CEO of Sapient Intelligence, framed this as an issue of the “economics of iteration.”

    “Enterprises today face three compounding problems: training is expensive, infrastructure is heavy, and experimentation cycles are too slow,” Wang said. “The industry’s scaling addiction says: ‘When the model fails, make it bigger. Add more data. Add more GPUs.’ That has worked, but it is reaching a point of diminishing returns. More scale often means more memorization, more latency, more infrastructure, and more vendor dependency. It does not necessarily give an enterprise a better reasoning engine.”

    This architectural and computational inefficiency is exactly why fine-tuning existing dense transformers isn’t always the silver bullet for enterprises. Fine-tuning to preserve a model’s general capabilities often requires mixing substantial general-purpose data into the process, making it computationally heavy and difficult to control.

    “Imagine a hedge fund, insurer, or bank that has highly proprietary data: internal research notes, transaction logic, compliance rules, analyst memos, risk models, portfolio constraints,” Wang said. “They may not want to send that data to an external frontier model, and they may not need a giant general-purpose model that memorized the internet. What they need is a compact reasoning core that can learn their task structure, reason across rules and numbers, and run in a controlled environment.”

    Because HRM-Text focuses its computation strictly on task completion and latent reasoning, it allows enterprises to start with a smaller, smarter model and adapt it to a proprietary domain with far less infrastructure.

    Rethinking architectures with HRM-Text

    HRM, which was introduced in 2025, represents a fundamental departure from traditional Transformer models. To build a more sample-efficient engine, HRM decouples computation into slow-evolving strategic and fast-evolving execution layers. The fast L-module performs local iterative refinement, while the slow H-module maintains stable semantic context across cycles. Processing consists of two high-level cycles, where each cycle executes three fast L-module updates followed by a single slow H-module update.

    image 9378b6

    Hierarchical reasoning model (HRM) (source: arXiv)

    Standard parameter-shared recurrent architectures (like Samsung’s TRM) can sometimes handle small logic puzzles, but the Sapient researchers found they become highly unstable when scaled to 1-billion parameters for language tasks. The separation between HRM’s slow H-module and fast L-module is mathematically necessary, not just an aesthetic choice. As Wang said: “For logic grids, you can sometimes get away with a tiny recursive mechanism because the world is clean and bounded. Language is not like that. Language needs both fast local refinement and slow semantic stability.”

    While the original HRM proved highly effective for controlled, symbolic reasoning problems, the researchers hit a wall when applying it to the massive, open-ended complexities of generalized language modeling. While HRM’s loops make it an incredibly efficient thinker, those same loops make it mathematically volatile to train on the diverse chaos of human language. Running recurrent loops on language creates massive mathematical instability, specifically, exploding or vanishing gradients.

    HRM-Text

    HRM-Text architecture (source: Sapient Inc.)

    To prevent this feedback loop in the neural network, the researchers introduced two key architectural innovations in HRM-Text. First, they developed MagicNorm, a specialized normalization technique designed specifically to keep the internal signals stable, no matter how many times the model loops its thought process.

    Second, they designed a warm-up method to stabilize training. During early training, the model is only evaluated on short, shallow reasoning loops. As training progresses, the system warms up, gradually giving the model deeper and longer reasoning sequences.

    They also switched the training objective from next-token prediction to task completion, where the model is rewarded only on the full response as opposed to individual tokens it generates. To achieve this goal, they changed the training data of HRM-Text from raw text to instruction-response pairs only.

    HRM-Text in action

    The researchers built a highly compact 1-billion-parameter HRM-Text model. Instead of using the standard multi-stage pipeline that requires churning through trillions of words of raw internet text, they trained it from scratch on a tightly curated dataset of just 40 billion tokens. The training data consisted entirely of instruction-response pairs across general instructions, math, symbolic logic, textbook exercises, and rewritten knowledge.

    They trained the model using the task-completion objective. To force the model to rely on its internal hierarchical architecture rather than copying step-by-step logic, they explicitly stripped out “thinking” tokens from the training data.

    The model was evaluated across a diverse suite of standard foundational AI benchmarks, heavily indexing on knowledge, reasoning, logic, math, and comprehension. The researchers tested HRM-Text against both small models and highly-resourced open-weight and fully open models.

    The results show a significant shift in the compute-to-performance frontier. The 1B-parameter HRM-Text achieved 60.7% on MMLU, 84.5% on GSM8K, and 56.2% on MATH. This performance is highly competitive with (and in several cases surpasses) the 2B to 7B parameter foundation models it was tested against.

    hrm text benchmark scatter

    HRM-Text performance (source: arXiv)

    The most important takeaway for the enterprise audience lies in the efficiency statistics and practical implications. Pretraining a foundation model from scratch is typically a multi-million dollar endeavor reserved for tech giants. HRM-Text was trained in just 1.9 days on a cluster of 16 GPUs. The total estimated compute cost was roughly $1,500. It achieved its competitive scores using 100 to 900 times fewer training tokens and 96 to 432 times less estimated compute than models like Qwen, Gemma, and Llama.

    Another important point is the decoupling of reasoning from knowledge memorization. From a practical standpoint, HRM-Text’s success on reasoning-heavy tasks despite its tiny 40B-token training diet proves that a model does not need to memorize the entire internet to become a smart reasoning engine.

    For enterprise applications, this behavior is a feature, not a bug. The researchers suggest a future where businesses deploy highly compact, incredibly cheap recurrent models that act as the “reasoning core” specialized for business logic. Instead of forcing the model to memorize company databases during pretraining, the model acts as the reasoning engine, relying on external retrieval systems to fetch factual knowledge.

    Critics have pointed out that training on instruction-response pairs makes comparisons against models trained on raw text an “apples-to-oranges” scenario. Wang pushes back on this framing, pointing out that every serious modern LLM sees instruction-response data during training or alignment. “So the comparison is not apples-to-oranges. It is closer to apple cores-and-apples. We started directly from the core task format because that is how people actually use models: they give an instruction and expect a useful response,” he said.

    The researchers also ran rigorous contamination tests to ensure the model wasn’t simply memorizing benchmark answers. On DROP, the one benchmark showing a marginal contamination signal under a specific setting, HRM-Text still scored an impressive 81.1% on a strictly clean, 0% contamination subset.

    Ultimately, Wang argues that for enterprises, “the right evaluation is not trivia recall. It is a workflow evaluation… Give HRM-Text a task like: multi-step financial reasoning, compliance logic, scientific workflow automation, structured extraction followed by reasoning.”

    Practical implementation and the future of enterprise AI

    While the benchmark scores and cost efficiencies are striking, Sapient is clear about the model’s current boundaries. The initial release is best viewed as a proof-of-concept, akin to early GPT releases, designed to showcase the architecture’s unique advantages.

    “Honestly, HRM-Text is not yet a plug-and-play ChatGPT replacement,” Wang said. “It is a compact foundation language reasoning model. For an enterprise engineering team, the operational work is mainly around templates, mode selection, attention masking, and alignment.”

    For AI engineering teams looking to experiment, getting started requires some specific, but standard, text-generation discipline. The model lists native support in the Transformers library (requiring transformers >= 5.9.0), and usage paths for vLLM and SGLang are actively being developed. The primary engineering task involves managing the PrefixLM design: production multi-turn chat applications will require careful KV-cache logic to ensure user prompts receive full bidirectional attention while the assistant’s outputs remain causal.

    “When the cost of training a capable reasoning model drops to around $1,500, AI stops being only an infrastructure question and becomes a strategy question,” Wang said. “A Fortune 500 company no longer has to ask, ‘Can we afford a foundation model?’ It would ask, ‘What should our model know about our business, and what kind of reasoning should it be optimized for?’”

    Foundation model Researchers scratch trained
    Follow on Google News Follow on Flipboard
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
    Previous ArticleThe World Cup’s Trionda Ball Challenges Traditional Aerodynamics
    Next Article Why India’s deadly dengue crisis is now no longer confined to the monsoons | Climate Crisis News
    admin
    • Website

    Related Posts

    Cybersecurity researchers aren’t happy about the guardrails on Anthropic’s Fable

    June 11, 2026

    Anthropic’s Dario Amodei has just one direct report

    June 11, 2026

    Microsoft Xbox CEO calls for reset amid reports of looming job cuts – GeekWire

    June 11, 2026
    Leave A Reply Cancel Reply

    Demo
    Latest Posts

    Ranking Asia’s contenders: Which team will go the furthest at the World Cup?

    How the Spurs authored the biggest single-game choke in NBA Finals history

    Cybersecurity researchers aren’t happy about the guardrails on Anthropic’s Fable

    GitHub to Disable npm Install Scripts by Default to Stop Supply Chain Attacks

    Latest Posts

    Subscribe to News

    Get the latest sports news from NewsSite about world, sports and politics.

    Advertisement
    Demo

    We are a digital news platform delivering timely, accurate, and insightful coverage of politics, global affairs, business, economy, sports, and more. Our mission is to keep readers informed with reliable news, clear analysis, and stories that truly matter.
    We're social. Connect with us:

    Facebook X (Twitter) Instagram Pinterest YouTube

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Type above and press Enter to search. Press Esc to cancel.

    Powered by
    ...
    ►
    Necessary cookies enable essential site features like secure log-ins and consent preference adjustments. They do not store personal data.
    None
    ►
    Functional cookies support features like content sharing on social media, collecting feedback, and enabling third-party tools.
    None
    ►
    Analytical cookies track visitor interactions, providing insights on metrics like visitor count, bounce rate, and traffic sources.
    None
    ►
    Advertisement cookies deliver personalized ads based on your previous visits and analyze the effectiveness of ad campaigns.
    None
    ►
    Unclassified cookies are cookies that we are in the process of classifying, together with the providers of individual cookies.
    None
    Powered by