Close Menu
    What's Hot

    Microsoft Stock: Looking Through Legacy Business Noise (NASDAQ:MSFT)

    Mikel Arteta’s Arsenal are going to win the Champions League final against PSG and go on to dominate, says Theo Walcott | Football News

    2026 MLB mock draft 2.0: Predicting the first 40 picks

    Facebook X (Twitter) Instagram
    Trending
    • Microsoft Stock: Looking Through Legacy Business Noise (NASDAQ:MSFT)
    • Mikel Arteta’s Arsenal are going to win the Champions League final against PSG and go on to dominate, says Theo Walcott | Football News
    • 2026 MLB mock draft 2.0: Predicting the first 40 picks
    • When women ask for more, they pay for it
    • Rwanda-Russia nuclear deal underscores Africa’s shifting power balance | News
    • Stop Running From Conflict — You’re Destroying Your Business
    • MVP Boxing: Amando Serrano targets Christie Martin’s knockout record in Cheyenne Hanson featherweight title clash | Boxing News
    • Five emerging themes for US stock investors
    interluknewsinterluknews
    • Home
    • Business
      • Corporate News
      • Industry Insights
      • Startups & Entrepreneurship
      • Technology & Innovation
    • Economy
      • Economic Policy
      • Financial Analysis
      • Inflation & Interest Rates
      • Trade & Markets
    • Global
      • Conflicts & Security
      • Diplomacy
      • Global Trends
      • International Affairs
    • Lifestyle
      • Fashion
      • Food & Dining
      • Personal Development
      • Travel
    • Opinion
      • Columns
      • Editorials
      • Expert Opinions
      • Reader Voices
    • More
      • Politics
        • Elections
        • Government & Policy
        • International Relations
        • Political Analysis
      • Sports
        • Cricket
        • Football / Soccer
        • International Sports
        • Local Sports
      • Technology
        • Artificial Intelligence
        • Cybersecurity
        • Gadgets & Reviews
        • Tech News
      • South Africa News
    Facebook X (Twitter) Instagram
    interluknewsinterluknews
    Startups & Entrepreneurship

    AI IQ is here: a new site scores frontier AI models on the human IQ scale. The results are already dividing tech.

    adminBy adminMay 14, 2026No Comments11 Mins Read
    Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
    AI IQ is here: a new site scores frontier AI models on the human IQ scale. The results are already dividing tech.
    Share
    Facebook Twitter LinkedIn Pinterest Email

    For decades, the IQ test has been one of the most familiar — and most contested — yardsticks for human intelligence. Now, a startup project called AI IQ is applying the same metaphor to artificial intelligence, assigning estimated intelligence quotients to more than 50 of the world’s most powerful language models and plotting them on a standard bell curve.

    The result is a set of interactive visualizations at aiiq.org that have ricocheted across social media in the past week, drawing praise from enterprise technologists who say the charts make an impossibly complex market legible — and sharp criticism from researchers and commentators who warn the entire framework is misleading.

    “This is super useful,” wrote Thibaut Mélen, a technology commentator, on X. “Much easier to understand model progress when it’s mapped like this instead of another giant leaderboard table.”

    Brian Vellmure, a business strategist, offered a similar endorsement: “This is helpful. Anecdotally tracks with personal experience.”

    But the backlash arrived just as quickly. “It’s nonsense. AI is far too jagged. The map is not the territory,” posted AI Deeply, an artificial intelligence commentary account, crystallizing a worry shared by many researchers: that reducing a language model’s sprawling, uneven capabilities to a single number creates a dangerous illusion of precision.

    aiiq-ai-models-by-iq-2026-05-13

    More than 50 AI language models, plotted on a standard IQ bell curve by the site AI IQ. The most capable models crowd the right tail of the distribution. (Credit: AI IQ)

    Twelve benchmarks, four dimensions, and one controversial number: how AI IQ actually works

    AI IQ was created by Ryan Shea, an engineer, entrepreneur, and angel investor best known as a co-founder of the blockchain platform Stacks. Shea also co-founded Voterbase and has invested in the early stages of several unicorns, including OpenSea, Lattice, Anchorage, and Mercury. He holds a Bachelor of Science in Mechanical Engineering from Princeton University.

    The site’s methodology rests on a deceptively simple formula. AI IQ groups 12 benchmarks into four reasoning dimensions: abstract, mathematical, programmatic, and academic. The composite IQ is a straight average of those four dimension scores: IQ = ¼ (IQ_Abstract + IQ_Math + IQ_Prog + IQ_Acad).

    The abstract reasoning dimension draws from ARC-AGI-1 and ARC-AGI-2, the notoriously difficult pattern-recognition benchmarks designed to test general fluid intelligence. Mathematical reasoning includes FrontierMath (Tiers 1–3 and Tier 4), AIME, and ProofBench. Programmatic reasoning uses Terminal-Bench 2.0, SWE-Bench Verified, and SciCode. Academic reasoning pulls from Humanity’s Last Exam, CritPt, and GPQA Diamond.

    Each raw benchmark score gets mapped to an implied IQ through what the site describes as “hand-calibrated difficulty curves.” Crucially, the methodology compresses ceilings for benchmarks considered easier or more susceptible to data contamination, preventing them from inflating scores above 100. Harder, less gameable benchmarks retain higher ceilings. The system also handles missing data conservatively: models need scores on at least two of the four dimensions to receive a derived IQ, and when benchmarks are absent, the pipeline deliberately pulls scores down rather than up. The site states that “every derived IQ averages all four dimensions, so missing coverage cannot make a model look better by omission.”

    OpenAI leads the bell curve, but the gap between the top AI models has never been smaller

    As of mid-May 2026, the AI IQ charts tell a story of rapid convergence at the top of the frontier — and widening diversity in the tiers below.

    According to the Frontier IQ Over Time chart, GPT-5.5 from OpenAI currently sits at the peak of the bell curve, with an estimated IQ near 136 — the highest of any model tracked. It is closely followed by GPT-5.4 (approximately 131), Opus 4.7 from Anthropic (approximately 132), and Opus 4.6 (approximately 129). Google’s Gemini 3.1 Pro lands near 131, making the top cluster extraordinarily tight.

    That compression is not unique to AI IQ’s framework. Visual Capitalist, drawing from a separate Mensa-based ranking by TrackingAI, recently observed the same dynamic, noting that “the biggest takeaway is how compressed the top of the leaderboard has become.” On that scale, Grok-4.20 Expert Mode and GPT 5.4 Pro tied at 145, with Gemini 3.1 Pro at 141.

    Below the frontier cluster, the AI IQ charts show a crowded midfield. Models from Chinese labs — Kimi K2.6, GLM-5, DeepSeek-V3.2, Qwen3.6, MiniMax-M2.7 — bunch between roughly 112 and 118, making the cost-performance tier increasingly competitive for enterprise buyers who don’t need the absolute best model for every task. One X user, ovsky, noted that the data “confirms experience with sonnet 4.6 being an absolute workhorse as opposed to opus 4.5” — pointing to the way the charts can validate practitioner intuitions that headline rankings often miss.

    aiiq-frontier-iq-over-time-2026-05-13

    The trajectory of frontier AI models from October 2023 to mid-2026, as tracked by AI IQ. Provider-colored step-lines connect each lab’s flagship releases, showing roughly 60 points of estimated IQ improvement in 30 months. (Credit: AI IQ)

    Why emotional intelligence scores are becoming the new battleground in AI model rankings

    What distinguishes AI IQ from most other benchmarking efforts is its inclusion of an “EQ” — emotional intelligence — score. The site maps each model’s EQ-Bench 3 Elo score and Arena Elo score to an estimated EQ using calibrated piecewise-linear scales, then takes a 50/50 weighted composite of the two.

    The EQ scores produce a meaningfully different ranking than IQ alone. On the IQ vs. EQ scatter plot, Anthropic’s Opus 4.7 leads on EQ with a score near 132, pushing it into the upper-right quadrant — the most desirable position, signaling both high cognitive and high emotional intelligence. OpenAI’s GPT-5.5 and GPT-5.4 cluster in the high-IQ zone but lag slightly on EQ. Google’s Gemini 3.1 Pro sits in a strong middle position on both axes.

    One notable methodological choice has drawn attention: EQ-Bench 3 is judged by Claude, an Anthropic model, which the site acknowledges “creates potential scoring bias in favor of Anthropic models.” To correct for this, AI IQ subtracts a 200-point Elo penalty from the EQ-Bench component for all Anthropic models before mapping to implied EQ. The Arena component is unaffected since it uses human judges. That self-correction is unusual in the benchmarking world, and it suggests Shea is aware of the methodological minefield he has entered. Still, the EQ dimension captures something IQ alone cannot: the growing importance of conversational quality, collaboration, and trust in models deployed for user-facing work.

    aiiq-iq-vs-eq-2026-05-13

    Plotting IQ against EQ reveals that the smartest models aren’t always the most emotionally intelligent. Anthropic’s Opus 4.7 dominates the upper-right quadrant. (Credit: AI IQ)

    The AI cost-performance chart that enterprise buyers actually need to see

    Perhaps the most practically useful chart on the site is not the bell curve but the IQ vs. Effective Cost scatter plot. It maps each model’s estimated IQ against an “effective cost” metric — defined as the token cost for a task using 2 million input tokens and 1 million output tokens, multiplied by a usage efficiency factor.

    The chart reveals a familiar pattern in enterprise technology: the best models are not always the best value. GPT-5.5 and Opus 4.7 sit in the upper-left corner — high IQ, high cost, with effective per-task costs north of $30 and $50 respectively. Meanwhile, models like GPT-5.4-mini, DeepSeek-V3.2, and MiniMax-M2.7 occupy a sweet spot in the middle: respectable IQ scores between 112 and 120, at effective costs ranging from roughly $1 to $5 per task. At the cheapest extreme, GPT-oss-20b (an open-source OpenAI model) appears near $0.20 effective cost with an IQ around 107 — potentially the most economical option for bulk classification or extraction workloads.

    The site also offers a 3D visualization mapping IQ, EQ, and effective cost simultaneously. A dashed line running through the cube points toward the ideal: higher IQ, higher EQ, and lower cost. Models near the “green end” of that axis are stronger all-around deals; those near the “red end” sacrifice capability, cost efficiency, or both. For CIOs staring at API invoices, the implication is clear: the intelligence gap between a $50 model and a $3 model has narrowed enough that routing — using expensive models for hard problems and cheap ones for everything else — is no longer optional. It is the dominant architecture for serious AI deployments.

    Critics say AI’s “jagged” capabilities make a single IQ score dangerously misleading

    The loudest objection to AI IQ is philosophical, and it cuts deep. Critics argue that collapsing a model’s uneven capabilities into a single score obscures more than it reveals.

    “IQ as a proxy is fading — we’re seeing reasoning density spikes that don’t map to g-factor,” posted Zaya, a technology commentator, on X. “GPT-5.5 already hit saturation on MMLU-Pro, but still fails ClockBench 50% of the time.”

    That observation touches on what AI researchers call the “jaggedness” problem: large language models often exhibit wildly uneven capabilities, excelling at graduate-level physics while failing at tasks a child could do. A composite score can paper over those gaps.

    Pressureangle, another X user, posted a more granular critique, calling out “complete lack of transparency” and arguing the site never fully discloses how its calibration curves were created or validated. In fairness, AI IQ does list its 12 benchmarks and shows the shape of each calibration curve in its methodology modal. But the raw data and precise mathematical transformations are not published as open datasets — a gap that matters to researchers accustomed to fully reproducible methods.

    Others questioned the premise itself. “As useless as human IQ testing,” wrote haashim on X. Shubham Sharma, an AI and technology writer, offered a constructive alternative: “Why not having the Models take an official (MENSA-Grade) test? Wouldn’t this be the most accurate and most ‘human-comparable’ way to benchmark intelligence?” That approach already exists through TrackingAI, which administers the Mensa Norway IQ test to language models. But Mensa-style tests measure only abstract pattern recognition, while AI IQ attempts a broader composite across coding, mathematics, and academic reasoning. As Visual Capitalist noted, “an IQ-style benchmark captures only one slice of capability.” Each approach has tradeoffs — and neither has won the argument yet.

    The real race isn’t for the highest score — it’s for the smartest model stack

    For all the debate about methodology, the most important signal in AI IQ’s data may not be any single model’s score. It is the shape of the market the charts reveal.

    There are now more than 50 frontier-class models available through APIs, from at least 14 major providers spanning the United States, China, and Europe. Each provider publishes its own benchmarks, often cherry-picked to showcase strengths. The result is a Tower of Babel where no two companies measure the same thing in the same way. Academic research has highlighted that “most benchmarks introduce bias by focusing on a particular type of domain,” and the Frontier IQ Over Time chart on AI IQ shows just how fast the targets are moving: in October 2023, GPT-4-turbo sat near an estimated IQ of 75. By early 2026, the top models were brushing 135 — roughly 60 points of improvement in 30 months.

    That pace raises a fundamental question about whether any scoring system can keep up. The site compresses ceilings for saturated benchmarks, but as models continue to max out even the hardest tests — ARC-AGI-2, FrontierMath Tier 4, Humanity’s Last Exam — the framework will face the same ceiling effects that have plagued every AI evaluation before it. Connor Forsyth pointed to this dynamic on X: “ARC AGI 3 disagrees,” he wrote, referencing a next-generation benchmark that may already be undermining current scores.

    AI IQ is not perfect. Its methodology is partially opaque. Its IQ metaphor can mislead. And its creator acknowledges known biases while likely missing others. But the alternative — wading through dozens of provider-specific benchmark tables, each using different test suites and scoring conventions — is worse. The site offers enterprise buyers something genuinely scarce: a single framework for comparing models across providers, dimensions, and price points, updated regularly, with enough nuance to show that the right answer to “which model is best?” is almost always “it depends on the task.”

    As Debdoot Ghosh mused on X after viewing the charts: “Now a human’s role is just to orchestrate?”

    Maybe. But if the AI IQ data shows anything clearly, it is that orchestration — knowing which model to deploy, when, and at what price — has become its own form of intelligence. And for that, there is no benchmark yet.

    dividing Frontier Human models results scale scores site tech
    Follow on Google News Follow on Flipboard
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
    Previous ArticleHearts 3 – 0 Falkirk
    Next Article Canada and Alberta Reach Agreement to Raise Carbon Prices as Pipeline Tradeoff
    admin
    • Website

    Related Posts

    Stop Running From Conflict — You’re Destroying Your Business

    May 30, 2026

    White House Releases Results of Trump’s Latest Physical Exam

    May 30, 2026

    The Missing Factor Behind Why Your Marketing Isn’t Converting

    May 30, 2026
    Leave A Reply Cancel Reply

    Demo
    Latest Posts

    Microsoft Stock: Looking Through Legacy Business Noise (NASDAQ:MSFT)

    Mikel Arteta’s Arsenal are going to win the Champions League final against PSG and go on to dominate, says Theo Walcott | Football News

    2026 MLB mock draft 2.0: Predicting the first 40 picks

    When women ask for more, they pay for it

    Latest Posts

    Subscribe to News

    Get the latest sports news from NewsSite about world, sports and politics.

    Advertisement
    Demo

    We are a digital news platform delivering timely, accurate, and insightful coverage of politics, global affairs, business, economy, sports, and more. Our mission is to keep readers informed with reliable news, clear analysis, and stories that truly matter.
    We're social. Connect with us:

    Facebook X (Twitter) Instagram Pinterest YouTube

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Type above and press Enter to search. Press Esc to cancel.

    Powered by
    ...
    ►
    Necessary cookies enable essential site features like secure log-ins and consent preference adjustments. They do not store personal data.
    None
    ►
    Functional cookies support features like content sharing on social media, collecting feedback, and enabling third-party tools.
    None
    ►
    Analytical cookies track visitor interactions, providing insights on metrics like visitor count, bounce rate, and traffic sources.
    None
    ►
    Advertisement cookies deliver personalized ads based on your previous visits and analyze the effectiveness of ad campaigns.
    None
    ►
    Unclassified cookies are cookies that we are in the process of classifying, together with the providers of individual cookies.
    None
    Powered by