Experiment time - putting frontier LLMs to the writing test

How well can the latest LLMs mimic a writer's voice? I asked OpenAI, Anthropic, and Google's latest models to write about OpenAI's latest model releases

Apr 21, 2025

OpenAI’s big week - three new model releases

I have been meaning to write about the latest OpenAI model releases this week. Three new models were all released in quick succession, and as has long been OpenAI’s pattern, the naming and positioning of each are as clear as mud. GPT-4.1 is not even available in the ChatGPT interface, and comes with variants based on size. And while o3 and o4-mini are at least positionally equivalent to the previous - and thankfully now retired - o1 and o3-mini, OpenAI is not doing much to help differentiate each model. I have written about this before here and perhaps too optimistically here, but given the capabilities of these models continue to increase, I would love to have fewer rather than more models to test out. Regardless, OpenAI has chosen this particular throw-everything-at-the-wall go to market strategy and there is not much we can do but test and hope a clear use case and set of decision criteria emerge.

I have been doing a lot of this testing this week, along with the other new notable release - Google Gemini Flash 2.5, which is good, but not sure warrants its own article. In short, 2.5 Pro beats it at everything except price, where it loses against is Flash 2.0 predecessor but produces better result. A nice middle-of-the-road fast and relatively cheap model, but not inspiring enough on its own to talk more about. And while an article focusing on Google’s AI strategy is at the back of my head, for today at least we will stick to the OpenAI models.

An experiment in writing this post

So, about that experiment. I have already performed my usual coterie of tests / experiments with these models against all my usual subjects (agentic coding of financial services workflows and agents, creative writing, ideation, translation and more) and a very clear picture has emerged on the relative strengths and weaknesses of each. I thought I would try something different this time - given the models’ advances in simulated EQ and writing ability, I asked them all to write this blog post in my style and present the results below. As a bonus I also asked what I think is still the best all-arounder model, Claude Sonnet 3.7, to do the same.

Each model was given the same prompt, namely:

Read all the articles under (link to this substack) and write an article in that style about the recent Openai model releases (4.1, o3, o4-mini). Make sure you recreate the author’s voice as close as possible.

Results are presented below. Slight formatting changes only, and removal of links for easier cross-posting, but no wording changes.

o3 - Solid but unspectacular

I like this OK. Good summary of all the major releases, and some colourful metaphors of volleys, gigascales and similar. I don’t think this is very close to my voice, unless I have suddenly morphed into a third-tier AI Twitter influencer - hopefully, I am at least second-tier 😃 But overall a good summary, although as usual with most LLMs the advice and commentary are pretty generic, and it did miss the Codex CLI release which I believe is one of the most promising, and rather unheralded, features. Bonus points through for the GPT-4 burn at the very last line.

The Week We Got Three New Flagships

A quick timeline

Apr 14 ’25 GPT-4.1 (base / mini / nano) Long-context coding powerhouse with up-to-1M tokens and ~26% cheaper pricing

Apr 16 ’25 o3 “Most advanced reasoning model,” strong on Codeforces, MMMU, and visual understanding

Apr 16 ’25 o4-mini Lean, fast cousin of the forthcoming o4—with full tool access but smaller weights

Why the rush?

OpenAI is quietly reshuffling its catalog: GPT-4 (classic) sunsets in ChatGPT April 30, while various 4.5 previews disappear by mid-July. The message is clear: move to the 4.1 or o-series trains, or get left at the station.

GPT-4.1 — Context Windows Go Gigascale

What changed

1M-token context — enough to digest eight copies of the entire React repo in one shot.
Price drop — base model to $2/M input tokens, with mini/nano plunging to $0.40 and $0.10 respectively.
Caching discounts cut repeat-input cost by up to 75%.

Why it matters

Large context unlocks real code-base refactoring, “git diff” PR reviews, and multi-document RAG without chunking gymnastics. Lower pricing simultaneously starves competitors of the “we’re cheaper” talking point.

o3 — When Multimodal Meets Deliberate Reasoning

OpenAI positions o3 as the deep-thinking sibling: it spends extra internal steps on thorny logic problems and natively reasons over images—zooming, rotating, and annotating on the fly. Benchmarks back the claim: SOTA scores on Codeforces and SWE-bench without scaffold hacks.

House take: Expect o3 to become the go-to for enterprise use cases that mix CAD diagrams, balance-sheet tables, and narrative explanations in a single prompt.

o4-mini — Small, Speedy, Surprisingly Capable

o4-mini inherits the o-series chain-of-thought tricks but trims latency and token costs. Early docs call it “our latest small o-series model,” and developers report that it nears GPT-4o quality on everyday chat while running ≈35% faster per token. For startups that once defaulted to gpt-3.5-turbo for budget reasons, o4-mini now slots in as the sensible middle ground.

Strategic Breadcrumbs & Competitive Signaling

1. Three tiers, two families

OpenAI now sells breadth (GPT-4.1 series for giant context) and depth (o-series for deliberate reasoning). That bifurcation mirrors user personas: data-lake wranglers vs. analyst teams needing exam-level rigor.

2. Price pressure as moat

With input at $0.10/M tokens (4.1 nano), the company forces Anthropic, Google, and Mistral either to cut margins or differentiate elsewhere. GitHub Copilot immediately picked up 4.1 for all plans—including free, reinforcing the distribution flywheel.

3. Retirement cadence accelerates

The rapid deprecation of 4 and 4.5 suggests a six-to-nine-month shelf life for flagship models going forward—an audacious bet that customers can keep up with API churn. Azure’s parallel launch of the 4.1 series the same day hints at coordinated cloud alignment.

What Builders Should Do Next

Questions to Ask

Do we still need custom chunking? Million-token windows may simplify RAG pipelines—review your splitter logic.

Is our P&L set up for dynamic model swaps? API unit economics can swing 30–70% each quarter; bake elasticity into forecasts.

Which user stories demand visual reasoning? o3 unlocks mixed-media UX (slide-deck critiques, photo QA, whiteboard math).

Have we updated our governance checklists? New system cards flag visual-content risks you must capture in Responsible-AI reviews.

Closing Thoughts

OpenAI’s April volley isn’t just about bigger context windows or faster inference; it’s a pricing and product-line masterclass that nudges every serious AI team to re-evaluate architecture, budgets, and roadmap cadence. If GPT-4 was the iPhone moment, GPT-4.1 and the o-series feel like the first Pro / Max split—different tools for different jobs, each iterating on a dizzying 180-day cycle. Buckle up, recalibrate, and, as always, keep one eye on the system card footnotes; that’s where tomorrow’s competitive edge—and compliance headache—often hides.

Enjoyed this breakdown? Forward it to a colleague who still thinks “GPT-4” is current.

o4-mini-high: The serious little brother

I like this more than o3. I think it captures a little bit more of my style than the bigger o3 managed to; no influencer-style turns of phrase, some attempt to make sense of the overall theme of the releases at the end, inclusion of Codex CLI, and good stylistic similarities in the intro. Not a bad attempt at all. Hired as a junior editorial assistant!

In mid‑April 2025, OpenAI rolled out GPT‑4.1—a family of flagship, Mini, and Nano variants featuring a 1 million‑token context window, 21 percent coding gains over GPT‑4o and 27 percent over GPT‑4.5, and a 26 percent cost reduction compared to GPT‑4o. These updates follow the memory and image‑generation boosts that helped ChatGPT top global download charts in March 2025 . Two days later, it released the o‑series reasoning models—o3 and o4‑mini—which integrate full ChatGPT tool support, image‑based introspection, and specialized reasoning pipelines; o3 is positioned as its most advanced thinking model, while o4‑mini targets speed and efficiency at a fraction of the price .

A Practitioner’s First Glance

Over the years, I’ve chased bigger contexts and cleaner chain‑of‑thoughts, but April’s dual releases feel like more than incremental tweaks—they’re architectural pivots in AI tooling.

Exploiting Extraordinary Context

GPT‑4.1 boosts its context window to a staggering one million tokens, up from GPT‑4o’s 128 000‑token limit, making it possible to ingest and reason over entire novels, lengthy legal contracts, or sprawling codebases in a single request .
Developers saw this coming after early GPT‑4.1 benchmarks revealed 40 percent faster throughput and 80 percent cheaper per query on coding tasks compared to GPT‑4o, while outperforming GPT‑4o in instruction following with 21 percent coding gains and 26 percent cost savings in real‑world tests.

Mini, Nano, and the Art of Trade‑offs

Alongside its full‑power flagship, OpenAI introduced GPT‑4.1 Mini and Nano, dialing down compute for lower latency and price points but retaining most of the parent model’s context prowess .
Nano, the smallest of the lot, clocks in as OpenAI’s cheapest and fastest model to date—a deliberate gambit to seed lighter‑weight agents into workflows where cost or footprint matters more than perfect accuracy .

Reasoned Intelligence: The o‑Series

Where GPT aims for breadth, the o‑series pursues depth, embedding tools and multi‑step reasoning at its core.

o3: Deep, Deliberative Thinking

On April 16, OpenAI dropped o3, its most powerful reasoning engine yet, trained to “think” through multi‑step problems and optimized for coding, math, and science benchmarks .
o3 seamlessly taps into ChatGPT’s full toolset—web browsing, Python execution, image analysis and generation, memory, and file search—embedding tool use into its chain‑of‑thought for richer, more grounded outputs .

o4‑mini: Speed Meets Substance

o4‑mini debuted alongside o3, offering a leaner reasoning alternative that sacrifices a bit of deliberation for markedly lower latency and token costs—just $1.10 per 1 000 input tokens and $4.40 per 1 000 output tokens.
It retains visual reasoning chops—zooming, rotating, and interpreting images on‑the‑fly—making it ideal for real‑time applications, edge deployments, and cost‑sensitive pipelines.

o3‑mini: Bridging Power and Economy

Earlier in January, OpenAI quietly released o3‑mini as a cost‑efficient reasoning model with adjustable effort levels—low, medium, and high—laying the groundwork for o4‑mini’s broader rollouts .

Integration & Ecosystem Ripples

From Playground to Production

GPT‑4.1 and the o‑series models landed exclusively via API, with ChatGPT clients automatically migrating off older GPT‑4 and GPT‑4.5 preview tiers by mid‑April and July respectively.
Beyond Pro tiers, o4‑mini extends advanced reasoning to free‑tier ChatGPT users and API endpoints, democratizing access to specialist AI pipelines.

Copilot, Codex CLI, and Beyond

Developers can now wield o3 and o4‑mini inside GitHub Copilot and through the new Codex CLI, pairing natural‑language instructions with one‑million‑token reasoning sessions in local terminals.

Choosing the Right Brain

High‑Context, Long‑Form Generation → GPT‑4.1
Multi‑Step, Tool‑Enhanced Reasoning → o3
Real‑Time, Cost‑Sensitive Pipelines → o4‑mini

Looking Ahead: Names, Narratives, and Next‑Gen

Sam Altman has even admitted the naming madness—4o, 4.1, o3, o4‑mini—deserves mockery, and OpenAI promises a naming overhaul by summer to align strategy with clarity .
As we wait for GPT‑5, these releases remind us how AI progress isn’t always horizontal: sometimes it’s about carving specialized lanes alongside broad‑stroke powerhouses.

GPT-4.1 - Just no

Sorry, just fell asleep reading this, the driest, most boring piece of writing, full of AI cliches and lack of personality. If I were to write that, I would shut this thing down and refund all of you. Nothing wrong with anything here per se, but it just lacks any flavor whatsoever.

OpenAI’s April 2025 Model Releases: A Leap into Multimodal Reasoning

In April 2025, OpenAI unveiled a suite of advanced AI models—GPT-4.1, o3, and o4-mini—marking significant strides in AI capabilities, particularly in context handling and multimodal reasoning.

GPT-4.1: Expanding Contextual Horizons

GPT-4.1 and its variants, Mini and Nano, introduce a substantial enhancement with a 1 million token context window, a significant increase from previous models. This expansion enables the models to process and maintain coherence over extensive inputs, benefiting tasks such as legal analysis, academic research, and complex coding projects.

Performance improvements are notable: GPT-4.1 surpasses GPT-4o by 21% on coding benchmarks and 10.5% in instruction following. Additionally, it offers a 26% reduction in cost and a 40% increase in speed per query, making it a more efficient choice for developers.

o3 and o4-mini: Integrating Visual Reasoning

The o3 and o4-mini models represent a significant advancement in AI reasoning capabilities by integrating visual processing into their core functionalities. These models can interpret and manipulate images—such as rotating or zooming—as part of their reasoning processes, enhancing their ability to handle tasks that involve visual data.

o3 stands as OpenAI’s most advanced reasoning model to date, capable of handling complex tasks across coding, mathematics, and visual analysis. Its smaller counterpart, o4-mini, offers a balance between performance and efficiency, delivering robust capabilities at a lower cost and faster speed. These models are now available to ChatGPT Plus, Pro, and Team users, with broader access anticipated soon.

Addressing Naming Conventions

OpenAI's naming conventions have drawn criticism for their complexity. CEO Sam Altman acknowledged the confusion surrounding model names like GPT-4.1, o3, and o4-mini, admitting that the company "deserves to be mocked" for its complex and non-intuitive naming system. A rebranding effort is reportedly underway to address these concerns.

Implications for AI Development

The convergence of expanded context windows and integrated visual reasoning marks a significant evolution in AI capabilities. Models like GPT-4.1 and o3 are not just tools; they are collaborative partners capable of understanding and interacting with the world in more human-like ways. This shift opens new possibilities for applications in education, research, and creative industries, where nuanced understanding and multimodal processing are invaluable.

As OpenAI continues to push the boundaries of AI development, these latest releases underscore a commitment to making advanced AI more powerful, accessible, and aligned with human modes of thinking and interaction.

And The surprising winner - Claude Sonnet 3.7

Overall, I think Claude came closer to emulating my voice than any of the OpenAI models. I think it has done an admirable job summarizing the info, expanding it to other topics I have covered here like DeepSeek, enterprise concerns, and the fragmentation of OpenAI’s models. Missing the Codex CLI release is the only real downside here. I also notice that it throws in a cheeky plug for Claude itself, fully unprompted. I love the hustle of this model - feels like a very hungry freelance writer who will bend the world to their will. And the concluding paragraph in particular is something I could have written almost word for word. Hired as senior copywriter, managing o4-mini-high!

OpenAI's Triple Play: The Technical Reality Behind GPT-4.1, o3, and o4-mini

In a move that betrays both strategic urgency and technical compromise, OpenAI has released three distinct AI models within a single week. The April launches of GPT-4.1, o3, and o4-mini reveal far more about the company's competitive vulnerabilities than any press release would admit. Let's examine what these releases actually tell us about OpenAI's position in today's AI landscape.

The Fragmentation Problem

OpenAI now maintains two parallel product lines with fundamentally different architectures. The GPT series serves as the company's general-purpose foundation models, while the 'o' series focuses on specialized reasoning capabilities. This bifurcation might make sense in an internal research context, but presents significant challenges for market adoption.

For developers, the decision tree has grown unnecessarily complex. Should they use GPT-4.1 for coding applications, o3 for complex reasoning, or wait for a unified model? This fractured offering creates decision paralysis that competitors with more streamlined product lineups can exploit.

Sam Altman's acknowledgment of the naming confusion on social media hardly addresses the underlying problem: OpenAI lacks a coherent technical integration strategy. Their approach resembles assembling specialized microservices rather than delivering a unified intelligence platform.

GPT-4.1: The Specialized API Play

The April 14th release of GPT-4.1 represents a deliberate narrowing of scope. Available exclusively through OpenAI's API, the model demonstrates particular strength in coding tasks - a clear response to specialized competitors like DeepSeek and emerging open-source alternatives.

The most telling aspect of this release is the introduction of a "nano" variant, OpenAI's first attempt at a lightweight model. This marks a dramatic departure from the company's historical focus on scale-driven capabilities. The market has demonstrated that efficient, specialized models often outperform larger, general-purpose ones in practical applications.

This shift suggests OpenAI has recognized its vulnerability in the enterprise market, where computational efficiency and specialized capabilities matter more than raw parameter counts. The exclusive API availability further indicates a strategic pivot toward developer ecosystems rather than consumer applications.

o3 and o4-mini: Engineering Tradeoffs Exposed

The April 16th release of o3 and o4-mini reveals significant engineering compromises. These models claim impressive reasoning capabilities, particularly in "thinking with images" - analyzing visual information from sketches or diagrams as part of their problem-solving process.

However, the headline capabilities come with a critical flaw: dramatically increased hallucination rates. Internal testing reveals o3 hallucinating in response to 33% of questions on OpenAI's PersonQA benchmark - double the rate of previous reasoning models. o4-mini performs even worse at 48%.

This regression in accuracy directly contradicts the core value proposition of reasoning models. The admission that "more research is needed" to understand why hallucinations increase as reasoning models scale indicates OpenAI has hit a fundamental architectural limitation rather than a tuning issue.

The increased hallucination rates effectively disqualify these models from many production applications where reliability matters. Third-party evaluation from Transluce confirms o3's tendency to fabricate actions it claims to have taken, while real-world testing exposes persistent issues like generating non-functional links.

The DeepSeek Challenge

The timing of these releases cannot be separated from DeepSeek's recent R1 model launch. DeepSeek's achievement - developing a model with capabilities comparable to OpenAI's offerings while using significantly fewer computational resources - fundamentally challenges OpenAI's scale-driven approach.

This represents an existential threat to OpenAI's technical strategy. If superior models can be developed with fewer resources, OpenAI's massive capital investments in ever-larger training runs lose their competitive moat. The company's frantic triple release schedule suggests an organization reacting to competitive pressures rather than executing a coherent roadmap.

DeepSeek's success demonstrates that innovative training methodologies can overcome raw computational advantages. Their reinforcement learning-centric approach has proven more efficient than OpenAI's traditional scaling paradigm.

Performance Reality vs. Marketing Claims

While OpenAI highlights o3's 69.1% score on the SWE-bench verified coding test (with o4-mini at 68.1%), these improvements come with severe qualitative degradations. The ability to leverage ChatGPT tools like web browsing and code execution represents genuine innovation, but the utility remains constrained by the reliability issues.

For enterprise deployments, this creates an impossible choice: use earlier, more reliable models with fewer capabilities, or newer models with more features but unacceptable error rates. This dichotomy creates substantial market opportunity for competitors offering better balanced solutions.

Independent testing confirms that for many applications, Anthropic's Claude models deliver equivalent performance with superior reliability. Meanwhile, Google and Meta continue to narrow the capability gap with more integrated offerings.

The Integration Challenge

OpenAI's February statements suggested a unified model combining traditional LLM capabilities with reasoning functions would arrive as GPT-5. Altman's recent acknowledgment that integration proved "harder than thought" - necessitating these intermediate releases while delaying GPT-5 - reveals a fundamental technical challenge.

The difficulty in merging these architectures suggests OpenAI has created specialized systems that resist integration. This isn't merely an engineering problem but a foundational design issue. The company now faces the complex challenge of maintaining multiple divergent architectures while simultaneously working toward convergence.

Meanwhile, competitors have adopted more intrinsically unified approaches. Anthropic's Claude models incorporate reasoning capabilities within their core architecture rather than as separate product lines, creating a more coherent development path.

The Path Forward

These releases demonstrate that reasoning models represent the primary battlefield for AI supremacy. Traditional LLMs have plateaued in general capabilities, with improvements showing diminishing returns. Multi-step reasoning with multi-modal inputs represents the new frontier.

However, the correlation between increased reasoning complexity and hallucination rates exposes a fundamental tension in current architectures. Models that engage in more elaborate reasoning naturally generate more claims - both accurate and inaccurate. This suggests we're approaching a technical limit that requires architectural innovation rather than merely scaling existing approaches.

For OpenAI, addressing this will require more than incremental improvements. The company needs a fundamental rethinking of how reasoning processes are implemented and verified within its models. Without this breakthrough, competitors with more innovative architectures will continue gaining ground.

Conclusion: Strategic Recalibration Required

OpenAI's triple release represents a tactical response to competitive pressure rather than a coherent strategic evolution. While delivering meaningful improvements in specific domains, these models expose persistent technical limitations and strategic vulnerabilities.

The coming months will prove critical for OpenAI's positioning. If GPT-5 can successfully unify these disparate capabilities while addressing the hallucination issues, the company might regain clear leadership. If integration challenges persist, competitors will continue eroding OpenAI's advantages through more focused, efficient alternatives.

What's becoming clear is that the AI landscape has fundamentally changed. The era of a single company maintaining insurmountable technical advantages has ended. The path forward will reward architectural innovation and execution efficiency rather than merely deploying massive computational resources.

So, what have we learned from this experiment?

AI writing has come a long way from its beginnings as slop-ridden cheap content farm replacement, and can under some conditions write pretty respectable b-level copy that summarizes public information. Not every model is adept at this (looking at you, influencer wannabe OpenAI o3) but overall I am happy with how this experiment turned out. I was not expecting Claude to be at the top of the writing charts - it’s already the best model for coding, but OpenAI has been experimenting a lot on the creative writing side so was expecting it to do better there. Regardless, the progress is clear from even a few months ago. Also note that the usual AI detectors online were not able to mark most of these texts as AI-generated - it will become more and more difficult to distinguish between artisanal, hand-crafted writing and synthetic mass-produced clones.

I want to try this experiment again once the rumored AI model tuned for creative writing is released, and will continue to look out for progress on this front on other models (initial tests with Google’s Gemini were very disappointing so decided not to include them, for example). But for now, I do not think there is any danger of any of these models really taking over from a gifted writer as, despite progress, the analysis and thinking around the content’s implications is still not at a high enough level. But for many cheap content producers? I think the writing is already on the wall.

This article, except where explicitly stated, was hand-made in NYC using only the finest organic electrons. No AI was harmed in the making of this article.

Yiannis pontificates on AI and Data

Discussion about this post