Claude 3 vs GPT 4: The Competitive AI Landscape We’ve All Been Waiting For

Published:
April 18, 2024
Claude 3 vs GPT 4: The Competitive AI Landscape We’ve All Been Waiting For

Claude 3’s March 2024 release marked a pivotal moment: it overtook GPT-4 Chatbot Arena ranking, bumping an OpenAI model to second place for the first time. Large Language Model (LLM) innovation is accelerating, yet while Google’s Gemini Ultra and Mistral AI have tried to outperform GPT-4, it was Anthropic’s Claude 3 that seemed to emerge on top. 

But just a few weeks later, OpenAI updated GPT-4, and the model returned to first place position. Now that the post-release hype for Claude 3 is cooling down, we’re wondering how the two models compare. After taking a look at the benchmarks, research, and public opinions, we conclude the models will continue to compete head-to-head, or at the very least, until the release of GPT-5.

Cost and Context Size

First, the basics: how do the models compare in terms of context window and price? Claude 3 has a larger context window (200k tokens vs GPT-4’s 128k), so it can hold almost an entire codebase in its memory. Claude is also cheaper: Anthropic charges $15 for every one million tokens for Claude Opus, whereas OpenAI charges $30 per million entered into GPT 4. So if you’re looking to balance price while still using a highly capable LLM, Claude may be a great fit. 

Plus, when it comes to a wide range of considerations for performance accuracy, Claude 3 shines with some exciting capabilities. 

Performance Accuracy and Fact Retrieval

When it comes to performance accuracy, the benchmarks used on Claude 3 suggest it has superior accuracy to GPT-4 on Undergraduate level knowledge, Graduate level reasoning, grade school math, math problem solving, multilingual math, code, reasoning over text, and so on.

Source: Anthropic

What makes Claude so impressive is not only its ability to effectively process long context prompts, but also its extraordinary recall capabilities. The Needle-In-A-Haystack (NIAH) evaluation measures a model’s ability to accurately pick out information from a vast source of data. Insert a target sentence (the “needle”) into a range of documents (the “haystack”) and ask the model a question that could only be answered using information in the needle. 

Anthropic enhanced the NIAH evaluation by using one of 30 random/needle question pairs per prompt and testing on a crowdsourced, wide range of documents. They found Claude 3 Opus not only surpassed 99% accuracy, but also identified the limitations of the evaluation by recognizing that the "needle" sentence appeared to be artificially inserted into the original text by a human. It could tell that the needle sentence was so out of place, that this had to be an artificial test. 

Moving Beyond Benchmarks 

Vendors do tend to cherry-pick numerical benchmarks or test-taking abilities, to make results appear favorable, so results from highly technical benchmarks don’t always correlate with the average user’s experience. The unreliable significance of benchmarks, however, is precisely why Claude’s short-lived domination on the Chatbot Arena leaderboard is so intriguing: in contrast to vendor-provided benchmarks, the leaderboard is crowdsourced. 

Chatbot Arena presents each visitor an opportunity to rate LLMs based on criteria the user deems most fit and then calculates the “best” models in aggregate. This aggregate tool allows us to measure the quality of an LLM based on subjective assessment of its output—or as independent AI researcher Simon Williamson puts it to Arstechnica, “its vibes.” 

Tech.Co compared ChatGPT and Claude by asking both LLMs the same 13 questions and contrasting the results. They found Claude to be more articulate, with answers that were usually better written and easier to read. GPT was better at creative writing, creating spreadsheet formulas, and composing an email, but Claude was better at brainstorming ideas, ethical reasoning, summarizing text, creating product descriptions, analyzing text, and providing factual information. Claude 3 also seemed to understand advanced science in a manner unprecedented by previous LLMs: quantum physicist Kevin Fischer reports Claude 3 grasped his doctoral thesis; another expert in quantum computing reported that Claude 3 reinvited his algorithm with just two prompts.

Other Considerations:

Claude 3 Reigns Supreme in Faithfulness and Spatial Understanding

More recent experiments have moved beyond testing simple fact retrieval capabilities or subjective prompt-level analysis. Yekyung Kim and others  compared how models summarize book-length documents (>100k tokens), with particular attention to whether the output represented the narrative accurately, or what they call, “faithfulness,” and content relevance, rather than coherence. Claude 3 significantly outperformed all closed-source LLMs, suggesting it is the superior model for long-context understanding. Other researchers examined how LLMs represent and reason about spatial structures (squares, triangles, hexagons, rings, and trees), and found that Claude 3 beat out both GPT-4 and GPT-4 turbo. 

Knowledge Cutoff

The last comparison is related to knowledge cutoff.  GPT-4 Turbo’s training knowledge has four months on Claude 3.  While this is more than likely an interim deficit, for an audience using Claude requiring up to date information from the end of 2023, it can be temporarily problematic.

Our Conclusion

These recent evaluations suggest that Claude 3 surpasses GPT’s capabilities in quite a few areas– particularly summarization, spatial understanding, human-like writing style, and advanced knowledge. Whether or not the hype will continue, its emergence on the market marks an AI milestone as we finally navigate an incredibly competitive scene.

Related Posts