Which LLM is right for you? The answer is clear: it depends.

Written by Miriam Khanukaev and Don Lariviere

‍

Our last post comparing Claude Opus vs GPT 4 highlighted Claude’s capabilities in areas such as summarization, spatial understanding, human-like writing style, and advanced knowledge. But the LLM scene continues to develop at a rapid pace, and with Anthropic’s June release of Claude 3.5 Sonnet, the scene has become all the more competitive.

Claude 3.5 Sonnet is set to be the mid-range model in Anthropic's lineup (Haiku is the lowest-tier and Opus is the top-tier). Sonnet, which stands out for its understanding, coherence, and customizability, also includes a new feature called Artifacts, which lets you interact with the outcomes of your Claude requests directly. Currently in preview, this feature allows Claude to run a code snippet or even display a website in a sidebar, giving users a dedicated arena to see, edit, and build upon Claude’s outputs in real time alongside their conversation. While non-essential, this smart feature makes Claude 3.5 a more versatile and engaging tool, expanding its capabilities beyond a chat box.

Interestingly, Anthropic claims that 3.5 Sonnet outperforms their higher-end 3 Opus—notably, that it operates at twice the speed—and their benchmarks back this up with a significant margin. But how does Claude 3.5 Sonnet compare to other LLMs? When OpenAI released their new flagship model GPT-4o in May, the model quickly took first place in the Chatbot Arena leaderboard (overall category). GPT-4o boasted improved vision and audio understanding capabilities, and quickly became the model of choice for many. Gemini Pro 1.5 also continues to gain traction in the LLM space, noted for its factual language understanding and multi-language capabilities in particular. Pro is Google’s mid-range consumer-focused model; Gemini Ultra (their top-tier LLM) has been announced with a release date later this year.

As much as we want to jump on the hype train, and boast one model’s superiority over others’, unsurprisingly, we found that their strengths simply differ. And ultimately, advanced, iterative prompting is still needed to get the most out of all models.

Cost and Context Window

Anthropic’s Claude has a larger context window than GPT-4o, with 200k tokens vs GPT-4o’s 128k. Gemini 1.5 Pro has a context window of 1 million tokens, but is more expensive than Claude Sonnet 3.5. Sonnet costs $3 per million input tokens and $15 per million output tokens. By contrast, Gemini Pro 1.5 has a price of $3.50 per million input tokens and $10.50 per million output tokens. GPT-4o is the most expensive: it charges $5 per million input tokens, and $15 per million output tokens. With batch API, however, costs go down to $2.50 per million input tokens, and $7.50 per million output tokens.

Reported Benchmarks and Impressive Vision Capabilities

Claude 3.5 Sonnet outscored GPT-4o, Gemini 1.5 Pro, and Meta’s Llama 3 400B in seven of nine overall benchmarks and four out of five vision benchmarks:

Claude Opus vs GPT 4 — Source: Anthropic

Their report benchmarks on spatial analysis and vision tasks are particularly impressive. Anecdotally, users are really excited about Sonnet’s vision capabilities: one reports that Sonnet transformed a neural network image into an interactive visual animation in just 30 seconds and another reports that Sonnet beats out GPT-4o and Gemini 1.5 when it comes to reading and calculating information in JPG files. The AI community is clearly excited about the model’s vision abilities, but there's room for larger scale evaluations and further research on Sonnet’s vision abilities.

As we’ve highlighted in our April comparison of Claude 3 Opus and GPT 4, AI model benchmarks aren’t an objective measure—there are a lot of assessments and tests. It’s easy to pick and choose the measures that make each vendor look good. The models and products are frequently updated and their highly technical nature doesn’t always correlate to an average user’s experience. Notably, while Claude 3.5 Sonnet outperforms GPT-4o in benchmarks by Anthropic, the model continues to sit below GPT-4o in the crowdsourced Chatbot Arena leaderboard. Gemini Pro 1.5 sits five points below Sonnet.

Feeling overwhelmed by AI choices?

Proxet makes it easy. We help you compare LLMs based on your needs. Contact us and we've got you covered!

Latency

GPT-4o’s average latency is lower than Claude 3.5 Sonnet’s. Doing a quick comparison through Artificial Analysis’ automated comparison (more about their benchmark methodology here), we compare how each model processes different prompt lengths, looking at the time to first token received (TTFT). At 100 tokens, Gpt-4o has the lowest latency at 0.44 TTFT, and Sonnet comes in second at 0.63 TTFT, with Gemini in third place. But as the prompt length goes up, Sonnet begins to fall behind. With a prompt length of 10,000 tokens, GPT-4o is the clear winner, with a TTFT of 0.95 seconds, while 1.5 Pro is 1.86, and Sonnet, in third place, takes 1.95 seconds to receive the first token. Gpt-4o is a clear winner here.

When we compare latency versus outspeed, GPT-4o remains at the top, but Sonnet falls squarely into second place, instead of Gemini. Output speed is measured by tokens per second received while the model is still generating tokens. GPT-4o’s output tokens per second is 89, Sonnet’s is 66, and Gemini 1.5 Pro’s is 69.

Latency vs Output Speed — Source: Artificial Analysis

Code generated tasks

Sonnet sits below GPT-4o on the overall leaderboard, but in the category breakdown, it does outperform all models on coding, beating GPT-4o by five points and Gemini 1.5 pro by thirty six. This edge is a notable achievement since it isn’t even the most advanced model of the Claude 3 family (we are expecting Opus 3.5 to drop in a few months’ time).

Writing for Hackernoon (4m+ monthly views), Independent AI researcher Shrinivan Sankar asked Gpt-40 and Sonnet to generate pure-backed code to play Sudoku. Claude generated more advanced code, allowing a user to choose the difficult level of the game, and it generated a higher speed. When prompted to produce a functional UI, Claude did so successfully while GPT-4o did not. These results, alongside the leaderboard, suggest that Claude should be the model of choice for code generated tasks.

Data Extraction and Classification

When asking Sonnet 3.5 and GPT-4o to extract key pieces of information from legal documents, Vellum, a developer platform for LLM applications, found GPT-4o to outperform Claude 3.5 Sonnet on five of the fourteen fields, maintain a similar performance on seven fields, and show a degraded performance on two of them. When evaluating the models’ classification capabilities by asking them to determine whether a customer service ticket was resolved or not, Claude outperformed GPT-4o in mean accuracy (72% vs 65%) but GPT-4o had the highest precision across the board (86.21%). GPT-4 was the most reliable, with an F1 score of 81.60%.

However, other evaluations from the community suggest different results. Nelson Auner and his team at Cleanlabs found that Sonnet has a slight, but consistent advantage over GPT-4o at categorizing customer support inquiries—in this case, banking requests. Auner ran both models through their Banking Task Benchmark, which measures the ability to correctly categorize customer support inquiries using zero-shot (0S), few-shot (FS), or their own curated few-shot (FS*) examples from messy training dataset. It’s hard for us to land on a clean takeaway here, other than to outline that differences are slight, and both models seem to be improving rapidly. Since Sonnet’s per-token prompt cost is cheaper than GPT-4o, it is cheaper to feed in large prompts and examples, and it’s clear that advanced prompting techniques remain necessary as we continue to rely on iterative improvements.

Reasoning

When it comes to reasoning, however, GPT-4o seems to perform better. That same Vellum study found that GPT-4o outperformed Claude 3.5 Sonnet on their verbal reasoning evaluations with 69% accuracy, versus 44%. Sonnet did well on analogy questions but struggled with numerical and data-related questions. The same researcher who tested the models on code evaluation, Sankar, found that GPT-4o led the way when it came to mathematical reasoning tasks, though both models performed similarly on all other logical reasoning tests. We hadn’t found many evaluations that incorporate Gemini yet, giving us the impression that right now, Claude Sonnet 3.5 and GPT-4o are the models of choice for data extraction, classification, and reasoning tasks in the AI community.

Conclusion

Recent evaluations by the AI community suggest that Claude 3.5 Sonnet surpasses GPT’s capabilities in areas like code generation and spatial analysis. Due to its large context window, it can remember large chunks of previous text, and so lead to more nuanced responses in longer interactions. Plus, it is cheaper.

When it comes to latency and reasoning, GPT-4o may be the model of choice. What’s clear from our analysis of the current evaluations out there is that further prompt testing and, well, evaluation, needs to be done based on specific use cases in order for us to fully understand the capabilities of these models, and more about Google’s Gemini suite as it evolves.

For now, consensus is Gemini is still maturing to gain advantage over OpenAi and Athropic’s models in tasks like code generation and reasoning.

Prompt curation and iterative evaluation continues to be necessary. Still, Claude’s impressive emergence in the market is exciting, as we are finally dealing with a competitive, fast-moving AI scene.

Proxet Can Help

Proxet can test the efficacy of different LLMs depending upon the specific needs of your business. Contact us today to learn more, and we will help you cut through the clutter of artificial intelligence!

All Posts

October 31, 2024

Experimenting to create cost-efficient LLMs

Large language models (LLMs) have become essential tools in various industries. While we previously discussed how to select the best LLM for your needs, we haven’t yet explored how to optimize LLMs for cost-efficient, production-ready setups. While there are a host of options for optimization, in this article, we’re going to explore options for optimizing LLMs including the usage of Open-Source, hardware selection as a strategy, and experiments driving our understanding of the benefits of modifying precision values.

August 14, 2024

Using LLMs to create personas

Struggling to create buyer personas? Discover how non-tech professionals can leverage AI to streamline the process. This post explores using ChatGPT to generate detailed personas and craft effective outreach messages. Learn how to optimize prompts and refine your approach for optimal results.