Anthropic Announces Claude 3 AI Models; Beats GPT-4 and Gemini 1.0 Ultra

Anthropic Announces Claude 3 AI Models; Beats GPT-4 and Gemini 1.0 Ultra - Entertainment - News

Anthropic’s Claude 3 Opus Model Outperforms GPT-4 on Popular Benchmarks

In the latest development in ai, Anthropic’s Claude 3 Opus model has reportedly surpassed GPT-4 on several benchmarks. The company, founded by ex-OpenAI members Daniela and Dario Amodei, has introduced a family of Claude 3 models including Opus (largest), Sonnet (mid-size), and Haiku (smallest) models.

Benchmark Scores

Anthropic has tested all three models on popular benchmarks like MMLU, GPQA, GSM8K, MATH, HumanEval, HellaSwag, and more. The Opus model scored 86.8% on MMLU, which is slightly higher than GPT-4’s reported score of 86.4%. On the HumanEval benchmark, the largest Opus model scored 84.9%, significantly higher than GPT-4’s 67% and Gemini 1.0 Ultra’s 74.4%. Even on the HellaSwag test, the Opus model scored 95.4%, edging out GPT-4’s 95.3% and Gemini 1.0 Ultra’s 87.8%.

Capabilities of Claude 3 Models

Anthropic claims that all three models have great capabilities in analysis and forecasting, nuanced content creation, code generation, and fluency in international languages such as Spanish, Japanese, and French.

Vision Capability

Claude 3 models have a vision capability, but Anthropic is not marketing them as multimodal models. The vision capability in Claude 3 can help enterprise customers process charts, graphs, and technical diagrams. On benchmarks, it performs better than GPT-4V but slightly lags behind Gemini 1.0 Ultra.

Context Length and Performance

Initially, all three models will offer a context window of 200K tokens, which is quite large. The company claims that the Claude 3 family models can process more than 1 million tokens, but this capability will be available to select customers only. On the Needle In A Haystack (NIAH) test with over 200K tokens, the Opus model performed exceptionally well, achieving over 99% accurate retrieval. Anthropic states that the largest Opus model offers the same performance as Claude 2 and 2.1 but with better intelligence, while the mid-size Sonnet model is almost 2x faster than Claude 2 and 1.1.

Pricing

The flagship Opus model can be accessed by subscribing to Anthropic’s service, which costs $23.60 after taxes. The mid-size Claude 3 Sonnet is already deployed on the free version of claude.ai (). Developers can immediately access APIs for Opus and Sonnet models.

API Pricing

Claude 3 Opus with a 200K context window costs $15 per one million tokens (input) and $75 per one million tokens (output). In comparison to GPT-4 Turbo ($10 input / $30 output with 128K context), the pricing seems quite expensive.

What’s Your Opinion?

What do you think about Anthropic’s new family of models, especially the Opus model? Let us know in the comment section below.

Summary

Anthropic’s Claude 3 Opus model has reportedly outperformed GPT-4 on popular benchmarks, scoring higher in areas such as language understanding and coding ability. The company claims that all three models have excellent capabilities in analysis and forecasting, nuanced content creation, code generation, and fluency in international languages like Spanish, Japanese, and French. The models offer a context window of 200K tokens initially, but can process more than 1 million tokens for select customers. On the Needle In A Haystack test, the Opus model performed exceptionally well with over 99% accurate retrieval. The largest Opus model offers the same performance as Claude 2 and 1.1 but with better intelligence, while the mid-size Sonnet model is almost 2x faster than Claude 2 and 1.1. The flagship Opus model can be accessed by subscribing to Anthropic’s service, which costs $23.60 after taxes, while the mid-size Claude 3 Sonnet is already deployed on the free version of claude.ai (). Developers can immediately access APIs for Opus and Sonnet models. The pricing for Claude 3 Opus with a 200K context window costs $15 per one million tokens (input) and $75 per one million tokens (output), which seems quite expensive compared to GPT-4 Turbo.