The upgraded Sonnet version demonstrates substantial improvements across various benchmarks, with particularly impressive gains in coding capabilities. On the SWE-bench Verified test, it scored 49.0%, surpassing all publicly available models, including specialized coding systems. The enhanced performance extends to tool use tasks as well, with notable improvements in both retail and airline domains, all while maintaining the same price and speed as its predecessor. Early feedback from companies like GitLab and Cognition has confirmed significant improvements in coding, planning, and problem-solving capabilities.
A groundbreaking new feature called computer use has been introduced in public beta, exclusively available through the API. This capability allows Claude to interact with computers similarly to humans, using screens, cursors, and keyboards to navigate interfaces and complete tasks. Several major companies, including Asana, Canva, and DoorDash, have already begun exploring these possibilities, implementing tasks that require numerous steps to complete. The computer use feature, while still experimental and sometimes error-prone, has shown promising results on the OSWorld evaluation, scoring 14.9% in the screenshot-only category and 22.0% when given more steps to complete tasks.
The new Claude 3.5 Haiku model represents a significant advancement in combining speed with capability. Despite maintaining a similar speed to its predecessor, it matches or exceeds the performance of Claude 3 Opus on many intelligence benchmarks. The model demonstrates particular strength in coding tasks, scoring 40.6% on SWE-bench Verified and outperforming many existing models, including the original Claude 3.5 Sonnet. With its low latency and improved instruction-following capabilities, it is particularly well-suited for user-facing products and specialized tasks. The model will be available across multiple platforms, including Anthropic's API, Amazon Bedrock, and Google Cloud's Vertex AI.
Anthropic has taken significant steps to ensure the responsible development and deployment of these new capabilities. The company conducted joint pre-deployment testing with the US AI Safety Institute and the UK Safety Institute. Safety measures have been implemented specifically for the computer use feature, including new classifiers to identify potential misuse for spam, misinformation, or fraud. The company maintains that the ASL-2 Standard, as outlined in their Responsible Scaling Policy, remains appropriate for these models. Anthropic has emphasized that while these technologies are still in their early stages, they represent important steps forward in AI capability and accessibility. Access the API here Github.com
clean
Nice one there
Appealing
Gret invention
great
Interesting
Nice article
Nice article
Real good.
Outstanding
Mind blowing
Great one
Great one
nice one
Great job
You must be logged in to post a comment.