The Evolution of Claude: New Models Learn to Navigate Computers

The upgraded Sonnet version demonstrates substantial improvements across various benchmarks, with particularly impressive gains in coding capabilities. On the SWE-bench Verified test, it scored 49.0%, surpassing all publicly available models, including specialized coding systems. The enhanced performance extends to tool use tasks as well, with notable improvements in both retail and airline domains, all while maintaining the same price and speed as its predecessor. Early feedback from companies like GitLab and Cognition has confirmed significant improvements in coding, planning, and problem-solving capabilities.

A groundbreaking new feature called computer use has been introduced in public beta, exclusively available through the API. This capability allows Claude to interact with computers similarly to humans, using screens, cursors, and keyboards to navigate interfaces and complete tasks. Several major companies, including Asana, Canva, and DoorDash, have already begun exploring these possibilities, implementing tasks that require numerous steps to complete. The computer use feature, while still experimental and sometimes error-prone, has shown promising results on the OSWorld evaluation, scoring 14.9% in the screenshot-only category and 22.0% when given more steps to complete tasks.

The new Claude 3.5 Haiku model represents a significant advancement in combining speed with capability. Despite maintaining a similar speed to its predecessor, it matches or exceeds the performance of Claude 3 Opus on many intelligence benchmarks. The model demonstrates particular strength in coding tasks, scoring 40.6% on SWE-bench Verified and outperforming many existing models, including the original Claude 3.5 Sonnet. With its low latency and improved instruction-following capabilities, it is particularly well-suited for user-facing products and specialized tasks. The model will be available across multiple platforms, including Anthropic's API, Amazon Bedrock, and Google Cloud's Vertex AI.

Anthropic has taken significant steps to ensure the responsible development and deployment of these new capabilities. The company conducted joint pre-deployment testing with the US AI Safety Institute and the UK Safety Institute. Safety measures have been implemented specifically for the computer use feature, including new classifiers to identify potential misuse for spam, misinformation, or fraud. The company maintains that the ASL-2 Standard, as outlined in their Responsible Scaling Policy, remains appropriate for these models. Anthropic has emphasized that while these technologies are still in their early stages, they represent important steps forward in AI capability and accessibility. Access the API here Github.com

Comments

hinson - Nov 18, 2024, 8:37 AM - Add Reply

clean

You must be logged in to post a comment.

Trevor Arege - Nov 18, 2024, 8:38 AM - Add Reply

Nice one there

Daiberias Bundi - Nov 18, 2024, 8:48 AM - Add Reply

Appealing

Brian - Nov 18, 2024, 8:50 AM - Add Reply

Gret invention

great

Anthony - Nov 18, 2024, 8:51 AM - Add Reply

Interesting

Barbara - Nov 18, 2024, 8:55 AM - Add Reply

Nice article

Fabish Onsomu - Nov 18, 2024, 11:48 AM - Add Reply

Prince Kungu - Nov 18, 2024, 7:14 PM - Add Reply

Real good.

Brevin Gitanga - Nov 18, 2024, 8:03 PM - Add Reply

Outstanding

Alex Nyakundi - Nov 19, 2024, 5:26 AM - Add Reply

Mind blowing

robin kimathi - Nov 19, 2024, 10:42 AM - Add Reply

Great one

robin kimathi - Nov 19, 2024, 10:43 AM - Add Reply

nice one

Thug nificent - Nov 20, 2024, 12:52 PM - Add Reply

Great job