GPT-4.1 is a flagship large language model optimized for advanced instruction following, real-world software engineering, and long-context reasoning. It supports a 1 million token context window and outperforms GPT-4o and GPT-4.5 across coding (54.6% SWE-bench Verified), instruction compliance (87.4% IFEval), and multimodal understanding benchmarks. It is tuned for precise code diffs, agent reliability, and high recall in large document contexts, making it ideal for agents, IDE tooling, and enterprise knowledge retrieval.
GPT-4.1 mini is a mid-sized model delivering performance competitive with GPT-4o at substantially lower latency and cost. It retains a 1 million token context window and scores 45.1% on hard instruction evals, 35.8% on MultiChallenge, and 84.1% on IFEval. Mini also shows strong coding ability (e.g., 31.6% on Aider’s polyglot diff benchmark) and vision understanding, making it suitable for interactive applications with tight performance constraints.
OpenAI's high-intelligence flagship model for complex, multi-step tasks. GPT-4o is cheaper and faster than GPT-4 Turbo.
OpenAI's previous high-intelligence model, optimized for chat but works well for traditional completions tasks.
OpenAI o4-mini is a compact reasoning model in the o-series, optimized for fast, cost-efficient performance while retaining strong multimodal and agentic capabilities. It supports tool use and demonstrates competitive reasoning and coding performance across benchmarks like AIME (99.5% with Python) and SWE-bench, outperforming its predecessor o3-mini and even approaching o3 in some domains.
OpenAI recently released a new AI model optimized for STEM reasoning that excels in science, math, and coding tasks. o3-mini matches o1's performance in these domains while delivering faster responses and lower costs. The model supports tool calling, structured outputs, and system messages, making it a great option for a wide range of applications.
The latest and strongest model family from OpenAI, o1 is designed to spend more time thinking before responding. The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. The o1 models are optimized for math, science, programming, and other STEM-related tasks. They consistently exhibit PhD-level accuracy on benchmarks in physics, chemistry, and biology.
Claude 3.7 Sonnet is an advanced large language model with improved reasoning, coding, and problem-solving capabilities. It introduces a hybrid reasoning approach, allowing users to choose between rapid responses and extended, step-by-step processing for complex tasks. The model demonstrates notable improvements in coding, particularly in front-end development and full-stack updates, and excels in agentic workflows, where it can autonomously navigate multi-step processes.
Claude 3.7 Sonnet with thinking mode.
Claude 3.5 Sonnet is a high-speed, cost-effective model offering industry-leading performance in reasoning, knowledge, and coding. It operates twice as fast as its predecessor. Key features include enhanced humor and nuance understanding, advanced coding capabilities, and strong visual reasoning.
Claude 3.5 Haiku features offers enhanced capabilities in speed, coding accuracy, and tool use. Engineered to excel in real-time applications, it delivers quick response times that are essential for dynamic tasks such as chat interactions and immediate coding suggestions.
Anthropic's Claude 3 Opus can handle complex analysis, longer tasks with multiple steps, and higher-order math and coding tasks, it provides top-level performance, intelligence, fluency, and understanding.
Gemini 2.5 Pro is Google's state-of-the-art AI model designed for advanced reasoning, coding, mathematics, and scientific tasks. It employs “thinking” capabilities, enabling it to reason through responses with enhanced accuracy and nuanced context handling. Gemini 2.5 Pro achieves top-tier performance on multiple benchmarks, including first-place positioning on the LMArena leaderboard, reflecting superior human-preference alignment and complex problem-solving abilities.
Gemini 2.5 Flash is Google's first fully hybrid reasoning model that allows developers to toggle thinking capabilities on or off according to their needs, offering enhanced reasoning abilities while maintaining the speed and cost-effectiveness of its predecessor.
Gemini 2.0 Flash builds on the success of 1.5 Flash, offering improved performance and twice the speed of 1.5 Pro on key benchmarks. It supports multimodal inputs like images, video, and audio, as well as outputs such as generated images, text, and multilingual text-to-speech. Additionally, it can natively integrate with tools like Google Search, execute code, and use third-party functions.
Gemini 2.0 Flash Thinking is an experimental AI model that showcases enhanced reasoning capabilities through transparent thought processes. The model demonstrates visible planning steps, tackles complex problems at unprecedented speeds, and offers expanded functionalities. This breakthrough innovation allows users to observe the AI's cognitive process in real-time while delivering swift, sophisticated solutions.
The Gemini 1.5 Pro is a cutting-edge multimodal AI model developed by Google DeepMind. It excels in processing and understanding text, images, audio, and video, featuring a breakthrough long context window of up to 1 million tokens. This model powers generative AI services across Google's platforms and supports third-party developers.
DeepSeek R1 is a cutting-edge AI model developed by DeepSeek that was released as a competitor to OpenAI's o1 model. This model emphasizes strong reasoning capabilities in areas such as complex math, coding, and logic. Designed to compete with leading AI models, it offers both transparency and competitive performance, making it a significant step forward in open-source AI development.
DeepSeek-V3 is the latest open-source model from DeepSeek, DeepSeek-V3 has outperformed other open-source models like Qwen2.5-72B and Llama-3.1-405B in various evaluations, and its performance is on par with world-class closed-source models like GPT-4o and Claude-3.5-Sonnet.
Grok 3 is the most advanced model from xAI. Grok 3 displays significant improvements in reasoning, mathematics, coding, world knowledge, and instruction-following tasks.
Grok-2 is xAI's frontier language model with state-of-the-art reasoning capabilities.
The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding.
The Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned generative model in 70B (text in/text out). The Llama 3.3 instruction tuned text only model is optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks.
Mistral Large is a cutting-edge language model developed by Mistral AI, renowned for its advanced reasoning capabilities. It excels in multilingual tasks, code generation, and complex problem-solving, making it ideal for diverse text-based applications.
Pixtral Large is a 124B open-weights multimodal model built on top of Mistral Large 2. The model is able to understand documents, charts and natural images.
Gemma 2 is a state-of-the-art, lightweight open model developed by Google, available in 9 billion and 27 billion parameter sizes. It offers enhanced performance and efficiency, building on the technology used in the Gemini models. Designed for a wide range of applications, Gemma 2 excels in text-to-text tasks, making it a versatile tool for developers.
The Perplexity Sonar Online model is a state-of-the-art large language model developed by Perplexity AI. It offers real-time internet access, ensuring up-to-date information retrieval. Known for its cost-efficiency, speed, and enhanced performance, it surpasses previous models in the Sonar family, making it ideal for dynamic and accurate data processing.
Command A is an open-weights 111B parameter model with a 256k context window focused on delivering great performance across agentic, multilingual, and coding use cases.
Qwen2.5-Max is a large MoE LLM pretrained on massive data and post-trained with curated SFT and RLHF recipes. It achieves competitive performance against the top-tier models, and outcompetes DeepSeek V3 in benchmarks like Arena Hard, LiveBench, LiveCodeBench, GPQA-Diamond.
Qwen2.5 is a model pretrained on a large-scale dataset of up to 18 trillion tokens, offering significant improvements in knowledge, coding, mathematics, and instruction following compared to its predecessor Qwen2. The model also features enhanced capabilities in generating long texts, understanding structured data, and generating structured outputs, while supporting multilingual capabilities for over 29 languages.
QwQ is an experimental research model developed by the Qwen Team, designed to advance AI reasoning capabilities. This model embodies the spirit of philosophical inquiry, approaching problems with genuine wonder and doubt. QwQ demonstrates impressive analytical abilities, achieving scores of 65.2% on GPQA, 50.0% on AIME, 90.6% on MATH-500, and 50.0% on LiveCodeBench. With its contemplative approach and exceptional performance on complex problems.
QVQ-Max, the inaugural official visual reasoning model from the Qwen Team, was released in March 2025. This model builds upon their earlier exploratory work with QVQ-72B-Preview. QVQ-Max is engineered to not only "see" content within images and videos but also to analyze and reason based on this visual data. Furthermore, it can generate solutions for diverse challenges, spanning mathematical problems, everyday scenarios, programming code, and artistic endeavors. Although this marks its first iteration, QVQ-Max showcases considerable promise as a practical visual agent equipped with both strong visual perception and analytical capabilities.
Doubao 1.5 Pro is a Chinese model from Doubao.
The Yi Large model was designed by 01.AI with the following usecases in mind: knowledge search, data classification, human-like chat bots, and customer service. It stands out for its multilingual proficiency, particularly in Spanish, Chinese, Japanese, German, and French.
WizardLM-2 8x22B is Microsoft AI's most advanced Wizard model. It demonstrates highly competitive performance compared to leading proprietary models, and it consistently outperforms all existing state-of-the-art opensource models.
Amazon Nova Pro 1.0 is a capable multimodal model from Amazon focused on providing a combination of accuracy, speed, and cost for a wide range of tasks. As of December 2024, it achieves state-of-the-art performance on key benchmarks including visual question answering (TextVQA) and video understanding (VATEX).