By: Jeffrey Kondas, Technology Fellow, with Grok xAI
Alibaba has recently introduced several new AI models under its Qwen series, with the latest being Qwen 2.5-Max and an array of models named Qwen2.5-VL. Here’s a detailed overview based on the latest information and a comparison with top market AI:
Qwen 2.5-Max:
- Performance Claims:
- Alibaba claims that Qwen 2.5-Max outperforms OpenAI’s GPT-4o, DeepSeek’s V3, and Meta’s Llama 3.1-405B in various benchmark tests, including problem-solving, coding, and math benchmarks. It also performed on par with Anthropic’s Claude 3.5 Sonnet in some areas.
- Market Impact:
- Following the announcement, Alibaba’s stock saw significant gains, suggesting market approval of the new model’s capabilities.
- Tech Specifications:
- The specifics of its architecture or parameter count “are not fully disclosed”.
Qwen2.5-VL Series:
- Capabilities:
- These models are capable of parsing files, understanding videos, counting objects in images, and even controlling PCs and mobile devices. They can perform tasks similar to OpenAI’s Operator by interacting with software applications.
- Benchmarking:
- According to Alibaba, Qwen2.5-VL models have shown superior performance in video understanding, math, document analysis, and question-answering when compared to GPT-4o, Claude 3.5 Sonnet, and Gemini 2.0 Flash.
- Applications:
- They can analyze charts, extract data from invoices and forms, comprehend long videos, and recognize intellectual properties from movies and TV series. A notable feature is their ability to book flights or perform other tasks directly through app interfaces.
- Availability:
- The smaller models (Qwen2.5-VL-3B and Qwen2.5-VL-7B) are available under a permissive license, while the flagship model (Qwen2.5-VL-72B) operates under Alibaba’s custom license, which has specific commercial use restrictions for large enterprises.
Strategic Context:
- Competitive Landscape:
- Alibaba’s move is seen as a response to the rapid advancements and market disruptions caused by competitors like DeepSeek in China and international players like OpenAI and Meta. The timing of Qwen 2.5-Max’s release during the Lunar New Year underscores the competitive pressure.
- AI Price Wars:
- Alibaba, along with other Chinese tech companies, has been part of an AI price war, reducing costs to attract more users and developers, which is crucial for expanding their AI ecosystem.
- Open-Source Strategy:
- Alibaba has taken a hybrid approach, offering both proprietary and open-source models to cater to a broad audience and encourage wider adoption and contributions from the global AI community.
Qwen 2.5-Max Overview:
Features:
- Coding:
- Qwen 2.5-Max has shown competitive performance in coding tasks. According to benchmarks like HumanEval and MBPP, it scores 73.2 and 80.6 respectively, which suggests it’s on par or slightly better than models like DeepSeek V3 and significantly ahead of Llama 3.1-405B for coding tasks. This indicates its capability for not just code generation but also in understanding and resolving coding problems.
- Prompt and Response Token Limits:
- Qwen 2.5-Max supports a context length of up to 20 trillion tokens in training, one of the largest datasets known. However, the operational context window for user interaction is up to 131,072 tokens (128K for practical use).
- Character Count: Roughly, one token in English text equals about 3-4 characters. Thus, 131,072 tokens translate to approximately 393,216 to 524,288 characters.
Comparison to Market AI:
- Qwen 2.5-Max stands out with its large training dataset, which might contribute to its performance in coding and problem-solving tasks. Its token limit is competitive within the industry, though not the largest, indicating a balance between capability and efficiency.
- GPT-4o (OpenAI):
- Coding: Excels in both code generation and problem-solving, with scores around 69.2 on HumanEval but is noted for its versatility across languages and frameworks.
- Token Limits: Supports up to 128,000 tokens in its latest version, translating to around 384,000 to 512,000 characters.
- Source: OpenAI’s GPT-4o Capabilities
- Claude 3.5 Sonnet (Anthropic):
- Coding: Known for its accuracy and context understanding, scoring around 80% on code-related tasks but with a focus on ethical coding practices.
- Token Limits: Can handle up to 200,000 tokens, making it suitable for long-form content analysis, roughly 600,000 to 800,000 characters.
- Source: Anthropic’s Claude 3.5 Sonnet Announcement
- DeepSeek V3 (DeepSeek AI):
- Coding: Specifically designed to excel in coding tasks, it performs well in benchmarks like DS-FIM-Eval and DS-Arena-Code but lags behind Qwen 2.5-Max in some areas.
- Token Limits: Similar to Qwen 2.5-Max with a 128K token context window, translating to about 384,000 to 512,000 characters.
- Source: DeepSeek AI Blog on V3
- Gemini (Google):
- Coding: While versatile, it’s noted for its integration capabilities rather than being a top performer in coding benchmarks.
- Token Limits: Gemini’s largest model supports up to 2 million tokens, equating to around 6 million to 8 million characters, significantly outpacing others in context handling.
- Source: Google DeepMind Gemini
- Compared to competitors, Qwen 2.5-Max offers a strong coding capability, especially in benchmarks where it outperforms or matches high-profile models like GPT-4o and Claude 3.5 Sonnet. Its token limit is well-suited for most practical applications, but models like Gemini push the boundaries for handling extremely long contexts.
Disclaimer: Token to character conversion is approximate and can vary based on the text’s nature. The data here is based on the latest public information, which might evolve with new updates from these companies.
Check It: