TL;DR

  • Baidu’s ERNIE-4.5-VL-28B uses mixture-of-experts architecture activating 3 billion of 28 billion parameters per task
  • Model claims superior performance to Google Gemini 2.5 Pro and OpenAI GPT-5-High on document and chart understanding benchmarks
  • Released under Apache 2.0 licence enabling unrestricted commercial deployment on single 80GB GPU hardware

Efficient Architecture Targets Enterprise Deployment

Baidu has released ERNIE-4.5-VL-28B-A3B-Thinking, a vision-language model employing mixture-of-experts (MoE) architecture to achieve competitive performance whilst consuming significantly less computational power than larger competitors. The model maintains 28 billion total parameters but activates only 3 billion during operation through selective routing, enabling deployment on a single 80GB GPU — hardware readily available in corporate data centres.

The model’s distinctive “Thinking with Images” capability allows dynamic image examination by zooming in and out to examine fine-grained details, departing from traditional fixed-resolution processing. This approach targets enterprise applications requiring both broad context and granular detail, including complex technical diagram analysis and manufacturing quality control defect detection. Baidu claims the system matches or exceeds performance of substantially larger Google and OpenAI models on tasks involving document understanding, chart analysis, and visual reasoning.

Strategic Licensing and Enterprise Positioning

Baidu released the model under the permissive Apache 2.0 licence, allowing unrestricted commercial use without ongoing licensing fees or usage restrictions. This contrasts with more restrictive competitor approaches and potentially accelerates enterprise adoption by eliminating deployment friction. The company provides comprehensive development tools through ERNIEKit, with compatibility across popular frameworks including Hugging Face Transformers and vLLM inference engine.

The technical documentation reveals advanced training techniques including “multimodal reinforcement learning techniques on verifiable tasks, integrating GSPO and IcePop strategies to stabilize MoE training combined with dynamic difficulty sampling.” The model forms part of Baidu’s broader ERNIE 4.5 family comprising 10 variants ranging from a 424 billion parameter flagship model to compact 0.3 billion parameter dense models.

Looking Forward

Baidu’s performance claims require independent verification before industry acceptance. Benchmark performance often fails to capture real-world behaviour across diverse enterprise scenarios, with models excelling in specific domains potentially struggling in others. The minimum 80GB GPU memory requirement, whilst more accessible than multi-GPU setups, still represents substantial infrastructure investment for organisations lacking existing GPU resources. Success depends on whether claimed efficiency advantages and open licensing overcome adoption barriers in enterprises evaluating vision-language systems for production deployment.


Article based on reporting by VentureBeat

Share this article