--- license: apache-2.0 language: - en --- # CogAgent ### Reminder: This is the repository for CogAgent of [SAT (SwissArmyTransformer)](https://github.com/THUDM/SwissArmyTransformer/) version. ### Please refer to [https://huggingface.co/THUDM/cogagent-chat-hf](https://huggingface.co/THUDM/cogagent-chat-hf) for CogAgent of Huggingface version. ## Introduction **CogAgent** is an open-source visual language model improved based on **CogVLM**. CogAgent-18B has 11 billion visual and 7 billion language parameters. 📖 Paper: https://arxiv.org/abs/2312.08914 🚀 GitHub: For more information, please refer to [Our GitHub](https://github.com/THUDM/CogVLM/) CogAgent demonstrates **strong performance** in image understanding and GUI agent: 1. CogAgent-18B **achieves state-of-the-art generalist performance on 9 cross-modal benchmarks**, including: VQAv2, MM-Vet, POPE, ST-VQA, OK-VQA, TextVQA, ChartQA, InfoVQA, DocVQA. 2. CogAgent-18B significantly **surpasses existing models on GUI operation datasets**, including AITW and Mind2Web. In addition to all the **features** already present in **CogVLM** (visual multi-round dialogue, visual grounding), **CogAgent**: 1. Supports higher resolution visual input and dialogue question-answering. It supports ultra-high-resolution image inputs of **1120x1120**. 2. Possesses the capabilities of a visual Agent, being able to return a plan, next action, and specific operations with coordinates for any given task on any GUI screenshot. 3. Enhanced GUI-related question-answering capabilities, allowing it to handle questions about any GUI screenshot, such as web pages, PC apps, mobile applications, etc. 4. Enhanced capabilities in OCR-related tasks through improved pre-training and fine-tuning.