--- license: cc-by-nc-4.0 base_model: Qwen/Qwen2-7B-Instruct model-index: - name: Dolphin results: [] tags: - RAG - on-device language model - Retrieval Augmented Generation inference: false space: false spaces: false language: - en --- # Dolphin: Long Context as a New Modality for on-device RAG

- Nexa Model Hub - ArXiv

nexa-octopus

## Overview Dolphin is a novel approach to accelerate language model inference by treating long context as a new modality, similar to image, audio, and video modalities in vision-language models. This innovative method incorporates a language encoder model to encode context information into embeddings, applying multimodal model concepts to enhance the efficiency of language model inference。 Below are model highlights: - 🧠 Context as a distinct modality - 🗜️ Language encoder for context compression - 🔗 Multimodal techniques applied to language processing - ⚡ Optimized for energy efficiency and on-device use - 📜 Specialized for long context understanding ## Model Architecture Dolphin employs a decoder-decoder framework with two main components: 1. A smaller decoder (0.5B parameters) for transforming information from extensive contexts 2. A larger decoder (7B parameters) for comprehending and generating responses to current queries 3. The architecture also includes a projector to align embeddings between the text encoder and the main decoder. ![Model Architecture](modelstructure.jpg) ## Running the Model ```python from transformers import AutoTokenizer from configuration_dolphin import DolphinForCausalLM import time tokenizer = AutoTokenizer.from_pretrained('nexa-collaboration/dolphin_instruct_1M_0805', trust_remote_code=True) model = DolphinForCausalLM.from_pretrained('nexa-collaboration/dolphin_instruct_1M_0805', trust_remote_code=True) def inference(input_text): inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100) return tokenizer.decode(outputs[0], skip_special_tokens=True) input_text = "Take a selfie for me with front camera" nexa_query = f"Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: {input_text} \n\nResponse:" start_time = time.time() result = inference(nexa_query) print("Dolphin model result:\n", result) print("Latency:", time.time() - start_time, "s") ``` ## Training Process Dolphin's training involves three stages: 1. Restoration Training: Reconstructing original context from compressed embeddings 2. Continual Training: Generating context continuations from partial compressed contexts 3. Instruction Fine-tuning: Generating responses to queries given compressed contexts This multi-stage approach progressively enhances the model's ability to handle long contexts and generate appropriate responses. ## Citation If you use Dolphin in your research, please cite our paper: ```bibtex @article{dolphin2024, title={Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models}, author={[Author Names]}, journal={arXiv preprint arXiv:[paper_id]}, year={2024} } ``` ## Contact For questions or feedback, please [contact us](octopus@nexa4ai.com)