From original readme

Introduction

InternLM2.5 has open-sourced a 7 billion parameter base model and a chat model tailored for practical scenarios. The model has the following characteristics:

Outstanding reasoning capability: State-of-the-art performance on Math reasoning, surpassing models like Llama3 and Gemma2-9B.
1M Context window: Nearly perfect at finding needles in the haystack with 1M-long context, with leading performance on long-context tasks like LongBench. Try it with LMDeploy for 1M-context inference and a file chat demo.
Stronger tool use: InternLM2.5 supports gathering information from more than 100 web pages, corresponding implementation will be released in Lagent soon. InternLM2.5 has better tool utilization-related capabilities in instruction following, tool selection and reflection. See examples.

LMDeploy

Since huggingface Transformers does not directly support inference with 1M-long context, we recommand to use LMDeploy. The conventional usage with huggingface Transformers is also shown below.

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.

Here is an example of 1M-long context inference. Note: 1M context length requires 4xA100-80G!

pip install lmdeploy

You can run batch inference locally with the following python code:

from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig

backend_config = TurbomindEngineConfig(
        rope_scaling_factor=2.5,
        session_len=1048576,  # 1M context length
        max_batch_size=1,
        cache_max_entry_count=0.7,
        tp=4)  # 4xA100-80G.
pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
prompt = 'Use a long prompt to replace this sentence'
response = pipe(prompt)
print(response)

Find more details in the LMDeploy documentation

Import from Transformers

Since Transformers does not support 1M long context, we only show the usage of non-long context. To load the InternLM2 7B Chat model using Transformers, use the following code:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2_5-7b-chat-1m", trust_remote_code=True)
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error.
model = AutoModelForCausalLM.from_pretrained("internlm/internlm2_5-7b-chat-1m", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()
response, history = model.chat(tokenizer, "hello", history=[])
print(response)
# Hello! How can I help you today?
response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
print(response)

The responses can be streamed using stream_chat:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "internlm/internlm2_5-7b-chat-1m"
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

model = model.eval()
length = 0
for response, history in model.stream_chat(tokenizer, "Hello", history=[]):
    print(response[length:], flush=True, end="")
    length = len(response)

vLLM

Launch OpenAI compatible server with vLLM>=0.3.2:

pip install vllm

python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --served-model-name internlm2_5-7b-chat-1m --trust-remote-code

If you encounter OOM, try to reduce --max-model-len or increase --tensor-parallel-size.

Then you can send a chat request to the server:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "internlm2_5-7b-chat-1m",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Introduce deep learning to me."}
    ]
    }'