README.md · BAAI/AquilaSQL-7B at main

metadata

license: other

English | 简体中文

Aquila Language Model is the first open source language model that supports both Chinese and English knowledge, commercial license agreements, and compliance with domestic data regulations.

🌟 Supports open source commercial licenses. The source code of the Aquila series models is based on the Apache 2.0 agreement, while the model weight is based on the BAAI Aquila Model License Agreement. Users can use it for commercial purposes as long as they meet the licensing restrictions.
✍️ Possesses Chinese and English knowledge. The Aquila series model is trained from scratch on a high-quality corpus of Chinese and English languages, with Chinese corpora accounting for about 40%, ensuring that the model accumulates native Chinese world knowledge during the pre-training phase, rather than translated knowledge.
👮‍♀️ Complies with domestic data regulations. The Chinese corpora of the Aquila series models come from Intelligence Source's accumulated Chinese datasets over the years, including Chinese internet data from over 10,000 sources (more than 99% of which are domestic sources), as well as high-quality Chinese literature and book data supported by authoritative domestic organizations. We will continue to accumulate high-quality and diverse datasets and incorporate them into the subsequent training of the Aquila base models.
🎯 Continuous improvements and open sourcing. We will continue to improve training data, optimize training methods, and enhance model performance, cultivate a flourishing "model tree" on a better base model foundation, and continuously update open-source versions.

The additional details of the Aquila model will be presented in the official technical report. Please stay tuned for updates on official channels, including the FlagAI GitHub repository, FlagAI's Zhihu account and FlagAI's official technical communication group.

Model	Model Type	Description	Status	GPUs Used
AquilaSQL-7B	chat model	text2sql model, cotinue traind from the AquilaCode-base model, AquilaSQL achieved sota on the cspider leadboard	published	Nvidia-A100

We will continue to release improved versions of Aquila model as open source. (https://huggingface.co/BAAI/AquilaSQL-7B/blob/main/change_log.log).

Inference

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
device = torch.device("cuda")
model_info = "BAAI/AquilaSQL-7B"

tokenizer = AutoTokenizer.from_pretrained(model_info, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_info, trust_remote_code=True, torch_dtype=torch.float16, device_map='auto')

model.eval()
model.to(device)
torch.manual_seed(123)

text = "有多个数据库表，信息如下：\n表名为cars_data，包含的属性为cars_data.horsepower,cars_data.accelerate,cars_data.mpg,cars_data.id,cars_data.year;表名为continents，包含的属性为continents.contid,continents.continent;表名为countries，包含的属性为countries.continent,countries.countryname,countries.countryid;表名为model_list，包含的属性为model_list.model,model_list.maker,model_list.modelid，它们之间的关系为 countries.continent = continents.contid\n请为下面的问题编写sql查询语句：\n加速度比马力最大的汽车更大的汽车有多少辆？ "

def generate_prompt(input: str):
    prompt = f"A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: {input}###Assistant:"
    return prompt

stop_tokens = ["###", "[UNK]", "</s>","<|endoftext|>"]

with torch.no_grad():

    _input = generate_prompt(text)
    tokens = tokenizer.encode_plus(_input, None, max_length=None)['input_ids']
    tokens = torch.tensor(tokens)[None,].to(device)
    out = model.generate(tokens, do_sample=False, max_length=1024, eos_token_id=100007,max_new_tokens=512,
                            bad_words_ids=[[tokenizer.encode(token)[0] for token in stop_tokens]])[0]
    out = tokenizer.decode(out.cpu().numpy().tolist())
    print(out)

License

AquilaSQL-7B open-source model is licensed under BAAI Aquila Model Licence Agreement