license: apache-2.0
language:
- zh
- en
datasets:
- HIT-TMG/TruthReader_RAG_train
pipeline_tag: text-generation
tags:
- chat
Mixtral_13B_Chat_RAG-Reader
Introduction
This is a retrieval-augmented large language model that incorporates reliable attribution techniques, specifically designed for serving as a document reading assistant with bilingual capability (English and Chinese). It is trained to generate accurate answers, produce reliable citations, and refuse unanswerable questions.
Quickstart
RAG QA format
The RAG question answering format is as follows:
请基于给定的文档,生成问题的答案。如果文档中没有包含答案的信息,请回复抱歉并给出理由。
# DOCUMENTS:
## 文档[1] this is dummy title 1
this is dummy document text 1
## 文档[2] this is dummy title 2
this is dummy document text 2
# QUESTION: this is a dummy question
# ANSWER:
It is worth noting that the prompt provided should be used as the last turn message within the conversation history. If you intend to address a multi-turn dialogue using RAG, please ensure to concatenate the entire history into the messages section, adhering to the "Templates for Chat Models" guidelines.
Code Example
Here is a code snippet with apply_chat_template
, which demonstrates the construction of the RAG QA prompt and generation of the complete input content.
from transformers import AutoTokenizer
def generate_qa_prompt(
question: str,
documents: list,
tokenizer: AutoTokenizer,
max_context_len: int = 3000
) -> str:
documents_list = []
for i, item in enumerate(documents):
title = item.get("title", "")
text = item["document"]
title = tokenizer.decode(tokenizer.encode(title, add_special_tokens=False)[:100])
text = tokenizer.decode(tokenizer.encode(text, add_special_tokens=False)[:max_context_len])
document = f"""## 文档[{i+1}]\t{title}\n{text}"""
documents_list.append(document)
documents_string = "\n\n".join(documents_list)
system = '''请基于给定的文档,生成问题的答案。如果文档中没有包含答案的信息,请回复抱歉并给出理由。'''
prompt = "{system}\n\n# DOCUMENTS:\n{documents}\n\n# QUESTION: {Question}\n\n# ANSWER: "
qa_prompt = prompt.format_map({
"system": system,
"documents": documents_string,
"Question": question
})
return qa_prompt
tokenizer = AutoTokenizer.from_pretrained("HIT-TMG/Mixtral_13B_Chat_RAG-Reader")
question = "Wakatobi national park and its significance in marine biodiversity."
history = [
"How did go fas racing come into existence, and who were the people involved in its formation?",
"Go fas racing was formed as a result of a merger between frank allen stoddard's fas lane racing and archie st. Hilaire's go green racing in 2014. Frank allen stoddard, the long-time crew chief, initially founded fas lane racing in 2011 after his previous team, latitude 43 motorsports, closed down. He formed the team with remaining crew members and used acquired cars and equipment from multiple sources, such as no fear racing, richard petty motorsports, roush fenway racing, and his old latitude 43 team. Fas is an abbreviation for his initials, francis allen stoddard. He chose the number 32 as a tribute to his racing mentor, stub fadden, a busch north series racer from new england. In 2014, after a few struggling years in team ownership, stoddard merged with archie st. Hilaire's go green racing to create go fas racing. By 2017, the primary team owner was st. Hilaire, and stoddard served as a consultant with mason st. Hilaire as the general manager."
]
documents = [
{
"title": "The Hawaii longline fishery is managed under Western Pacific Regional Fishery Management Council’s (WPRFMC's) Pelagics Fisheries Ecosystem Plan (formerly Pelagics Fisheries Management Plan). Through this plan, the WPRFMC has introduced logbooks, observers, vessel monitoring systems, fishing gear modifications and spatial management for the Hawaii longline fishery. Until relatively recently, the main driver for management of the Hawaii longline fishery has been bycatch and not fishery resources.",
"document": "The revival of the Hawaii longline fleet in the late 1980s meant that larger ocean-going longline vessels began operating from Honolulu. The advent of the new fleet was driven primarily by targeting swordfish, which meant using squid bait on hooks deployed in relatively shallow depths (<30 m) and with light sticks attached to the branch lines. Observers began to be employed on vessels in 1994 and it soon became apparent that in the shallow set fishery there were catches of sea turtles and seabirds. The principal seabirds caught were black-footed and Laysan albatross, and for the turtles, loggerheads and leatherbacks. There were turtle and seabird interactions in the deep set fishery also, but these were one to two orders of magnitude lower than in the shallow set fishery.\n\nSeabird Bycatch Mitigation Development \n\nPrior to 2001, 1380 black footed albatross and 1163 Laysan albatrosses were caught annually by the Hawaii longline fishery. The WPRFMC's response to the volume of seabirds being caught was to mount a project through 1998 and 1999 to test various seabird mitigation methods. It was found that during gear setting operations, blue dyed baits were the most successful mitigation method, followed by strategic offal discards. Tori lines and a towed buoy system also proved to be effective mitigation measures during the set. During hauling operations, blue dyed baited and tori lines were found to be equally effective mitigation strategies, followed by the towed buoy. Retaining offal on the vessel during the haul increased seabird interactions.\n\nThe National Marine Fisheries Service Pacific Islands Fisheries Science Center (NMFS PIFSC) also tested tori lines, blue dyed bait and weighted hooks in 1999, They found that baits dyed blue and baits with additional weight reduced the number of interactions with both black-footed and Laysan albatross. Tori lines reduced contact between baits and albatrosses by 70%\n\nThe WPRFMC's plan for implementing seabird mitigation measures was for an Fishery Management Plan(FMP) amendment where fishermen could choose the measures from a selected list of proven mitigation methods. However, this was forestalled by a 2000 US Fish and Wildlife Service Biological Opinion (BiOp) on the endangered Short-tailed albatross in, which prescribed what seabird mitigation measures would be used by the tuna-targeting (deep sets) and by swordfish (shallow sets) as follows:\n\nSummary of seabird deterrent measures by set type \n\nThe WPRFMC incorporated these measures into a Pelagics FMP amendment in 2002, requiring that these seabird mitigation measures be used when fishing north of 23 deg N. This measure was further refined in 2006 by an FMP amendment that allowed operators of Hawaii-based longline vessels fishing north of 23 degrees north latitude, as well as those targeting swordfish south of 23 degrees north, to utilize side-setting to reduce seabird interactions in lieu of the seabird mitigation already measures required.\n\nThe implementation of the seabird measures caused a massive drop in seabird interactions by more than 90% in the Hawaii longline fishery.\n\nSea Turtle Bycatch Mitigation Development \n\nDespite low observer coverage, usually 5% or less, it was estimated that prior to 2001, a total of 666 turtles were caught annually in the Hawaii longline fishery: 418 loggerheads, 146 olive ridleys, 112 leatherbacks and 40 green turtles.\n\nUnlike the seabird issue, the solutions for sea turtles were propelled initially by litigation by environmental organizations which resulted in a complete closure of the shallow set longline fishery between 2001 and 2004. Over these years, the Hawaii fishery was only permitted to target tunas. An FMP amendment in 2002 incorporated reasonable and prudent alternative of the March 2001 Biological Opinion issued by NMFS. This amendment prohibited shallow set pelagic longlining north of the equator and closed waters between 0° and 15° N from April–May annually to longline fishing. It instituted sea turtle handling requirements for all vessels using hooks to target pelagic species in the region's EEZ waters and extended the protected species workshop requirement to include the operators of vessels registered to longline general permits\n\nSalvation was at hand, however, for the shallow-set longline fishery, based on hook research by NMFS Fisheries Engineering Laboratory in Pascagoula, Mississippi. This research found that large 18/0 circle hooks combined with mackerel type fish bait could sharply reduce loggerhead and leatherback interactions of longline vessels fishing on the Grand Banks for swordfish. The WPRFMC operationalized this technology in an FMP amendment which established a limited Hawaii-based shallow-set swordfish fishery using circle hooks with mackerel bait."
},
{
"title": "Capitella teleta is a small, cosmopolitan, segmented annelid worm. It is a well-studied invertebrate, which has been cultured for use in laboratories for over 30 years. C. teleta is the first marine polychaete to have its genome sequenced.",
"document": "Description\n\nInitial discovery \nFor many years researchers believed that Capitella capitata was the only representative of this genus that survived, and flourished, in polluted environments. After the oil spill that occurred near Cape Cod in West Falmouth, Massachusetts in 1969, researchers collected sediment and found an abundance of what they believed to be C. capitata. However, subsequent research showed that while the individuals collected from that region had very similar gross morphology, their life histories, methods of reproduction and genetics indicated there were at least six distinct species. Capitella species I, eventually described as Capitella teleta in 2009, was one of the initial species identified from these surveys.\n\nEtymology \nAfter 30 years of research on the group, Capitella teleta was officially described in 2009 by Blake et al. The species name is derived from the Greek word teleta, meaning \"initiation\". This word symbolizes that it was the first alternative Capitella species that was identified.\n\nPhylogenetics \nA 2018 molecular phylogeny of the family Capitellidae established clear monophyly and showed 8 genera. The phylogeny utilized 36 capitellid species and combined data from 18S, 28S, H3, and COI gene sequences. This study also established Capitellidae as the sister group to Echiura. While the study attempted to map morphological characters to the molecular phylogeny, this was not phylogenetically informative and a more detailed re-evaluation of morphology could help to elucidate character trait evolution.\n\nTaxonomic morphology \nCapitella teleta has a narrow, segmented body with reduced parapodia and is red in color. There are nine anterior thoracic segments and many more abdominal segments. New segments are added throughout the lifespan from a posterior subterminal growth zone, called the posterior growth zone. Like other polychaetes, C. teleta has fine bristles or setae. Setae are segmentally repeated along the body, with morphologically distinct setae in the thoracic (hooded hooks) and abdominal segments (capillary setae). This animal exhibits sexual dimorphism and males have dorsally-positioned genital spines on setigers 8-9 while females have paired ovaries in the abdominal segments. Generally, there are separate sexes, however, hermaphroditism is possible when there are low densities of females. Males, females and hermaphrodites are of similar size (max size collected was a male that is 24 mm in length).\n\nEcology\n\nHabitat \nCapitella teleta lives in the shallow-water or intertidal marine environment. It is also found in salt marshes and is often found in high concentrations in disturbed soft sediments. It is a member of the infaunal benthic community. C. teleta burrows through the sediment by peristalsis, using its hydrostatic skeleton and contraction of longitudinal and circular muscles in the body wall. The thoracic segments of C. teleta also contain helical muscles that are proposed to generate additional force for burrowing. Capitellids are commonly thought of as opportunistic in nature, due to their ability to inhabit and flourish in organically enriched marine sediments.\n\nThis organism is commonly found in sediments along the east and west coasts of North America. Additional reports have placed this group in the Mediterranean region as well as Japan.\n\nLife history \nCapitella teleta embryos and early larval stages develop in a brood tube that surrounds the mother. The embryos are approximately 200 µm in diameter. Over the course of approximately a week, the embryos develop into non-feeding larvae which form musculature, a centralized nervous system, two circular ciliary bands, two eye spots, segments, and setae. The larvae are non-feeding and the digestive system develops at a later stage than other organs. Pre-metamorphosis larvae can be categorized into nine stages, with each stage lasting approximately one day. Upon further body elongation and gut maturation, the larvae emerge from the brood tube, and swim forward with a rotational turn via the beating of cilia organized within two circular bands, the prototroch and telotroch. Larvae exhibit positive phototactic behavior in which they swim towards light, potentially an adaptation to aid in larval dispersal C. teleta is an indirect developer and undergoes metamorphosis from a swimming larva into a burrowing juvenile. Metamorphosis is characterized by cilia loss, body elongation, and crawling behavior. Marine sediment functions as a cue to initiate metamorphosis into juvenile worms that thereafter grow into mature adults. Competent larvae can be induced to metamorphose into juveniles when exposed to the B vitamins Nicotinamide (B3) and Riboflavin (B2), suggesting that these chemical compounds may be responsible for the inductive role of the marine sediment in larval metamorphosis. The number of offspring in each brood tube can vary between 50 - 400 individuals, and is influenced by food quality.\n\nAfter metamorphosis, the juveniles begin burrowing and feeding. The juvenile worms continue to grow and add segments during the eight weeks it takes to become sexually mature adults. Males and females can reproduce multiple times during their lifetime. Adults live approximately 12–14 weeks after maturation.\n\nFeeding \nCapitella teleta feeds on the enriched sediment in which it burrows. C. teleta has a complex, regionalized alimentary canal consisting of a foregut, midgut and hindgut. It ingests the sediment by everting its proboscis, which contains a ciliated, muscular dorsal pharynx. Presence of a dorsal pharynx is uncommon in marine polychaetes, and this adaptation may have evolved independently in the family Capitellidae through selective pressures on feeding mode in the benthic marine niche they occupy.\n\nResearch \n\nA wide range of techniques have been developed to investigate C. teleta developmental processes. In 2006, the first study using whole mount in situ hybridization was published. This technique allows investigation of the expression and localization of specific mRNAs within a fixed sample. Immunohistochemistry was later developed as a way to visualize specific cell types in fixed specimens. A microinjection protocol for uncleaved embryos and early cleavage stages was developed in 2010 and was used in a fate mapping study to investigate the ultimate fate of blastomeres."
},
]
last_turn_qa_prompt = generate_qa_prompt(question=question, documents=documents, tokenizer=tokenizer, max_context_len=3000)
messages = [{"role": "user", "content": content} if i % 2 == 0
else {"role": "assistant", "content": content}
for i, content in enumerate(history)]
messages = [{"role": "system", "content": "You are a helpful assistant."}] + messages + [{"role": "user", "content": last_turn_qa_prompt}]
input_ids = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
max_length=5000,
truncation=True,
)
Citation
If you find our model helpful, feel free to give us a citation.
@misc{truthreader,
author = {Xinshuo Hu and Zetian Sun and Dongfang Li and Shaolin Ye and Zifei Shan and Qian Chen and Baotian Hu and Min Zhang},
title = {TruthReader: Towards Trustworthy Document Assistant Chatbot with Reliable Attribution},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/HITsz-TMG/TruthReader-document-assistant}},
}