xLAM-v0.1-r / README.md

jianguozhang

Add Discord

11d58b4 verified 12 days ago

preview code

raw

history blame contribute delete

No virus

9.4 kB

	---
	extra_gated_heading: >-
	Acknowledge to follow corresponding license to access the
	repository
	extra_gated_button_content: Agree and access repository
	extra_gated_fields:
	First Name: text
	Last Name: text
	Country: country
	Affiliation: text
	license: cc-by-nc-4.0
	---

	<p align="center">
	<img width="500px" alt="xLAM" src="https://huggingface.co/datasets/jianguozhang/logos/resolve/main/xlam-no-background.png">
	</p>
	<p align="center">
	<a href="https://arxiv.org/abs/2402.15506">[AgentOhana Paper]</a> \|
	<a href="https://github.com/SalesforceAIResearch/xLAM">[Github]</a> \|
	<a href="https://discord.gg/tysWwgZyQ2">[Discord]</a> \|
	<a href="https://www.salesforceairesearch.com/projects/xlam-large-action-models">[Homepage]</a> \|
	<a href="https://huggingface.co/spaces/Tonic/Salesforce-Xlam-7b-r">[Community Demo]</a>
	</p>
	<hr>


	License: cc-by-nc-4.0

	If you already know [Mixtral](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1), xLAM-v0.1 is a significant upgrade and better at many things.
	For the same number of parameters, the model have been fine-tuned across a wide range of agent tasks and scenarios, all while preserving the capabilities of the original model.

	xLAM-v0.1-r represents the version 0.1 of the Large Action Model series, with the "-r" indicating it's tagged for research.
	This model is compatible with VLLM and FastChat platforms.




	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("Salesforce/xLAM-v0.1-r")
	model = AutoModelForCausalLM.from_pretrained("Salesforce/xLAM-v0.1-r", device_map="auto")

	messages = [
	{"role": "user", "content": "What is your favourite condiment?"},
	{"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
	{"role": "user", "content": "Do you have mayonnaise recipes?"}
	]

	inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

	outputs = model.generate(inputs, max_new_tokens=512)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	You may need to tune the Temperature setting for different applications. Typically, a lower Temperature is helpful for tasks that require deterministic outcomes.
	Additionally, for tasks demanding adherence to specific formats or function calls, explicitly including formatting instructions is advisable.


	# Benchmarks

	## [BOLAA](https://github.com/salesforce/BOLAA)

	### Webshop


	<div class="datagrid" style="width:700px;">
	<table>
	<!-- <thead><tr><th></th><th colspan="6"></th></tr></thead> -->
	<thead><tr><th>LLM Name</th><th>ZS</th><th>ZST</th><th>ReaAct</th><th>PlanAct</th><th>PlanReAct</th><th>BOLAA</th></tr></thead>
	<tbody>
	<tr><td>Llama-2-70B-chat </td><td>0.0089 </td><td>0.0102</td><td>0.4273</td><td>0.2809</td><td>0.3966</td><td>0.4986</td></tr>
	<tr><td>Vicuna-33B </td><td>0.1527 </td><td>0.2122</td><td>0.1971</td><td>0.3766</td><td>0.4032</td><td>0.5618</td></tr>
	<tr><td>Mixtral-8x7B-Instruct-v0.1 </td><td>0.4634 </td><td>0.4592</td><td><u>0.5638</u></td><td>0.4738</td><td>0.3339</td><td>0.5342</td></tr>
	<tr><td>GPT-3.5-Turbo </td><td>0.4851 </td><td><u>0.5058</u></td><td>0.5047</td><td>0.4930</td><td><u>0.5436</u></td><td><u>0.6354</u></td></tr>
	<tr><td>GPT-3.5-Turbo-Instruct </td><td>0.3785 </td><td>0.4195</td><td>0.4377</td><td>0.3604</td><td>0.4851</td><td>0.5811</td></tr>
	<tr><td>GPT-4-0613</td><td><u>0.5002</u></td><td>0.4783 </td><td>0.4616</td><td><strong>0.7950</strong></td><td>0.4635</td><td>0.6129</td></tr>
	<tr><td>xLAM-v0.1-r</td><td><strong>0.5201</strong></td><td><strong>0.5268</strong></td><td><strong>0.6486</strong></td><td><u>0.6573</u></td><td><strong>0.6611</strong></td><td><strong>0.6556</strong></td></tr>
	</tbody>
	</table>

	### HotpotQA

	<div class="datagrid" style="width:700px;">
	<table>
	<!-- <thead><tr><th></th><th colspan="6"></th></tr></thead> -->
	<thead><tr><th>LLM Name</th><th>ZS</th><th>ZST</th><th>ReaAct</th><th>PlanAct</th><th>PlanReAct</th></thead>
	<tbody>
	<tr><td>Mixtral-8x7B-Instruct-v0.1 </td><td>0.3912 </td><td>0.3971</td><td>0.3714</td><td>0.3195</td><td>0.3039</td></tr>
	<tr><td>GPT-3.5-Turbo </td><td>0.4196 </td><td>0.3937</td><td>0.3868</td><td>0.4182</td><td>0.3960</td></tr>
	<tr><td>GPT-4-0613</td><td><strong>0.5801</strong></td><td><strong>0.5709 </strong></td><td><strong>0.6129</strong></td><td><strong>0.5778</strong></td><td><strong>0.5716</strong></td></tr>
	<tr><td>xLAM-v0.1-r</td><td><u>0.5492</u></td><td><u>0.4776</u></td><td><u>0.5020</u></td><td><u>0.5583</u></td><td><u>0.5030</u></td></tr>
	</tbody>
	</table>

	## [AgentLite](https://github.com/SalesforceAIResearch/AgentLite/tree/main)

	Please note: All prompts provided by AgentLite are considered "unseen prompts" for xLAM-v0.1-r, meaning the model has not been trained with data related to these prompts.

	#### Webshop

	<div class="datagrid" style="width:780px;">
	<table>
	<!-- <thead><tr><th></th><th colspan="2">Easy</th><th colspan="2">Medium</th><th colspan="2">Hard</th></tr></thead> -->
	<thead><tr><th>LLM Name</th><th>Act</th><th>ReAct</th><th>BOLAA</th></tr></thead>
	<tbody>
	<tr><td>GPT-3.5-Turbo-16k </td><td>0.6158 </td><td>0.6005</td><td>0.6652</td></tr>
	<tr><td>GPT-4-0613</td><td><strong>0.6989 </strong></td><td><strong>0.6732</strong></td><td><strong>0.7154</strong></td></tr>
	<tr><td>xLAM-v0.1-r</td><td><u>0.6563</u></td><td><u>0.6640</u></td><td><u>0.6854</u></td></tr>
	</tbody>
	</table>

	#### HotpotQA

	<div class="datagrid" style="width:700px;">
	<table>
	<thead><tr><th></th><th colspan="2">Easy</th><th colspan="2">Medium</th><th colspan="2">Hard</th></tr></thead>
	<thead><tr><th>LLM Name</th><th>F1 Score</th><th>Accuracy</th><th>F1 Score</th><th>Accuracy</th><th>F1 Score</th><th>Accuracy</th></tr></thead>
	<tbody>
	<tr><td>GPT-3.5-Turbo-16k-0613 </td><td>0.410 </td><td>0.350</td><td>0.330</td><td>0.25</td><td>0.283</td><td>0.20</td></tr>
	<tr><td>GPT-4-0613</td><td><strong>0.611</strong></td><td><strong>0.47</strong> </td><td><strong>0.610</strong></td><td><strong>0.480</strong></td><td><strong>0.527</strong></td><td><strong>0.38</strong></td></tr>
	<tr><td>xLAM-v0.1-r</td><td><u>0.532</u></td><td><u>0.45</u></td><td><u>0.547</u></td><td><u>0.46</u></td><td><u>0.455</u></td><td><u>0.36</u></td></tr>
	</tbody>
	</table>



	## ToolBench

	<div class="datagrid" style="width:780px;">
	<table>
	<!-- <thead><tr><th></th><th colspan="2">Easy</th><th colspan="2">Medium</th><th colspan="2">Hard</th></tr></thead> -->
	<thead><tr><th>LLM Name</th><th>Unseen Insts & Same Set</th><th>Unseen Tools & Seen Cat</th><th>Unseen Tools & Unseen Cat</th></tr></thead>
	<tbody>
	<tr><td>TooLlama V2 </td><td>0.4385 </td><td>0.4300</td><td>0.4350</td></tr>
	<tr><td>GPT-3.5-Turbo-0125 </td><td>0.5000 </td><td>0.5150</td><td>0.4900</td></tr>
	<tr><td>GPT-4-0125-preview</td><td><strong>0.5462</strong></td><td><u>0.5450</u></td><td><u>0.5050</u></td></tr>
	<tr><td>xLAM-v0.1-r</td><td><u>0.5077</u></td><td><strong>0.5650</strong></td><td><strong>0.5200</strong></td></tr>
	</tbody>
	</table>

	## [MINT-BENCH](https://github.com/xingyaoww/mint-bench)


	<div class="datagrid" style="width:780px;">
	<table>
	<!-- <thead><tr><th></th><th colspan="2">Easy</th><th colspan="2">Medium</th><th colspan="2">Hard</th></tr></thead> -->
	<thead><tr><th>LLM Name</th><th>1-step</th><th>2-step</th><th>3-step</th><th>4-step</th><th>5-step</th></tr></thead>
	<tbody>
	<tr><td>GPT-4-0613</td><td>-</td><td>-</td><td>-</td><td>-</td><td>69.45</td></tr>
	<tr><td>Claude-Instant-1</td><td>12.12</td><td>32.25</td><td>39.25</td><td>44.37</td><td>45.90</td></tr>
	<tr><td>xLAM-v0.1-r</td><td>4.10</td><td>28.50</td><td>36.01</td><td>42.66</td><td>43.96</td></tr>
	<tr><td>Claude-2 </td><td>26.45 </td><td>35.49</td><td>36.01</td><td>39.76</td><td>39.93</td></tr>
	<tr><td>Lemur-70b-Chat-v1 </td><td>3.75 </td><td>26.96</td><td>35.67</td><td>37.54</td><td>37.03</td></tr>
	<tr><td>GPT-3.5-Turbo-0613 </td><td>2.73</td><td>16.89</td><td>24.06</td><td>31.74</td><td>36.18</td></tr>
	<tr><td>AgentLM-70b </td><td>6.48</td><td>17.75</td><td>24.91</td><td>28.16</td><td>28.67</td></tr>
	<tr><td>CodeLlama-34b </td><td>0.17</td><td>16.21</td><td>23.04</td><td>25.94</td><td>28.16</td></tr>
	<tr><td>Llama-2-70b-chat </td><td>4.27</td><td>14.33</td><td>15.70</td><td>16.55</td><td>17.92</td></tr>
	</tbody>
	</table>


	## [Tool-Query](https://github.com/hkust-nlp/AgentBoard)

	<div class="datagrid" style="width:780px;">
	<table>
	<!-- <thead><tr><th></th><th colspan="2">Easy</th><th colspan="2">Medium</th><th colspan="2">Hard</th></tr></thead> -->
	<thead><tr><th>LLM Name</th><th>Success Rate</th><th>Progress Rate</th></tr></thead>
	<tbody>
	<tr><td>xLAM-v0.1-r</td><td><strong>0.533</strong></td><td><strong>0.766</strong></td></tr>
	<tr><td>DeepSeek-67B </td><td>0.400 </td><td>0.714</td></tr>
	<tr><td>GPT-3.5-Turbo-0613 </td><td>0.367 </td><td>0.627</td></tr>
	<tr><td>GPT-3.5-Turbo-16k </td><td>0.317</td><td>0.591</td></tr>
	<tr><td>Lemur-70B </td><td>0.283</td><td>0.720</td></tr>
	<tr><td>CodeLlama-13B </td><td>0.250</td><td>0.525</td></tr>
	<tr><td>CodeLlama-34B </td><td>0.133</td><td>0.600</td></tr>
	<tr><td>Mistral-7B </td><td>0.033</td><td>0.510</td></tr>
	<tr><td>Vicuna-13B-16K </td><td>0.033</td><td>0.343</td></tr>
	<tr><td>Llama-2-70B </td><td>0.000</td><td>0.483</td></tr>
	</tbody>
	</table>