Safetensors
qwen2
linqq9 commited on
Commit
0faa7ad
1 Parent(s): 7c0ce81

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -1
README.md CHANGED
@@ -71,7 +71,27 @@ The model is capable of handling various function calling scenarios, including:
71
 
72
  *Note: The rankings are based on the performance metrics provided.*
73
 
74
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
  ## Upcoming Developments
77
 
 
71
 
72
  *Note: The rankings are based on the performance metrics provided.*
73
 
74
+ 2.In our evaluation, we assessed the function calling capabilities of various models, including our own fine-tuned models using both masked and non-masked approaches. Below are the results across several benchmarks, derived from evaluations performed in a zero-shot manner. Our model, **hammer-7b-mask**, demonstrated superior performance compared to other models.
75
+
76
+ The table below replicates and extends the format found in ["Granite-Function Calling Model"](https://arxiv.org/abs/2407.00121), particularly Table 6: Function Calling Academic Benchmarks.
77
+
78
+ | Model | Size | API-Bank L-1 | | API-Bank L-2| | Tool-Alpaca| | Nexus| | F1 Average| |
79
+ | | | F1 Func Name | F1 Args | Func Name | F1 Args | Func Name | F1 Args | Func Name | F1 Args | F1 Func Name | F1 Args |
80
+ |-------------------------------|------|--------------|--------------|------------|---------|-----------|---------|-----------|---------|--------------|---------|
81
+ | Functionary-small-v2.4 | 7B | 78.0% | 70.0% | 54.0% | 45.0% | 88.0% | 47.0% | 82.0% | 64.0% | 75.5% | 56.5% |
82
+ | Gorilla-openfunctions-v2 | 7B | 43.0% | 41.0% | 12.0% | 12.0% | 69.0% | 39.0% | 81.0% | 65.0% | 51.2% | 39.3% |
83
+ | Hermes-2-Pro-Mistral | 7B | 93.0% | 77.0% | 54.0% | 25.0% | 80.0% | 26.0% | 90.0% | 63.0% | 79.3% | 47.8% |
84
+ | Mistral-Instruct-v0.3 | 7B | 79.0% | 69.0% | 69.0% | 46.0% | 33.0% | 33.0% | 71.0% | 54.0% | 63.0% | 50.5% |
85
+ | CodeGemma-Instruct | 7B | 77.0% | 57.0% | 59.0% | 38.0% | 59.0% | 31.0% | 84.0% | 68.0% | 69.8% | 48.5% |
86
+ | Nexusflow-Raven-v2 | 13B | 51.0% | 42.0% | 28.0% | 22.0% | 85.0% | 37.0% | 92.0% | 75.0% | 64.0% | 44.0% |
87
+ | C4AI-Command-R-v01 | 35B | 93.0% | 76.0% | 77.0% | 54.0% | 90.0% | 42.0% | 93.0% | 71.0% | 88.3% | 60.8% |
88
+ | Meta-Llama-3-70B-Instruct | 70B | 85.0% | 67.0% | 69.0% | 52.0% | 78.0% | 43.0% | 70.0% | 52.0% | 75.5% | 53.5% |
89
+ | GRANITE-20B-FUNCTIONCALLING | 20B | 91.0% | 71.0% | 83.0% | 60.0% | 89.0% | 44.0% | 92.0% | 72.0% | 88.8% | 61.8% |
90
+ | **hammer-7b-mask** | 7B | **93.8%** | **85.9%** | 79.2% | **64.4%**| 82.3% | **59.9%**| **92.5%** | **77.4%**| **86.9%** | **71.9%**|
91
+ | hammer-7b-nomask | 7B | 93.5% | 85.8% | **74.1%** | 65.8% | **82.6%** | 59.4% | 87.1% | 75.0% | 84.3% | 71.5% |
92
+ | xlam-7b-fc-r | 7B | 90.0% | 80.7% | 68.9% | 60.7% | 67.3% | 59.0% | 54.1% | 57.5% | 70.1% | 64.5% |
93
+
94
+ Our results demonstrate that **hammer-7b-mask** sets a new benchmark in function calling performance by achieving the highest average F1 score across multiple function calling benchmarks. This highlights the efficacy of our fine-tuning approach in addressing the challenges associated with function calling tasks.
95
 
96
  ## Upcoming Developments
97