Update README.md
Browse files
README.md
CHANGED
@@ -71,7 +71,27 @@ The model is capable of handling various function calling scenarios, including:
|
|
71 |
|
72 |
*Note: The rankings are based on the performance metrics provided.*
|
73 |
|
74 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
75 |
|
76 |
## Upcoming Developments
|
77 |
|
|
|
71 |
|
72 |
*Note: The rankings are based on the performance metrics provided.*
|
73 |
|
74 |
+
2.In our evaluation, we assessed the function calling capabilities of various models, including our own fine-tuned models using both masked and non-masked approaches. Below are the results across several benchmarks, derived from evaluations performed in a zero-shot manner. Our model, **hammer-7b-mask**, demonstrated superior performance compared to other models.
|
75 |
+
|
76 |
+
The table below replicates and extends the format found in ["Granite-Function Calling Model"](https://arxiv.org/abs/2407.00121), particularly Table 6: Function Calling Academic Benchmarks.
|
77 |
+
|
78 |
+
| Model | Size | API-Bank L-1 | | API-Bank L-2| | Tool-Alpaca| | Nexus| | F1 Average| |
|
79 |
+
| | | F1 Func Name | F1 Args | Func Name | F1 Args | Func Name | F1 Args | Func Name | F1 Args | F1 Func Name | F1 Args |
|
80 |
+
|-------------------------------|------|--------------|--------------|------------|---------|-----------|---------|-----------|---------|--------------|---------|
|
81 |
+
| Functionary-small-v2.4 | 7B | 78.0% | 70.0% | 54.0% | 45.0% | 88.0% | 47.0% | 82.0% | 64.0% | 75.5% | 56.5% |
|
82 |
+
| Gorilla-openfunctions-v2 | 7B | 43.0% | 41.0% | 12.0% | 12.0% | 69.0% | 39.0% | 81.0% | 65.0% | 51.2% | 39.3% |
|
83 |
+
| Hermes-2-Pro-Mistral | 7B | 93.0% | 77.0% | 54.0% | 25.0% | 80.0% | 26.0% | 90.0% | 63.0% | 79.3% | 47.8% |
|
84 |
+
| Mistral-Instruct-v0.3 | 7B | 79.0% | 69.0% | 69.0% | 46.0% | 33.0% | 33.0% | 71.0% | 54.0% | 63.0% | 50.5% |
|
85 |
+
| CodeGemma-Instruct | 7B | 77.0% | 57.0% | 59.0% | 38.0% | 59.0% | 31.0% | 84.0% | 68.0% | 69.8% | 48.5% |
|
86 |
+
| Nexusflow-Raven-v2 | 13B | 51.0% | 42.0% | 28.0% | 22.0% | 85.0% | 37.0% | 92.0% | 75.0% | 64.0% | 44.0% |
|
87 |
+
| C4AI-Command-R-v01 | 35B | 93.0% | 76.0% | 77.0% | 54.0% | 90.0% | 42.0% | 93.0% | 71.0% | 88.3% | 60.8% |
|
88 |
+
| Meta-Llama-3-70B-Instruct | 70B | 85.0% | 67.0% | 69.0% | 52.0% | 78.0% | 43.0% | 70.0% | 52.0% | 75.5% | 53.5% |
|
89 |
+
| GRANITE-20B-FUNCTIONCALLING | 20B | 91.0% | 71.0% | 83.0% | 60.0% | 89.0% | 44.0% | 92.0% | 72.0% | 88.8% | 61.8% |
|
90 |
+
| **hammer-7b-mask** | 7B | **93.8%** | **85.9%** | 79.2% | **64.4%**| 82.3% | **59.9%**| **92.5%** | **77.4%**| **86.9%** | **71.9%**|
|
91 |
+
| hammer-7b-nomask | 7B | 93.5% | 85.8% | **74.1%** | 65.8% | **82.6%** | 59.4% | 87.1% | 75.0% | 84.3% | 71.5% |
|
92 |
+
| xlam-7b-fc-r | 7B | 90.0% | 80.7% | 68.9% | 60.7% | 67.3% | 59.0% | 54.1% | 57.5% | 70.1% | 64.5% |
|
93 |
+
|
94 |
+
Our results demonstrate that **hammer-7b-mask** sets a new benchmark in function calling performance by achieving the highest average F1 score across multiple function calling benchmarks. This highlights the efficacy of our fine-tuning approach in addressing the challenges associated with function calling tasks.
|
95 |
|
96 |
## Upcoming Developments
|
97 |
|