gx-ai-architect commited on
Commit
8d4f3d5
1 Parent(s): d2d12c3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -19
README.md CHANGED
@@ -71,25 +71,6 @@ We also observe a clear correlation between the Mixtral DPO reward scores and MT
71
 
72
  The final Merlinite-7B-pt is the peak checkpoint measured by both Batch-Reward and MT-Bench.
73
 
74
- ### Acknowledgements
75
-
76
- Guangxuan Xu,
77
- Project lead.
78
-
79
- Akash Srivastava,
80
- Primary advisor
81
-
82
- Kai Xu,
83
- Advised on evaluation and model training.
84
-
85
- Tahira Naseem,
86
- Advised on DPO rewards.
87
-
88
- Abhishek Bhandwaldar,
89
- Advised on distributed sampling and reward annotation implementation.
90
-
91
- Thanks to Luis Lastras, David D. Cox, Ruchir Puri, and Sriram Raghavan for enabling this project and for provisioning the resources.
92
-
93
 
94
  ## Model description
95
 
@@ -117,3 +98,23 @@ The model has been tuned via AI preference. However, this is not a targeted RLHF
117
  The model undergoes training on synthetic data, leading to the potential inheritance of both advantages and limitations from the underlying teacher models and data generation methods. The incorporation of safety measures during Merlinite-7b-pt's training process is considered beneficial. However, a nuanced understanding of the associated risks requires detailed studies for more accurate quantification.
118
 
119
  In the absence of adequate safeguards, there exists a risk of malicious utilization of these models for generating disinformation or harmful content. Caution is urged against complete reliance on a specific language model for crucial decisions or impactful information, as preventing these models from fabricating content is not straightforward. Additionally, it remains uncertain whether smaller models might exhibit increased susceptibility to hallucination in ungrounded generation scenarios due to their reduced sizes and memorization capacities. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
 
72
  The final Merlinite-7B-pt is the peak checkpoint measured by both Batch-Reward and MT-Bench.
73
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
  ## Model description
76
 
 
98
  The model undergoes training on synthetic data, leading to the potential inheritance of both advantages and limitations from the underlying teacher models and data generation methods. The incorporation of safety measures during Merlinite-7b-pt's training process is considered beneficial. However, a nuanced understanding of the associated risks requires detailed studies for more accurate quantification.
99
 
100
  In the absence of adequate safeguards, there exists a risk of malicious utilization of these models for generating disinformation or harmful content. Caution is urged against complete reliance on a specific language model for crucial decisions or impactful information, as preventing these models from fabricating content is not straightforward. Additionally, it remains uncertain whether smaller models might exhibit increased susceptibility to hallucination in ungrounded generation scenarios due to their reduced sizes and memorization capacities. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain.
101
+
102
+ ### Acknowledgements
103
+
104
+ Guangxuan Xu,
105
+ Project lead.
106
+
107
+ Akash Srivastava,
108
+ Primary advisor
109
+
110
+ Kai Xu,
111
+ Advised on evaluation and model training.
112
+
113
+ Tahira Naseem,
114
+ Advised on DPO rewards.
115
+
116
+ Abhishek Bhandwaldar,
117
+ Advised on distributed sampling and reward annotation implementation.
118
+
119
+ Thanks to Luis Lastras, David D. Cox, Ruchir Puri, and Sriram Raghavan for enabling this project and for provisioning the resources.
120
+