gugarosa commited on
Commit
1cb0668
1 Parent(s): 3e53f58

Upload 5 files

Browse files
Files changed (5) hide show
  1. CODE_OF_CONDUCT.md +9 -0
  2. LICENSE +22 -0
  3. NOTICE.md +38 -0
  4. README.md +36 -31
  5. SECURITY.md +41 -0
CODE_OF_CONDUCT.md ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ # Microsoft Open Source Code of Conduct
2
+
3
+ This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
4
+
5
+ Resources:
6
+
7
+ - [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
8
+ - [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
9
+ - Contact [[email protected]](mailto:[email protected]) with questions or concerns
LICENSE ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ PhyAGI.
2
+ Copyright (c) Microsoft Corporation.
3
+
4
+ MIT License
5
+
6
+ Permission is hereby granted, free of charge, to any person obtaining a copy
7
+ of this software and associated documentation files (the "Software"), to deal
8
+ in the Software without restriction, including without limitation the rights
9
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10
+ copies of the Software, and to permit persons to whom the Software is
11
+ furnished to do so, subject to the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be included in all
14
+ copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22
+ SOFTWARE.
NOTICE.md ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ NOTICES AND INFORMATION
2
+ Do Not Translate or Localize
3
+
4
+ This software incorporates material from third parties.
5
+
6
+ **Component.** https://github.com/Dao-AILab/flash-attention
7
+
8
+ **Open Source License/Copyright Notice.**
9
+
10
+ BSD 3-Clause License
11
+
12
+ Copyright (c) 2022, the respective contributors, as shown by the AUTHORS file.
13
+ All rights reserved.
14
+
15
+ Redistribution and use in source and binary forms, with or without
16
+ modification, are permitted provided that the following conditions are met:
17
+
18
+ * Redistributions of source code must retain the above copyright notice, this
19
+ list of conditions and the following disclaimer.
20
+
21
+ * Redistributions in binary form must reproduce the above copyright notice,
22
+ this list of conditions and the following disclaimer in the documentation
23
+ and/or other materials provided with the distribution.
24
+
25
+ * Neither the name of the copyright holder nor the names of its
26
+ contributors may be used to endorse or promote products derived from
27
+ this software without specific prior written permission.
28
+
29
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
30
+ AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
31
+ IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
32
+ DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
33
+ FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
34
+ DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
35
+ SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
36
+ CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
37
+ OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
38
+ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
README.md CHANGED
@@ -1,19 +1,20 @@
1
  ---
2
  inference: false
3
- license: other
4
- license_name: microsoft-research-license
5
- license_link: https://huggingface.co/microsoft/phi-1/resolve/main/Research%20License.docx
6
  language:
7
  - en
8
  pipeline_tag: text-generation
9
  tags:
10
  - code
11
  ---
 
12
  ## Model Summary
13
 
14
  The language model Phi-1 is a Transformer with 1.3 billion parameters, specialized for basic Python coding. Its training involved a variety of data sources, including subsets of Python codes from [The Stack v1.2](https://huggingface.co/datasets/bigcode/the-stack), Q&A content from [StackOverflow](https://archive.org/download/stackexchange), competition code from [code_contests](https://github.com/deepmind/code_contests), and synthetic Python textbooks and exercises generated by [gpt-3.5-turbo-0301](https://platform.openai.com/docs/models/gpt-3-5). Even though the model and the datasets are relatively small compared to contemporary Large Language Models (LLMs), Phi-1 has demonstrated an impressive accuracy rate exceeding 50% on the simple Python coding benchmark, HumanEval.
15
 
16
  ## Intended Uses
 
17
  Given the nature of the training data, Phi-1 is best suited for prompts using the code format:
18
 
19
  ### Code Format:
@@ -30,35 +31,18 @@ def print_prime(n):
30
  else:
31
  print(num)
32
  ```
 
33
  where the model generates the code after the comments. (Note: This is a legitimate and correct use of the else statement in Python loops.)
34
 
35
  **Notes:**
36
- * Phi-1 is intended for research purposes. The model-generated code should be treated as a starting point rather than a definitive solution for potential use cases. Users should be cautious when employing this model in their applications.
 
 
37
  * Direct adoption for production coding tasks is out of the scope of this research project. As a result, Phi-1 has not been tested to ensure that it performs adequately for production-level code. Please refer to the limitation sections of this document for more details.
38
- * If you are using `transformers>=4.36.0`, always load the model with `trust_remote_code=True` to prevent side-effects.
39
 
40
- ## Sample Code
41
 
42
- There are four types of execution mode:
43
-
44
- 1. FP16 / Flash-Attention / CUDA:
45
- ```python
46
- model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1", torch_dtype="auto", flash_attn=True, flash_rotary=True, fused_dense=True, device_map="cuda", trust_remote_code=True)
47
- ```
48
- 2. FP16 / CUDA:
49
- ```python
50
- model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1", torch_dtype="auto", device_map="cuda", trust_remote_code=True)
51
- ```
52
- 3. FP32 / CUDA:
53
- ```python
54
- model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1", torch_dtype=torch.float32, device_map="cuda", trust_remote_code=True)
55
- ```
56
- 4. FP32 / CPU:
57
- ```python
58
- model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1", torch_dtype=torch.float32, device_map="cpu", trust_remote_code=True)
59
- ```
60
-
61
- To ensure the maximum compatibility, we recommend using the second execution mode (FP16 / CUDA), as follows:
62
 
63
  ```python
64
  import torch
@@ -79,26 +63,33 @@ text = tokenizer.batch_decode(outputs)[0]
79
  print(text)
80
  ```
81
 
82
- **Remark:** In the generation function, our model currently does not support beam search (`num_beams > 1`).
83
- Furthermore, in the forward pass of the model, we currently do not support outputting hidden states or attention values, or using custom input embeddings.
84
-
85
  ## Limitations of Phi-1
86
 
87
  * Limited Scope: 99.8% of the Python scripts in our fine-tuning dataset use only the packages "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages, we strongly recommend users manually verify all API uses.
 
88
  * Replicate Scripts Online: As our model is trained on Python scripts found online, there is a small chance it may replicate such scripts, especially if they appear repetitively across different online sources.
 
89
  * Generate Inaccurate Code: The model frequently generates incorrect code. We suggest that users view these outputs as a source of inspiration rather than definitive solutions.
90
  * Unreliable Responses to Alternate Formats: Despite appearing to comprehend instructions in formats like Q&A or chat, our models often respond with inaccurate answers, even when seeming confident. Their capabilities with non-code formats are significantly more limited.
 
91
  * Limitations on Natural Language Comprehension. As a coding bot, Phi-1's main focus is to help with coding-related questions. While it may have some natural language comprehension capabilities, its primary function is not to engage in general conversations or demonstrate common sense like a general AI assistant. Its strength lies in providing assistance and guidance in the context of programming and software development.
 
92
  * Potential Biases: Phi-1, like other AI models, is trained on web and synthetic data. This data can contain biases and errors that might affect the AI's performance. Biases could stem from various sources like unbalanced representation, stereotypes, or controversial opinions present in the training data. As a result, the model might sometimes generate responses that reflect these biases or errors.
93
 
94
  ## Warning about Security Risks
 
95
  When leveraging Phi-1, it's paramount to be vigilant. The model, though powerful, can inadvertently introduce security vulnerabilities in the generated code. Examples include, but are not limited to:
96
 
97
  * Directory Traversal: The code might fail to implement safe checks against directory traversal attacks, potentially allowing unauthorized access to sensitive files on your system.
 
98
  * Injection Attacks: There could be lapses in escaping strings properly, making the application susceptible to SQL, OS commands, or other injection attacks.
 
99
  * Misunderstanding Requirements: The model might sometimes misunderstand or oversimplify user requirements, leading to incomplete or insecure solutions.
 
100
  * Lack of Input Validation: In some cases, the model might neglect to incorporate input validation or sanitize user inputs, opening doors to attacks like Cross-Site Scripting (XSS).
 
101
  * Insecure Defaults: The model might recommend or generate code with insecure default settings, such as weak password requirements or unencrypted data transmissions.
 
102
  * Failure in Error Handling: Improper error handling can inadvertently reveal sensitive information about the system or the application's internal workings.
103
 
104
  Given these potential pitfalls, and others not explicitly mentioned, it's essential to thoroughly review, test, and verify the generated code before deploying it in any application, especially those that are security-sensitive. Always consult with security experts or perform rigorous penetration testing when in doubt.
@@ -106,21 +97,31 @@ Given these potential pitfalls, and others not explicitly mentioned, it's essent
106
  ## Training
107
 
108
  ### Model
 
109
  * Architecture: a Transformer-based model with next-word prediction objective
 
110
  * Training tokens: 54B tokens (7B unique tokens)
 
111
  * Precision: fp16
 
112
  * GPUs: 8 A100
 
113
  * Training time: 6 days
114
 
115
  ### Software
 
116
  * [PyTorch](https://github.com/pytorch/pytorch)
 
117
  * [DeepSpeed](https://github.com/microsoft/DeepSpeed)
 
118
  * [Flash-Attention](https://github.com/HazyResearch/flash-attention)
119
 
120
  ### License
121
- The model is licensed under the [Research License](https://huggingface.co/microsoft/phi-1/resolve/main/Research%20License.docx).
 
122
 
123
  ### Citation
 
124
  ```bib
125
  @article{gunasekar2023textbooks,
126
  title={Textbooks Are All You Need},
@@ -128,4 +129,8 @@ The model is licensed under the [Research License](https://huggingface.co/micros
128
  journal={arXiv preprint arXiv:2306.11644},
129
  year={2023}
130
  }
131
- ```
 
 
 
 
 
1
  ---
2
  inference: false
3
+ license: mit
4
+ license_link: https://huggingface.co/microsoft/phi-1/resolve/main/LICENSE
 
5
  language:
6
  - en
7
  pipeline_tag: text-generation
8
  tags:
9
  - code
10
  ---
11
+
12
  ## Model Summary
13
 
14
  The language model Phi-1 is a Transformer with 1.3 billion parameters, specialized for basic Python coding. Its training involved a variety of data sources, including subsets of Python codes from [The Stack v1.2](https://huggingface.co/datasets/bigcode/the-stack), Q&A content from [StackOverflow](https://archive.org/download/stackexchange), competition code from [code_contests](https://github.com/deepmind/code_contests), and synthetic Python textbooks and exercises generated by [gpt-3.5-turbo-0301](https://platform.openai.com/docs/models/gpt-3-5). Even though the model and the datasets are relatively small compared to contemporary Large Language Models (LLMs), Phi-1 has demonstrated an impressive accuracy rate exceeding 50% on the simple Python coding benchmark, HumanEval.
15
 
16
  ## Intended Uses
17
+
18
  Given the nature of the training data, Phi-1 is best suited for prompts using the code format:
19
 
20
  ### Code Format:
 
31
  else:
32
  print(num)
33
  ```
34
+
35
  where the model generates the code after the comments. (Note: This is a legitimate and correct use of the else statement in Python loops.)
36
 
37
  **Notes:**
38
+
39
+ * Phi-1 is intended for code purposes. The model-generated code should be treated as a starting point rather than a definitive solution for potential use cases. Users should be cautious when employing this model in their applications.
40
+
41
  * Direct adoption for production coding tasks is out of the scope of this research project. As a result, Phi-1 has not been tested to ensure that it performs adequately for production-level code. Please refer to the limitation sections of this document for more details.
 
42
 
43
+ * If you are using `transformers<4.37.0`, always load the model with `trust_remote_code=True` to prevent side-effects.
44
 
45
+ ## Sample Code
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
  ```python
48
  import torch
 
63
  print(text)
64
  ```
65
 
 
 
 
66
  ## Limitations of Phi-1
67
 
68
  * Limited Scope: 99.8% of the Python scripts in our fine-tuning dataset use only the packages "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages, we strongly recommend users manually verify all API uses.
69
+
70
  * Replicate Scripts Online: As our model is trained on Python scripts found online, there is a small chance it may replicate such scripts, especially if they appear repetitively across different online sources.
71
+
72
  * Generate Inaccurate Code: The model frequently generates incorrect code. We suggest that users view these outputs as a source of inspiration rather than definitive solutions.
73
  * Unreliable Responses to Alternate Formats: Despite appearing to comprehend instructions in formats like Q&A or chat, our models often respond with inaccurate answers, even when seeming confident. Their capabilities with non-code formats are significantly more limited.
74
+
75
  * Limitations on Natural Language Comprehension. As a coding bot, Phi-1's main focus is to help with coding-related questions. While it may have some natural language comprehension capabilities, its primary function is not to engage in general conversations or demonstrate common sense like a general AI assistant. Its strength lies in providing assistance and guidance in the context of programming and software development.
76
+
77
  * Potential Biases: Phi-1, like other AI models, is trained on web and synthetic data. This data can contain biases and errors that might affect the AI's performance. Biases could stem from various sources like unbalanced representation, stereotypes, or controversial opinions present in the training data. As a result, the model might sometimes generate responses that reflect these biases or errors.
78
 
79
  ## Warning about Security Risks
80
+
81
  When leveraging Phi-1, it's paramount to be vigilant. The model, though powerful, can inadvertently introduce security vulnerabilities in the generated code. Examples include, but are not limited to:
82
 
83
  * Directory Traversal: The code might fail to implement safe checks against directory traversal attacks, potentially allowing unauthorized access to sensitive files on your system.
84
+
85
  * Injection Attacks: There could be lapses in escaping strings properly, making the application susceptible to SQL, OS commands, or other injection attacks.
86
+
87
  * Misunderstanding Requirements: The model might sometimes misunderstand or oversimplify user requirements, leading to incomplete or insecure solutions.
88
+
89
  * Lack of Input Validation: In some cases, the model might neglect to incorporate input validation or sanitize user inputs, opening doors to attacks like Cross-Site Scripting (XSS).
90
+
91
  * Insecure Defaults: The model might recommend or generate code with insecure default settings, such as weak password requirements or unencrypted data transmissions.
92
+
93
  * Failure in Error Handling: Improper error handling can inadvertently reveal sensitive information about the system or the application's internal workings.
94
 
95
  Given these potential pitfalls, and others not explicitly mentioned, it's essential to thoroughly review, test, and verify the generated code before deploying it in any application, especially those that are security-sensitive. Always consult with security experts or perform rigorous penetration testing when in doubt.
 
97
  ## Training
98
 
99
  ### Model
100
+
101
  * Architecture: a Transformer-based model with next-word prediction objective
102
+
103
  * Training tokens: 54B tokens (7B unique tokens)
104
+
105
  * Precision: fp16
106
+
107
  * GPUs: 8 A100
108
+
109
  * Training time: 6 days
110
 
111
  ### Software
112
+
113
  * [PyTorch](https://github.com/pytorch/pytorch)
114
+
115
  * [DeepSpeed](https://github.com/microsoft/DeepSpeed)
116
+
117
  * [Flash-Attention](https://github.com/HazyResearch/flash-attention)
118
 
119
  ### License
120
+
121
+ The model is licensed under the [MIT license](https://huggingface.co/microsoft/phi-1/resolve/main/LICENSE).
122
 
123
  ### Citation
124
+
125
  ```bib
126
  @article{gunasekar2023textbooks,
127
  title={Textbooks Are All You Need},
 
129
  journal={arXiv preprint arXiv:2306.11644},
130
  year={2023}
131
  }
132
+ ```
133
+
134
+ ## Trademarks
135
+
136
+ This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft’s Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.
SECURITY.md ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!-- BEGIN MICROSOFT SECURITY.MD V0.0.9 BLOCK -->
2
+
3
+ ## Security
4
+
5
+ Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet) and [Xamarin](https://github.com/xamarin).
6
+
7
+ If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/security.md/definition), please report it to us as described below.
8
+
9
+ ## Reporting Security Issues
10
+
11
+ **Please do not report security vulnerabilities through public GitHub issues.**
12
+
13
+ Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/security.md/msrc/create-report).
14
+
15
+ If you prefer to submit without logging in, send email to [[email protected]](mailto:[email protected]). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/security.md/msrc/pgp).
16
+
17
+ You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).
18
+
19
+ Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
20
+
21
+ * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
22
+ * Full paths of source file(s) related to the manifestation of the issue
23
+ * The location of the affected source code (tag/branch/commit or direct URL)
24
+ * Any special configuration required to reproduce the issue
25
+ * Step-by-step instructions to reproduce the issue
26
+ * Proof-of-concept or exploit code (if possible)
27
+ * Impact of the issue, including how an attacker might exploit the issue
28
+
29
+ This information will help us triage your report more quickly.
30
+
31
+ If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/security.md/msrc/bounty) page for more details about our active programs.
32
+
33
+ ## Preferred Languages
34
+
35
+ We prefer all communications to be in English.
36
+
37
+ ## Policy
38
+
39
+ Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/security.md/cvd).
40
+
41
+ <!-- END MICROSOFT SECURITY.MD BLOCK -->