--- language: - code - en task_categories: - text-classification tags: - arxiv:2305.06156 license: mit metrics: - accuracy widget: - text: |- Sum two integersdef sum(a, b): return a + b example_title: Simple toy - text: |- Look for methods that might be dynamically defined and define them for lookup.def respond_to_missing?(name, include_private = false) if name == :to_ary || name == :empty? false else return true if mapping(name).present? mounting = all_mountings.find{ |mount| mount.respond_to?(name) } return false if mounting.nil? end end example_title: Ruby example - text: |- Method that adds a candidate to the party @param c the candidate that will be added to the partypublic void addCandidate(Candidate c) { this.votes += c.getVotes(); candidates.add(c); } example_title: Java example - text: |- we do not need Buffer pollyfill for nowfunction(str){ var ret = new Array(str.length), len = str.length; while(len--) ret[len] = str.charCodeAt(len); return Uint8Array.from(ret); } example_title: JavaScript example pipeline_tag: text-classification --- ## Table of Contents - [Model Description](#model-description) - [Model Details](#model-details) - [Usage](#usage) - [Limitations](#limitations) - [Additional Information](#additional-information) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Model Description This model is developed based on [Codebert](https://github.com/microsoft/CodeBERT) and a 5M subset of [The Vault](https://huggingface.co/datasets/Fsoft-AIC/the-vault-function) to detect the inconsistency between docstring/comment and function. It is used to remove noisy examples in The Vault dataset. More information: - **Repository:** [FSoft-AI4Code/TheVault](https://github.com/FSoft-AI4Code/TheVault) - **Paper:** [The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation](https://arxiv.org/abs/2305.06156) - **Contact:** support.ailab@fpt.com ## Model Details * Developed by: [Fsoft AI Center](https://www.fpt-aicenter.com/ai-residency/) * License: MIT * Model type: Transformer-Encoder based Language Model * Architecture: BERT-base * Data set: [The Vault](https://huggingface.co/datasets/Fsoft-AIC/thevault-function-level) * Tokenizer: Byte Pair Encoding * Vocabulary Size: 50265 * Sequence Length: 512 * Language: English and 10 Programming languages (Python, Java, JavaScript, PHP, C#, C, C++, Go, Rust, Ruby) * Training details: * Self-supervised learning, binary classification * Positive class: Original code-docstring pair * Negative class: Random pairing code and docstring ## Usage The input to the model follows the below template: ```python """ Template: {docstring}{code} Example: from transformers import AutoTokenizer #Load tokenizer tokenizer = AutoTokenizer.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency") input = "Sum two integersdef sum(a, b):\n return a + b" tokenized_input = tokenizer(input, add_special_tokens= False) """ ``` Using model with Jax and Pytorch ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification, FlaxAutoModelForSequenceClassification #Load model with jax model = FlaxAutoModelForSequenceClassification.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency") #Load model with torch model = AutoModelForSequenceClassification.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency") ``` ## Limitations This model is trained on 5M subset of The Vault in a self-supervised manner. Since the negative samples are generated artificially, the model's ability to identify instances that require a strong semantic understanding between the code and the docstring might be restricted. It is hard to evaluate the model due to the unavailable labeled datasets. GPT-3.5-turbo is adopted as a reference to measure the correlation between the model and GPT-3.5-turbo's scores. However, the result could be influenced by GPT-3.5-turbo's potential biases and ambiguous conditions. Therefore, we recommend having human labeling dataset and fine-tune this model to achieve the best result. ## Additional information ### Licensing Information MIT License ### Citation Information ``` @article{manh2023vault, title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation}, author={Manh, Dung Nguyen and Hai, Nam Le and Dau, Anh TV and Nguyen, Anh Minh and Nghiem, Khanh and Guo, Jin and Bui, Nghi DQ}, journal={arXiv preprint arXiv:2305.06156}, year={2023} } ```