CodeAstra-7b: Open Source State-of-the-Art Vulnerability Detection Model 🔍🛡️

Model Description

CodeAstra-7b is a state-of-the-art language model fine-tuned for vulnerability detection in multiple programming languages. Based on the powerful Mistral-7B-Instruct-v0.2 model, CodeAstra-7b has been specifically trained to identify potential security vulnerabilities across a wide range of popular programming languages.

Key Features

🌐 Multi-language Support: Detects vulnerabilities in Go, Python, C, C++, Fortran, Ruby, Java, Kotlin, C#, PHP, Swift, JavaScript, and TypeScript.
🏆 State-of-the-Art Performance: Achieves cutting-edge results in vulnerability detection tasks.
📊 Custom Dataset: Trained on a proprietary dataset curated for comprehensive vulnerability detection.
🖥️ Large-scale Training: Utilized A100 GPUs for efficient and powerful training.

Performance Comparison 📊

CodeAstra-7b significantly outperforms existing models in vulnerability detection accuracy. Here's a comparison table:

Model	Accuracy (%)
gpt4o	88.78
CodeAstra-7b	83.00
codebert-base-finetuned-detect-insecure-code	65.30
CodeBERT	62.08
RoBERTa	61.05
TextCNN	60.69
BiLSTM	59.37

As shown in the table, CodeAstra-7b achieves an impressive 83% accuracy, substantially surpassing other state-of-the-art models in the field of vulnerability detection.

Intended Use

CodeAstra-7b is designed to assist developers, security researchers, and code auditors in identifying potential security vulnerabilities in source code. It can be integrated into development workflows, code review processes, or used as a standalone tool for code analysis.

Multiple Vulnerability Scenarios

It's important to note that while CodeAstra-7b excels at finding security issues in most cases, its performance may vary when multiple vulnerabilities are present in the same code snippet. In scenarios where two or three vulnerabilities coexist, the model might not always identify all of them correctly. Users should be aware of this limitation and consider using the model as part of a broader, multi-faceted security review process.

Training 🏋️‍♂️

CodeAstra-7b was fine-tuned from the Mistral-7B-Instruct-v0.2 base model using a custom dataset specifically compiled for vulnerability detection across multiple programming languages. The training process leveraged A100 GPUs to ensure optimal performance and efficiency.

Usage 💻

CodeAstra-7b was trained using PEFT (Parameter-Efficient Fine-Tuning). To use the model for vulnerability detection and code quality analysis, you can leverage the Hugging Face Transformers library along with PEFT. Here's how to get started:

import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the model and tokenizer
peft_model_id = "rootxhacker/CodeAstra-7B"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_4bit=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)

def get_completion(query, model, tokenizer):
    inputs = tokenizer(query, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage
code_to_analyze = """
def user_input():
    name = input("Enter your name: ")
    print("Hello, " + name + "!")

user_input()
"""

query = f"Analyze this code for vulnerabilities and quality issues:\n{code_to_analyze}"
result = get_completion(query, model, tokenizer)
print(result)

This script loads the CodeAstra-7b model, tokenizer, and provides a function to generate completions. You can use this setup to analyze code for vulnerabilities and quality issues.

Limitations ⚠️

While CodeAstra-7b represents a significant advancement in automated vulnerability detection and code quality analysis, it's important to note that:

The model may not catch all vulnerabilities or code quality issues and should be used as part of a comprehensive security and code review strategy.
In cases where multiple vulnerabilities (two or three) are present in the same code snippet, the model might not identify all of them correctly.
False positives are possible, and results should be verified by human experts.
The model's performance may vary depending on the complexity and context of the code being analyzed.
CodeAstra's performance depends on input code snippet length.

Test Aparatus

I tested CodeAstra-7b against code snippets from dataset such as Cvefix , YesWeHack vulnerable code repository , Synthetically generated code using LLMs aand OWASP Juice Shop source code
I ran all those vulnerable scripts against LLMs such as GPT4 , GPT4o etc for evaluation

Citation 📜

If you use CodeAstra-7b in your research or project, please cite it as follows:

@software{CodeAstra-7b,
  author = {Harish Santhanalakshmi Ganesan},
  title = {CodeAstra-7b: State-of-the-Art Vulnerability Detection Model},
  year = {2024},
  howpublished = {\url{https://huggingface.co/rootxhacker/CodeAstra-7b}}
}

License 📄

CodeAstra-7b is released under the Apache License 2.0.

Copyright 2024 [Harish Santhanalakshmi Ganesan]

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Acknowledgements 🙏

We would like to thank the Mistral AI team for their excellent base model, which served as the foundation for CodeAstra-7b.

rootxhacker
/

CodeAstra-7B