fullstack's picture
Upload folder using huggingface_hub
e0b1078 verified

Content Classification LoRA Adapter for Gemma-2B

A LoRA adapter for unsloth/gemma-2b that determines content indexing suitability using chain-of-thought reasoning.

Used in a pipeline.

Technical Specifications

Base Model

  • Model: unsloth/gemma-2b
  • LoRA Rank: 64
  • Target Modules: q_proj, up_proj, down_proj, gate_proj, o_proj, k_proj, v_proj
  • Task: CAUSAL_LM
  • Dropout: 0
  • Alpha: 32

Input/Output Format

Input XML structure:

<instruction>Determine true or false if the following content is suitable and should be indexed.</instruction>
<suitable>
  <content>{input_text}</content>

Output XML structure:

  <thinking>{reasoning_process}</thinking>
  <category>{content_type}</category>
  <should_index>{true|false}</should_index>
</suitable>

The model then expects an indefinite list of <suitable> ... </suitable> that you may not want. But you can use this to do fewshots with incontext learning to correct a mistake or enhance the results.

Your stop token should be </suitable>.

Deployment

VLLM Server Setup

export VLLM_ALLOW_RUNTIME_LORA_UPDATING=1
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

vllm serve unsloth/gemma-2-2b \
  --gpu-memory-utilization=1 \
  --port 6002 \
  --served-model-name="gemma" \
  --trust-remote-code \
  --max-model-len 8192 \
  --disable-log-requests \
  --enable-lora \
  --lora-modules lora=./dataset/output/unsloth/lora_model \
  --max-lora-rank 64

Processing Pipeline

  1. Install Dependencies:
pip install requests tqdm concurrent.futures
  1. Run Content Processor:
python process.py --input corpus.jsonl --output results.jsonl --threads 24

Client Implementation

import requests

def classify_content(text: str, vllm_url: str = "http://localhost:6002/v1/completions") -> dict:
    xml_content = (
        '<instruction>Determine true or false if the following content is '
        'suitable and should be indexed.</instruction>\n'
        '<suitable>\n'
        f'  <content>{text}</content>'
    )
    
    response = requests.post(
        vllm_url,
        json={
            "prompt": xml_content,
            "max_tokens": 6000,
            "temperature": 1,
            "model": "lora",
            "stop": ["</suitable>"]
        },
        timeout=30000
    )
    
    completion = response.json()["choices"][0]["text"]
    
    # Parse XML tags
    import re
    def extract_tag(tag: str) -> str:
        match = re.search(f'<{tag}>(.*?)</{tag}>', completion, re.DOTALL)
        return match.group(1).strip() if match else ""
        
    return {
        "thinking": extract_tag("thinking"),
        "category": extract_tag("category"),
        "should_index": extract_tag("should_index")
    }

Example Usage

text = """Multiservice Tactics, Techniques, and Procedures
for
Nuclear, Biological, and Chemical Aspects of Consequence
Management

TABLE OF CONTENTS..."""

result = classify_content(text)
print(result)

Example output:

{
    "thinking": "This is a table of contents for a document, not the actual content.",
    "category": "table of contents",
    "should_index": "false"
}

Batch Processing

The included processor supports parallel processing of JSONL files:

from request_processor import RequestProcessor

processor = RequestProcessor(
    input_file="corpus.jsonl",
    output_file="results.jsonl",
    num_threads=24
)
processor.process_file()

Input JSONL format:

{
    "pid": "document_id",
    "docid": "path/to/source",
    "content": "document text",
    "metadata": {
        "key": "value"
    }
}

Output JSONL format:

{
    "pid": "document_id",
    "docid": "path/to/source",
    "content": "document text",
    "metadata": {
        "key": "value"
    },
    "thinking": "reasoning process",
    "category": "content type",
    "should_index": "true/false",
    "processed_at": "2024-10-22 02:52:33"
}

Implementation and Performance Considerations

  • Use thread pooling for parallel processing
  • Implement atomic writes with file locking
  • Progress tracking with tqdm
  • Automatic error handling and logging
  • Configurable thread count for optimization

Error Handling

Errors are captured in the output JSONL:

{
    "error": "error message",
    "processed_at": "timestamp"
}

Monitor errors in real-time:

tail -f results.jsonl | grep error