nsteinbrink's picture
Minor corrections of model card
4b99776 verified
metadata
language: de
tags:
  - text-classification
license: other
metrics:
  - f1
  - precision
  - recall
library_name: spacy

Extraction of Formal Educational Requirements from Online Job Advertisements

Pipeline

  1. Localization (NER)
    • Task: find span in job ad containing the educational requirements
    • Trained and evaluated using annotated data (see table below)
  2. Education Level Extraction (Rule-based NER)
    • Task: determine education level(s) requested by employer
    • Rules refined and evaluated using annotated data (see table below)
    • Entities correspond to granular set of educational qualifications in the German system and are assigned to one or more ISCED codes by mapping rules
  3. Educational Subject Localization (NER)
    • Task: find entities within the span retrieved from step 1 matching an educational subject
    • Performed separately for academic and vocational subjects
    • Trained using annotated data
  4. Educational Subject Classification (few-shot SBERT)
    • Task: Classify the span retrieved from step 3 according to educational subject taxonomy
    • Performed separately for academic and vocational subjects
    • Training examples from
      • academic: official classifications of various study programs at German universities
      • vocational: synonym lists of vocational professions provided by Bundesagentur für Arbeit (available at https://download-portal.arbeitsagentur.de/), additional examples for higher-order clusters of vocational subjects have been generated using ChatGPT 4
    • All examples and candidate span are converted to embeddings using fine-tuned SBERT and classified using radius nearest neighbors with cosine distance
    • Combination of steps 3 and 4 evaluated using annotated data (see table below)

Usage

from huggingface_hub import snapshot_download
import sys

# download snapshot of model
path = snapshot_download(
    cache_dir="tmp/",
    repo_id="bertelsmannstift/oja_education_extraction",
    revision="main",
    token=HF_TOKEN,
)

# Add pipeline module to path and import 
sys.path.append(path)
from pipeline import PipelineWrapper

# Init model
pipeline = PipelineWrapper(path=path)

# Predictions
queries = [{"posting_id": "123",
         "full_text": "Wir sind Firma XYZ. Wir suchen einen Data Scientist. Sie haben Mathematik, Politikwissenschaften oder ein vergleichbares Fach studiert.",
         "candidate_description": None,
         "job_description": None}]

result = pipeline(queries)

Output

[{'posting_id': 'foo', 
'education_level_raw_id': [...], 
'education_level_isced_id': [...], 
'education_studies_label': [...], 
'education_vocational_label': [...]},
...
]

Variables and Taxonomy

Performance and Datasets

Training data are sampled from the Textkernel German OJA dataset https://www.textkernel.com/

Task n Data Sampling Annotators Annotator Overlap Inter Annotator Agreement Test-Split Evaluation Mode Precision (Micro) Recall (Micro)
Localization (step 1) 1500 Stratified by first number ISCO, equally distributed 4 10 % 0.2 NER partial 0.95 0.92
Education Level (step 2) 1200 Stratified by first number ISCO, equally distributed 3 20 % 0.84 (Krippendorff) 0.5 Multilabel classification 0.92 0.87
Academic Subject 1200 Stratified by first number ISCO, optimized 5 20 % 0.77 (Krippendorff) 0.2 NER strict 0.87 0.88
Vocational Subject 1200 Stratified by first number ISCO, optimized 2 20 % 0.72 (Krippendorff) 0.2 NER strict 0.86 0.83

Notes:

  • Performance estimates for step 2 refer to granular taxonomy (education_level_raw_id), not ISCED (will be marginally better)
  • False negatives accumulate in the pipeline, so total recall might be lower
  • However, since IAA < 1, some of the false negatives and false positives are due to annotator misclassification, so effective performance will be higher (mostly precision)
  • Optimized sampling performs equally distributed sampling on a subsample of 0.1 %, where not all strata can be completely filled, thus suppressing rare subgroups