metadata
language: de
tags:
- text-classification
license: other
metrics:
- f1
- precision
- recall
library_name: spacy
Extraction of Formal Educational Requirements from Online Job Advertisements
Pipeline
- Localization (NER)
- Task: find span in job ad containing the educational requirements
- Trained and evaluated using annotated data (see table below)
- Education Level Extraction (Rule-based NER)
- Task: determine education level(s) requested by employer
- Rules refined and evaluated using annotated data (see table below)
- Entities correspond to granular set of educational qualifications in the German system and are assigned to one or more ISCED codes by mapping rules
- Educational Subject Localization (NER)
- Task: find entities within the span retrieved from step 1 matching an educational subject
- Performed separately for academic and vocational subjects
- Trained using annotated data
- Educational Subject Classification (few-shot SBERT)
- Task: Classify the span retrieved from step 3 according to educational subject taxonomy
- Performed separately for academic and vocational subjects
- Training examples from
- academic: official classifications of various study programs at German universities
- vocational: synonym lists of vocational professions provided by Bundesagentur für Arbeit (available at https://download-portal.arbeitsagentur.de/), additional examples for higher-order clusters of vocational subjects have been generated using ChatGPT 4
- All examples and candidate span are converted to embeddings using fine-tuned SBERT and classified using radius nearest neighbors with cosine distance
- Combination of steps 3 and 4 evaluated using annotated data (see table below)
Usage
from huggingface_hub import snapshot_download
import sys
# download snapshot of model
path = snapshot_download(
cache_dir="tmp/",
repo_id="bertelsmannstift/oja_education_extraction",
revision="main",
token=HF_TOKEN,
)
# Add pipeline module to path and import
sys.path.append(path)
from pipeline import PipelineWrapper
# Init model
pipeline = PipelineWrapper(path=path)
# Predictions
queries = [{"posting_id": "123",
"full_text": "Wir sind Firma XYZ. Wir suchen einen Data Scientist. Sie haben Mathematik, Politikwissenschaften oder ein vergleichbares Fach studiert.",
"candidate_description": None,
"job_description": None}]
result = pipeline(queries)
Output
[{'posting_id': 'foo',
'education_level_raw_id': [...],
'education_level_isced_id': [...],
'education_studies_label': [...],
'education_vocational_label': [...]},
...
]
Variables and Taxonomy
- education_level_raw_id:
- A list of codes for a granular kind of education level in accordance with the German educational system
- Loosely based on https://www.datenportal.bmbf.de/portal/de/G293.html, column "Bildungsprogramme"
- Codebook: https://huggingface.co/bertelsmannstift/oja_education_extraction/blob/main/assets/2022-12-08_formale_abschluesse_codebook_v4.json
- Rules for pattern matcher: https://huggingface.co/bertelsmannstift/oja_education_extraction/blob/main/assets/patterns_education_level.json
- education_level_isced_id:
- Assignment of the education_level_raw_id to one or more codes of ISCED taxonomy (first two numbers): https://www.datenportal.bmbf.de/portal/de/G293.html
- It is generally recommended to use education_level_isced_id instead of education_level_raw_id, except for very specific analyses
- Modifications:
- additional category "alternative Berufserfahrung" (alternative job experience)
- ISCED level starts at 2
- ISCED codes 44 and 55 have been eliminated
- All vocational training ("Berufsausbildung") is mapped to ISCED 35, except for health and social professions which are mapped to ISCED 45. (ISCED subclass 454 officially refers either to "Zweitausbildung" or vocational training in combination with "Erwerb einer Studienberechtigung", however both can not be determined from OJAs. Thus only ISCED subclass 453 remains in the ISCED 45 class.)
- "Schulbildung" (not otherwise specified) is mapped to ISCED 24
- "Studium" (not otherwise specified) is mapped to ISCED 64
- Codebook and assignment rules: https://huggingface.co/bertelsmannstift/oja_education_extraction/blob/main/assets/map_education_level_isced.json
- education_studies_label:
- A list of academic subjects if asked for
- Taxonomy follows DESTATIS Fächersystematik 2023: https://www.destatis.de/DE/Methoden/Klassifikationen/Bildung/studenten-pruefungsstatistik.pdf?__blob=publicationFile
- Codebook and training examples: https://huggingface.co/bertelsmannstift/oja_education_extraction/blob/main/assets/2023-03-24_codebook_studies_v2_1.json
- education_vocational_label:
- A list of vocational subjects if asked for
- Based on official list of German vocational subjects by Bundesagentur für Arbeit. Can be obtained here: https://download-portal.arbeitsagentur.de/
- Additionally, subjects have been grouped into 20 Clusters
- Subjects with different specializations have been grouped as well
- One additional supercluster: "Handwerk"
- Codebook and training examples: https://huggingface.co/bertelsmannstift/oja_education_extraction/blob/main/assets/2024-01-11_codebook_ausbildungsfaecher_v2_2.json
Performance and Datasets
Training data are sampled from the Textkernel German OJA dataset https://www.textkernel.com/
Task | n Data | Sampling | Annotators | Annotator Overlap | Inter Annotator Agreement | Test-Split | Evaluation Mode | Precision (Micro) | Recall (Micro) |
---|---|---|---|---|---|---|---|---|---|
Localization (step 1) | 1500 | Stratified by first number ISCO, equally distributed | 4 | 10 % | 0.2 | NER partial | 0.95 | 0.92 | |
Education Level (step 2) | 1200 | Stratified by first number ISCO, equally distributed | 3 | 20 % | 0.84 (Krippendorff) | 0.5 | Multilabel classification | 0.92 | 0.87 |
Academic Subject | 1200 | Stratified by first number ISCO, optimized | 5 | 20 % | 0.77 (Krippendorff) | 0.2 | NER strict | 0.87 | 0.88 |
Vocational Subject | 1200 | Stratified by first number ISCO, optimized | 2 | 20 % | 0.72 (Krippendorff) | 0.2 | NER strict | 0.86 | 0.83 |
Notes:
- Performance estimates for step 2 refer to granular taxonomy (education_level_raw_id), not ISCED (will be marginally better)
- False negatives accumulate in the pipeline, so total recall might be lower
- However, since IAA < 1, some of the false negatives and false positives are due to annotator misclassification, so effective performance will be higher (mostly precision)
- Optimized sampling performs equally distributed sampling on a subsample of 0.1 %, where not all strata can be completely filled, thus suppressing rare subgroups