metadata

language: de
tags:
  - text-classification
license: other
metrics:
  - f1
  - precision
  - recall
library_name: spacy

Extraction of Formal Educational Requirements from Online Job Advertisements

Pipeline

Localization (NER)
- Task: find span in job ad containing the educational requirements
- Trained and evaluated using annotated data (see table below)
Education Level Extraction (Rule-based NER)
- Task: determine education level(s) requested by employer
- Rules refined and evaluated using annotated data (see table below)
- Entities correspond to granular set of educational qualifications in the German system and are assigned to one or more ISCED codes by mapping rules
Educational Subject Localization (NER)
- Task: find entities within the span retrieved from step 1 matching an educational subject
- Performed separately for academic and vocational subjects
- Trained using annotated data
Educational Subject Classification (few-shot SBERT)
- Task: Classify the span retrieved from step 3 according to educational subject taxonomy
- Performed separately for academic and vocational subjects
- Training examples from
  - academic: official classifications of various study programs at German universities
  - vocational: synonym lists of vocational professions provided by Bundesagentur für Arbeit (available at https://download-portal.arbeitsagentur.de/), additional examples for higher-order clusters of vocational subjects have been generated using ChatGPT 4
- All examples and candidate span are converted to embeddings using fine-tuned SBERT and classified using radius nearest neighbors with cosine distance
- Combination of steps 3 and 4 evaluated using annotated data (see table below)

Usage

from huggingface_hub import snapshot_download
import sys

# download snapshot of model
path = snapshot_download(
    cache_dir="tmp/",
    repo_id="bertelsmannstift/oja_education_extraction",
    revision="main",
    token=HF_TOKEN,
)

# Add pipeline module to path and import 
sys.path.append(path)
from pipeline import PipelineWrapper

# Init model
pipeline = PipelineWrapper(path=path)

# Predictions
queries = [{"posting_id": "123",
         "full_text": "Wir sind Firma XYZ. Wir suchen einen Data Scientist. Sie haben Mathematik, Politikwissenschaften oder ein vergleichbares Fach studiert.",
         "candidate_description": None,
         "job_description": None}]

result = pipeline(queries)

Output

[{'posting_id': 'foo', 
'education_level_raw_id': [...], 
'education_level_isced_id': [...], 
'education_studies_label': [...], 
'education_vocational_label': [...]},
...
]

Variables and Taxonomy

education_level_raw_id:
- A list of codes for a granular kind of education level in accordance with the German educational system
- Loosely based on https://www.datenportal.bmbf.de/portal/de/G293.html, column "Bildungsprogramme"
- Codebook: https://huggingface.co/bertelsmannstift/oja_education_extraction/blob/main/assets/2022-12-08_formale_abschluesse_codebook_v4.json
- Rules for pattern matcher: https://huggingface.co/bertelsmannstift/oja_education_extraction/blob/main/assets/patterns_education_level.json
education_level_isced_id:
- Assignment of the education_level_raw_id to one or more codes of ISCED taxonomy (first two numbers): https://www.datenportal.bmbf.de/portal/de/G293.html
- It is generally recommended to use education_level_isced_id instead of education_level_raw_id, except for very specific analyses
- Modifications:
  - additional category "alternative Berufserfahrung" (alternative job experience)
  - ISCED level starts at 2
  - ISCED codes 44 and 55 have been eliminated
  - All vocational training ("Berufsausbildung") is mapped to ISCED 35, except for health and social professions which are mapped to ISCED 45. (ISCED subclass 454 officially refers either to "Zweitausbildung" or vocational training in combination with "Erwerb einer Studienberechtigung", however both can not be determined from OJAs. Thus only ISCED subclass 453 remains in the ISCED 45 class.)
  - "Schulbildung" (not otherwise specified) is mapped to ISCED 24
  - "Studium" (not otherwise specified) is mapped to ISCED 64
- Codebook and assignment rules: https://huggingface.co/bertelsmannstift/oja_education_extraction/blob/main/assets/map_education_level_isced.json
education_studies_label:
- A list of academic subjects if asked for
- Taxonomy follows DESTATIS Fächersystematik 2023: https://www.destatis.de/DE/Methoden/Klassifikationen/Bildung/studenten-pruefungsstatistik.pdf?__blob=publicationFile
- Codebook and training examples: https://huggingface.co/bertelsmannstift/oja_education_extraction/blob/main/assets/2023-03-24_codebook_studies_v2_1.json
education_vocational_label:
- A list of vocational subjects if asked for
- Based on official list of German vocational subjects by Bundesagentur für Arbeit. Can be obtained here: https://download-portal.arbeitsagentur.de/
- Additionally, subjects have been grouped into 20 Clusters
- Subjects with different specializations have been grouped as well
- One additional supercluster: "Handwerk"
- Codebook and training examples: https://huggingface.co/bertelsmannstift/oja_education_extraction/blob/main/assets/2024-01-11_codebook_ausbildungsfaecher_v2_2.json

Performance and Datasets

Training data are sampled from the Textkernel German OJA dataset https://www.textkernel.com/

Task	n Data	Sampling	Annotators	Annotator Overlap	Inter Annotator Agreement	Test-Split	Evaluation Mode	Precision (Micro)	Recall (Micro)
Localization (step 1)	1500	Stratified by first number ISCO, equally distributed	4	10 %		0.2	NER partial	0.95	0.92
Education Level (step 2)	1200	Stratified by first number ISCO, equally distributed	3	20 %	0.84 (Krippendorff)	0.5	Multilabel classification	0.92	0.87
Academic Subject	1200	Stratified by first number ISCO, optimized	5	20 %	0.77 (Krippendorff)	0.2	NER strict	0.87	0.88
Vocational Subject	1200	Stratified by first number ISCO, optimized	2	20 %	0.72 (Krippendorff)	0.2	NER strict	0.86	0.83

Notes:

Performance estimates for step 2 refer to granular taxonomy (education_level_raw_id), not ISCED (will be marginally better)
False negatives accumulate in the pipeline, so total recall might be lower
However, since IAA < 1, some of the false negatives and false positives are due to annotator misclassification, so effective performance will be higher (mostly precision)
Optimized sampling performs equally distributed sampling on a subsample of 0.1 %, where not all strata can be completely filled, thus suppressing rare subgroups