--- language: de tags: - text-classification license: "other" metrics: - f1 - precision - recall library_name: spacy --- # Extraction of Formal Educational Requirements from Online Job Advertisements ## Usage ```python from huggingface_hub import snapshot_download import sys # download snapshot of model path = snapshot_download( cache_dir="tmp/", repo_id="bertelsmannstift/oja_education_extraction", revision="main", token=HF_TOKEN, ) # Add pipeline module to path and import sys.path.append(path) from pipeline import PipelineWrapper # Init model pipeline = PipelineWrapper(path=path) # Predictions queries = [{"posting_id": "123", "full_text": "Wir sind Firma XYZ. Wir suchen einen Data Scientist. Sie haben Mathematik, Politikwissenschaften oder ein vergleichbares Fach studiert.", "candidate_description": None, "job_description": None}] result = pipeline(queries) ``` ## Input: ``` queries = [     {         "posting_id": "foo",         "full_text": "bla bla",         "candidate_description": "bla bla",         "job_description": "bla bla",     },     ... ] ``` ## Output: ``` [{'posting_id': 'foo', 'education_level_raw_id': [...], 'education_level_isced_id': [...], 'education_studies_label': [...], 'education_vocational_label': [...]}, ... ] ``` ## Variables and Taxonomy - education_level_raw_id: - A code for a granular kind of education level in accordance with the German educational system. - Classes: TODO - Rules for pattern matcher: https://huggingface.co/bertelsmannstift/oja_education_extraction/blob/main/assets/patterns_education_level.json - education_level_isced_id: - Assignment of the detected granular education levels to international ISCED taxonomy (first two numbers) - Modifications: TODO - Classes: TODO - Assignment rules: https://huggingface.co/bertelsmannstift/oja_education_extraction/blob/main/assets/map_education_level_isced.json - education_studies_label: - A list of academic subjects if asked for - Taxonomy follows DESTATIS Fächersystematik 2023: https://www.destatis.de/DE/Methoden/Klassifikationen/Bildung/studenten-pruefungsstatistik.pdf?__blob=publicationFile - Examples for few-shot training: https://huggingface.co/bertelsmannstift/oja_education_extraction/blob/main/models/studies/sbert_entity_classifier/examples.json - education_vocational_label: - A list of vocational subjects if asked for - Based on official list of German vocational subjects by Bundesagentur für Arbeit. Can be obtained here: https://download-portal.arbeitsagentur.de/ - Additionally, subjects have been grouped into 20 Clusters - Subjects with different specializations have been grouped as well - One additional supercluster: "Handwerk" - Examples for few-shot training: https://huggingface.co/bertelsmannstift/oja_education_extraction/blob/main/models/vocational/sbert_entity_classifier/examples.json ## Performance and Datasets | Task | n Data | Sampling | Annotators | Annotator Overlap | Inter-Annotator Agreement | Test Split | Evaluation Mode | Precision (Micro) | Recall (Micro) | | --------------------- | ------ | ---------------------------------------------------- | ---------- | ----------------- | --------------------------------- | ---------- | ------------------------- | ----------------- | -------------- | | Localization | 1500 | Stratified by first number ISCO, equally distributed | 4 | 10 % | | 0.2 | NER partial | 0.95 | 0.92 | | Education Level (raw) | 1200 | Stratified by first number ISCO, equally distributed | 3 | 20 % | 0.84 (Krippendorff) | 0.5 | Multilabel classification | 0.92 | 0.87 | | Academic Subject | 1200 | Stratified by first number ISCO, optimized* | 5 | 20 % | 0.77 (Krippendorff) | 0.2 | NER strict | 0.87 | 0.88 | | Vocational Subject | 1200 | Stratified by first number ISCO, optimized* | 2 | 20 % | 0.72 (Krippendorff) | 0.2 | NER strict | 0.86 | 0.83 | Note: since tasks are in pipeline, errors might accumulate