doc2txt
/

layoutlmv2_cord

Token Classification

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

doc2txt commited on Feb 2

Commit

ee6fdb6

•

1 Parent(s): a77090b

Update README.md

Files changed (1) hide show

README.md +24 -0

README.md CHANGED Viewed

@@ -39,6 +39,30 @@ model-index:
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
 # layoutlmv2-finetuned-cord

 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
+# overfitting issue
+I use this colab:
+https://colab.research.google.com/drive/1AXh3G3-VmbMWlwbSvesVIurzNlcezTce?usp=sharing
+to Fine tuning LayoutLMv2ForTokenClassification on CORD dataset
+here is the result:
+https://huggingface.co/doc2txt/layoutlmv2-finetuned-cord
+* F1: 0.9665
+and indeed the result are pretty amazing when running on the test set,
+however when running on any other receipt (printed or pdf) the result are completely off
+So from some reason the model is overfitting to the cord dataset, even though I use similar images for testing.
+I don't think that there is a **Data leakage** unless the cord DS is not clean (which I assume it is clean)
+What could be the reason for this?
+Is it some inherent property of LayoutLM?
+The LayoutLM models are somewhat old, and it seems deserted...
+I don't have much experience so I would appreciate any info
+Thanks
 # layoutlmv2-finetuned-cord