Update README.md
Browse files
README.md
CHANGED
@@ -39,6 +39,30 @@ model-index:
|
|
39 |
|
40 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
41 |
should probably proofread and complete it, then remove this comment. -->
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
42 |
|
43 |
# layoutlmv2-finetuned-cord
|
44 |
|
|
|
39 |
|
40 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
41 |
should probably proofread and complete it, then remove this comment. -->
|
42 |
+
# overfitting issue
|
43 |
+
I use this colab:
|
44 |
+
https://colab.research.google.com/drive/1AXh3G3-VmbMWlwbSvesVIurzNlcezTce?usp=sharing
|
45 |
+
|
46 |
+
to Fine tuning LayoutLMv2ForTokenClassification on CORD dataset
|
47 |
+
|
48 |
+
here is the result:
|
49 |
+
https://huggingface.co/doc2txt/layoutlmv2-finetuned-cord
|
50 |
+
|
51 |
+
* F1: 0.9665
|
52 |
+
|
53 |
+
and indeed the result are pretty amazing when running on the test set,
|
54 |
+
however when running on any other receipt (printed or pdf) the result are completely off
|
55 |
+
|
56 |
+
So from some reason the model is overfitting to the cord dataset, even though I use similar images for testing.
|
57 |
+
|
58 |
+
I don't think that there is a **Data leakage** unless the cord DS is not clean (which I assume it is clean)
|
59 |
+
|
60 |
+
What could be the reason for this?
|
61 |
+
Is it some inherent property of LayoutLM?
|
62 |
+
The LayoutLM models are somewhat old, and it seems deserted...
|
63 |
+
|
64 |
+
I don't have much experience so I would appreciate any info
|
65 |
+
Thanks
|
66 |
|
67 |
# layoutlmv2-finetuned-cord
|
68 |
|