inconsistencies in the output
Hello,
I have a relatively big dataset of paragraphs and used this model to extract the graph json schema, but some of the outputs are problematic.
Here are some issues I managed to identify:
- Model adds "detailedal", "detailedalb" kind of words in the middle of the json, without a key or value, distrupting the json parsing.
- I used your system prompt and even though the repeating issue explicitly written, some long jsons has couple of repetitions, expecially in the part of edges.
- Sometimes, model use wrong words for field names. For instance, instead of using detailed_type, I saw detailedal_type couple of times.
- And sometimes, it makes up words, like "detailedalloy", {"id":"0.0%","type":"number","detailedalian population percentage"}, messing key:value pairs, messing values, making the json unparsable overall..
I haven't tried other 2 models, but have you seen such cases in your experiments and how you resolved them ?
I am using NVIDIA Tesla P100 with CUDA 11.4.
Hey,
Indeed, it can happen on some small percentage of cases depending on the variations of your input text. It’s good you are sticking to the exact prompt we give you.
The simplest and most effective solution is to use Outlines to constrain your output schema. This allows you to use nearly any shape and structure of input text without running into problems with mini. Simply define a pydantic class with id, type, and detailed_type, and pass that model to Outlines when you run your generation.
The other option can be to use the medium sized model instead, as it is more adaptive to your input text.
Cheers,
Rob
Thanks for the report - keep in mind that using Outlines would yield 100% Valid graph outputs!