The concept of language usage with the llm models are not a right thought my friend :
the model learned language by consuming corpuses of text data and created word matrixes at each step : so at each layer of the transformer it calculates the probablity word to word matrix : We also can say the same for training embeddings ...so the model is trained on word expectation :
it has some understandin of structure ( based on probablity ) .
so we need a refinement of the output based on a set of rules : ie grammar :
Im not sure this would be of benifit :
but :
Teaching methodologys ! Yes:
Teaching to Reformat and output according to a set of rules : Yes
So we could write a function to clean the output text based on a set of rules :
so given a text : the output should be formatted according to a set of rules :
So we would need an example Set...
or write a specific python function to perform the task :
ie Process output ! So we could create a datsa set and produce the processed output according to the function and train the model to use this model ( in its methodology ) .... ie internally in a scratch pad : Ie given an input prodcue the output using the internal function ( simulated ) then format the output as required ::
by using a python function such as ntlk toolkit : this dataset will be based on the ruleset proposed and the outputs which do not conform can be filtered .
so by givibng this method the model will also simulate the same method to format the output :
it would be prudent to create multiple stages of output processing ... so given a text and we say replace X then with enough sample the model will perform that task :
so given a compound task : - it should generate the steps and perform the task in serial then produce the output... hence allowing for chians to be thought of by the model , as well as also recrating the compound task with and advanced function :
this should also be done for tasks as entity recognition, sentiment anyalsis etc :
this enabke the model to have excat training so it will produce exact correct expectations:
the problem is bad data ... despite we have lots of data hidden within is often bad data , bad formatted etc :
in truth we need to train the model to understand a THING ! <<< what it can or cannot do , its parts ... what it is a part of ... what it is !! what it is made of ... uits genus ... hpow it can be used and how it has been used .... the deeper the description for these components then it will know about a thing !
the question is .... is the model generating new code or copied code !~ >>>>>
the model should be trained to be a good hallucenator .... ie a best guesser ... not a repeater ... hence training at depths and in multiple ways with the same data.... this is task training as in the pretraining stages knowlege was given via text dumps so now we need access points on our corpus probablitys ! <<<
it will be able to make true prediction and not hallucenations !! <<< we want the odel to generate original voice , original code, creative images , creative text ... by utilising its content hence we need the pretraining data in multiple forms and styles !
hence training for a task you will need to first dump the correct corpusses first then task train it after !