Lacking documentation of datasets used, architecture, fine-tuning procedures, source code

#15

by markding - opened Oct 12, 2023

Discussion

markding

Oct 12, 2023

•

edited Oct 19, 2023

Intriguing to see Solar as "a great example of the progress enabled by open source". However, for a model claiming this, Solar is remarkably silent about its own sources. The dataset details are given as:

Orca-style dataset
Alpaca-style dataset

It would be helpful to document exactly which datasets and which versions have been used, and to specify the instruction tuning process and overall architecture in more detail.

Currently, this model trails the very bottom of the openness leaderboard: it is more closed and less documented than even Llama2 itself. Hoping to see this improve!

m9e

Oct 19, 2023

Well, they were enabled by open source, but they are clearly NOT open source. No dataset, no code, and no commercial use. "Weights available for evaluation/personal use" is not open. One might assume that since they said "alpaca-style dataset" that they are using an actual Alpaca variant - and as Alpaca is CC-by-NC, they may feel they must then restrict to CC-by-NC; but cc-by-nc also requires attribution and "alpaca-style" is NOT an attribution and it would mean, imo, that they were breaching the Alpaca terms. Strange days.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment