Christoph Holthaus commited on
Commit
d65f135
1 Parent(s): 4a70f4c

more readme ideas

Browse files
Files changed (1) hide show
  1. README.md +8 -5
README.md CHANGED
@@ -14,12 +14,15 @@ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-
14
  This is a test ...
15
 
16
  TASKS:
17
- - rewrite generation from scratch or use the one of mistral space if possible. alternative use https://github.com/abetlen/llama-cpp-python#chat-completion
 
 
18
  - write IN LARGE LETTERS that this is not the original model but a quantified one that is able to run on free CPU Inference
19
- - check ho wmuch parallel generation is possible or only one que and set max context etc accordingly. Maybe live-log ram free etc to the interface on "system health" graph if not too resource hungry on its own ... -> Gradio for display??
20
- - live stream response (see mistral space!!)
21
- - log memory usage to console? Maybe auto reboot if too slim
22
- - readd system prompt? maybe checkout how to setup from lm studio - could be a dropdown with an option to set one fix also via env var when only one model available ...
 
23
  - move model to DL into env-var with proper error handling
24
  - chore: cleanup ignore, dockerfile etc.
25
  - update all deps to one up to date version, then PIN them!
 
14
  This is a test ...
15
 
16
  TASKS:
17
+ - for fast debug: Add a debug mode that enables me to run direct cli commands? -> Never for prod!
18
+ - prod harden docker with proper users etc. OR mention this is only a dev build an intended for messing with, no readonly filesystem etc.
19
+ - rewrite generation from scratch or use the one of mistral space if possible. alternative use https://github.com/abetlen/llama-cpp-python#chat-completion or https://huggingface.co/spaces/deepseek-ai/deepseek-coder-7b-instruct/blob/main/app.py
20
  - write IN LARGE LETTERS that this is not the original model but a quantified one that is able to run on free CPU Inference
21
+ - test multimodal with llama?
22
+ - can i use swap in docker to maximize usable memory?
23
+ - proper token handling - make it a real chat (if not auto by chatcompletion interface ...)
24
+ - maybe run as webserver locally and gradio only uses the webserver as backend? (better for async but maybe worse to control - just an idea)
25
+ - check ho wmuch parallel generation is possible or only one que and set
26
  - move model to DL into env-var with proper error handling
27
  - chore: cleanup ignore, dockerfile etc.
28
  - update all deps to one up to date version, then PIN them!