It's been a while we shipped native quantization support in diffusers ๐งจ
We currently support bistandbytes as the official backend but using others like torchao is already very simple.
This post is just a reminder of what's possible:
1. Loading a model with a quantization config 2. Saving a model with quantization config 3. Loading a pre-quantized model 4. enable_model_cpu_offload() 5. Training and loading LoRAs into quantized checkpoints
For anyone who struggles with NER or information extraction with LLM.
We showed an efficient workflow for token classification including zero-shot suggestions and model fine-tuning with Argilla, GliNER, the NuMind NuExtract LLM and SpanMarker. @argilla
โจ Unified 3D generation & text understanding. โจ 3D meshes as plain text for seamless LLM integration. โจ High-quality 3D outputs rivaling specialized models.
OmniVision-968M: a new local VLM for edge devices, fast & small but performant ๐จ a new vision language model with 9x less image tokens, super efficient ๐ aligned with DPO for reducing hallucinations โก๏ธ Apache 2.0 license ๐ฅ
Models ๐ป Coding: Qwen team released two Qwen2.5-Coder checkpoints of 32B and 7B. Infly released OpenCoder: 1.5B and 8B coding models with instruction SFT'd versions and their datasets! ๐
๐ผ๏ธ Image/Video Gen: Alibaba vision lab released In-context LoRA -- 10 LoRA models on different themes based on Flux. Also Mochi the sota video generation model with A2.0 license now comes natively supported in diffusers ๐
๐ผ๏ธ VLMs/Multimodal: NexaAIDev released Omnivision 968M a new vision language model aligned with DPO for reducing hallucinations, also comes with GGUF ckpts ๐ Microsoft released LLM2CLIP, a new CLIP-like model with longer context window allowing complex text inputs and better search
๐ฎ AGI?: Etched released Oasis 500M, a diffusion based open world model that takes keyboard input and outputs gameplay ๐คฏ
Datasets Common Corpus: A text dataset with 2T tokens with permissive license for EN/FR on various sources: code, science, finance, culture ๐
reacted to maxiw's
post with ๐๐๐ฅ๐คโค๏ธ7 days ago
Microsoft released LLM2CLIP: a CLIP model with longer context window for complex text inputs ๐คฏ All models with Apache 2.0 license here microsoft/llm2clip-672323a266173cfa40b32d4c
TLDR; they replaced CLIP's text encoder with various LLMs fine-tuned on captioning, better top-k accuracy on retrieval. This will enable better image-text retrieval, better zero-shot image classification, better vision language models ๐ฅ Read the paper to learn more: LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation (2411.04997)