--- license: cc-by-nc-4.0 datasets: - passing2961/stark-dialogue - passing2961/stark-face-image language: - en base_model: - meta-llama/Llama-3.2-11B-Vision-Instruct tags: - conversational ai --- # Ultron-Summarizer-8B Model Card [🏠 Homepage](https://stark-dataset.github.io/) | [💻 Github](https://github.com/passing2961/Stark) | [📄 Arxiv](https://arxiv.org/abs/2407.03958) | [📕 PDF](https://arxiv.org/pdf/2407.03958) ## List of Provided Model Series - **Ultron-Summarizer-Series:** [🤖 Ultron-Summarizer-1B](https://huggingface.co/passing2961/Ultron-Summarizer-1B) | [🤖 Ultron-Summarizer-3B](https://huggingface.co/passing2961/Ultron-Summarizer-3B) | [🤖 Ultron-Summarizer-8B](https://huggingface.co/passing2961/Ultron-Summarizer-8B) - **Ultron 7B**: [🤖 Ultron-7B](https://huggingface.co/passing2961/Ultron-7B) | [🤖 Ultron-11B](https://huggingface.co/passing2961/Ultron-11B) > 🚨 Disclaimer: All models and datasets are intended for research purposes only. ## Model Description - **Repository:** [Code](https://github.com/passing2961/Stark) - **Paper:** [Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge](https://arxiv.org/abs/2407.03958) - **Point of Contact:** [Young-Jun Lee](mailto:yj2961@kaist.ac.kr) ## Model Details - **Model**: Ultron-11B is a fully open-source multi-modal conversation model that generates the most appropriate image description at the image-sharing moment. - **Date**: Ultron-11B was trained in 2024. - **Training Dataset**: [Stark-Dialogue](https://huggingface.co/datasets/passing2961/stark-dialogue) - **Architecture**: Ultron-11B was trained on top of [LLaMA-3.2-11B-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct). ## How to Use ```python import logging from PIL import Image import torch from transformers import ( AutoModelForVision2Seq, BitsAndBytesConfig, AutoProcessor, ) # Define Ultron template ULTRON_TEMPLATE = 'You are an excellent image sharing system that generates token with the following image description. The image description must be provided with the following format: image description . The following conversation is between {name} and AI assistant on {date}. The given image is {name}\'s appearance.\n{dialogue}' # Ultron model initialization def load_ultron_model(model_path): """ Loads the Ultron model and processor. Args: model_path (str): Path to the pre-trained model. Returns: model: Loaded Vision-to-Seq model. processor: Corresponding processor for the model. """ logging.info(f"Loading Ultron model from {model_path}...") quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type='nf4' ) model_kwargs = dict( torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True, device_map="auto", ) processor = AutoProcessor.from_pretrained( 'meta-llama/Llama-3.2-11B-Vision-Instruct', torch_dtype=torch.bfloat16 ) model = AutoModelForVision2Seq.from_pretrained( model_path, **model_kwargs ).eval() logging.info("Ultron model loaded successfully.") return model, processor # Run Ultron model def run_ultron_model(model, processor, dialogue, name='Tom', date='2023.04.20', face_image_path='sample_face.png'): """ Runs the Ultron model with a given dialogue, name, and image. Args: model: Pre-trained model instance. processor: Processor for model input. dialogue (str): Input dialogue for the assistant. name (str): Name of the user. date (str): Date of the conversation. face_image_path (str): Path to the face image file. Returns: str: Description of the shared image. """ logging.info("Running Ultron model...") face_image = Image.open(face_image_path).convert("RGB") prompt = ULTRON_TEMPLATE.format( dialogue=dialogue, name=name, date=date ) messages = [ { "content": [ {"text": prompt, "type": "text"}, {"type": "image"} ], "role": "user" }, ] logging.info("Preparing input for Ultron model...") prompt_input = processor.apply_chat_template(messages, add_generation_prompt=True) inputs = processor(face_image, prompt_input, return_tensors='pt').to('cuda') with torch.inference_mode(): logging.info("Generating output from Ultron model...") output = model.generate( **inputs, do_sample=True, temperature=0.9, max_new_tokens=512, top_p=1.0, use_cache=True, num_beams=1, ) output_text = processor.decode(output[0], skip_special_token=True) logging.info("Output generated successfully from Ultron model.") return parse_ultron_output(output_text) # Parse Ultron output def parse_ultron_output(output): """ Parses the output to extract the image description. Args: output (str): The generated output text from the model. Returns: str: Extracted image description. """ logging.info("Parsing output from Ultron model...") if '' in output: return output.split('')[-1].split('')[0].strip() else: logging.warning(" not found in output.") return output # Example usage def main(): """ Example usage of Ultron model. """ model_path = "passing2961/Ultron-11B" model, processor = load_ultron_model(model_path) dialogue = """Tom: I have so much work at the office, I'm exhausted... Personal AI Assistant: How can I help you feel less tired? Tom: Hmm.. I miss my dog Star at home. Personal AI Assistant: """ image_description = run_ultron_model(model, processor, dialogue) logging.info(f"Image description generated: {image_description}") if __name__ == "__main__": main() ``` ## License and Recommendations 🚨 Ultron-11B is intended to be used for research purposes only. ## Acknowledgement This work was supported by a grant of the KAIST-KT joint research project through AI Tech Lab, Institute of convergence Technology, funded by KT [Project No. G01230605, Development of Task-oriented Persona-based Dialogue Generation Combining Multi-modal Interaction and Knowledge Modeling]. ## Citation If you find the resources in this repository useful, please cite our work: ``` @article{lee2024stark, title={Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge}, author={Lee, Young-Jun and Lee, Dokyong and Youn, Junyoung and Oh, Kyeongjin and Ko, Byungsoo and Hyeon, Jonghwan and Choi, Ho-Jin}, journal={arXiv preprint arXiv:2407.03958}, year={2024} } ```