Xiaotian Han

xiaotianhan

AI & ML interests

Multimodal LLM

Recent Activity

authored a paper 22 days ago
liked a model 23 days ago
shallowdream204/DreamClear
upvoted a paper 24 days ago

Organizations

xiaotianhan's activity

reacted to their post with ๐Ÿš€ 2 months ago
view post
Post
871
๐Ÿš€ Excited to announce the release of InfiMM-WebMath-40B โ€” the largest open-source multimodal pretraining dataset designed to advance mathematical reasoning in AI! ๐Ÿงฎโœจ

With 40 billion tokens, this dataset aims for enhancing the reasoning capabilities of multimodal large language models in the domain of mathematics.

If you're interested in MLLMs, AI, and math reasoning, check out our work and dataset:

๐Ÿค— HF: InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning (2409.12568)
๐Ÿ“‚ Dataset: Infi-MM/InfiMM-WebMath-40B
posted an update 2 months ago
view post
Post
871
๐Ÿš€ Excited to announce the release of InfiMM-WebMath-40B โ€” the largest open-source multimodal pretraining dataset designed to advance mathematical reasoning in AI! ๐Ÿงฎโœจ

With 40 billion tokens, this dataset aims for enhancing the reasoning capabilities of multimodal large language models in the domain of mathematics.

If you're interested in MLLMs, AI, and math reasoning, check out our work and dataset:

๐Ÿค— HF: InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning (2409.12568)
๐Ÿ“‚ Dataset: Infi-MM/InfiMM-WebMath-40B
reacted to victor's post with ๐Ÿš€ 7 months ago
view post
Post
4267
The hype is real: a mysterious gpt2-chatbot model has appeared on the LLM Arena Leaderboard ๐Ÿ‘€.
It seems to be at least on par with the top performing models (closed and open).

To try it out: https://chat.lmsys.org/ -> then click on the Direct Chat tab and select gpt2-chatbot.

Take your bet, what do you think it is?
ยท
replied to their post 7 months ago
view reply

Thanks for your interest, yeah, we will open source our code and pretrained weights soon.

posted an update 8 months ago
view post
Post
2087
๐ŸŽ‰ ๐ŸŽ‰ ๐ŸŽ‰ Happy to share our recent work. We noticed that image resolution plays an important role, either in improving multi-modal large language models (MLLM) performance or in Sora style any resolution encoder decoder, we hope this work can help lift restriction of 224x224 resolution limit in ViT.

ViTAR: Vision Transformer with Any Resolution (2403.18361)
  • 2 replies
ยท
reacted to akhaliq's post with ๐Ÿ‘ 9 months ago
view post
Post
LongRoPE

Extending LLM Context Window Beyond 2 Million Tokens

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens (2402.13753)

Large context window is a desirable feature in large language models (LLMs). However, due to high fine-tuning costs, scarcity of long texts, and catastrophic values introduced by new token positions, current extended context windows are limited to around 128k tokens. This paper introduces LongRoPE that, for the first time, extends the context window of pre-trained LLMs to an impressive 2048k tokens, with up to only 1k fine-tuning steps at within 256k training lengths, while maintaining performance at the original short context window. This is achieved by three key innovations: (i) we identify and exploit two forms of non-uniformities in positional interpolation through an efficient search, providing a better initialization for fine-tuning and enabling an 8x extension in non-fine-tuning scenarios; (ii) we introduce a progressive extension strategy that first fine-tunes a 256k length LLM and then conducts a second positional interpolation on the fine-tuned extended LLM to achieve a 2048k context window; (iii) we readjust LongRoPE on 8k length to recover the short context window performance. Extensive experiments on LLaMA2 and Mistral across various tasks demonstrate the effectiveness of our method. Models extended via LongRoPE retain the original architecture with minor modifications to the positional embedding, and can reuse most pre-existing optimizations.
  • 3 replies
ยท
reacted to DmitryRyumin's post with โค๏ธ 9 months ago
view post
Post
๐ŸŒŸโœจ Exciting Announcement: NVIDIA AI Foundation Models โœจ๐ŸŒŸ

๐Ÿš€ Interact effortlessly with the latest SOTA AI model APIs, all optimized on the powerful NVIDIA accelerated computing stack-right from your browser! ๐Ÿ’ปโšก

๐Ÿ”— Web Page: https://catalog.ngc.nvidia.com/ai-foundation-models

๐ŸŒŸ๐ŸŽฏ Favorites:

๐Ÿ”น Code Generation:
1๏ธโƒฃ Code Llama 70B ๐Ÿ“๐Ÿ”ฅ: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/codellama-70b
Model ๐Ÿค–: codellama/CodeLlama-70b-hf

๐Ÿ”น Text and Code Generation:
1๏ธโƒฃ Gemma 7B ๐Ÿ’ฌ๐Ÿ’ป: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/gemma-7b
Model ๐Ÿค–: google/gemma-7b
2๏ธโƒฃ Yi-34B ๐Ÿ“š๐Ÿ’ก: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/yi-34b
Model ๐Ÿค–: 01-ai/Yi-34B

๐Ÿ”น Text Generation:
1๏ธโƒฃ Mamba-Chat ๐Ÿ’ฌ๐Ÿ: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/mamba-chat
Model ๐Ÿค–: havenhq/mamba-chat
2๏ธโƒฃ Llama 2 70B ๐Ÿ“๐Ÿฆ™: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/llama2-70b
Model ๐Ÿค–: meta-llama/Llama-2-70b

๐Ÿ”น Text-To-Text Translation:
1๏ธโƒฃ SeamlessM4T V2 ๐ŸŒ๐Ÿ”„: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/seamless-m4t2-t2tt
Model ๐Ÿค–: facebook/seamless-m4t-v2-large

๐Ÿ”น Image Generation:
1๏ธโƒฃ Stable Diffusion XL ๐ŸŽจ๐Ÿ”: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/sdxl

๐Ÿ”น Image Conversation:
1๏ธโƒฃ NeVA-22B ๐Ÿ—จ๏ธ๐Ÿ“ธ: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/neva-22b

๐Ÿ”น Image Classification and Object Detection:
1๏ธโƒฃ CLIP ๐Ÿ–ผ๏ธ๐Ÿ”: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/clip

๐Ÿ”น Voice Conversion:
1๏ธโƒฃ Maxine Voice Font ๐Ÿ—ฃ๏ธ๐ŸŽถ: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/voice-font

๐Ÿ”น Multimodal LLM (MLLM):
1๏ธโƒฃ Kosmos-2 ๐ŸŒ๐Ÿ‘๏ธ: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/kosmos-2
  • 2 replies
ยท
reacted to dvilasuero's post with โค๏ธ 9 months ago
view post
Post
๐Ÿš€๐Ÿง™๐Ÿผโ€โ™‚๏ธIntroducing OpenHermesPreferences: the largest open AI feedback dataset for RLHF & DPO

> Using LLMs to improve other LLMs, at scale!

Built in collaboration with the H4 Hugging Face team, it's a 1M preferences dataset on top of the amazing @teknium 's dataset.

Dataset:
argilla/OpenHermesPreferences

The dataset is another example of open collaboration:

> The H4 team created responses with Mixtral using llm-swarm

> Argilla created responses with NousResearch Hermes-2-Yi-34B using distilabel

> The H4 ranked these responses + original response with PairRM from AllenAI, University of Southern California, Zhejiang University ( @yuchenlin @DongfuTingle and colleagues)

We hope this dataset will help the community's research efforts towards understanding the role of AI feedback for LLM alignment.

We're particularly excited about the ability of filtering specific subsets to improve LLM skills like math or reasoning.

Here's how easy it is to filter by subset:

ds = load_dataset("HuggingFaceH4/OpenHermesPreferences", split="train")

# Get the categories of the source dataset
# ['airoboros2.2', 'CamelAI', 'caseus_custom', ...]
sources = ds.unique("source")

# Filter for a subset
ds_filtered = ds.filter(lambda x : x["source"] in ["metamath", "EvolInstruct_70k"], num_proc=6)


As usual, all the scripts to reproduce this work are available and open to the community!

argilla/OpenHermesPreferences

So fun collab between @vwxyzjn , @plaguss , @kashif , @philschmid & @lewtun !

Open Source AI FTW!
ยท
reacted to their post with โค๏ธ๐Ÿ‘๐Ÿค— 10 months ago
view post
Post
Thrilled to share some of our recent work in the field of Multimodal Large Language Models (MLLMs).

1๏ธโƒฃ A Survey on Multimodal Reasoning ๐Ÿ“š
Are you curious about the reasoning abilities of MLLMs? In our latest survey, we delve into the world of multimodal reasoning. We comprehensively review existing evaluation protocols, categorize the frontiers of MLLMs, explore recent trends in their applications for reasoning-intensive tasks, and discuss current practices and future directions. For an in-depth exploration, check out our paper: Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning (2401.06805)

2๏ธโƒฃ Advancing Flamingo with InfiMM ๐Ÿ”ฅ
Building upon the foundation of Flamingo, we introduce the InfiMM model series. InfiMM is a reproduction of Flamingo, enhanced with stronger Large Language Models (LLMs) such as LLaMA2-13B, Vicuna-13B, and Zephyr7B. We've meticulously filtered pre-training data and fine-tuned instructions, resulting in superior performance on recent benchmarks like MMMU, InfiMM-Eval, MM-Vet, and more. Explore the power of InfiMM on Huggingface: Infi-MM/infimm-zephyr

3๏ธโƒฃ Exploring Multimodal Instruction Fine-tuning ๐Ÿ–ผ๏ธ
Visual Instruction Fine-tuning (IFT) is crucial for aligning MLLMs' output with user intentions. Our research identified challenges with models trained on the LLaVA-mix-665k dataset, particularly in multi-round dialog settings. To address this, we've created a new IFT dataset with high-quality, diverse instruction annotations and images sourced exclusively from the COCO dataset. Our experiments demonstrate that when fine-tuned with this dataset, MLLMs excel in open-ended evaluation benchmarks for both single-round and multi-round dialog settings. Dive into the details in our paper: COCO is "ALL'' You Need for Visual Instruction Fine-tuning (2401.08968)

Stay tuned for more exciting developments.
Special thanks to all our collaborators: @Ye27 @wwyssh @Yongfei @Yi-Qi638 @xudonglin @KhalilMrini @lllliuhhhhggg @Borise @Hongxia
posted an update 10 months ago
view post
Post
Thrilled to share some of our recent work in the field of Multimodal Large Language Models (MLLMs).

1๏ธโƒฃ A Survey on Multimodal Reasoning ๐Ÿ“š
Are you curious about the reasoning abilities of MLLMs? In our latest survey, we delve into the world of multimodal reasoning. We comprehensively review existing evaluation protocols, categorize the frontiers of MLLMs, explore recent trends in their applications for reasoning-intensive tasks, and discuss current practices and future directions. For an in-depth exploration, check out our paper: Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning (2401.06805)

2๏ธโƒฃ Advancing Flamingo with InfiMM ๐Ÿ”ฅ
Building upon the foundation of Flamingo, we introduce the InfiMM model series. InfiMM is a reproduction of Flamingo, enhanced with stronger Large Language Models (LLMs) such as LLaMA2-13B, Vicuna-13B, and Zephyr7B. We've meticulously filtered pre-training data and fine-tuned instructions, resulting in superior performance on recent benchmarks like MMMU, InfiMM-Eval, MM-Vet, and more. Explore the power of InfiMM on Huggingface: Infi-MM/infimm-zephyr

3๏ธโƒฃ Exploring Multimodal Instruction Fine-tuning ๐Ÿ–ผ๏ธ
Visual Instruction Fine-tuning (IFT) is crucial for aligning MLLMs' output with user intentions. Our research identified challenges with models trained on the LLaVA-mix-665k dataset, particularly in multi-round dialog settings. To address this, we've created a new IFT dataset with high-quality, diverse instruction annotations and images sourced exclusively from the COCO dataset. Our experiments demonstrate that when fine-tuned with this dataset, MLLMs excel in open-ended evaluation benchmarks for both single-round and multi-round dialog settings. Dive into the details in our paper: COCO is "ALL'' You Need for Visual Instruction Fine-tuning (2401.08968)

Stay tuned for more exciting developments.
Special thanks to all our collaborators: @Ye27 @wwyssh @Yongfei @Yi-Qi638 @xudonglin @KhalilMrini @lllliuhhhhggg @Borise @Hongxia