Kuldeep Singh Sidhu's picture
6 3

Kuldeep Singh Sidhu

singhsidhukuldeep

AI & ML interests

😃 TOP 3 on HuggingFace for posts 🤗 Seeking contributors for a completely open-source 🚀 Data Science platform! singhsidhukuldeep.github.io

Recent Activity

posted an update about 5 hours ago
Exciting breakthrough in Document AI! Researchers from UNC Chapel Hill and Bloomberg have developed M3DocRAG, a revolutionary framework for multi-modal document understanding. The innovation lies in its ability to handle complex document scenarios that traditional systems struggle with: - Process 40,000+ pages across 3,000+ documents - Answer questions requiring information from multiple pages - Understand visual elements like charts, tables, and figures - Support both closed-domain (single document) and open-domain (multiple documents) queries Under the hood, M3DocRAG operates through three sophisticated stages: >> Document Embedding: - Converts PDF pages to RGB images - Uses ColPali to project both text queries and page images into a shared embedding space - Creates dense visual embeddings for each page while maintaining visual information integrity >> Page Retrieval: - Employs MaxSim scoring to compute relevance between queries and pages - Implements inverted file indexing (IVFFlat) for efficient search - Reduces retrieval latency from 20s to under 2s when searching 40K+ pages - Supports approximate nearest neighbor search via Faiss >> Question Answering: - Leverages Qwen2-VL 7B as the multi-modal language model - Processes retrieved pages through a visual encoder - Generates answers considering both textual and visual context The results are impressive: - State-of-the-art performance on MP-DocVQA benchmark - Superior handling of non-text evidence compared to text-only systems - Significantly better performance on multi-hop reasoning tasks This is a game-changer for industries dealing with large document volumes—finance, healthcare, and legal sectors can now process documents more efficiently while preserving crucial visual context.
posted an update 3 days ago
Excited to share @LinkedIn 's innovative approach to evaluating semantic search quality! As part of the Search AI team, we've developed a groundbreaking evaluation pipeline that revolutionizes how we measure search relevance. >> Key Innovation: On-Topic Rate (OTR) This novel metric measures the semantic match between queries and search results, going beyond simple keyword matching. The system evaluates whether content is truly relevant to the query's intent, not just matching surface-level terms. >> Technical Implementation Details Query Set Construction • Golden Set: Contains curated top queries and complex topical queries • Open Set: Includes trending queries and random production queries for diversity Evaluation Pipeline Architecture 1. Query Processing: - Retrieves top 10 documents per query - Extracts post text and article information - Processes both primary content and reshared materials 2. GAI Integration: - Leverages GPT-3.5 with specialized prompts - Produces three key outputs: - Binary relevance decision - Relevance score (0-1 range) - Decision reasoning Quality Assurance • Validation achieved 94.5% accuracy on a test set of 600 query-post pairs • Human evaluation showed 81.72% consistency with expert annotators >> Business Impact This system now serves as LinkedIn's benchmark for content search experiments, enabling: • Weekly performance monitoring • Rapid offline testing of new ML models • Systematic identification of improvement opportunities What are your thoughts on semantic search evaluation?
View all activity

Organizations

MLX Community's profile picture Social Post Explorers's profile picture C4AI Community's profile picture

singhsidhukuldeep's activity

posted an update about 5 hours ago
view post
Post
213
Exciting breakthrough in Document AI! Researchers from UNC Chapel Hill and Bloomberg have developed M3DocRAG, a revolutionary framework for multi-modal document understanding.

The innovation lies in its ability to handle complex document scenarios that traditional systems struggle with:
- Process 40,000+ pages across 3,000+ documents
- Answer questions requiring information from multiple pages
- Understand visual elements like charts, tables, and figures
- Support both closed-domain (single document) and open-domain (multiple documents) queries

Under the hood, M3DocRAG operates through three sophisticated stages:

>> Document Embedding:
- Converts PDF pages to RGB images
- Uses ColPali to project both text queries and page images into a shared embedding space
- Creates dense visual embeddings for each page while maintaining visual information integrity

>> Page Retrieval:
- Employs MaxSim scoring to compute relevance between queries and pages
- Implements inverted file indexing (IVFFlat) for efficient search
- Reduces retrieval latency from 20s to under 2s when searching 40K+ pages
- Supports approximate nearest neighbor search via Faiss

>> Question Answering:
- Leverages Qwen2-VL 7B as the multi-modal language model
- Processes retrieved pages through a visual encoder
- Generates answers considering both textual and visual context

The results are impressive:
- State-of-the-art performance on MP-DocVQA benchmark
- Superior handling of non-text evidence compared to text-only systems
- Significantly better performance on multi-hop reasoning tasks

This is a game-changer for industries dealing with large document volumes—finance, healthcare, and legal sectors can now process documents more efficiently while preserving crucial visual context.
·
posted an update 1 day ago
view post
Post
748
Exciting breakthrough in multimodal search technology! @nvidia researchers have developed MM-Embed, a groundbreaking universal multimodal retrieval system that's changing how we think about search.

Key innovations:
• First-ever universal multimodal retriever that excels at both text and image searches across diverse tasks
• Leverages advanced multimodal LLMs to understand complex queries combining text and images
• Implements novel modality-aware hard negative mining to overcome modality bias issues
• Achieves state-of-the-art performance on M-BEIR benchmark while maintaining superior text retrieval capabilities

Under the hood:
The system uses a sophisticated bi-encoder architecture with LLaVa-Next (based on Mistral 7B) as its backbone. It employs a unique two-stage training approach: first with random negatives, then with carefully mined hard negatives to improve cross-modal understanding.

The real magic happens in the modality-aware negative mining, where the system learns to distinguish between incorrect modality matches and unsatisfactory information matches, ensuring retrieved results match both content and format requirements.

What sets it apart is its ability to handle diverse search scenarios - from simple text queries to complex combinations of images and text, all while maintaining high accuracy across different domains
posted an update 3 days ago
view post
Post
891
Excited to share @LinkedIn 's innovative approach to evaluating semantic search quality! As part of the Search AI team, we've developed a groundbreaking evaluation pipeline that revolutionizes how we measure search relevance.

>> Key Innovation: On-Topic Rate (OTR)
This novel metric measures the semantic match between queries and search results, going beyond simple keyword matching. The system evaluates whether content is truly relevant to the query's intent, not just matching surface-level terms.

>> Technical Implementation Details
Query Set Construction
• Golden Set: Contains curated top queries and complex topical queries
• Open Set: Includes trending queries and random production queries for diversity

Evaluation Pipeline Architecture
1. Query Processing:
- Retrieves top 10 documents per query
- Extracts post text and article information
- Processes both primary content and reshared materials

2. GAI Integration:
- Leverages GPT-3.5 with specialized prompts
- Produces three key outputs:
- Binary relevance decision
- Relevance score (0-1 range)
- Decision reasoning

Quality Assurance
• Validation achieved 94.5% accuracy on a test set of 600 query-post pairs
• Human evaluation showed 81.72% consistency with expert annotators

>> Business Impact
This system now serves as LinkedIn's benchmark for content search experiments, enabling:
• Weekly performance monitoring
• Rapid offline testing of new ML models
• Systematic identification of improvement opportunities

What are your thoughts on semantic search evaluation?
posted an update 6 days ago
view post
Post
398
Good folks ask Google have released a paper on CAT4D, a cutting-edge framework that's pushing the boundaries of multi-view video generation. Probably coming to Google Photos near you!

This innovative approach introduces a novel way to create dynamic 4D content with unprecedented control and quality.

Key Technical Innovations:
- Multi-View Video Diffusion Model (MVVM) architecture that handles both spatial and temporal dimensions simultaneously
- Zero-shot text-to-4D generation pipeline
- Temporal-aware attention mechanisms for consistent motion synthesis
- View-consistent generation across multiple camera angles

Technical Deep Dive:
The framework employs a sophisticated cascade of diffusion models that work in harmony to generate consistent content across both space and time. The architecture leverages view-dependent rendering techniques while maintaining temporal coherence through specialized attention mechanisms.

What sets CAT4D apart:
- Real-time view synthesis capabilities
- Seamless integration of temporal and spatial information
- Advanced motion handling through specialized temporal encoders
- Robust view consistency preservation across generated frames

Thoughts on how this could transform content creation in your industry?
updated a Space 6 days ago
posted an update 8 days ago
view post
Post
956
Exciting breakthrough in AI Hallucination Detection & Mitigation! THaMES (Tool for Hallucination Mitigations and EvaluationS), a groundbreaking end-to-end framework tackling one of AI's biggest challenges: hallucination in Large Language Models.

Key Technical Features:

• Automated QA Testset Generation using weighted sampling and batch processing
- Implements VectorStoreIndex for knowledge base construction
- Uses text-embedding-large-3 for semantic similarity
- Generates 6 question types: simple, reasoning, multi-context, situational, distracting, and double

• Advanced Hallucination Detection
- Utilizes fine-tuned NLI (deberta-v3-base-tasksource-nli)
- Implements HHEM-2.1-Open for factual consistency scoring
- Combines entailment and factual consistency for ensemble scoring

• Multiple Mitigation Strategies
- In-Context Learning with Chain-of-Verification (CoVe)
- Retrieval-Augmented Generation (RAG)
- Parameter-Efficient Fine-Tuning (PEFT) using LoRA

Real-world Results:
- GPT-4o showed significant improvement with RAG
- Llama-3.1 performed better with In-Context Learning
- PEFT significantly improved Llama-3.1's hallucination metrics

Why it matters:
This framework sets a new standard for reliable AI development by providing comprehensive tools to evaluate and mitigate hallucinations in LLMs. Perfect for AI researchers, developers, and organizations focused on building trustworthy AI systems
posted an update 10 days ago
view post
Post
359
Good folks from @amazon , @Stanford , and other great institutions have released “A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models!”

This comprehensive survey examines over 32 cutting-edge techniques to combat hallucination in Large Language Models (LLMs). As LLMs become increasingly integral to our daily operations, addressing their tendency to generate ungrounded content is crucial.

Retrieval-Augmented Generation (RAG) Innovations:
- Pre-generation retrieval using LLM-Augmenter with Plug-and-Play modules
- Real-time verification through the EVER framework implementing three-stage validation
- Post-generation refinement via the RARR system for automated attribution

Advanced Decoding Strategies:
- Context-Aware Decoding (CAD) utilizing contrastive output distribution
- DoLa's innovative approach of contrasting logit differences between transformer layers

Knowledge Integration Methods:
- The RHO framework leveraging entity representations and relation predicates
- FLEEK's intelligent fact verification system using curated knowledge graphs

Novel Loss Functions:
- Text Hallucination Regularization (THR) derived from mutual information
- The mFACT metric for evaluating faithfulness in multilingual contexts

This research provides a structured taxonomy for categorizing these mitigation techniques, offering valuable insights for practitioners and researchers working with LLMs.

What are your thoughts on hallucination mitigation in LLMs?
  • 1 reply
·
posted an update 12 days ago
view post
Post
766
Excited to share my analysis of the most groundbreaking DCN-V2 paper from @Google , which introduces significant improvements to deep learning recommendation systems!

Key technical highlights:

>> Core Architecture
- Starts with an embedding layer that handles both sparse categorical and dense features
- Unique capability to handle variable embedding sizes from small to large vocabulary sizes
- Cross network creates explicit bounded-degree feature interactions
- Deep network complements with implicit feature interactions
- Two combination modes: stacked and parallel architectures

>> Key Technical Innovations
- Enhanced cross layers with full matrix-based feature interaction learning instead of vector-based
- Mixture of Low-Rank architecture with:
* Multiple expert networks learning in different subspaces
* Dynamic gating mechanism to adaptively combine experts
* Efficient time complexity when specific conditions are met
* Support for non-linear transformations in projected spaces

>> Production Optimizations
- Low-rank matrix approximation leveraging singular value decay patterns
- Mixture-of-Experts decomposition into smaller subspaces
- Efficient parameter allocation between cross and deep networks
- Automatic feature interaction learning for higher-order interactions in multi-layered networks
- Support for both homogeneous and heterogeneous polynomial patterns

>> Real-World Impact
- Successfully deployed across Google's recommendation systems
- Significant gains in both offline accuracy and online metrics
- Better performance-latency tradeoffs through low-rank approximations
- Proven effectiveness on large-scale data with billions of training examples

This represents a major leap forward in making deep learning recommendation systems more practical and efficient at scale.

Thoughts? Would love to hear your experiences implementing similar architectures in production!
posted an update 13 days ago
view post
Post
902
It's always exciting to revisit Google's DCN paper—impractical but good!

Deep & Cross Network (DCN) - a groundbreaking approach to click-through rate prediction that's revolutionizing digital advertising!

Key Innovation:
DCN introduces a novel cross-network architecture that automatically learns feature interactions without manual engineering. What sets it apart is its ability to explicitly model bounded-degree feature crossings while maintaining the power of deep neural networks.

Technical Deep Dive:
- The architecture combines a cross network with a deep network in parallel.
- The cross network performs automatic feature crossing at each layer.
- The embedding layer transforms sparse categorical features into dense vectors.
- Cross layers use a unique formula that enables efficient high-degree polynomial feature interactions.
- Memory-efficient design with linear complexity O(d) in the input dimension.

Performance Highlights:
- Outperforms traditional DNN models with 60% less memory usage.
- Achieved 0.4419 logloss on the Criteo Display Ads dataset.
- Consistently performs better than state-of-the-art models like Deep Crossing and Factorization Machines.
- Exceptional performance on non-CTR tasks like Forest Covertype (97.40% accuracy).

Under the Hood:
- Uses embedding vectors of dimension 6 × (category cardinality)^1/4.
- Implements batch normalization and the Adam optimizer.
- The cross network depth determines the highest polynomial degree of feature interactions.
- An efficient projection mechanism reduces cubic computational cost to linear.
- Parameter sharing enables better generalization to unseen feature interactions.

Key Advantages:
1. No manual feature engineering required.
2. Explicit feature crossing at each layer.
3. Highly memory-efficient.
4. Scalable to web-scale data.
5. Robust performance across different domains.

Thoughts on how this could transform digital advertising?
  • 2 replies
·
posted an update 14 days ago
view post
Post
1250
Sorry judge, my lawyer hallucinated? 😂 If you get an AI lawyer, you would want it to be hallucination-free!

New @Stanford -@Yale research reveals surprising findings about leading AI legal research tools. Here's what you need to know:

>> Key Findings
The study tested LexisNexis (Lexis+ AI), Thomson Reuters (Westlaw AI & Ask Practical Law AI), and GPT-4, finding hallucination rates between 17-33% despite claims of being "hallucination-free".

>> Technical Deep Dive
The research evaluated these tools using Retrieval-Augmented Generation (RAG) architecture, which operates in two crucial steps:

1. Retrieval System:
- Uses neural text embeddings to capture semantic meaning
- Employs both lexical and semantic search mechanisms
- Implements document filtering and extraction
- Retrieves relevant legal documents from vast databases

2. Generation Pipeline:
- Processes retrieved documents alongside original queries
- Synthesizes information from multiple legal sources
- Generates responses based on retrieved context
- Includes citation verification mechanisms

>> Performance Breakdown:
- Lexis+ AI: 65% accuracy rate
- Westlaw AI: 42% accuracy rate
- Ask Practical Law AI: Over 60% incomplete answers

>> Why This Matters
This research exposes critical vulnerabilities in AI legal tools that lawyers increasingly rely on. It's essential for legal professionals to understand these limitations when incorporating AI into their practice.
posted an update 16 days ago
view post
Post
1181
Exciting breakthrough in LLM reasoning: Introducing "Thread of Thought" (ThoT) - a novel prompting strategy that revolutionizes how language models handle chaotic contexts!

Unlike traditional approaches that struggle with complex, interleaved information, ThoT enables LLMs to methodically segment and analyze extended contexts with remarkable precision. Here's how it works:

Technical Deep Dive:
- ThoT employs a two-step prompting mechanism:
1. Initial Analysis: Uses a template combining chaotic context (X) and query (Q) with a trigger sentence that initiates systematic reasoning.
2. Conclusion Refinement: Leverages the organized thought sequence to extract definitive answers.

Implementation Details:
- Seamlessly integrates as a "plug-and-play" module with existing LLMs.
- Requires no model retraining or fine-tuning.
- Works with various prompting techniques and model architectures.

Performance Highlights:
- Outperformed traditional methods on PopQA and EntityQ datasets.
- Achieved 57.4% accuracy with GPT-3.5-turbo (vs. 48.2% for Chain-of-Thought).
- Demonstrated superior performance across model scales, from 7B to 70B parameters.

Key Applications:
- Retrieval-augmented generation.
- Multi-turn conversation responses.
- Complex reasoning tasks requiring information synthesis.

What makes it special: ThoT mirrors human cognitive processes by breaking down complex information into manageable segments while maintaining logical continuity – a game-changer for handling information-dense contexts.
New activity in maxiw/hf-posts 17 days ago

Update Request

2
#2 opened 17 days ago by singhsidhukuldeep
posted an update 17 days ago
view post
Post
2296
Good folks at @nvidia and @Tsinghua_Uni have released LLAMA-MESH - A Revolutionary Approach to 3D Content Generation!

This innovative framework enables the direct generation of 3D meshes from natural language prompts while maintaining strong language capabilities.

Here is the Architecture & Implementation!

>> Core Components

Model Foundation
- If you haven't guessed it yet, it's built on the LLaMA-3.1-8B-Instruct base model
- Maintains original language capabilities while adding 3D generation
- Context length is set to 8,000 tokens

3D Representation Strategy
- Uses the OBJ file format for mesh representation
- Quantizes vertex coordinates into 64 discrete bins per axis
- Sorts vertices by z-y-x coordinates, from lowest to highest
- Sorts faces by the lowest vertex indices for consistency

Data Processing Pipeline
- Filters meshes to a maximum of 500 faces for computational efficiency
- Applies random rotations (0°, 90°, 180°, 270°) for data augmentation
- Generates ~125k mesh variations from 31k base meshes
- Uses Cap3D-generated captions for text descriptions

>> Training Framework

Dataset Composition
- 40% Mesh Generation tasks
- 20% Mesh Understanding tasks
- 40% General Conversation (UltraChat dataset)
- 8x training turns for generation, 4x for understanding

Training Configuration
- Deployed on 32 A100 GPUs (for Nvidia, this is literally in-house)
- 21,000 training iterations
- Global batch size: 128
- AdamW optimizer with a 1e-5 learning rate
- 30-step warmup with cosine scheduling
- Total training time: approximately 3 days (based on the paper)

This research opens exciting possibilities for intuitive 3D content creation through natural language interaction. The future of digital design is conversational!
posted an update 18 days ago
view post
Post
1901
It's not every day you see the No. 1 ranked paper of the day open-sourcing a very powerful image editing app!

Fascinating to see MagicQuill - a groundbreaking interactive image editing system that makes precise photo editing effortless through advanced AI!

The system's architecture features three sophisticated components:

1. Editing Processor:
- Implements a dual-branch architecture integrated into a latent diffusion framework
- Utilizes PiDiNet for edge map extraction and content-aware per-pixel inpainting
- Features a specialized UNet architecture with zero-convolution layers for feature insertion
- Employs denoising score matching for training the control branch
- Processes both structural modifications via scribble guidance and color manipulation through downsampled color blocks
- Maintains pixel-level control through VAE-based latent space operations

2. Painting Assistor:
- Powered by a fine-tuned LLaVA multimodal LLM using Low-Rank Adaptation (LoRA)
- Trained on a custom dataset derived from Densely Captioned Images (DCI)
- Processes user brushstrokes through specialized Q&A tasks for add/subtract/color operations
- Features bounding box coordinate normalization for precise stroke localization
- Implements streamlined single-word/phrase outputs for real-time performance

3. Idea Collector:
- Built as a modular ReactJS component library
- Supports cross-platform deployment via HTTP protocols
- Compatible with Gradio and ComfyUI frameworks
- Features comprehensive layer management and parameter adjustment capabilities
- Implements real-time canvas updates and preview generation

The system outperforms existing solutions like SmartEdit and BrushNet in edge alignment and color fidelity while maintaining seamless integration with popular AI frameworks.

What are your thoughts on AI-powered creative tools?
replied to m-ric's post 18 days ago
replied to maxiw's post 19 days ago
posted an update 19 days ago
view post
Post
1510
Sometimes, we forget that all these LLMs are trained on just raw text. Ideally, they are simply text completion models. Imagine a model that keeps on writing follow-up questions when you ask, "How to make pizza?" rather than answering you!

That's where Instruction Tuning comes in—it’s a game-changer.

Instruction tuning has revolutionized how we interact with Large Language Models (LLMs), bridging the crucial gap between raw model capabilities and practical applications.

It’s what transforms a GPT into ChatGPT!

Think of instruction tuning as teaching AI to "speak human"—it's the difference between a model that merely predicts the next words and one that truly understands and executes our intentions.

The real magic? It enables zero-shot learning, meaning models can tackle new tasks they've never encountered before, as long as the instructions are clear. This versatility is what makes modern AI assistants so powerful and user-friendly.
  • 2 replies
·
replied to maxiw's post 20 days ago
view reply

Enough to make a grown man cry! 😃🤗

Anyway, next whole week I will be posting about the best papers(according to me), 1 every day that discuss ways to reduce hallucinations (total 7)...cheers😬

posted an update 25 days ago
view post
Post
2275
Thinking about upgrading from Python 3.10 to 3.11? Here's why you should make the move - a deep technical breakdown that might convince you:

>> Performance Revolution
The performance improvements are staggering, with benchmarks showing 10-60% faster execution across different workloads. Let me break down the game-changing features:

>> Core Architecture Changes
Python 3.11's interpreter now uses statically allocated core modules, eliminating the multi-step loading process we've dealt with in 3.10. This means your applications will start 10-15% faster out of the gate.

>> Function Optimization
The redesigned frame objects are a thing of beauty - they've been stripped of unnecessary baggage, resulting in a 3-7% speedup for all function calls. But it gets better: function calls are now inlined, giving us a 1-3% boost, with recursive functions like Fibonacci seeing up to 1.7x improvement!

>> Adaptive Intelligence
The new Specializing Interpreter is perhaps the most exciting addition. Think of it as a lightweight JIT - it identifies hot code paths and optimizes them automatically.

The interpreter now automatically specializes math operations, array indexing, and even sequence unpacking based on actual usage patterns.

>> Exception Handling Revolution
My favorite feature? Zero-cost exceptions! Your try-except blocks no longer carry overhead when no exceptions occur. The code runs at full speed until an exception actually happens.

Ready to make the switch? These improvements aren't just numbers - they're real-world performance gains waiting to be unlocked in your codebase.