metadata

title: README
emoji: 🏃
colorFrom: pink
colorTo: blue
sdk: static
pinned: false

Welcome - This classroom organization holds examples and links for this session. Begin by adding a bookmark.

Chat and Clinical

🥫Open Datasets for Health Care📊

Datasets for open source or creative commons zero datasets and also links with PDF's for public clinical use:
Curated Datasets: Kaggle. NLM UMLS. LOINC. ICD10 Diagnosis. ICD11. Papers,Code,Datasets for SOTA in Medicine. Mental. Behavior. CMS Downloads. CMS CPT and HCPCS Procedures and Services

Examples and Exercises - Create These Spaces in Your Account and Test / Modify

Easy Examples

FastSpeech - https://huggingface.co/spaces/AIZero2HeroBootcamp/FastSpeech2LinerGradioApp
Memory - https://huggingface.co/spaces/AIZero2HeroBootcamp/Memory
StaticHTML5PlayCanvas - https://huggingface.co/spaces/AIZero2HeroBootcamp/StaticHTML5Playcanvas
3DHuman - https://huggingface.co/spaces/AIZero2HeroBootcamp/3DHuman
TranscriptAILearnerFromYoutube - https://huggingface.co/spaces/AIZero2HeroBootcamp/TranscriptAILearnerFromYoutube
AnimatedGifGallery - https://huggingface.co/spaces/AIZero2HeroBootcamp/AnimatedGifGallery
VideoToAnimatedGif - https://huggingface.co/spaces/AIZero2HeroBootcamp/VideoToAnimatedGif

Hard Examples:

ChatGPTandLangChain - https://huggingface.co/spaces/AIZero2HeroBootcamp/ChatGPTandLangchain a. Keys: https://platform.openai.com/account/api-keys
MultiPDFQAChatGPTLangchain - https://huggingface.co/spaces/AIZero2HeroBootcamp/MultiPDF-QA-ChatGPT-Langchain

👋 Two easy ways to turbo boost your AI learning journey - Lets go 100X! 💻

🌐 AI Pair Programming with GPT

Open 2 Browsers to:

🌐 ChatGPT URL or URL2 and
🌐 Huggingface URL in separate browser windows.
🤖 Use prompts to generate a streamlit program on Huggingface or locally to test it.
🔧 For advanced work, add Python 3.10 and VSCode locally, and debug as gradio or streamlit apps.
🚀 Use these two superpower processes to reduce the time it takes you to make a new AI program! ⏱️

🎥 YouTube University Method:

🏋️‍♀️ Plan two hours each weekday to exercise your body and brain.
🎬 Make a playlist of videos you want to learn from on YouTube. Save the links to edit later.
🚀 Try watching the videos at a faster speed while exercising, and sample the first five minutes of each video.
📜 Reorder the playlist so the most useful videos are at the front, and take breaks to exercise.
📝 Practice note-taking in markdown to instantly save what you want to remember. Share your notes with others!
👥 AI Pair Programming Using Long Answer Language Models with Human Feedback

🎥 2023 AI/ML Learning Playlists for ChatGPT, LLMs, Recent Events in AI:

AI News: https://www.youtube.com/playlist?list=PLHgX2IExbFotMOKWOErYeyHSiikf6RTeX
ChatGPT Code Interpreter: https://www.youtube.com/playlist?list=PLHgX2IExbFou1pOQMayB7PArCalMWLfU-
Ilya Sutskever and Sam Altman: https://www.youtube.com/playlist?list=PLHgX2IExbFovr66KW6Mqa456qyY-Vmvw-
Andrew Huberman on Neuroscience and Health: https://www.youtube.com/playlist?list=PLHgX2IExbFotRU0jl_a0e0mdlYU-NWy1r
Andrej Karpathy: https://www.youtube.com/playlist?list=PLHgX2IExbFovbOFCgLNw1hRutQQKrfYNP
Medical Futurist on GPT: https://www.youtube.com/playlist?list=PLHgX2IExbFosVaCMZCZ36bYqKBYqFKHB2
ML APIs: https://www.youtube.com/playlist?list=PLHgX2IExbFovPX9z4m61rQImM7cDDY79L
FastAPI and Streamlit: https://www.youtube.com/playlist?list=PLHgX2IExbFosyX2jzJJimPAI9C0FHflwB
AI UI UX: https://www.youtube.com/playlist?list=PLHgX2IExbFosCUPzEp4bQaygzrzXPz81w
ChatGPT Streamlit 2023: https://www.youtube.com/playlist?list=PLHgX2IExbFotDzxBRWwUBTb0_XFEr4Dlg

LLM Base Model Overview and Evolutionary Tree: https://github.com/Mooler0410/LLMsPracticalGuide

🎥 2023 AI/ML Advanced Learning Playlists:

🥫Open Datasets for Health Care📊

Curated Datasets: Kaggle. NLM UMLS. LOINC. ICD10 Diagnosis. ICD11. Papers,Code,Datasets for SOTA in Medicine. Mental. Behavior. CMS Downloads. CMS CPT and HCPCS Procedures and Services

Azure Development Architectures in 2023:

ChatGPT: https://azure.github.io/awesome-azd/?tags=chatgpt
Azure OpenAI Services: https://azure.github.io/awesome-azd/?tags=openai
Python: https://azure.github.io/awesome-azd/?tags=python
AI LLM Architecture - Guidance by MS: https://github.com/microsoft/guidance

Dockerfile and Azure ACR->ACA Easy Robust Deploys from VSCode:

Set up VSCode with Azure and Remote extensions and install Azure CLI locally
Get access to azure subscriptions. Form there in VSCode, expand to Container Apps
In Container Apps create new and pick Dockerfile to deploy to a ACR then ACA spin up using Azure to build.

Dockerfile for Streamlit and Dockerfile for FastAPI:

Show two examples.

Example Starter Prompts for AIPP:

Write a streamlit program that demonstrates Data synthesis. Synthesize data from multiple sources to create new datasets. Use two datasets and demonstrate pandas dataframe query merge and join with two datasets in python list dictionaries: List of Hospitals that are over 1000 bed count by city and state, and State population size and square miles. Perform a calculated function on the merged dataset.

Comparison of Large Language Models

Model Name	Model Size (in Parameters)
BigScience-tr11-176B	176 billion
GPT-3	175 billion
OpenAI's DALL-E 2.0	500 million
NVIDIA's Megatron	8.3 billion
Transformer-XL	250 million
XLNet	210 million

ChatGPT Datasets 📚

WebText
Common Crawl
BooksCorpus
English Wikipedia
Toronto Books Corpus
OpenWebText

ChatGPT Datasets - Details 📚

WebText: A dataset of web pages crawled from domains on the Alexa top 5,000 list. This dataset was used to pretrain GPT-2.
- WebText: A Large-Scale Unsupervised Text Corpus by Radford et al.
Common Crawl: A dataset of web pages from a variety of domains, which is updated regularly. This dataset was used to pretrain GPT-3.
- Language Models are Few-Shot Learners by Brown et al.
BooksCorpus: A dataset of over 11,000 books from a variety of genres.
- Scalable Methods for 8 Billion Token Language Modeling by Zhu et al.
English Wikipedia: A dump of the English-language Wikipedia as of 2018, with articles from 2001-2017.
- Improving Language Understanding by Generative Pre-Training Space for Wikipedia Search
Toronto Books Corpus: A dataset of over 7,000 books from a variety of genres, collected by the University of Toronto.
- Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond by Schwenk and Douze.
OpenWebText: A dataset of web pages that were filtered to remove content that was likely to be low-quality or spammy. This dataset was used to pretrain GPT-3.
- Language Models are Few-Shot Learners by Brown et al.

Big Science Model 🚀

📜 Papers:
1. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Paper
2. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Paper
3. 8-bit Optimizers via Block-wise Quantization Paper
4. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation Paper
5. Other papers related to Big Science
6. 217 other models optimized for use with Bloom
📚 Datasets:

Datasets:

- Universal Dependencies: A collection of annotated corpora for natural language processing in a range of languages, with a focus on dependency parsing.

Universal Dependencies official website.

- WMT 2014: The fourth edition of the Workshop on Statistical Machine Translation, featuring shared tasks on translating between English and various other languages.

WMT14 website.

- The Pile: An English language corpus of diverse text, sourced from various places on the internet.

The Pile official website.

- HumanEval: A dataset of English sentences, annotated with human judgments on a range of linguistic qualities.

HumanEval: An Evaluation Benchmark for Language Understanding by Gabriel Ilharco, Daniel Loureiro, Pedro Rodriguez, and Afonso Mendes.

- FLORES-101: A dataset of parallel sentences in 101 languages, designed for multilingual machine translation.

FLORES-101: A Massively Multilingual Parallel Corpus for Language Understanding by Aman Madaan, Shruti Rijhwani, Raghav Gupta, and Mitesh M. Khapra.

- CrowS-Pairs: A dataset of sentence pairs, designed for evaluating the plausibility of generated text.

CrowS-Pairs: A Challenge Dataset for Plausible Plausibility Judgments by Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, Pascale Fung, and Caiming Xiong.

- WikiLingua: A dataset of parallel sentences in 75 languages, sourced from Wikipedia.

WikiLingua: A New Benchmark Dataset for Cross-Lingual Wikification by Jiarui Yao, Yanqiao Zhu, Ruihan Bao, Guosheng Lin, Lidong Bing, and Bei Shi.

- MTEB: A dataset of English sentences, annotated with their entailment relationships with respect to other sentences.

Multi-Task Evaluation Benchmark for Natural Language Inference by Michał Lukasik, Marcin Junczys-Dowmunt, and Houda Bouamor.

- xP3: A dataset of English sentences, annotated with their paraphrase relationships with respect to other sentences.

xP3: A Large-Scale Evaluation Benchmark for Paraphrase Identification in Context by Aniket Didolkar, James Mayfield, Markus Saers, and Jason Baldridge.

- DiaBLa: A dataset of English dialogue, annotated with dialogue acts.

A Large-Scale Corpus for Conversation Disentanglement by Samuel Broscheit, António Branco, and André F. T. Martins.
📚 Dataset Papers with Code
1. Universal Dependencies
2. WMT 2014
3. The Pile
4. HumanEval
5. FLORES-101
6. CrowS-Pairs
7. WikiLingua
8. MTEB
9. xP3
10. DiaBLa

Deep RL ML Strategy 🧠

The AI strategies are:

Language Model Preparation using Human Augmented with Supervised Fine Tuning 🤖
Reward Model Training with Prompts Dataset Multi-Model Generate Data to Rank 🎁
Fine Tuning with Reinforcement Reward and Distance Distribution Regret Score 🎯
Proximal Policy Optimization Fine Tuning 🤝
Variations - Preference Model Pretraining 🤔
Use Ranking Datasets Sentiment - Thumbs Up/Down, Distribution 📊
Online Version Getting Feedback 💬
OpenAI - InstructGPT - Humans generate LM Training Text 🔍
DeepMind - Advantage Actor Critic Sparrow, GopherCite 🦜
Reward Model Human Prefence Feedback 🏆

For more information on specific techniques and implementations, check out the following resources:

OpenAI's paper on GPT-3 which details their Language Model Preparation approach
DeepMind's paper on SAC which describes the Advantage Actor Critic algorithm
OpenAI's paper on Reward Learning which explains their approach to training Reward Models
OpenAI's blog post on GPT-3's fine-tuning process