nbroad HF staff commited on
Commit
3fb26c5
1 Parent(s): 6a99e97
Dockerfile ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Read the doc: https://huggingface.co/docs/hub/spaces-sdks-docker
2
+ # you will also find guides on how best to write your Dockerfile
3
+
4
+ FROM python:3.11
5
+
6
+ RUN useradd -m -u 1000 user
7
+ USER user
8
+ ENV PATH="/home/user/.local/bin:$PATH"
9
+
10
+ WORKDIR /app
11
+
12
+ COPY --chown=user ./requirements.txt requirements.txt
13
+ RUN pip install --no-cache-dir --upgrade -r requirements.txt
14
+
15
+ COPY --chown=user . /app
16
+ CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860"]
README.md CHANGED
@@ -1,12 +1,11 @@
1
  ---
2
- title: About Me
3
- emoji: 📊
4
- colorFrom: red
5
- colorTo: gray
6
- sdk: gradio
7
- sdk_version: 4.24.0
8
- app_file: app.py
9
- pinned: false
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
  ---
2
+ title: Me
3
+ emoji: 🤠
4
+ colorFrom: green
5
+ colorTo: yellow
6
+ sdk: docker
7
+ pinned: true
8
+ license: apache-2.0
 
9
  ---
10
 
11
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
about.py ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from app import *
2
+ from datetime import datetime
3
+
4
+
5
+ def page():
6
+ # me = Image('https://live.staticflickr.com/65535/53939742438_6ca1a4b3eb.jpg', alt='Nicholas Broad and his dog, Maya in Maine', caption="it'sa me", left=False, width=300)
7
+ h2s = 'Hi there 👋', 'Work Experience', "Education", "Volunteering",'Hobbies'
8
+ txts = [Markdown(intro), Markdown(work), Markdown(education), Markdown(volunteering), Markdown(hobbies)]
9
+ secs = Sections(h2s, txts)
10
+ return BstPage(0, "", *secs)
11
+
12
+ birthdate = datetime(1994, 3, 18)
13
+ age_in_years = (datetime.now() - birthdate).days // 365
14
+
15
+ maya_birthdate = datetime(2019, 3, 18)
16
+ maya_age_in_years = (datetime.now() - maya_birthdate).days // 365
17
+
18
+ intro = f"""
19
+ <div class="row">
20
+ <div class="col-md-7">
21
+
22
+ I'm Nicholas Broad. I'm currently {age_in_years} years old, living in San Francisco, California. I'm an open-minded, curious, and self-motivated person who enjoys learning new things and working on interesting projects. I like to think of myself as friendly, humble, and a bit goofy. I live with my beautiful dog, <a href="https://www.flickr.com/photos/131470140@N06/albums/72177720295849325/">Maya</a>, who is {maya_age_in_years} years old. You can often find us at parks in SF playing fetch. Maya will almost always be wearing a bandana, as all dogs should.
23
+ </div>
24
+ <div class="col-md-4">
25
+ <a data-flickr-embed="true" href="https://www.flickr.com/photos/131470140@N06/53939742438/in/album-72177720300307682/" title="Nico and Maya"><img src="https://live.staticflickr.com/65535/53939742438_6ca1a4b3eb.jpg" width="333" height="500" alt="Nico and Maya" class="img-fluid rounded"/></a><script async src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"></script>
26
+ </div>
27
+ </div>
28
+ """
29
+
30
+ hf_start_date = datetime(2021, 12, 6)
31
+ time_at_hf = (datetime.now() - hf_start_date)
32
+ years_at_hf = time_at_hf.days // 365
33
+ months_at_hf = (time_at_hf.days - years_at_hf * 365) // 30
34
+
35
+ work = f"""
36
+ I am currently working at [Hugging Face](https://hf.co) as a Machine Learning Engineer. I have been there {years_at_hf} years and {months_at_hf} months as a member of the monetization team. I have helped companies such as Johnson & Johnson, Grammarly, Optum, Writer, Uber, and Liberty Mutual improve their machine learning capabilities.
37
+
38
+ Prior to working at Hugging Face, I was a data scientist at GSK, where I created a search engine to make it easier to understand Medical Voice of Customer data (what the clinicians who use the products think of them).
39
+
40
+
41
+ I believe in being transparent about my work journey, so I will share more than most people typically do. Here is all of my work experience since I was 16.
42
+
43
+
44
+ * Machine Learning Engineer @ [Hugging Face](https://hf.co) (2021 - Present)
45
+ * Member of the success team within the monetization team.
46
+ * Helping enterprises and startups improve their machine learning capabilities.
47
+ * This involves teaching how to use Hugging Face libraries such as transformers, accelerate, datasets, peft, trl, and diffusers.
48
+ * Typical use cases:
49
+ * Fine-tuning a model on a custom dataset.
50
+ * Serving fine-tuned models (both LLM and non-LLMs).
51
+ * Pre-training from scratch.
52
+ * RAG using LLMs and embeddings.
53
+ * Document understanding using vision-language models.
54
+ * Curating a high-quality instruction dataset.
55
+ * Manage relations with ~$2M worth of customers.
56
+ * Most senior member of post-sales MLE team.
57
+ * Hired 5 other team mebers.
58
+ * Author on [BLOOM paper](https://huggingface.co/papers/2211.05100) for minor help with datasets.
59
+ * Data Scientist @ GSK (2020 - 2021)
60
+ * Created a search engine to make it easier to understand what clinicians think of the products.
61
+ * Data Science Fellowship @ SharpestMinds (2019 - 2020)
62
+ * Fellowship is a generous term. It was a 3-month bootcamp where I taught myself all about data science and machine learning while being mentored by an industry professional.
63
+ * High School / Middle School Tutor @ AJ Tutoring (2018 - 2019)
64
+ * Shared my love for physics, math, and chemistry with students in the Bay Area.
65
+ * I even worked with the daughter of Google's CEO.
66
+ * One of my students was in [the scandal where parents paid to get their kids into college](https://en.wikipedia.org/wiki/Varsity_Blues_scandal).
67
+ * Dog Walker (2017-2019)
68
+ * I ran with my neighbors dog 3 times a week for 3 years.
69
+ * We did over 2000 miles together.
70
+ * Image Sensor Intern @ ON Semiconductor (2017)
71
+ * Quit after 5 weeks because it was a waste of my time. Unrealistic expectations, no guidance, and not much learning.
72
+ * Research Assistant Internship @ Stanford University (2016)
73
+ * Worked on thermionic energy harvesters.
74
+ * Spent too much time in the cleanroom alone at odd hours around dangerous chemicals and tools.
75
+ * Financial Manager @ [Muwekma Tah-Ruk House](https://nacc.stanford.edu/our-community/muwekma-tah-ruk-native-theme-house) (2015-2016)
76
+ * Budgeted $100k for 35 students in the house for food and events.
77
+ * Research Internship @ Lawrence Livermore Laboratory (2015)
78
+ * Fabrication of implantable neural devices.
79
+ * Research Internship @ Stanford University (2014)
80
+ * Fabricating sensors that work in extreme environments.
81
+ * Worked with graphene, high bandgap semiconductors, and cool equipment like scanning electron microscopes and x-ray diffraction.
82
+ * Co-author on 2 papers:
83
+ * [Irradiation effects of graphene-enhanced gallium nitride (GaN) metal-semiconductor-metal (MSM) ultraviolet photodetectors](https://www.spiedigitallibrary.org/conference-proceedings-of-spie/9491/949107/Irradiation-effects-of-graphene-enhanced-gallium-nitride-GaN-metal-semiconductor/10.1117/12.2178091.short),
84
+ * [Impact of gamma irradiation on GaN/sapphire surface acoustic wave resonators](https://ieeexplore.ieee.org/abstract/document/6932293)
85
+ * Admin Assistant @ Stanford Graduate School of Education (2014)
86
+ * Did various small tasks to help the school run smoothly.
87
+ * Research Internship @ Stanford University (2013)
88
+ * Worked on helping underrepresented students start learning Computer Science.
89
+ * Soccer Coach @ Los Gatos Soccer (2010-2012)
90
+ * Coached soccer basics for kids during the summer.
91
+ """
92
+
93
+ volunteering = """All of my volunteering is tutoring.
94
+
95
+ * [Lucile Packard Children's Hospital](https://med.stanford.edu/phm/clinical/lpch.html) (2013 - 2014)
96
+ * Tutored kids who had been in the hospital for a long time.
97
+ * [East Palo Alto Tennis and Tutoring](https://www.epatt.org/) (2019)
98
+ * Tutored high school students in math and science.
99
+ * [Student U](https://studentudurham.org/) (2019-2020)
100
+ * Tutored high school students in math and science.
101
+ """
102
+
103
+ education = """
104
+ * Coterminal M.S. in Electrical Engineering @ Stanford University (2016-2018)
105
+ * Did not graduate.
106
+ * Decided that I wasn't interested in the classes I was taking (physics, analog devices, signal processing, networks).
107
+ * I was also sick of doing research in a cleanroom.
108
+ * B.S. in Electrical Engineering @ Stanford University (2012-2016)
109
+ * Focus on semiconductor device physics.
110
+ * Los Gatos High School (2008-2012)
111
+ * Voted "Best All Around" by senior class.
112
+ * Played Varsity Soccer for 3 years.
113
+ * [Captain of the Varsity Soccer Team (2011-2012).](https://www.maxpreps.com/ca/los-gatos/los-gatos-wildcats/athletes/nicky-broad/soccer/stats/?careerid=11o9q24ridm22)
114
+ * [Co-MVP of the league (2012).](https://www.removepaywall.com/search?url=https://www.mercurynews.com/2012/03/19/broad-mason-are-league-mvps-in-soccer/)
115
+ * [Honorary mention in CCS (2012).](https://www.removepaywall.com/search?url=https://www.mercurynews.com/2012/03/27/all-mercury-news-boys-soccer-honors/)
116
+ """
117
+
118
+ hobbies = """My main hobbies are going on walks with my dog, playing soccer, doing [Kaggle competitions](https://www.kaggle.com/nbroad), reading, [photography](https://flickr.com/photos/131470140@N06/), and taiko. This last one, taiko, is probably the most interesting, though I am on a bit of a hiatus. I played taiko for 12 years, starting with [San Jose Junior Taiko](https://www.taikoconservatory.org/juniortaiko) and continuing with [Stanford Taiko](https://taiko.stanford.edu/). Here is a video of me performing in Stanford's Bing Concert Hall.
119
+ """
120
+
121
+ four_sizzlin_beets = """<div class="embed-responsive embed-responsive-16by9"><iframe width="560" height="315" src="https://www.youtube.com/embed/0Q_4sffQZBA?si=4ArGNw1wy5ps2ra0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe></div>"""
122
+
123
+ stanford_instagram_embed = """<blockquote class="instagram-media" data-instgrm-captioned data-instgrm-permalink="https://www.instagram.com/p/njAzUvj7Cs/?utm_source=ig_embed&amp;utm_campaign=loading" data-instgrm-version="14" style=" background:#FFF; border:0; border-radius:3px; box-shadow:0 0 1px 0 rgba(0,0,0,0.5),0 1px 10px 0 rgba(0,0,0,0.15); margin: 1px; max-width:540px; min-width:326px; padding:0; width:99.375%; width:-webkit-calc(100% - 2px); width:calc(100% - 2px);"><div style="padding:16px;"> <a href="https://www.instagram.com/p/njAzUvj7Cs/?utm_source=ig_embed&amp;utm_campaign=loading" style=" background:#FFFFFF; line-height:0; padding:0 0; text-align:center; text-decoration:none; width:100%;" target="_blank"> <div style=" display: flex; flex-direction: row; align-items: center;"> <div style="background-color: #F4F4F4; border-radius: 50%; flex-grow: 0; height: 40px; margin-right: 14px; width: 40px;"></div> <div style="display: flex; flex-direction: column; flex-grow: 1; justify-content: center;"> <div style=" background-color: #F4F4F4; border-radius: 4px; flex-grow: 0; height: 14px; margin-bottom: 6px; width: 100px;"></div> <div style=" background-color: #F4F4F4; border-radius: 4px; flex-grow: 0; height: 14px; width: 60px;"></div></div></div><div style="padding: 19% 0;"></div> <div style="display:block; height:50px; margin:0 auto 12px; width:50px;"><svg width="50px" height="50px" viewBox="0 0 60 60" version="1.1" xmlns="https://www.w3.org/2000/svg" xmlns:xlink="https://www.w3.org/1999/xlink"><g stroke="none" stroke-width="1" fill="none" fill-rule="evenodd"><g transform="translate(-511.000000, -20.000000)" fill="#000000"><g><path d="M556.869,30.41 C554.814,30.41 553.148,32.076 553.148,34.131 C553.148,36.186 554.814,37.852 556.869,37.852 C558.924,37.852 560.59,36.186 560.59,34.131 C560.59,32.076 558.924,30.41 556.869,30.41 M541,60.657 C535.114,60.657 530.342,55.887 530.342,50 C530.342,44.114 535.114,39.342 541,39.342 C546.887,39.342 551.658,44.114 551.658,50 C551.658,55.887 546.887,60.657 541,60.657 M541,33.886 C532.1,33.886 524.886,41.1 524.886,50 C524.886,58.899 532.1,66.113 541,66.113 C549.9,66.113 557.115,58.899 557.115,50 C557.115,41.1 549.9,33.886 541,33.886 M565.378,62.101 C565.244,65.022 564.756,66.606 564.346,67.663 C563.803,69.06 563.154,70.057 562.106,71.106 C561.058,72.155 560.06,72.803 558.662,73.347 C557.607,73.757 556.021,74.244 553.102,74.378 C549.944,74.521 548.997,74.552 541,74.552 C533.003,74.552 532.056,74.521 528.898,74.378 C525.979,74.244 524.393,73.757 523.338,73.347 C521.94,72.803 520.942,72.155 519.894,71.106 C518.846,70.057 518.197,69.06 517.654,67.663 C517.244,66.606 516.755,65.022 516.623,62.101 C516.479,58.943 516.448,57.996 516.448,50 C516.448,42.003 516.479,41.056 516.623,37.899 C516.755,34.978 517.244,33.391 517.654,32.338 C518.197,30.938 518.846,29.942 519.894,28.894 C520.942,27.846 521.94,27.196 523.338,26.654 C524.393,26.244 525.979,25.756 528.898,25.623 C532.057,25.479 533.004,25.448 541,25.448 C548.997,25.448 549.943,25.479 553.102,25.623 C556.021,25.756 557.607,26.244 558.662,26.654 C560.06,27.196 561.058,27.846 562.106,28.894 C563.154,29.942 563.803,30.938 564.346,32.338 C564.756,33.391 565.244,34.978 565.378,37.899 C565.522,41.056 565.552,42.003 565.552,50 C565.552,57.996 565.522,58.943 565.378,62.101 M570.82,37.631 C570.674,34.438 570.167,32.258 569.425,30.349 C568.659,28.377 567.633,26.702 565.965,25.035 C564.297,23.368 562.623,22.342 560.652,21.575 C558.743,20.834 556.562,20.326 553.369,20.18 C550.169,20.033 549.148,20 541,20 C532.853,20 531.831,20.033 528.631,20.18 C525.438,20.326 523.257,20.834 521.349,21.575 C519.376,22.342 517.703,23.368 516.035,25.035 C514.368,26.702 513.342,28.377 512.574,30.349 C511.834,32.258 511.326,34.438 511.181,37.631 C511.035,40.831 511,41.851 511,50 C511,58.147 511.035,59.17 511.181,62.369 C511.326,65.562 511.834,67.743 512.574,69.651 C513.342,71.625 514.368,73.296 516.035,74.965 C517.703,76.634 519.376,77.658 521.349,78.425 C523.257,79.167 525.438,79.673 528.631,79.82 C531.831,79.965 532.853,80.001 541,80.001 C549.148,80.001 550.169,79.965 553.369,79.82 C556.562,79.673 558.743,79.167 560.652,78.425 C562.623,77.658 564.297,76.634 565.965,74.965 C567.633,73.296 568.659,71.625 569.425,69.651 C570.167,67.743 570.674,65.562 570.82,62.369 C570.966,59.17 571,58.147 571,50 C571,41.851 570.966,40.831 570.82,37.631"></path></g></g></g></svg></div><div style="padding-top: 8px;"> <div style=" color:#3897f0; font-family:Arial,sans-serif; font-size:14px; font-style:normal; font-weight:550; line-height:18px;">View this post on Instagram</div></div><div style="padding: 12.5% 0;"></div> <div style="display: flex; flex-direction: row; margin-bottom: 14px; align-items: center;"><div> <div style="background-color: #F4F4F4; border-radius: 50%; height: 12.5px; width: 12.5px; transform: translateX(0px) translateY(7px);"></div> <div style="background-color: #F4F4F4; height: 12.5px; transform: rotate(-45deg) translateX(3px) translateY(1px); width: 12.5px; flex-grow: 0; margin-right: 14px; margin-left: 2px;"></div> <div style="background-color: #F4F4F4; border-radius: 50%; height: 12.5px; width: 12.5px; transform: translateX(9px) translateY(-18px);"></div></div><div style="margin-left: 8px;"> <div style=" background-color: #F4F4F4; border-radius: 50%; flex-grow: 0; height: 20px; width: 20px;"></div> <div style=" width: 0; height: 0; border-top: 2px solid transparent; border-left: 6px solid #f4f4f4; border-bottom: 2px solid transparent; transform: translateX(16px) translateY(-4px) rotate(30deg)"></div></div><div style="margin-left: auto;"> <div style=" width: 0px; border-top: 8px solid #F4F4F4; border-right: 8px solid transparent; transform: translateY(16px);"></div> <div style=" background-color: #F4F4F4; flex-grow: 0; height: 12px; width: 16px; transform: translateY(-4px);"></div> <div style=" width: 0; height: 0; border-top: 8px solid #F4F4F4; border-left: 8px solid transparent; transform: translateY(-4px) translateX(8px);"></div></div></div> <div style="display: flex; flex-direction: column; flex-grow: 1; justify-content: center; margin-bottom: 24px;"> <div style=" background-color: #F4F4F4; border-radius: 4px; flex-grow: 0; height: 14px; margin-bottom: 6px; width: 224px;"></div> <div style=" background-color: #F4F4F4; border-radius: 4px; flex-grow: 0; height: 14px; width: 144px;"></div></div></a><p style=" color:#c9c8cd; font-family:Arial,sans-serif; font-size:14px; line-height:17px; margin-bottom:0; margin-top:8px; overflow:hidden; padding:8px 0 7px; text-align:center; text-overflow:ellipsis; white-space:nowrap;"><a href="https://www.instagram.com/p/njAzUvj7Cs/?utm_source=ig_embed&amp;utm_campaign=loading" style=" color:#c9c8cd; font-family:Arial,sans-serif; font-size:14px; font-style:normal; font-weight:normal; line-height:17px; text-decoration:none;" target="_blank">A post shared by Stanford University (@stanford)</a></p></div></blockquote> <script async src="//www.instagram.com/embed.js"></script>"""
124
+
125
+
126
+ hobbies += four_sizzlin_beets + "<br><br>Stanford Instagram also featured me on their page.<br><br>" + stanford_instagram_embed
app.py ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fh_bootstrap import *
2
+ from itertools import chain
3
+ from markdown import markdown
4
+
5
+ md_exts='codehilite', 'smarty', 'extra', 'sane_lists'
6
+ def Markdown(s, exts=md_exts, **kw): return Div(NotStr(markdown(s, extensions=exts)), **kw)
7
+
8
+ ghurl = 'https://github.com/nbroad1881'
9
+ hf_logo_svg = "https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo.svg"
10
+ li_url = "https://www.linkedin.com/in/nicholas-m-broad/"
11
+ kaggle_url = "https://www.kaggle.com/nbroad"
12
+ fh_url = "https://fastht.ml/"
13
+ fh_logo = 'assets/fasthtml_logo.svg'
14
+
15
+ def BstPage(selidx, title, *c):
16
+ navitems = [('Home', '/'), ('Blog', '/blog')]
17
+
18
+ ra_items = (
19
+ A(Image(src="/assets/hf-logo.svg", width=28, height=28, cls="my-0 px-0 mx-0 py-0", left=False, pad=0), cls="ms-2 my-0 px-1 btn-lg btn", role="button", href="https://hf.co/nbroad"),
20
+ Icon('fab fa-github', dark=False, sz='lg', href=ghurl, cls='ms-2 px-2'),
21
+ Icon("fab fa-linkedin", dark=False, sz='lg', href=li_url, cls='ms-2 px-2'),
22
+ Icon("fab fa-kaggle", dark=False, sz='lg', href=kaggle_url, cls='ms-2 px-2'),
23
+ Icon("fab fa-youtube", dark=False, sz='lg', href="https://www.youtube.com/@nicholasbroad1881", cls='ms-2 px-2'),
24
+ Icon("fab fa-twitter", dark=False, sz='lg', href="https://twitter.com/nbroad1881", cls='ms-2 px-2'),
25
+ )
26
+ ftlinks = [A(k, href=v, cls='nav-link px-2 text-muted')
27
+ for k,v in dict(Home='/').items()]
28
+ return (
29
+ Title(title),
30
+ Script('initTOC()'),
31
+ Container(
32
+ Navbar('nav', selidx, items=navitems, ra_items=ra_items, cls='navbar-light bg-secondary rounded-lg',
33
+ image=f'', hdr_href="fhurl", placement=PlacementT.Default, expand=SizeT.Md, toggle_left=False),
34
+ Toc(Container(H1(title, cls='pb-2 pt-1'), *c, cls='mt-3')),
35
+ BstFooter('Made using FastHTML', File(fh_logo), img_href=fh_url, cs=ftlinks),
36
+ typ=ContainerT.Xl, cls='mt-3', data_bs_spy='scroll', data_bs_target='#toc'))
37
+
38
+ def Sections(h2s, texts):
39
+ colors = 'yellow', 'pink', 'teal', 'blue'
40
+ div_cls = 'py-2 px-3 mt-4 bg-light-{} rounded-lg'
41
+ return chain([Div(H2(h2, id=f'sec{i+1}', cls=div_cls.format(colors[i%4])), Div(txt, cls='px-2'))
42
+ for i,(h2,txt) in enumerate(zip(h2s, texts))])
assets/fasthtml_logo.svg ADDED
assets/hf-logo.svg ADDED
assets/hl-styles.css ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ pre { line-height: 125%; }
2
+ td.linenos .normal { color: inherit; background-color: transparent; padding-left: 5px; padding-right: 5px; }
3
+ span.linenos { color: inherit; background-color: transparent; padding-left: 5px; padding-right: 5px; }
4
+ td.linenos .special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; }
5
+ span.linenos.special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; }
6
+ .codehilite .hll { background-color: #ffffcc }
7
+ .codehilite { background: #f8f8f8; }
8
+ .codehilite .c { color: #3D7B7B; font-style: italic } /* Comment */
9
+ .codehilite .err { border: 1px solid #FF0000 } /* Error */
10
+ .codehilite .k { color: #008000; font-weight: bold } /* Keyword */
11
+ .codehilite .o { color: #666666 } /* Operator */
12
+ .codehilite .ch { color: #3D7B7B; font-style: italic } /* Comment.Hashbang */
13
+ .codehilite .cm { color: #3D7B7B; font-style: italic } /* Comment.Multiline */
14
+ .codehilite .cp { color: #9C6500 } /* Comment.Preproc */
15
+ .codehilite .cpf { color: #3D7B7B; font-style: italic } /* Comment.PreprocFile */
16
+ .codehilite .c1 { color: #3D7B7B; font-style: italic } /* Comment.Single */
17
+ .codehilite .cs { color: #3D7B7B; font-style: italic } /* Comment.Special */
18
+ .codehilite .gd { color: #A00000 } /* Generic.Deleted */
19
+ .codehilite .ge { font-style: italic } /* Generic.Emph */
20
+ .codehilite .gr { color: #E40000 } /* Generic.Error */
21
+ .codehilite .gh { color: #000080; font-weight: bold } /* Generic.Heading */
22
+ .codehilite .gi { color: #008400 } /* Generic.Inserted */
23
+ .codehilite .go { color: #717171 } /* Generic.Output */
24
+ .codehilite .gp { color: #000080; font-weight: bold } /* Generic.Prompt */
25
+ .codehilite .gs { font-weight: bold } /* Generic.Strong */
26
+ .codehilite .gu { color: #800080; font-weight: bold } /* Generic.Subheading */
27
+ .codehilite .gt { color: #0044DD } /* Generic.Traceback */
28
+ .codehilite .kc { color: #008000; font-weight: bold } /* Keyword.Constant */
29
+ .codehilite .kd { color: #008000; font-weight: bold } /* Keyword.Declaration */
30
+ .codehilite .kn { color: #008000; font-weight: bold } /* Keyword.Namespace */
31
+ .codehilite .kp { color: #008000 } /* Keyword.Pseudo */
32
+ .codehilite .kr { color: #008000; font-weight: bold } /* Keyword.Reserved */
33
+ .codehilite .kt { color: #B00040 } /* Keyword.Type */
34
+ .codehilite .m { color: #666666 } /* Literal.Number */
35
+ .codehilite .s { color: #BA2121 } /* Literal.String */
36
+ .codehilite .na { color: #687822 } /* Name.Attribute */
37
+ .codehilite .nb { color: #008000 } /* Name.Builtin */
38
+ .codehilite .nc { color: #0000FF; font-weight: bold } /* Name.Class */
39
+ .codehilite .no { color: #880000 } /* Name.Constant */
40
+ .codehilite .nd { color: #AA22FF } /* Name.Decorator */
41
+ .codehilite .ni { color: #717171; font-weight: bold } /* Name.Entity */
42
+ .codehilite .ne { color: #CB3F38; font-weight: bold } /* Name.Exception */
43
+ .codehilite .nf { color: #0000FF } /* Name.Function */
44
+ .codehilite .nl { color: #767600 } /* Name.Label */
45
+ .codehilite .nn { color: #0000FF; font-weight: bold } /* Name.Namespace */
46
+ .codehilite .nt { color: #008000; font-weight: bold } /* Name.Tag */
47
+ .codehilite .nv { color: #19177C } /* Name.Variable */
48
+ .codehilite .ow { color: #AA22FF; font-weight: bold } /* Operator.Word */
49
+ .codehilite .w { color: #bbbbbb } /* Text.Whitespace */
50
+ .codehilite .mb { color: #666666 } /* Literal.Number.Bin */
51
+ .codehilite .mf { color: #666666 } /* Literal.Number.Float */
52
+ .codehilite .mh { color: #666666 } /* Literal.Number.Hex */
53
+ .codehilite .mi { color: #666666 } /* Literal.Number.Integer */
54
+ .codehilite .mo { color: #666666 } /* Literal.Number.Oct */
55
+ .codehilite .sa { color: #BA2121 } /* Literal.String.Affix */
56
+ .codehilite .sb { color: #BA2121 } /* Literal.String.Backtick */
57
+ .codehilite .sc { color: #BA2121 } /* Literal.String.Char */
58
+ .codehilite .dl { color: #BA2121 } /* Literal.String.Delimiter */
59
+ .codehilite .sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */
60
+ .codehilite .s2 { color: #BA2121 } /* Literal.String.Double */
61
+ .codehilite .se { color: #AA5D1F; font-weight: bold } /* Literal.String.Escape */
62
+ .codehilite .sh { color: #BA2121 } /* Literal.String.Heredoc */
63
+ .codehilite .si { color: #A45A77; font-weight: bold } /* Literal.String.Interpol */
64
+ .codehilite .sx { color: #008000 } /* Literal.String.Other */
65
+ .codehilite .sr { color: #A45A77 } /* Literal.String.Regex */
66
+ .codehilite .s1 { color: #BA2121 } /* Literal.String.Single */
67
+ .codehilite .ss { color: #19177C } /* Literal.String.Symbol */
68
+ .codehilite .bp { color: #008000 } /* Name.Builtin.Pseudo */
69
+ .codehilite .fm { color: #0000FF } /* Name.Function.Magic */
70
+ .codehilite .vc { color: #19177C } /* Name.Variable.Class */
71
+ .codehilite .vg { color: #19177C } /* Name.Variable.Global */
72
+ .codehilite .vi { color: #19177C } /* Name.Variable.Instance */
73
+ .codehilite .vm { color: #19177C } /* Name.Variable.Magic */
74
+ .codehilite .il { color: #666666 } /* Literal.Number.Integer.Long */
assets/styles.css ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ pre {
2
+ background-color: #f8f9fa;
3
+ border: 1px solid #e9ecef;
4
+ border-radius: 4px;
5
+ padding: 1rem;
6
+ margin-bottom: 1rem;
7
+ }
8
+
9
+ :root, [data-bs-theme=light] {
10
+ --bs-secondary: #169873;
11
+ --bs-secondary-rgb: 22, 152, 115;
12
+ --bs-primary: #c6f97f;
13
+ --bs-light-yellow: #faaf5c;
14
+ --bs-light-pink: #f49fbc;
15
+ --bs-light-red: #ff6060;
16
+ --bs-light-teal: #B8E1FF;
17
+ --bs-light-blue: #A9FFF7;
18
+ --bs-light-green: #94FBAB;
19
+ }
20
+
21
+ nav.navbar { --bs-btn-hover-bg: rgba(255,255,255,0.2); }
22
+ .nav-link:hover { color: rgba(255,255,0,0.6); }
23
+ .nav-link.active { font-weight: bold; }
24
+
25
+ @media (min-width: 992px) {
26
+ .rounded-tl-lg { border-top-left-radius: 1.5rem !important; }
27
+ }
28
+
29
+ .rounded-lg {
30
+ border-top-left-radius: 1rem !important;
31
+ border-top-right-radius: 1rem !important;
32
+ }
33
+
34
+ blockquote {
35
+ border-left: 5px solid #007bff;
36
+ padding-left: 15px;
37
+ margin-left: 0;
38
+ background-color: #f8f9fa;
39
+ padding: 15px;
40
+ padding-bottom: 5px;
41
+ }
blog.py ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from app import *
2
+ import yaml
3
+ import functools
4
+
5
+ from pathlib import Path
6
+ from datetime import datetime, date
7
+
8
+ file_path = Path(__file__).parent
9
+
10
+ NUM_RECENT_BLOGS = 20
11
+
12
+ def full_page():
13
+
14
+ secs = Sections(["Recent Blogs"], [Div(*[blog_preview(blog_id) for blog_id in sorted_blogs[:NUM_RECENT_BLOGS]])])
15
+ return BstPage(0, '', *secs)
16
+
17
+
18
+ def blog_preview(blog_id):
19
+ details = all_blogs[blog_id]
20
+
21
+ return Div(
22
+ A(H3(details[0]["title"]), href=f"/blog/{blog_id}"),
23
+ P(details[0].get("date_published", "")),
24
+ P(details[0].get("desc", "")+"...")
25
+ )
26
+
27
+
28
+
29
+ @functools.lru_cache()
30
+ def get_blogs():
31
+ blogs = (file_path / "blogs").rglob("*.md")
32
+
33
+ blog_dict = {}
34
+
35
+ for blog in blogs:
36
+ with open(blog, 'r') as f:
37
+ id_ = blog.stem
38
+ text = f.read()
39
+ if "---" not in text:
40
+ continue
41
+ metadata, markdown = text.split("---", 2)[1:]
42
+ metadata = yaml.safe_load(metadata)
43
+ metadata["id"] = id_
44
+ blog_dict[id_] = (metadata, markdown)
45
+
46
+ blog_dict = {k:v for k, v in blog_dict.items() if v[0].get("date_published", "") != ""}
47
+
48
+ sorted_blogs = [x[0]["id"] for x in sorted(blog_dict.values(), key=lambda x: x[0].get("date_published"), reverse=True)]
49
+
50
+ return blog_dict, sorted_blogs
51
+
52
+
53
+ all_blogs, sorted_blogs = get_blogs()
54
+
55
+
56
+
57
+ def single_blog(blog_id):
58
+ """
59
+ Return a single blog post.
60
+
61
+ """
62
+ details = all_blogs[blog_id]
63
+
64
+ return BstPage(0, details[0]["title"], Markdown(details[1]))
blogs/2020/sentimentr.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Sentimentr
3
+ desc: A tool to visualize bias in news headlines about presidential candidates
4
+ published: true
5
+ date_published: 2020-01-12
6
+ tags: nlp
7
+ ---
8
+
9
+ <!-- {% include figure image_path="/assets/images/i-voted.jpg" alt="I voted"%} -->
10
+
11
+ With presidential primaries just around the corner, I thought it would be interesting to see if I could tell if there is a consistent bias toward one candidate or another. Could I quantitatively show that Fox News has more favorable headlines about Trump and CNN showing the opposite?
12
+
13
+ <!-- This year also broke the previous record with 20 candidates vying for the nomination. I thought it would be interesting to see if the general success or failure of a candidate could be seen in news headlines. To make the task easier, I limited the candidate pool to the top 5 candidates: Biden, Sanders, Warren, Harris, and Buttigieg. At the time I started Harris had not dropped out, and since there was a reasonable amount of data about Harris, I decided to keep her in the final results.
14
+ -->
15
+ The ideal news source is unbiased and not focusing all of their attention on one candidate; however we live in a time where "fake news" has entered everyone's daily vernacular. Unfortunately, there is scorn going both ways between liberals and conservatives with both claiming that their side knows the truth and lambasting the other side for being deceived and following villainous leaders.
16
+
17
+ I gathered thousands of headlines from CNN, Fox News, and The New York Times that contain the keywords Trump, Biden, Sanders, Warren, Harris, or Buttigieg. I had to exclude many headlines that contained the names of multiple candidates because it would require making multiple models that are each tailored to one single candidate.
18
+
19
+ Here are a few instances that have contain different candidates in the same headline that would make it difficult to measure a single sentiment for each candidate.
20
+
21
+ * *Here's why Trump keeps pumping up Bernie Sanders*
22
+ * *Buttigieg on Trump: 'Senate is the jury today, but we are the jury tomorrow'*
23
+ * *Elizabeth Warren sought to 'raise a concern' with Bernie Sanders in post-debate exchange, Sanders campaign says*
24
+
25
+ For this reason I decided to drop all headlines with the names of multiple candidates for this analysis. Thankfully, I still ended up with over 5,000 articles. Take a look at the distribution of articles for each candidate and for each news source.
26
+
27
+ <!-- {% include figure image_path="/assets/images/graphs/article-count-raw-and-opinion.png" alt="Linear Graph of Article Counts" caption="Bar plots of how many articles had a given candidate's name in it. Top is a raw count of total articles. Bottom separates it by news group."%} -->
28
+
29
+ <!--{% include figure image_path="/assets/images/graphs/article-count-log.png" alt="Logarithmic Graph of Article Counts" caption="Logarithmic barplots of how many articles had a given candidate's name in it. Blue represents CNN, orange is Fox News, and green is The New York Times."%}
30
+ -->
31
+ Trump is by far the most talked-about candidate and for good reason: he is the sitting president and the sole republican candidate. After Trump in the ranking goes Biden, then Sanders and Warren are about the same then finally Harris and Buttigieg.
32
+
33
+ I was surprised at the sheer volume of CNN articles and also The New York Times' tiny quantity.
34
+
35
+
36
+ ## Sentiment Analysis Models
37
+
38
+ I used 3 different sentiment analysis models: two of which were pre-made packages. VADER and TextBlob are python packages that offer sentiment analysis trained on different subjects. VADER is a lexicon approach based off of social media tweets that contain emoticons and slang. TextBlob is a Naive Bayes approach trained on IMDB review data. My model is an LSTM with a base languange model based off of the [AWD-LSTM](https://arxiv.org/abs/1708.02182). I then trained its language model on news articles. Following that, I trained it on hand-labeled (by me 😤) article headlines.
39
+
40
+ Here are the average scores for each candidate.
41
+
42
+ <!-- {% include figure image_path="/assets/images/graphs/news-barplot-average-scores.png" alt="Sentiment scores over time" caption="Bar plots of average sentiment scores separated by model and candidate."%} -->
43
+
44
+ And then looking average scores over time.
45
+
46
+ <!-- {% include figure image_path="/assets/images/graphs/average-4wk-scores-over-time-top-6.png" alt="Sentiment scores over time" caption="Line plots of sentiment scores separated by model and candidate."%} -->
47
+
48
+ I should also note that these scores have been smoothed by a sliding average with a window size of 4 weeks. Without smoothing it looks much more chaotic. Smoothing hopefully shows longterm trends. Even with smoothing it is a bit hard to tell if there are any consistent trends. The next thing I tried was to superimpose debate dates onto the democratic candidates to see if the candidate's performance could be seen after a debate. In some cases, there does seem to be a rise or drop in scores after a debate, but whether they are correlated remains unknown.
49
+
50
+
51
+ <!-- {% include figure image_path="/assets/images/graphs/average-4wk-scores-over-time-top-6-with-debates.png" alt="Sentiment scores over time" caption="Line plots of sentiment scores separated by model and candidate with debates superimposed over."%} -->
52
+
53
+
54
+ <!-- Out of 477,
55
+ vader got 261,
56
+ textblob got 151,
57
+ lstm got 233
58
+ -->
blogs/2021/harry_potter_rag.md ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Facebook's RAG
3
+ desc: How well does it do on Harry Potter trivia?
4
+ published: true
5
+ date_published: 2021-09-22
6
+ tags: nlp
7
+ ---
8
+
9
+
10
+ I posted this on Medium to see what it would be like to publish there. [Link here](https://medium.com/@nicholas_59841/testing-facebooks-rag-model-on-harry-potter-trivia-e95c316a6487)
blogs/2021/kaggle_chaii.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Reflections on Kaggle competition [chaii - Hindi and Tamil Question Answering]
3
+ desc: Reflecting on what worked and what didn't
4
+ published: true
5
+ date_published: 2021-01-05
6
+ tags: kaggle nlp
7
+ ---
8
+
9
+ On the evening of November 15, 2021 I eagerly sat at my computer, counting down the seconds until 7pm. As soon as it did, I hit refresh and my heart skipped a beat as I saw my name at the 17th spot on the leaderboard. I got a silver medal in a competition with over 900 teams! I was over the moon with excitement, and I achieved the rank of Kaggle Expert in the competitions category.
10
+
11
+ I'm sure this is a common experience for Kagglers when they do well in a competition for the first time. This also might be a description of the moment when Kagglers officially become addicted. And as far as addictions go, this one is at least educational, but I'm sure many spouses and partners have complained about how much time Kaggle consumes.
12
+
13
+ It's been a couple of months since the competition ended, and I've come down from my high enough to reflect on what I learned from the competition. This post is quite detailed so feel free to skip around to what interests you.
14
+
15
+ ## Competition Details 🏆
16
+
17
+ Google Research India hosted the [chaii - Hindi and Tamil Question Answering](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering) competition to create question answering models for two Indian languages used by millions of people. Hindi is a relatively high resource language, but Tamil is more in the mid-tier category. When there aren't as many high-quality public datasets or models, it can be hard for the community and industry to make useful AI applications.
18
+
19
+ ### Question Answering
20
+
21
+ Given a passage of text and a question about the passage, a question answering (QA) model should find the span of text that best answers that question. This is an extractive process, so the answer must be in one contiguous section of text. There is a bit of ambiguity between QA and machine reading comprehension (MRC), so click [here to read an in-depth explanation about the differences](https://alessandro-lombardini.medium.com/question-and-answering-vs-machine-reading-comprehension-qa-vs-mrc-acf599536fe1).
22
+
23
+ ## The power of SQuAD 💪
24
+
25
+ SQuAD is a popular dataset for training English QA models, but interestingly enough, training on English SQuAD helps models do better in languages that are nothing like English, such as Hindi and Tamil. Training a model on SQuAD before training on the chaii dataset resulted in significant improvements.
26
+
27
+ ## Model choices
28
+
29
+ Encoder-only models like RoBERTa were heavily used, and while there were a few attempts to use encoder-decoder models like BART or T5, it seemed like people ultimately decided against it. XLM-RoBERTa (XLM-R), MuRIL, and RemBERT showed up in all of the top team's solutions and for good reason - they are high-performing multi-lingual models that can be trained on English QA to get better scores in Hindi and Tamil.
30
+
31
+ ### The power of multilingual models 🔥
32
+
33
+ Early on there was a discussion about whether it would be better to make two monolingual models, one for Hindi and one for Tamil, or just a single multilingual model. Quick experiments showed that the multilingual models were superior, and nearly all of the top scores relied on only multilingual models*. It was somewhat mind-blowing for me to see first-hand how multilingual models were actually learning the similarities between related languages. This played a huge factor in getting good performance because there were Bengali and Telugu datasets that gave a slight boost to the model's performance in Hindi and Tamil. My strategy, along with many others, was to use SQuAD and TyDiQA as "QA Pre-training." This is not typical pre-training using Masked Language Modeling (MLM), but rather pre-training in the sense that the model is gaining a general understanding of language and more fine-tuning will happen afterwards. It's just the name I'm using for the stage of training that occurs in between MLM pre-training and QA fine-tuning. This stage primes the model to have a sense of how the QA task works in general, and then the finetuning helps the model understand QA in the context of a specific language or two. Maybe midtuning or pre-fine-tuning would be better names.
34
+
35
+ *There was at least one team in the gold medal range that used an [interesting technique to transfer model performance from English to Hindi and Tamil](https://www.kaggle.com/competitions/chaii-hindi-and-tamil-question-answering/discussion/288110).
36
+
37
+ ## The limit of multilingual models
38
+
39
+ After seeing that SQuAD could serve as useful QA pre-training, I thought that I should find as many QA datasets in as many languages as possible. I found datasets in Korean, Thai, Bengali, German, French, Spanish, Arabic, Vietnamese, Japanese, Persian, English, and probably some more that I'm not remembering. This was quite a lot of data to churn through, but on TPU it only took an hour or so.
40
+
41
+ This was not a rigorous examination, but it did show the amazing cross-lingual learning ability of these models. Only after I did these tests did I notice that the XLM-R paper mentioned that training on too many languages resulted in reduced performance, making me choose only a small subset of QA datasets for QA pre-training.
42
+
43
+ ## Max Sequence Length and Doc Stride
44
+
45
+ Two of the most important parameters when training a QA model are the max sequence length and doc stride. Ideally, the model would be able to look at all of the passage in one pass, but the passages are usually long enough that they need to be broken up to be able to be processed by the model. When breaking it up, the doc stride is how much overlap there is between one chunk and the next. There didn't seem to be any well-known methods for determining max sequence length and doc stride, so most people ended up empirically finding what worked best. The standard max sequence length/doc stride of 384/128 is a good default.
46
+
47
+ ## BigBird 🐥
48
+
49
+ One trick that set me apart from other competitors was using BigBird to handle long sequence lengths. Because many of the texts had thousands of tokens, I thought it would be advantageous to use a model that does well on long contexts. BigBird is a model designed to have the attention mechanism scale linearly in memory rather than the quadratically like the default transformer. Unfortunately, the public pre-trained BigBird models are only in English, so I had to come up with a way to stick MuRIL into BigBird. This is much simpler than it seems because MuRIL is essentially RoBERTa trained on Indian languages, and since the authors of BigBird did a "warm start" by using the pre-trained weights from RoBERTa, nearly all of the parameters lined up one-to-one.
50
+
51
+ The only component of BigBird that could not directly come from RoBERTa/MuRIL was the position embeddings. These are trained parameters that are of the shape (max sequence length, hidden size) which means RoBERTa/MuRIL only has values up to a max sequence length of 512 and BigBird needs 4096. The solution to this is to tile the RoBERTa/MuRIL embeddings 8 times so it has the right dimensions.
52
+
53
+ I used the [Google TPU Research Cloud (TRC)](https://sites.research.google/trc/about/) to do training on TPU v3-8 accelerators and ran a base and large model for a couple of days on mC4 data. I used the Flax scripts in the Hugging Face examples file and it could not have been simpler.
54
+
55
+ I couldn't train the large BigBird model at 4096 sequence length, nor even 2048, but rather a measly 1024 -- even at a batch size of 1 and using bfloat16. A 6-fold ensemble of these models got my highest score, earning me a silver medal in the competition.
56
+
57
+ ## Ensembling
58
+
59
+ I had decent scoring models using XLMR, RemBERT, MuRIL, and BigBird MuRIL, but I ran out of time before I was able to ensemble them together. I took inspiration from [this notebook from a previous competition](https://www.kaggle.com/code/theoviel/character-level-model-magic) that turned token-level predictions into character-level predictions before combining. Different models have different tokenizers, so it isn't straightforward to combine token-level outputs from them. The one thing they do have in common is the original context, thus the need for character-level predictions. To go from token-level to character-level, the token-level predictions are duplicated for each character.
60
+
61
+ I submitted an ensemble for scoring after the competition end, but it did poorly. If I ever get around to seeing why my ensemble failed, I'd be curious to know how well it could have done. Many of the gold medalists did similar ensembling, but theirs got good results, unlike mine.
62
+
63
+ ## What might be interesting to explore in the future
64
+
65
+ Two approaches I tried that didn't yield any results were Splinter and SAM.
66
+
67
+ ### Splinter
68
+
69
+ Splinter is a model designed for few-shot QA, which seems ideal for this competition. Just like BigBird, there was only an English model so I tried to replicate it in Hindi and Tamil to no avail. The [original work](https://github.com/oriram/splinter) was done using TensorFlow 1, and I attempted to replicate it using Hugging Face and PyTorch. I trained on TPU v3-8 but I was unable to get it to converge smoothly. Hugging Face even released [official support for Splinter](https://huggingface.co/docs/transformers/model_doc/splinter), but I was still unable to get it to work. I have trained [Splinter on SQuAD v2](https://huggingface.co/nbroad/splinter-base-squad2) so I can confirm that the English version works.
70
+
71
+ ### SAM
72
+
73
+ Sharpness-Aware Minimization is an optimization technique that is supposed to lead to better generalization because it finds region of flatness in optimization space rather than sharp holes. There are papers that indicate it does well for [computer vision tasks](https://github.com/google-research/sam) and for [language tasks](https://arxiv.org/abs/2110.08529) as well. Having a model that generalizes well is useful to not have bad surprises on the final leaderboard. When I used SAM, I had no intuition on hyperparameters, so I found my trainings diverging or just scoring the same or worse as before.
74
+
75
+ Both of these approaches seemed very useful, and I thought that Splinter would have been a gold-medal-worthy approach. I'm still curious to see how well it could have been for this competition.
blogs/2021/kaggle_coleridge.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Reflections on Kaggle competition [Coleridge Initiative - Show US the Data]
3
+ desc: Reflecting on what worked and what didn't
4
+ published: true
5
+ date_published: 2021-12-29
6
+ tags: kaggle nlp
7
+ ---
8
+
9
+ It's been several months since the [Coleridge Initiative - Show US the Data](https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data) competition has ended, but I recently got in the mood to write a quick reflection about my experience. This reflection is mainly a way for me to assess what I learned, but maybe you'll also find something worthwhile.
10
+
11
+
12
+ ## Competition Details 🏆
13
+
14
+ The hosts wanted to find all of the mentions of public datasets hidden in published journals and articles. When I say hidden, I mean that the paper refers to a dataset but never officially cites anything in the references section. In the hosts own words,
15
+
16
+ > Can natural language processing find the hidden-in-plain-sight data citations? [This will] show how public data are being used in science and help the government make wiser, more transparent public investments."
17
+
18
+ Participants competed to see who could find the most mentions of datasets in roughly 8,000 publications.
19
+
20
+ As this was my first Kaggle competition, I quickly realized that it was much more nuanced and complicated than I expected. It was also a wake-up call for me to realize how little I knew about PyTorch, Tensorflow, GPUs, TPUs, and Machine Learning in general. I was challenged and pushed to think in new ways, and I felt that I had grown significantly after the competition ended.
21
+
22
+ Moreover, the final results of the competition were very surprising because my team jumped over 1400 spots from the public to private leaderboard to end at 47th out of 1610 teams, earning us a silver medal (top 3%). If you would like to know more about why there are separate public and private leaderboards, [read this post here.](https://qr.ae/pG6Xc1) I've included a plot below of public ranking vs private ranking to show how much "shake-up" there was. My team had the 3rd highest positive delta, meaning that only 2 other teams jumped more positions from public to private. The person who probably had the worst day went from 176 (just outside the medal range) to 1438. To understand the figure, there are essentially 3 different categories. The first category would be any point on the line y=x. This means that the team had the exact same score on public and private leaderboards. The further the points get away from y=x, the bigger the difference between public and private leaderboards. The second category is for the teams who dropped from public to private -- they are in the region between the line y=x and the y-axis. The final category is for the teams that went up from public to private leaderboards: the region from the line y=x to the x-axis. My team fell into this third category and we were practically in the bottom right corner.
23
+
24
+ {% include figure image_path="/assets/images/coleridge-shakeup.png" alt="Public vs private leaderboard rank for Coleridge competition"%}
25
+
26
+ ## Why the shake-up?
27
+
28
+ >*shake-up is the term used on Kaggle to describe the change in rank from public to private leaderboard*
29
+
30
+ In general, shake-up is caused by overfitting to the public leaderboard, not having a good cross-validation method, and having a model that does not generalize well.
31
+
32
+ For this competition, the huge jump in score was because
33
+ 1. string matching worked well on the public leaderboard but not the private one.
34
+ - the public leaderboard contained dataset names from the training data but the private leaderboard didn't have any.
35
+ 2. most people did not check to see if their approach could find datasets that weren't mentioned in the training dataset
36
+
37
+ If I'm being honest, I didn't have great cross-validation, but I also refused to do string matching using dataset lists because it didn't seem like it was what the hosts wanted and I was more interested in applying a deep learning approach.
38
+
39
+ ## Best solution 💡
40
+
41
+
42
+ It was a bit annoying that my best submission actually didn't require any training whatsoever. I suppose I could consider this as being resourceful, but I still think it's annoying considering all the time I put into other methods that ultimately scored worse.
43
+
44
+ My best submission used a combination of regular expressions and a pre-trained question answering model that I pulled off the 🤗 Hugging Face Model Hub. In terms of regular expressions, I came up with a few simple patterns that looked for the sentences that contained words indicative of a dataset, such as Study, Survey, Dataset, etc. These regex patterns were used to quickly narrow down the amount of text that I would run through the slow transformer model. I sampled a few different architectures, BERT, RoBERTa, ELECTRA, and ultimately chose an ELECTRA model that had been trained on SQuAD 2.0. The final step was to pass the sentences that matched the regex patterns into the question answering model as the context, alongside "What is the dataset?" as the question. Adding the question gives the model extra information which helps it extract the right span of text. This submission went from 0.335 to 0.339 from public to private leaderboard.
45
+
46
+
47
+ ## Setbacks 😖
48
+
49
+ As mentioned earlier, my NER models were absolutely dreadful. I figured that identifying the dataset names from passages of text was essentially token classification so I focused 95% of my efforts on building an NER model. I think NER models struggled because the training set had over 14,000 different texts, but only 130 different dataset labels. If there were a wider variety of dataset labels, it might have done better but I think the model just memorized those 130 labels. Moreover, there were many spans that had dataset names that were not labeled, so I was implicitly training the model to ignore other dataset names. I ended up using my NER model for one of my two final submissions, and it went from 0.398 in the public leaderboard to 0.038 in the private leaderboard. Big oof.
50
+
51
+
52
+ ## Key Takeaways 🗝️
53
+
54
+
55
+
56
+ I don't think this was a typical Kaggle competition, and it was a little surprising to see how some top teams relied much more on good rule-based approaches rather than on deep learning models. I think most Kagglers default to using the biggest, most complicated transformer-based approach because they've heard so much about BERT et al. While transformers have been able to accomplish remarkable scores on numerous benchmarks, they can't do everything! I fell into this trap too, and I think my final score was mainly due to luck. 🍀
57
+
58
+
59
+
60
+ ### Current stats
61
+
62
+ ![competition](https://road-to-kaggle-grandmaster.vercel.app/api/badges/nbroad/competition)
63
+ ![dataset](https://road-to-kaggle-grandmaster.vercel.app/api/badges/nbroad/dataset)
64
+ ![notebook](https://road-to-kaggle-grandmaster.vercel.app/api/badges/nbroad/notebook)
65
+ ![discussion](https://road-to-kaggle-grandmaster.vercel.app/api/badges/nbroad/discussion)
66
+
67
+ If you want to add Kaggle badges to your GitHub profile or website, use this: <https://github.com/subinium/kaggle-badge>
blogs/2022/exotic_cheese.md ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Eccentric billionaire hobby #1 - exotic cheeses"
3
+ desc: I like cheese
4
+ published: true
5
+ date_published: 2022-07-04
6
+ tags: fun
7
+ ---
8
+
9
+ # 🧀
10
+
11
+ While I am nowhere close to becoming a billionaire, I do occasionally daydream about frivolous scenarios that, in all likelihood, will never happen but are entertaining nonetheless. One scenario I keep returning to is the one where I'm a billionaire. "Regular" billionaires like Bezos and Musk might take trips into outer space, but I've decided that my first eccentric billionaire hobby will be to open up an exotic cheese shop.
12
+
13
+ > You might be saying, "Exotic cheese? Aren't there better ways of spending your money?" Yes, and since this is a fictional scenario, let's just assume I've already funded the research that cured cancer, I've already ended child poverty and hunger, and I've already fixed whichever tragic cause is closest to your heart. Even after all of that heroic philanthropy, I will still have tons of cash to do dumb shit.
14
+
15
+ Ok, back to exotic cheese. Have you ever wondered what jaguar cheese tastes like? What about elephant cheese? Dolphin cheese? It's questions like this that keep me up at night.
16
+
17
+ {% include figure image_path="/assets/images/dolphin_cheese.jpeg" alt="Woman: He's probably thinking about other women. Man: I wonder what dolphin cheese tastes like."%}
18
+
19
+ This cheese shop will only be possible if the milk from any mammal can be turned into cheese, which may or may not be true. For the sake of this fantasy, let's assume it is possible.
20
+
21
+ There are tons of different mammals like giraffes, whales, and wolves, and my shop will have hundreds of different types to try! I'd be willing to bet that a good portion of these cheeses will be absolutely disgusting, but it will certainly be a one-of-a-kind experience.
22
+
23
+ Now, I have absolutely no idea how to actually procure the milk from endangered/rare/dangerous animals, but I'm sure that a billionaire would be able to make it happen. Who knows, maybe my exotic cheese shop will be so popular that it will spur a movement to protect more animals so that more cheese can be made.
24
+
25
+ Oh, and if a billionaire somehow reads this and wants to make a dream come true, please reach out 😄
26
+
27
+ # 🐵 🐶 🦊 🐻 🐼 🧀
blogs/2022/kaggle_commonlit_readability.md ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Reflections on Kaggle competition [CommonLit Readability Prize]
3
+ desc: Reflecting on what worked and what didn't
4
+ published: true
5
+ date_published: 2022-01-05
6
+ tags: kaggle nlp
7
+ ---
8
+
9
+ The [CommonLit Readability Prize](https://www.kaggle.com/c/commonlitreadabilityprize) was my second Kaggle competition, as well as my worst leaderboard finish. In retrospect, it was a great learning opportunity and I'm still only a *little* disappointed that I didn't rank higher. Here are some details about what I learned, what I was impressed by, and what I would do differently in the future.
10
+
11
+ {% include figure image_path="https://upload.wikimedia.org/wikipedia/commons/7/74/Open_books_stacked.jpg" alt="open books" caption="Hh1718, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons" %}
12
+
13
+ ## Competition Details 🏆
14
+
15
+ The task for the competition was to rate the readability of a passage of text - basically assigning a grade level based on the difficulty of the passage. For 99% of the participants this task seemed like a straightforward regression problem, but some interesting approaches emerged after the competition ended. Transformers were, of course, necessary to do well. The competition had over 3000 teams which I think was due to the fact that there was a large amount of prize money and students who would normally be in school having more free time during the summer.
16
+
17
+ ## Defining Readability 📚
18
+
19
+ Readability is pretty subjective, so the hosts got many teachers from grades 3-12 to do pairwise comparisons between book excerpts. That is, the task for each teacher was to decide, given two texts, which one was relatively harder/easier to read. After getting over 100k pairwise comparisons, they used a [Bradley-Terry (BT) Model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) to get values for all excerpts.
20
+
21
+ ## Approaches
22
+
23
+ Since the training data was basically a text and a number (the readability score), most participants turned it into a regression task. Give the model a bit of text and the model will output a score. Many of the winning solutions used this, but after the competition ended, a few people shared how they tried implementing pairwise models.
24
+
25
+ In essence, the pairwise model takes two texts as input and then gives a score for how much harder/easier one text is over the other. If one of those texts has a known readability score, such as those that are in the training data, the unlabeled text can be assigned a score based on the known score and the model's prediction of the relative difference in readability.
26
+
27
+ I'm probably not doing the explanation justice, so please refer to the following posts.
28
+
29
+ - Chris Deotte ([@cdeotte](https://www.kaggle.com/cdeotte)) [explaining how he used BT to score passages](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/257446),
30
+ - Abhishek Thakur's ([@ahbishek](https://www.kaggle.com/ahbishek)) [notebook]() and post about his pairwise model.
31
+ - User [@Allohvk](https://www.kaggle.com/allohvk) [giving more details about BT as a loss function](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/276356).
32
+
33
+ I don't think the pairwise approach was able to surpass the regression approach, but I think it could potentially be used again in future competitions, such as this one: [Jigsaw Rate Severity of Toxic Comments - Rank relative ratings of toxicity between comments](https://www.kaggle.com/c/jigsaw-toxic-severity-rating)
34
+
35
+ To summarize the various techniques that the top solutions used, here is a quick list.
36
+
37
+ - In-domain pre-training to adapt language models to this type of text.
38
+ - People used [Project Gutenberg](https://www.gutenberg.org/) and other freely available literature.
39
+ - Pseudo-labeling on unlabeled text and then training on the pseudo-labels.
40
+ - The unlabeled text came from sources similar to what was used for in-domain pre-training.
41
+ - Ensembling models
42
+ - Some of the top approaches used over 10 different models with different architectures, different training schemes, and different sizes.
43
+ - Adjusting predictions to have the same mean score as the training data
44
+ - Using SVR on top of model predictions
45
+ - Not using dropout when fine-tuning
46
+
47
+ ## The magic of no dropout ✨
48
+
49
+ One interesting discovery during this competition was the bizarre effect of dropout when using transformer models for regression. Typically dropout is essential to prevent overfitting, but if a transformer is being used for regression, dropout will actually ***hurt*** the performance. It seems hard to believe but [it actually made a substantial difference](https://www.kaggle.com/competitions/commonlitreadabilityprize/discussion/260729). Some users did some digging and found published articles claiming that dropout for classification is fine because the magnitude of the outputs does not really matter – as long as the relative values produce the right answer when taking the greatest value, it doesn’t matter how big or small that output is. With regression, the magnitude of the output is precisely what you need fine control over. [Here is a discussion on it](https://www.kaggle.com/competitions/commonlitreadabilityprize/discussion/260729#1442448).
50
+
51
+ ## Lessons Learned 👨‍🏫
52
+
53
+ Looking back, I can identify two main reasons why I struggled: working alone and refusing to use public notebooks. Due to a combination of factors I decided to fly solo, which meant I was able to learn a lot, but it also meant I went slowly and didn't have great support when running into problems. This was useful for learning purposes but it wasted a lot of valuable time when there were good public notebooks and models available. I realized how foolish my decision was when I read how the [person who got 1st place used the public notebooks to make even better models](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/257844). I'm still a little salty that I didn't do better, but I'm taking it as a learning opportunity and moving forward. 🤷‍♂️ Onto the next one!
54
+
55
+
56
+ ### Current stats
57
+
58
+ ![competition](https://road-to-kaggle-grandmaster.vercel.app/api/badges/nbroad/competition)
59
+ ![dataset](https://road-to-kaggle-grandmaster.vercel.app/api/badges/nbroad/dataset)
60
+ ![notebook](https://road-to-kaggle-grandmaster.vercel.app/api/badges/nbroad/notebook)
61
+ ![discussion](https://road-to-kaggle-grandmaster.vercel.app/api/badges/nbroad/discussion)
62
+
63
+ If you want to add Kaggle badges to your GitHub profile or website, use this: <https://github.com/subinium/kaggle-badge>
blogs/2022/kaggle_nbme.md ADDED
File without changes
blogs/2023/kaggle_benetech.md ADDED
File without changes
blogs/2023/kaggle_llm_sci_exam.md ADDED
File without changes
blogs/2024/ai_disconnect.md ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: The Great AI Disconnect
3
+ desc: The dystopian AI future that is very close.
4
+ date_published: 2024-11-09
5
+ published: true
6
+ ---
7
+
8
+ I have an uneasy feeling about the way the digital world is getting more and more cluttered with increasingly realistic AI-generated text, images, audio, and video. In the next few years, AI will become more convincingly real as well as cheaper to produce. Social media sites will become saturated with AI content, and there won't be any reliable ways of detecting AI content. Moreover, people are already primed to believe what they want to believe, so it shouldn't even be that hard to convince people of lies. To make matters worse, it is far easier to spread AI content than it is to correct people's misconceptions. Imagine a photo that gets seen by 10 million people in 1 hour - something that happens every single day on Twitter, Instagram, Tik Tok, facebook, etc. It might take a few hours, maybe a few days to figure out if that image was actually real. How many of the millions of people would see the update that the photo was fake? I'd imagine it would be a tiny fraction, thus showing the power of spreading AI content.
9
+
10
+ I am anticipating a critical event where millions of people are misled by AI-generated content, resulting in people losing all trust in anything they can't see. There are already people who don't feel they can trust any news because of the "fake-news" rhetoric. How will people be able to trust anything that they don't see with their own eyes? How will information be spread when people are disconnected from the internet? Will there need to be a slow verification process by multiple third parties before anything can be posted? These are a few of the questions I have been thinking about, and it is very hard for me to see how this catastrophe can be avoided.
11
+
12
+ The one possibility I see is that people will start to rebel and destroy data centers, leading to all GPUs behind consolidated within an international organization that has many layers of redundant oversight and transparency. Maybe the GPUs will even be air-gapped so that nothing can get out, meaning that all research and development would have to happen at this one organization. There would undoubtedly be a black market of GPUs that didn't get seized, or GPUs that get "lost" somewhere in the manufacturing process. Even a single DGX machine could be pretty useful for creating a large amount of AI content, and since it is about the size of a desktop PC, it would be pretty easy to conceal and smuggle.
13
+
14
+ I suppose one positive aspect is that people may start developing more in-person relationships after disconnecting from the internet. This is one silver lining, but I think there are far more negative aspects than positive ones. I predict that the Great AI Disconnect will happen before 2030.
15
+
16
+
17
+ Why don't I think that AI-detectors will be useful? There are already AI-detectors being used to check student's essays with alarming false-positive rates. I've worked on a project related to this task, and it seems like something that is extremely difficult to reliable detect. Moreover, just like cybersecurity, this is a cat and mouse game where the attacker is always trying to beat the latest defense, and vice versa. There is always a new way to beat a new technique, and I have a feeling that it is far easier to break the detection than it is to create a robust detector.
blogs/2024/beautiful_dog.md ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ ---
2
+ title: Dogs are Beautiful
3
+ ---
4
+
blogs/2024/kaggle_llm_detect_ai.md ADDED
File without changes
blogs/2024/kaggle_llm_prompt_recovery.md ADDED
File without changes
blogs/2024/kaggle_pii_data_detection.md ADDED
File without changes
favicon.ico ADDED
main.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fh_bootstrap import *
2
+ import about, blog
3
+
4
+ hdrs = (
5
+ Link(href='/assets/hl-styles.css', rel='stylesheet'),
6
+ Link(href='/assets/styles.css', rel='stylesheet'),
7
+ *Socials(title='Nicholas Broad', description='', site_name='',
8
+ twitter_site='@nbroad1881', image=f'/assets/og-sq.png', url='')
9
+ )
10
+
11
+ app,rt = fast_app(pico=False, hdrs=bst_hdrs+hdrs, live=False)
12
+
13
+ app.get('/')(about.page)
14
+ app.get('/blog')(blog.full_page)
15
+
16
+ @rt("/blog/{blog_id}")
17
+ def get(blog_id: str):
18
+ return blog.single_blog(blog_id)
19
+
20
+
21
+ @rt("/{fname:path}.{ext:static}")
22
+ def get(fname:str, ext:str): return FileResponse(f'{fname}.{ext}')
23
+
24
+
25
+ serve()
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ uvicorn
2
+ python-fasthtml
3
+ python-dotenv
4
+ fasthtml-hf
5
+ markdown
6
+ huggingface_hub