pearsonkyle
commited on
Commit
•
d83d12e
1
Parent(s):
a6f4731
figure
Browse files- README.md +4 -66
- exoplanet_keywords.png +0 -0
README.md
CHANGED
@@ -10,7 +10,7 @@ A deep language model, GPT-2, is trained on scientific manuscripts from NASA's A
|
|
10 |
```python
|
11 |
from transformers import pipeline
|
12 |
|
13 |
-
exo = pipeline('text-generation',model='gpt2-exomachina
|
14 |
machina = lambda text: exo(text)[0]['generated_text']
|
15 |
|
16 |
print(machina("Transiting exoplanets are"))
|
@@ -19,7 +19,7 @@ print(machina("Transiting exoplanets are"))
|
|
19 |
## Training Samples
|
20 |
~40,000 Abstracts from NASA's Astrophysical data system (ADS) and ArXiv.
|
21 |
|
22 |
-
![](
|
23 |
|
24 |
A few generated samples are below:
|
25 |
|
@@ -28,49 +28,7 @@ A few generated samples are below:
|
|
28 |
`that large, deciduous forest fires are composed of mostly dry, unprocessed material that is distributed in a nearly patchy fashion. The distributions of these fires are correlated with temperature, and also with vegetation...`
|
29 |
- *Directly imaged exoplanets probe* `key aspects of planet formation and evolution theory, as well as atmospheric and interior physics. These insights have led to numerous direct imaging instruments for exoplanets, many using polarimetry. However, current instruments take`
|
30 |
|
31 |
-
|
32 |
-
|
33 |
-
- ### Setup a SQL Databse to store training samples
|
34 |
-
A postegres SQL database is set up on Amazon RDS in order to provide online access to the same data for multiple computers. Follow the instructions below to set up your own database using the Free Tier of services on AWS:
|
35 |
-
|
36 |
-
1. sign in or register: https://aws.amazon.com/
|
37 |
-
2. Search for a services and go to RDS
|
38 |
-
|
39 |
-
Add your credentials to a new file called `settings.json` like such:
|
40 |
-
```
|
41 |
-
{
|
42 |
-
"database":{
|
43 |
-
"dialect":"postgresql",
|
44 |
-
"username":"readonly",
|
45 |
-
"password":"readonly",
|
46 |
-
"endpoint":"exomachina.c4luhvcn1k1s.us-east-2.rds.amazonaws.com",
|
47 |
-
"port":5432,
|
48 |
-
"dbname":"exomachina"
|
49 |
-
}
|
50 |
-
}
|
51 |
-
```
|
52 |
-
|
53 |
-
## Scraping NASA ADS
|
54 |
-
|
55 |
-
https://ads.readthedocs.io/en/latest/
|
56 |
-
|
57 |
-
Scrape ADS and save entries into a sql database:
|
58 |
-
|
59 |
-
`python ads_query.py -s settings.json -q exoplanet`
|
60 |
-
|
61 |
-
```
|
62 |
-
usage: ads_query.py [-h] [-q QUERY] [-s SETTINGS] [-k KEY]
|
63 |
-
|
64 |
-
optional arguments:
|
65 |
-
-h, --help show this help message and exit
|
66 |
-
-q QUERY, --query QUERY
|
67 |
-
Initial search criteria
|
68 |
-
-s SETTINGS, --settings SETTINGS
|
69 |
-
Settings file
|
70 |
-
-k KEY, --key KEY Settings key
|
71 |
-
```
|
72 |
-
|
73 |
-
Letting the scrape run for ~2 hours found articles from these publications in descending order:
|
74 |
```
|
75 |
5364 - The Astrophysical Journal
|
76 |
3365 - Astronomy and Astrophysics
|
@@ -87,24 +45,4 @@ Letting the scrape run for ~2 hours found articles from these publications in de
|
|
87 |
129 - Planetary and Space Science
|
88 |
114 - Space Science Reviews
|
89 |
109 - Geophysical Research Letters
|
90 |
-
```
|
91 |
-
|
92 |
-
The number of manuscripts for each year:
|
93 |
-
![](Figures/exoplanet_histogram.png)
|
94 |
-
|
95 |
-
## Pre-processing
|
96 |
-
Extract abstracts from the database and create a new file where each line is an new sample. Try a new tokenizer
|
97 |
-
|
98 |
-
## Things to improve
|
99 |
-
|
100 |
-
## Export the models to an iOS application
|
101 |
-
|
102 |
-
|
103 |
-
References
|
104 |
-
- https://huggingface.co/roberta-base
|
105 |
-
- GPT-2 generative text
|
106 |
-
- https://huggingface.co/docs
|
107 |
-
- https://huggingface.co/transformers/training.html
|
108 |
-
- https://huggingface.co/transformers/notebooks.html
|
109 |
-
https://colab.research.google.com/drive/1vsCh85T_Od7RBwXfvh1iysV-vTxmWXQO#scrollTo=ljknzOlNoyrv
|
110 |
-
http://jalammar.github.io/illustrated-gpt2/
|
|
|
10 |
```python
|
11 |
from transformers import pipeline
|
12 |
|
13 |
+
exo = pipeline('text-generation',model='pearsonkyle/gpt2-exomachina', tokenizer='gpt2', config={'max_length':1600})
|
14 |
machina = lambda text: exo(text)[0]['generated_text']
|
15 |
|
16 |
print(machina("Transiting exoplanets are"))
|
|
|
19 |
## Training Samples
|
20 |
~40,000 Abstracts from NASA's Astrophysical data system (ADS) and ArXiv.
|
21 |
|
22 |
+
![](exoplanet_keywords.png)
|
23 |
|
24 |
A few generated samples are below:
|
25 |
|
|
|
28 |
`that large, deciduous forest fires are composed of mostly dry, unprocessed material that is distributed in a nearly patchy fashion. The distributions of these fires are correlated with temperature, and also with vegetation...`
|
29 |
- *Directly imaged exoplanets probe* `key aspects of planet formation and evolution theory, as well as atmospheric and interior physics. These insights have led to numerous direct imaging instruments for exoplanets, many using polarimetry. However, current instruments take`
|
30 |
|
31 |
+
Letting the scrape run for ~2 hours found articles from these publications in descending amount:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
```
|
33 |
5364 - The Astrophysical Journal
|
34 |
3365 - Astronomy and Astrophysics
|
|
|
45 |
129 - Planetary and Space Science
|
46 |
114 - Space Science Reviews
|
47 |
109 - Geophysical Research Letters
|
48 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
exoplanet_keywords.png
ADDED