pearsonkyle commited on
Commit
d83d12e
1 Parent(s): a6f4731
Files changed (2) hide show
  1. README.md +4 -66
  2. exoplanet_keywords.png +0 -0
README.md CHANGED
@@ -10,7 +10,7 @@ A deep language model, GPT-2, is trained on scientific manuscripts from NASA's A
10
  ```python
11
  from transformers import pipeline
12
 
13
- exo = pipeline('text-generation',model='gpt2-exomachina/checkpoint-320000', tokenizer='gpt2', config={'max_length':1600})
14
  machina = lambda text: exo(text)[0]['generated_text']
15
 
16
  print(machina("Transiting exoplanets are"))
@@ -19,7 +19,7 @@ print(machina("Transiting exoplanets are"))
19
  ## Training Samples
20
  ~40,000 Abstracts from NASA's Astrophysical data system (ADS) and ArXiv.
21
 
22
- ![](Figures/exoplanet_keywords.png)
23
 
24
  A few generated samples are below:
25
 
@@ -28,49 +28,7 @@ A few generated samples are below:
28
  `that large, deciduous forest fires are composed of mostly dry, unprocessed material that is distributed in a nearly patchy fashion. The distributions of these fires are correlated with temperature, and also with vegetation...`
29
  - *Directly imaged exoplanets probe* `key aspects of planet formation and evolution theory, as well as atmospheric and interior physics. These insights have led to numerous direct imaging instruments for exoplanets, many using polarimetry. However, current instruments take`
30
 
31
- ## Instructions
32
-
33
- - ### Setup a SQL Databse to store training samples
34
- A postegres SQL database is set up on Amazon RDS in order to provide online access to the same data for multiple computers. Follow the instructions below to set up your own database using the Free Tier of services on AWS:
35
-
36
- 1. sign in or register: https://aws.amazon.com/
37
- 2. Search for a services and go to RDS
38
-
39
- Add your credentials to a new file called `settings.json` like such:
40
- ```
41
- {
42
- "database":{
43
- "dialect":"postgresql",
44
- "username":"readonly",
45
- "password":"readonly",
46
- "endpoint":"exomachina.c4luhvcn1k1s.us-east-2.rds.amazonaws.com",
47
- "port":5432,
48
- "dbname":"exomachina"
49
- }
50
- }
51
- ```
52
-
53
- ## Scraping NASA ADS
54
-
55
- https://ads.readthedocs.io/en/latest/
56
-
57
- Scrape ADS and save entries into a sql database:
58
-
59
- `python ads_query.py -s settings.json -q exoplanet`
60
-
61
- ```
62
- usage: ads_query.py [-h] [-q QUERY] [-s SETTINGS] [-k KEY]
63
-
64
- optional arguments:
65
- -h, --help show this help message and exit
66
- -q QUERY, --query QUERY
67
- Initial search criteria
68
- -s SETTINGS, --settings SETTINGS
69
- Settings file
70
- -k KEY, --key KEY Settings key
71
- ```
72
-
73
- Letting the scrape run for ~2 hours found articles from these publications in descending order:
74
  ```
75
  5364 - The Astrophysical Journal
76
  3365 - Astronomy and Astrophysics
@@ -87,24 +45,4 @@ Letting the scrape run for ~2 hours found articles from these publications in de
87
  129 - Planetary and Space Science
88
  114 - Space Science Reviews
89
  109 - Geophysical Research Letters
90
- ```
91
-
92
- The number of manuscripts for each year:
93
- ![](Figures/exoplanet_histogram.png)
94
-
95
- ## Pre-processing
96
- Extract abstracts from the database and create a new file where each line is an new sample. Try a new tokenizer
97
-
98
- ## Things to improve
99
-
100
- ## Export the models to an iOS application
101
-
102
-
103
- References
104
- - https://huggingface.co/roberta-base
105
- - GPT-2 generative text
106
- - https://huggingface.co/docs
107
- - https://huggingface.co/transformers/training.html
108
- - https://huggingface.co/transformers/notebooks.html
109
- https://colab.research.google.com/drive/1vsCh85T_Od7RBwXfvh1iysV-vTxmWXQO#scrollTo=ljknzOlNoyrv
110
- http://jalammar.github.io/illustrated-gpt2/
 
10
  ```python
11
  from transformers import pipeline
12
 
13
+ exo = pipeline('text-generation',model='pearsonkyle/gpt2-exomachina', tokenizer='gpt2', config={'max_length':1600})
14
  machina = lambda text: exo(text)[0]['generated_text']
15
 
16
  print(machina("Transiting exoplanets are"))
 
19
  ## Training Samples
20
  ~40,000 Abstracts from NASA's Astrophysical data system (ADS) and ArXiv.
21
 
22
+ ![](exoplanet_keywords.png)
23
 
24
  A few generated samples are below:
25
 
 
28
  `that large, deciduous forest fires are composed of mostly dry, unprocessed material that is distributed in a nearly patchy fashion. The distributions of these fires are correlated with temperature, and also with vegetation...`
29
  - *Directly imaged exoplanets probe* `key aspects of planet formation and evolution theory, as well as atmospheric and interior physics. These insights have led to numerous direct imaging instruments for exoplanets, many using polarimetry. However, current instruments take`
30
 
31
+ Letting the scrape run for ~2 hours found articles from these publications in descending amount:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  ```
33
  5364 - The Astrophysical Journal
34
  3365 - Astronomy and Astrophysics
 
45
  129 - Planetary and Space Science
46
  114 - Space Science Reviews
47
  109 - Geophysical Research Letters
48
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
exoplanet_keywords.png ADDED