Spaces:
Running
Running
FoodDesert
commited on
Commit
•
0e02b5f
1
Parent(s):
38b3693
Upload 5 files
Browse files- README.md +1 -44
- app.py +70 -6
- e621FastTextModel010Replacement_small.bin +3 -0
- fluffyrock_3m.csv +0 -0
- requirements.txt +1 -1
README.md
CHANGED
@@ -7,49 +7,6 @@ sdk: gradio
|
|
7 |
sdk_version: 4.19.1
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
-
tags:
|
11 |
-
- not-for-all-audience
|
12 |
---
|
13 |
|
14 |
-
|
15 |
-
## Frequently Asked Questions (FAQs)
|
16 |
-
|
17 |
-
Technically I am writing this before anyone but me has used the tool, so no one has asked questions yet. But if they did, here are the questions I think they might ask:
|
18 |
-
|
19 |
-
### Why is this space tagged "not-for-all-audience"
|
20 |
-
|
21 |
-
The "not-for-all-audience" tag informs users that this tool's text output is derived from e621.net data for tag prediction and completion. This measure underscores a commitment to responsible content sharing.
|
22 |
-
|
23 |
-
### Does input order matter?
|
24 |
-
|
25 |
-
No
|
26 |
-
|
27 |
-
### Should I use underscores in the input tags?
|
28 |
-
|
29 |
-
It doesn't matter. The application handles tags either way.
|
30 |
-
|
31 |
-
### Why are some valid tags marked as "unseen", and why don't some artists ever get returned?
|
32 |
-
|
33 |
-
Some data is excluded from consideration if it did not occur frequently enough in the sample from which the application makes its calculations.
|
34 |
-
If an artist or tag is too infrequent, we might not think we have enough data to make predictions about it.
|
35 |
-
|
36 |
-
### Are there any special tags?
|
37 |
-
|
38 |
-
Yes. We normalized the favorite counts of each image to a range of 0-9, with 0 being the lowest favcount, and 9 being the highest.
|
39 |
-
You can include any of these special tags: "score:0", "score:1", "score:2", "score:3", "score:4", "score:5", "score:6", "score:7", "score:8", "score:9"
|
40 |
-
in your list to bias the output toward artists with higher or lower scoring images.
|
41 |
-
|
42 |
-
### Are there any other special tricks?
|
43 |
-
|
44 |
-
Yes. If you want to more strongly bias the artist output toward a specific tag, you can just list it multiple times.
|
45 |
-
So for example, the query "red fox, red fox, red fox, score:7" will yield a list of artists who are more strongly associated with the tag "red fox"
|
46 |
-
than the query "red fox, score:7".
|
47 |
-
|
48 |
-
### What calculation is this thing actually performing?
|
49 |
-
|
50 |
-
Each artist is represented by a "pseudo-document" composed of all the tags from their uploaded images, treating these tags similarly to words in a text document.
|
51 |
-
Similarly, when you input a set of tags, the system creates a pseudo-document for your query out of all the tags.
|
52 |
-
It then uses a technique called cosine similarity to compare your tags against each artist's collection, essentially finding which artist's tags are most "similar" to yours.
|
53 |
-
This method helps identify artists whose work is closely aligned with the themes or elements you're interested in.
|
54 |
-
For those curious about the underlying mechanics of comparing text-like data, we employ the TF-IDF (Term Frequency-Inverse Document Frequency) method, a standard approach in information retrieval.
|
55 |
-
You can read more about TF-IDF on its [Wikipedia page](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).
|
|
|
7 |
sdk_version: 4.19.1
|
8 |
app_file: app.py
|
9 |
pinned: false
|
|
|
|
|
10 |
---
|
11 |
|
12 |
+
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app.py
CHANGED
@@ -4,6 +4,11 @@ import numpy as np
|
|
4 |
from joblib import load
|
5 |
import h5py
|
6 |
from io import BytesIO
|
|
|
|
|
|
|
|
|
|
|
7 |
|
8 |
|
9 |
faq_content="""
|
@@ -59,13 +64,71 @@ with h5py.File('complete_artist_data.hdf5', 'r') as f:
|
|
59 |
|
60 |
# Load artist names and decode to strings
|
61 |
artist_names = [name.decode() for name in f['artist_names'][:]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
62 |
|
63 |
def find_similar_artists(new_tags_string, top_n):
|
64 |
-
#
|
65 |
new_image_tags = [tag.replace('_', ' ').strip() for tag in new_tags_string.split(",")]
|
66 |
-
unseen_tags = set(new_image_tags) - set(vectorizer.vocabulary_.keys())
|
67 |
-
|
68 |
-
|
69 |
X_new_image = vectorizer.transform([','.join(new_image_tags)])
|
70 |
similarities = cosine_similarity(X_new_image, X_artist)[0]
|
71 |
|
@@ -75,7 +138,8 @@ def find_similar_artists(new_tags_string, top_n):
|
|
75 |
top_artists_str = "\n".join([f"{rank+1}. {artist[3:]} ({score:.4f})" for rank, (artist, score) in enumerate(top_artists)])
|
76 |
dynamic_prompts_formatted_artists = "{" + "|".join([artist for artist, _ in top_artists]) + "}"
|
77 |
|
78 |
-
return
|
|
|
79 |
|
80 |
iface = gr.Interface(
|
81 |
fn=find_similar_artists,
|
@@ -84,7 +148,7 @@ iface = gr.Interface(
|
|
84 |
gr.Slider(minimum=1, maximum=100, value=10, step=1, label="Number of artists")
|
85 |
],
|
86 |
outputs=[
|
87 |
-
gr.
|
88 |
gr.Textbox(label="Top Artists", info="These are the artists most strongly associated with your tags. The number in parenthes is a similarity score between 0 and 1, with higher numbers indicating greater similarity."),
|
89 |
gr.Textbox(label="Dynamic Prompts Format", info="For if you're using the Automatic1111 webui (https://github.com/AUTOMATIC1111/stable-diffusion-webui) with the Dynamic Prompts extension activated (https://github.com/adieyal/sd-dynamic-prompts) and want to try them all individually.")
|
90 |
],
|
|
|
4 |
from joblib import load
|
5 |
import h5py
|
6 |
from io import BytesIO
|
7 |
+
import csv
|
8 |
+
import re
|
9 |
+
import random
|
10 |
+
import compress_fasttext
|
11 |
+
from collections import OrderedDict
|
12 |
|
13 |
|
14 |
faq_content="""
|
|
|
64 |
|
65 |
# Load artist names and decode to strings
|
66 |
artist_names = [name.decode() for name in f['artist_names'][:]]
|
67 |
+
|
68 |
+
def clean_tag(tag):
|
69 |
+
return ''.join(char for char in tag if ord(char) < 128)
|
70 |
+
|
71 |
+
#Normally returns tag to aliases, but when reverse=True, returns alias to tags
|
72 |
+
def build_aliases_dict(filename, reverse=False):
|
73 |
+
aliases_dict = {}
|
74 |
+
with open(filename, 'r', newline='', encoding='utf-8') as csvfile:
|
75 |
+
reader = csv.reader(csvfile)
|
76 |
+
for row in reader:
|
77 |
+
tag = clean_tag(row[0])
|
78 |
+
alias_list = [] if row[3] == "null" else [clean_tag(alias) for alias in row[3].split(',')]
|
79 |
+
if reverse:
|
80 |
+
for alias in alias_list:
|
81 |
+
aliases_dict.setdefault(alias, []).append(tag)
|
82 |
+
else:
|
83 |
+
aliases_dict[tag] = alias_list
|
84 |
+
return aliases_dict
|
85 |
+
|
86 |
+
|
87 |
+
def find_similar_tags(test_tags):
|
88 |
+
|
89 |
+
#Initialize stuff
|
90 |
+
if not hasattr(find_similar_tags, "fasttext_small_model"):
|
91 |
+
find_similar_tags.fasttext_small_model = compress_fasttext.models.CompressedFastTextKeyedVectors.load('e621FastTextModel010Replacement_small.bin')
|
92 |
+
tag_aliases_file = 'fluffyrock_3m.csv'
|
93 |
+
if not hasattr(find_similar_tags, "tag2aliases"):
|
94 |
+
find_similar_tags.tag2aliases = build_aliases_dict(tag_aliases_file)
|
95 |
+
if not hasattr(find_similar_tags, "alias2tags"):
|
96 |
+
find_similar_tags.alias2tags = build_aliases_dict(tag_aliases_file, reverse=True)
|
97 |
+
|
98 |
+
|
99 |
+
# Find similar tags and prepare data for dataframe.
|
100 |
+
results_data = []
|
101 |
+
for tag in test_tags:
|
102 |
+
similar_words = find_similar_tags.fasttext_small_model.most_similar(tag)
|
103 |
+
result, seen = [], set()
|
104 |
+
if tag in find_similar_tags.tag2aliases:
|
105 |
+
result.append((tag, 1))
|
106 |
+
seen.add(tag)
|
107 |
+
else:
|
108 |
+
for item in similar_words:
|
109 |
+
similar_word, similarity = item
|
110 |
+
if similar_word not in seen:
|
111 |
+
if similar_word in find_similar_tags.tag2aliases:
|
112 |
+
result.append((similar_word.replace('_', ' '), round(similarity, 3)))
|
113 |
+
seen.add(similar_word)
|
114 |
+
else:
|
115 |
+
for similar_tag in find_similar_tags.alias2tags.get(similar_word, []):
|
116 |
+
if similar_tag not in seen:
|
117 |
+
result.append((similar_tag.replace('_', ' '), round(similarity, 3)))
|
118 |
+
seen.add(similar_tag)
|
119 |
+
# Append tag and formatted similar tags to results_data
|
120 |
+
for word, sim in result:
|
121 |
+
#if word not in seen:
|
122 |
+
results_data.append([tag, word, sim])
|
123 |
+
#seen.add(word)
|
124 |
+
|
125 |
+
return results_data # Return list of lists for Dataframe
|
126 |
|
127 |
def find_similar_artists(new_tags_string, top_n):
|
|
|
128 |
new_image_tags = [tag.replace('_', ' ').strip() for tag in new_tags_string.split(",")]
|
129 |
+
unseen_tags = list(set(OrderedDict.fromkeys(new_image_tags)) - set(vectorizer.vocabulary_.keys()))
|
130 |
+
unseen_tags_data = find_similar_tags(unseen_tags) if unseen_tags else [["No unseen tags", "", ""]]
|
131 |
+
|
132 |
X_new_image = vectorizer.transform([','.join(new_image_tags)])
|
133 |
similarities = cosine_similarity(X_new_image, X_artist)[0]
|
134 |
|
|
|
138 |
top_artists_str = "\n".join([f"{rank+1}. {artist[3:]} ({score:.4f})" for rank, (artist, score) in enumerate(top_artists)])
|
139 |
dynamic_prompts_formatted_artists = "{" + "|".join([artist for artist, _ in top_artists]) + "}"
|
140 |
|
141 |
+
return unseen_tags_data, top_artists_str, dynamic_prompts_formatted_artists
|
142 |
+
|
143 |
|
144 |
iface = gr.Interface(
|
145 |
fn=find_similar_artists,
|
|
|
148 |
gr.Slider(minimum=1, maximum=100, value=10, step=1, label="Number of artists")
|
149 |
],
|
150 |
outputs=[
|
151 |
+
gr.Dataframe(label="Unseen Tags", headers=["Tag", "Similar Tags"]),
|
152 |
gr.Textbox(label="Top Artists", info="These are the artists most strongly associated with your tags. The number in parenthes is a similarity score between 0 and 1, with higher numbers indicating greater similarity."),
|
153 |
gr.Textbox(label="Dynamic Prompts Format", info="For if you're using the Automatic1111 webui (https://github.com/AUTOMATIC1111/stable-diffusion-webui) with the Dynamic Prompts extension activated (https://github.com/adieyal/sd-dynamic-prompts) and want to try them all individually.")
|
154 |
],
|
e621FastTextModel010Replacement_small.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:a9ade94b75665a92776b73d4bb8871deca566b1b24a0866c0b1d2c56fa7ce68e
|
3 |
+
size 15782079
|
fluffyrock_3m.csv
ADDED
The diff for this file is too large to render.
See raw diff
|
|
requirements.txt
CHANGED
@@ -3,4 +3,4 @@ numpy==1.25.1
|
|
3 |
scikit-learn==1.2.2
|
4 |
h5py==3.8.0
|
5 |
joblib==1.2.0
|
6 |
-
|
|
|
3 |
scikit-learn==1.2.2
|
4 |
h5py==3.8.0
|
5 |
joblib==1.2.0
|
6 |
+
compress-fasttext
|