2nd Place Solution - All about text embeddings :P
#22
by
zayedupal
- opened
Hi All,
First of all thanks a lot to
@huggingface-co
for hosting this competition.
It was a truly enjoyable experience to take part in, and I learned a lot.
I hope to share a more detailed post and codes in the future when I have time, but here is a summary of the steps I took and the intuitions driving them:
Basic Data Analysis:
From the basic analysis, it was clear that the dataset was balanced, no missing values. However, there were duplicates based on the movie name and synopsis (discussed it later).Feature selection:
- I used only the synopsis first and later concatenated movie name and synopsis in a single text. Combining them increased the evaluation accuracy.
Removing Duplicates:
To remove duplicates, I used sentence-transformer to calculate the cosine similarity between movie name and synopsis with genre. Then I kept the most similar one.My approach:
- I tried out different pre-trained models for generating text embeddings.
- Classified those embeddings using different models (used scikit learn for easy model building).
- Finally, combined the predictions through soft voting (average of the predicted probabilities to select class) from each of those classifiers.
The model that won the second place:
- Embeddings generated using - 1. https://huggingface.co/sentence-transformers/sentence-t5-xxl, 2. https://huggingface.co/google/flan-t5-xxl, 3. https://huggingface.co/google/flan-t5-xl.
- Model used - Logistic Regression with saga solver. All the other parameters remained at the default values specified by scikit-learn.
The model that has the best private score:
Everything was similar to the previous model, except I also added https://huggingface.co/hkunlp/instructor-xl for embedding generation.Learnings and Observations
- Simple Logistic Regression worked better than Random Forest, MLP, Decision Tree, SVM.
- https://huggingface.co/spaces/mteb/leaderboard has been a huge help in choosing different models for text embedding generation.
- https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu and https://paperswithcode.com/sota/natural-language-inference-on-rte benchmarks also helped a lot.
- I used simple 70:30 split on the training data to evaluate the models. I should've done k-fold for better selection of the models.
Hope this helps.
Best,
Zayed
How much time it took for generating embedding?
Okay btw nicely done so much to learn from your solution ๐
@zayedupal , thank you, i am trying to reproduce but not success, could you please share your code?
Sorry for the late.
I have uploaded my codes here:
https://github.com/zayedupal/Hugging_Face_Movie_Genre_Prediction_Public/blob/main/README.md
@AkmalAzzam
@janbelke
, waiting for the prize :P