competitions/movie-genre-prediction · 2nd Place Solution

Aug 1, 2023

Hi All,

First of all thanks a lot to @huggingface-co for hosting this competition.
It was a truly enjoyable experience to take part in, and I learned a lot.
I hope to share a more detailed post and codes in the future when I have time, but here is a summary of the steps I took and the intuitions driving them:

Basic Data Analysis:
From the basic analysis, it was clear that the dataset was balanced, no missing values. However, there were duplicates based on the movie name and synopsis (discussed it later).
Feature selection:
I used only the synopsis first and later concatenated movie name and synopsis in a single text. Combining them increased the evaluation accuracy.
Removing Duplicates:
To remove duplicates, I used sentence-transformer to calculate the cosine similarity between movie name and synopsis with genre. Then I kept the most similar one.
My approach:
I tried out different pre-trained models for generating text embeddings.
Classified those embeddings using different models (used scikit learn for easy model building).
Finally, combined the predictions through soft voting (average of the predicted probabilities to select class) from each of those classifiers.
The model that won the second place:
Embeddings generated using - 1. https://huggingface.co/sentence-transformers/sentence-t5-xxl, 2. https://huggingface.co/google/flan-t5-xxl, 3. https://huggingface.co/google/flan-t5-xl.
Model used - Logistic Regression with saga solver. All the other parameters remained at the default values specified by scikit-learn.
The model that has the best private score:
Everything was similar to the previous model, except I also added https://huggingface.co/hkunlp/instructor-xl for embedding generation.
Learnings and Observations

Simple Logistic Regression worked better than Random Forest, MLP, Decision Tree, SVM.
https://huggingface.co/spaces/mteb/leaderboard has been a huge help in choosing different models for text embedding generation.
https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu and https://paperswithcode.com/sota/natural-language-inference-on-rte benchmarks also helped a lot.
I used simple 70:30 split on the training data to evaluate the models. I should've done k-fold for better selection of the models.

Hope this helps.

Best,
Zayed

aman1391

Aug 2, 2023

How much time it took for generating embedding?

zayedupal

Aug 2, 2023

@aman1391 , it was different for the models. I didn't save the runtimes. As far as I remember, flan-t5-xxl and t5-xxl took longest time. I have used paid google colab. For the whole competition, I used around 250 compute units in total.