Bajiyo
/

ml-en-transliteration

TF-Keras

Model card Files Files and versions Community

Bajiyo commited on Jul 21

Commit

b65512d

•

1 Parent(s): 7588ebf

Update README.md

Browse files

Files changed (1) hide show

README.md +86 -0

README.md CHANGED Viewed

@@ -1,3 +1,89 @@
 ---
 license: other
 license_name: other

+# Malayalam to English Transliteration Model
+This repository contains a model for transliterating Malayalam names to English names using LSTM and Attention mechanisms.
+## Dataset
+The dataset used for training this model is a subset of the [Santhosh's English-Malayalam Names dataset](https://huggingface.co/datasets/santhosh/english-malayalam-names). Only a small subset of the large dataset was used for training.
+The code for training and testing the model, along with the subset of the dataset used, is available in the following GitHub repository:
+- [GitHub Repository](https://github.com/Bajiyo2223/ml-en_trasnliteration/blob/main/ml_en_transliteration.ipynb)
+You can run and use the train and test datasets from this GitHub link. The dataset is located in a folder called `dataset`.
+## Model Files
+- `saved_model.pb`: The trained model saved in TensorFlow's SavedModel format.
+- `source_tokenizer.json`: Tokenizer for Malayalam text.
+- `target_tokenizer.json`: Tokenizer for English text.
+- `variables.data-00000-of-00001`: Model variables.
+- `variables.index`: Index for model variables.
+## Model Architecture
+The model architecture consists of the following components:
+- **Embedding Layer**: Converts the input characters to dense vectors of fixed size.
+- **Bidirectional LSTM Layer**: Captures the sequence dependencies in both forward and backward directions.
+- **Attention Layer**: Helps the model focus on relevant parts of the input sequence when generating the output sequence.
+- **Dense Layer**: Produces the final output with a softmax activation function to generate character probabilities.
+## Preprocessing
+- **Tokenization**: Both source (Malayalam) and target (English) texts are tokenized at the character level.
+- **Padding**: Sequences are padded to ensure uniform input lengths.
+## Training
+- **Optimizer**: Adam
+- **Loss Function**: Sparse categorical cross-entropy
+- **Metrics**: Accuracy
+- **Callbacks**: EarlyStopping and ModelCheckpoint to save the best model during training.
+## Results
+The model achieved the following performance on the test set:
+- **Test Loss**: `insert_test_loss`
+- **Test Accuracy**: `insert_test_accuracy`
+## Usage
+To use the model for transliteration:
+```python
+import tensorflow as tf
+from keras.preprocessing.sequence import pad_sequences
+import numpy as np
+import json
+# Function to convert sequences back to strings
+def sequence_to_text(sequence, tokenizer):
+    reverse_word_map = dict(map(reversed, tokenizer.word_index.items()))
+    text = ''.join([reverse_word_map.get(i, '') for i in sequence])
+    return text
+# Load the model
+model = tf.keras.models.load_model('path_to_your_model_directory')
+# Load tokenizers
+with open('source_tokenizer.json') as f:
+    source_tokenizer_data = json.load(f)
+source_tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(source_tokenizer_data)
+with open('target_tokenizer.json') as f:
+    target_tokenizer_data = json.load(f)
+target_tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(target_tokenizer_data)
+# Prepare the input text
+input_text = "your_input_text"
+input_sequence = source_tokenizer.texts_to_sequences([input_text])
+input_padded = pad_sequences(input_sequence, maxlen=100, padding='post')  # Adjust maxlen if needed
+# Get the prediction
+prediction = model.predict(input_padded)
+predicted_sequence = np.argmax(prediction, axis=-1)[0]
+predicted_text = sequence_to_text(predicted_sequence, target_tokenizer)
+print("Transliterated Text:", predicted_text)
 ---
 license: other
 license_name: other