Bajiyo commited on
Commit
b65512d
1 Parent(s): 7588ebf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -0
README.md CHANGED
@@ -1,3 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: other
3
  license_name: other
 
1
+ # Malayalam to English Transliteration Model
2
+
3
+ This repository contains a model for transliterating Malayalam names to English names using LSTM and Attention mechanisms.
4
+
5
+ ## Dataset
6
+
7
+ The dataset used for training this model is a subset of the [Santhosh's English-Malayalam Names dataset](https://huggingface.co/datasets/santhosh/english-malayalam-names). Only a small subset of the large dataset was used for training.
8
+
9
+ The code for training and testing the model, along with the subset of the dataset used, is available in the following GitHub repository:
10
+ - [GitHub Repository](https://github.com/Bajiyo2223/ml-en_trasnliteration/blob/main/ml_en_transliteration.ipynb)
11
+
12
+ You can run and use the train and test datasets from this GitHub link. The dataset is located in a folder called `dataset`.
13
+
14
+ ## Model Files
15
+
16
+ - `saved_model.pb`: The trained model saved in TensorFlow's SavedModel format.
17
+ - `source_tokenizer.json`: Tokenizer for Malayalam text.
18
+ - `target_tokenizer.json`: Tokenizer for English text.
19
+ - `variables.data-00000-of-00001`: Model variables.
20
+ - `variables.index`: Index for model variables.
21
+
22
+ ## Model Architecture
23
+
24
+ The model architecture consists of the following components:
25
+ - **Embedding Layer**: Converts the input characters to dense vectors of fixed size.
26
+ - **Bidirectional LSTM Layer**: Captures the sequence dependencies in both forward and backward directions.
27
+ - **Attention Layer**: Helps the model focus on relevant parts of the input sequence when generating the output sequence.
28
+ - **Dense Layer**: Produces the final output with a softmax activation function to generate character probabilities.
29
+
30
+ ## Preprocessing
31
+
32
+ - **Tokenization**: Both source (Malayalam) and target (English) texts are tokenized at the character level.
33
+ - **Padding**: Sequences are padded to ensure uniform input lengths.
34
+
35
+ ## Training
36
+
37
+ - **Optimizer**: Adam
38
+ - **Loss Function**: Sparse categorical cross-entropy
39
+ - **Metrics**: Accuracy
40
+ - **Callbacks**: EarlyStopping and ModelCheckpoint to save the best model during training.
41
+
42
+ ## Results
43
+
44
+ The model achieved the following performance on the test set:
45
+ - **Test Loss**: `insert_test_loss`
46
+ - **Test Accuracy**: `insert_test_accuracy`
47
+
48
+ ## Usage
49
+
50
+ To use the model for transliteration:
51
+
52
+ ```python
53
+ import tensorflow as tf
54
+ from keras.preprocessing.sequence import pad_sequences
55
+ import numpy as np
56
+ import json
57
+
58
+ # Function to convert sequences back to strings
59
+ def sequence_to_text(sequence, tokenizer):
60
+ reverse_word_map = dict(map(reversed, tokenizer.word_index.items()))
61
+ text = ''.join([reverse_word_map.get(i, '') for i in sequence])
62
+ return text
63
+
64
+ # Load the model
65
+ model = tf.keras.models.load_model('path_to_your_model_directory')
66
+
67
+ # Load tokenizers
68
+ with open('source_tokenizer.json') as f:
69
+ source_tokenizer_data = json.load(f)
70
+ source_tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(source_tokenizer_data)
71
+
72
+ with open('target_tokenizer.json') as f:
73
+ target_tokenizer_data = json.load(f)
74
+ target_tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(target_tokenizer_data)
75
+
76
+ # Prepare the input text
77
+ input_text = "your_input_text"
78
+ input_sequence = source_tokenizer.texts_to_sequences([input_text])
79
+ input_padded = pad_sequences(input_sequence, maxlen=100, padding='post') # Adjust maxlen if needed
80
+
81
+ # Get the prediction
82
+ prediction = model.predict(input_padded)
83
+ predicted_sequence = np.argmax(prediction, axis=-1)[0]
84
+ predicted_text = sequence_to_text(predicted_sequence, target_tokenizer)
85
+
86
+ print("Transliterated Text:", predicted_text)
87
  ---
88
  license: other
89
  license_name: other