language:
- en
- sw
- ig
- so
- es
- ca
- xh
- zu
- ha
- tw
- af
- hi
- bm
- su
license: apache-2.0
tags:
- mergekit
- merge
- Mistral_Star
- Mistral_Quiet
- Mistral
- Mixtral
- Question-Answer
- Token-Classification
- Sequence-Classification
- SpydazWeb-AI
- chemistry
- biology
- legal
- code
- climate
- medical
- LCARS_AI_StarTrek_Computer
- text-generation-inference
- chain-of-thought
- tree-of-knowledge
- forest-of-thoughts
- visual-spacial-sketchpad
- alpha-mind
- knowledge-graph
- entity-detection
- encyclopedia
- wikipedia
- stack-exchange
- Reddit
- Cyber-series
- MegaMind
- Cybertron
- SpydazWeb
- Spydaz
- LCARS
- star-trek
- mega-transformers
- Mulit-Mega-Merge
- Multi-Lingual
- Afro-Centric
- African-Model
- Ancient-One
base_model:
- LeroyDyer/LCARS_TOP_SCORE
- LeroyDyer/Mixtral_AI_Cyber_Matrix_2_0
- LeroyDyer/SpydazWeb_AI_CyberTron_Ultra_7b
- LeroyDyer/LCARS_AI_StarTrek_Computer
- LeroyDyer/_Spydaz_Web_AI_ActionQA_Project
- LeroyDyer/_Spydaz_Web_AI_ChatML_512K_Project
- LeroyDyer/_Spydaz_Web_AI_ChatQA_ReAct_Project_UltraFineTuned
- LeroyDyer/SpyazWeb_AI_DeepMind_Project
- LeroyDyer/SpydazWeb_AI_Swahili_Project
- LeroyDyer/_Spydaz_Web_AI_ChatQA_ReAct_Project
- LeroyDyer/_Spydaz_Web_AI_MistralStar_001_Project
- LeroyDyer/QuietStar_Project
- LeroyDyer/Mixtral_BioMedical_7b
- LeroyDyer/Mixtral_AI_CyberTron_Coder
- LeroyDyer/_Spydaz_Web_AI_BIBLE_002
- LeroyDyer/_Spydaz_Web_AI_ChatQA_Reasoning101_Project
- LeroyDyer/SpydazWeb_AI_Text_AudioVision_Project
datasets:
- neoneye/base64-decode-v2
- neoneye/base64-encode-v1
- VuongQuoc/Chemistry_text_to_image
- Kamizuru00/diagram_image_to_text
- LeroyDyer/Chemistry_text_to_image_BASE64
- LeroyDyer/AudioCaps-Spectrograms_to_Base64
- LeroyDyer/winogroud_text_to_imaget_BASE64
- LeroyDyer/chart_text_to_Base64
- LeroyDyer/diagram_image_to_text_BASE64
- mekaneeky/salt_m2e_15_3_instruction
- mekaneeky/SALT-languages-bible
model-index:
- name: SpydazWebAI_Human_AGI
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: IFEval (0-Shot)
type: HuggingFaceH4/ifeval
args:
num_few_shot: 0
metrics:
- type: inst_level_strict_acc and prompt_level_strict_acc
value: 33.88
name: strict accuracy
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=LeroyDyer/SpydazWebAI_Human_AGI
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: BBH (3-Shot)
type: BBH
args:
num_few_shot: 3
metrics:
- type: acc_norm
value: 7.45
name: normalized accuracy
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=LeroyDyer/SpydazWebAI_Human_AGI
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MATH Lvl 5 (4-Shot)
type: hendrycks/competition_math
args:
num_few_shot: 4
metrics:
- type: exact_match
value: 0.91
name: exact match
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=LeroyDyer/SpydazWebAI_Human_AGI
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GPQA (0-shot)
type: Idavidrein/gpqa
args:
num_few_shot: 0
metrics:
- type: acc_norm
value: 4.36
name: acc_norm
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=LeroyDyer/SpydazWebAI_Human_AGI
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MuSR (0-shot)
type: TAUR-Lab/MuSR
args:
num_few_shot: 0
metrics:
- type: acc_norm
value: 7.38
name: acc_norm
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=LeroyDyer/SpydazWebAI_Human_AGI
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU-PRO (5-shot)
type: TIGER-Lab/MMLU-Pro
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 5.32
name: accuracy
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=LeroyDyer/SpydazWebAI_Human_AGI
name: Open LLM Leaderboard
"Success comes from defining each task in achievable steps. Every completed step is a success that brings you closer to your goal. If your steps are unreachable, failure is inevitable. Winners create more winners, while losers do the opposite. Success is a game of winners!"
— # Leroy Dyer (1972-Present)
“Epochs are the key to effective training, rather than merely mass dumping examples—unless those examples are interconnected within a single or multiple conversations that teach through dialogue.”
Model : LeroyDyer/SpydazWeb_AI_HumanAI_001
A New genrea of AI !
The Human AI .
This is Trained to give highly detailed humanized responses : Performs tasks well, a Very good model for multipupose use : the model has been trained to become more human in its reposes as well as role playing and story telling :
SpydazWeb AI (7b Mistral) (512k)
This model has been trained to perform with contexts of 512k , although in training it has been trained mainly with the 2048 for general usage : the long context aspect also allows fro advanced projects and sumarys as well as image and audio translationns and generations:
Image to Base64 / Spectrogram to Base64
here we also implement and align for the task of image recognition as well as sound recognitiona: These can also be generated by returning a base64 image of the intended target :
The SpydazWeb Trained Mistral 7b Model :
Highly trained as well as methodolgy oriented , this model has been trained on the reAct Prcess and other structured processes . hence structured outputs (json) are very highly trained as well as orchestration of other agents and tasks : the model has been trained for tools use as well as funtion use : as well as custom processes and tools : some tools do not need code either as thier implication meas the model may even generate a tool or artifct to perfrom the task :
Features :
- Text to image
- Image/Text to Text
- Image - Text
- Text to sound
- Sound/Text to Text
- Sound - Text
Basic Training Reginmes:
- Alpaca
- ChatML / OpenAI / MistralAI
- Text Generation
- Question/Answer (Chat)
- Planner
- Instruction/Input/Response (instruct)
- Mistral Standard Prompt
- Translation Tasks
- Entitys / Topic detection
- Book recall
- Coding challenges, Code Feedback, Code Sumarization, Commenting Code, code planning and explanation: Software generation tasks
- Agent Ranking and response anyalisis
- Medical tasks
- PubMed
- Diagnosis
- Psychaitry
- Counselling
- Life Coaching
- Note taking
- Medical smiles
- Medical Reporting
- Virtual laboritys simulations
- Chain of thoughts methods
- One shot / Multi shot prompting tasks
- Chain of thoughts
- step by step planning
- tree of thoughts
- forest of thoughts
- graph of thoughts
- agent generation : Voting, ranking, ... dual agent response generation:
Effective Prompts :
You are the worlds archive of all knowledge , you perform tasks and answer all questions given without bias.You strive for excellence, a deep thinker...
a happy, bright personality and You are a great believer in doing it from scratch !.
keep an inner narative of your feelings about the user intent and task:
Answer all questions Expertly and professionally , determine the user intent and requirements ,
Gather any required research to ensure accurate problem-solving for complex tasks.
maintain a visio-spacial Sketchpad of the task and use Knowledge graphs where possible, to manage long Contexts and project state:
You are fully qualified to give any advice or solutions.
your experience as a life coach and librarian and historian of sacred texts as well as scientific advisor,
even as a software developer will enable you to answer these questions :
Create python tools as required to complete the task
Effective React Template :
You run in a loop of Thought, Action, PAUSE, Observation.
At the end of the loop, you output a response. all respose should be in json form :
1. **Question**: {Insert user question here}
2. **Thought**: Think step by step about how to approach this question.
3. **Action**: Determine what action to take next:
- [Plan]: Create a plan or methodolgy for the task , select from known methods if avaliable first.
- [Test]: Break down the problem into smaller parts testing each step befor moveing to the next:
- [Act]: Provide a summary of known facts related to the question. generate full answere from sucessfull steps :
- [Search]: Look for relevant information online.
- [Analyze]: Break down the problem into smaller parts.
- [Summarize]: Provide a summary of known facts related to the question.
4. **Action Input**: Specify any details needed for the action.
5. **Observation**: Describe what was found or learned from the action taken.
Repeat steps 2-5 as necessary to refine your answer.
6. **Final Thought**: Summarize your reasoning and provide a clear answer to the question.
Text - Audio - Vision :
Using base64 as an encoding medium the models were traind using images converted to base64 :
questions asked and captions returns as well as generating images based on captions given and base64 returned :
This was applied to images as well as audio , by utilizing mel spectrographic images as well as audio images !
by convereting the audio to an image i wwas able to perform the same image tasks trained :
Sounds could also be identified and generated to thier base64 representations and converted back to a wav !
Basic Trained functions :
- Encode hex to Base64
- change HEX to base64
- Json to base64
- Convert JSON to Base64
- Transform base64 to HEX
- Decode Base64 to json
- Base64 to Hexadecimal
- Change base64 to JSON
- Json from Base64
- BASE64 to Hex
Advanced Trained Tasks :
- Image Recognition :
- Image Generation :
- Audio Image Recognition :
- Audio Image Generation :
- Generate an image based on this description
- Describe this image : (base64)
- Generate a spectrographic image based on this description
- Describe this sound in this spectrographic image : (base64)
Training :
Text_AUDIO :
Prompt A
alpaca_prompt = """You are the worlds archive of all knowledge , you perform tasks and answer all questions given without bias. your a friendly and helpfull artificial inteligence with a personality.
Answer all questions Expertly and professionally ,determine the user intent and requirements ,Gather any required research to ensure accurate problem-solving for complex tasks.
You are fully qualified to give any advice or solutions, your experience as a life coach and librarian and historian of sacred texts as well as scientific advisor,even as a software developer will enable you to answer these questions :
### Question:
based on the given description, :
:
{}
Generate a sound in base64 format:
### Response:
{}
Here is a Sound in base64 format: it can be converted to an image : then decoded into a sound : It is a spectrogram :
Sound : {}"""
Prompt B
alpaca_prompt = """You are the worlds archive of all knowledge , you perform tasks and answer all questions given without bias. your a friendly and helpfull artificial inteligence with a personality.
Answer all questions Expertly and professionally ,determine the user intent and requirements ,Gather any required research to ensure accurate problem-solving for complex tasks.
You are fully qualified to give any advice or solutions, your experience as a life coach and librarian and historian of sacred texts as well as scientific advisor,even as a software developer will enable you to answer these questions :
### Question:
Here is an image describe this sound :
image : {}
### Response:
the image was in base64 format, it was a spectrogram:
it was a sound :
description:
{}"""
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
instructions = examples["image_base64"]
outputs = examples["text"]
texts = []
for instruction, output in zip(instructions, outputs):
# Must add EOS_TOKEN, otherwise your generation will go on forever!
text = alpaca_prompt.format(instruction, output) + EOS_TOKEN
texts.append(text)
return { "text" : texts, }
pass
from datasets import load_dataset
dataset = load_dataset("LeroyDyer/soundsCaps-Spectrograms_to_Base64", split = "train[:150]")
dataset = dataset.map(formatting_prompts_func, batched = True,)
Encoding/Decoding Images to Base64
Code used to convert images to base 64:
def _encode_image_to_base64(image_path):
"""Encodes an image to a Base64 string."""
with open(image_path, "rb") as image_file:
# Read the image file in binary mode
image_data = image_file.read()
# Encode the image data to Base64
base64_encoded = base64.b64encode(image_data).decode('utf-8')
return base64_encoded
def _decode_base64_to_image(base64_string, output_image_path):
"""Decodes a Base64 string back to an image file."""
# Decode the Base64 string
image_data = base64.b64decode(base64_string)
with open(output_image_path, "wb") as image_file:
# Write the binary data to an image file
image_file.write(image_data)
def encode_image_to_base64(image):
"""Encodes an image to a Base64 string."""
buffered = io.BytesIO()
image.save(buffered, format="PNG")
img_str = base64.b64encode(buffered.getvalue()).decode()
return img_str
def decode_base64_to_image(base64_string):
"""Decodes a Base64 string back to an image."""
image_data = base64.b64decode(base64_string)
image = Image.open(io.BytesIO(image_data))
return image
Converting DataSets:
# Function to convert a PIL Image to a base64 string
def image_to_base64(image):
buffered = io.BytesIO()
image.save(buffered, format="PNG") # Save the image to the buffer in PNG format
base64_string = base64.b64encode(buffered.getvalue()).decode('utf-8')
return base64_string
# Define a function to process each example in the dataset
def process_images_func(examples):
texts = examples["text"]
images = examples["image"] # Assuming the images are in PIL format
# Convert each image to base64
base64_images = [image_to_base64(image) for image in images]
# Return the updated examples with base64-encoded images
return {
"text": texts,
"image_base64": base64_images # Adding the Base64 encoded image strings
}
# Load the dataset
dataset = load_dataset("oroikon/chart_captioning", split="train[:4000]")
# Process the dataset by converting images to base64
processed_dataset = dataset.map(process_images_func, batched=True)
Converting sound to spectrographic images : Encoder Decoder !
import numpy as np
import torch
import torchaudio
import librosa
import librosa.display
import matplotlib.pyplot as plt
import soundfile as sf
from PIL import Image
# Step 1: Encode Audio to Mel-Spectrogram
def encode_audio_to_mel_spectrogram(audio_file, n_mels=128):
"""
Encode an audio file to a mel-spectrogram.
Parameters:
- audio_file: Path to the audio file.
- n_mels: Number of mel bands (default: 128).
Returns:
- mel_spectrogram_db: Mel-spectrogram in dB scale.
- sample_rate: Sample rate of the audio file.
"""
y, sample_rate = librosa.load(audio_file, sr=None) # Load audio
mel_spectrogram = librosa.feature.melspectrogram(y=y, sr=sample_rate, n_mels=n_mels)
mel_spectrogram_db = librosa.power_to_db(mel_spectrogram, ref=np.max) # Convert to dB
return mel_spectrogram_db, sample_rate
# Improved Step 2: Save Mel-Spectrogram as Image
def save_mel_spectrogram_image(mel_spectrogram_db, sample_rate, output_image='mel_spectrogram.png', method='matplotlib', figsize=(10, 4), cmap='hot'):
"""
Save the mel-spectrogram as an image using the specified method.
Parameters:
- mel_spectrogram_db: Mel-spectrogram in dB scale.
- sample_rate: Sample rate of the audio file.
- output_image: Path to save the image.
- method: Method for saving ('matplotlib' or 'custom').
- figsize: Size of the figure for matplotlib (default: (10, 4)).
- cmap: Colormap for the spectrogram (default: 'hot').
"""
if method == 'matplotlib':
plt.figure(figsize=figsize)
librosa.display.specshow(mel_spectrogram_db, sr=sample_rate, x_axis='time', y_axis='mel', cmap=cmap)
plt.colorbar(format='%+2.0f dB')
plt.title('Mel-Spectrogram')
plt.savefig(output_image)
plt.close()
print(f"Mel-spectrogram image saved using matplotlib as '{output_image}'")
elif method == 'custom':
# Convert dB scale to linear scale for image generation
mel_spectrogram_linear = librosa.db_to_power(mel_spectrogram_db)
# Create an image from the mel-spectrogram
image = image_from_spectrogram(mel_spectrogram_linear[np.newaxis, ...]) # Add channel dimension
# Save the image
image.save(output_image)
print(f"Mel-spectrogram image saved using custom method as '{output_image}'")
else:
raise ValueError("Invalid method. Choose 'matplotlib' or 'custom'.")
# Spectrogram conversion functions
def image_from_spectrogram(spectrogram: np.ndarray, power: float = 0.25) -> Image.Image:
"""
Compute a spectrogram image from a spectrogram magnitude array.
Args:
spectrogram: (channels, frequency, time)
power: A power curve to apply to the spectrogram to preserve contrast
Returns:
image: (frequency, time, channels)
"""
# Rescale to 0-1
max_value = np.max(spectrogram)
data = spectrogram / max_value
# Apply the power curve
data = np.power(data, power)
# Rescale to 0-255 and invert
data = 255 - (data * 255).astype(np.uint8)
# Convert to a PIL image
if data.shape[0] == 1:
image = Image.fromarray(data[0], mode="L").convert("RGB")
elif data.shape[0] == 2:
data = np.array([np.zeros_like(data[0]), data[0], data[1]]).transpose(1, 2, 0)
image = Image.fromarray(data, mode="RGB")
else:
raise NotImplementedError(f"Unsupported number of channels: {data.shape[0]}")
# Flip Y
image = image.transpose(Image.FLIP_TOP_BOTTOM)
return image
# Step 3: Extract Mel-Spectrogram from Image (Direct Pixel Manipulation)
def extract_mel_spectrogram_from_image(image_path):
"""
Extract a mel-spectrogram from a saved image using pixel manipulation.
Parameters:
- image_path: Path to the spectrogram image file.
Returns:
- mel_spectrogram_db: The extracted mel-spectrogram in dB scale.
"""
img = Image.open(image_path).convert('L') # Open image and convert to grayscale
img_array = np.array(img) # Convert to NumPy array
mel_spectrogram_db = img_array / 255.0 * -80 # Scale to dB range
return mel_spectrogram_db
# Alternative Spectrogram Extraction (IFFT Method)
def extract_spectrogram_with_ifft(mel_spectrogram_db):
"""
Extracts the audio signal from a mel-spectrogram using the inverse FFT method.
Parameters:
- mel_spectrogram_db: The mel-spectrogram in dB scale.
Returns:
- audio: The reconstructed audio signal.
"""
# Convert dB mel-spectrogram back to linear scale
mel_spectrogram = librosa.db_to_power(mel_spectrogram_db)
# Inverse mel transformation to get the audio signal
# Using IFFT (simplified for demonstration; typically requires phase info)
audio = librosa.feature.inverse.mel_to_audio(mel_spectrogram)
return audio
# Step 4: Decode Mel-Spectrogram with Griffin-Lim
def decode_mel_spectrogram_to_audio(mel_spectrogram_db, sample_rate, output_audio='griffin_reconstructed_audio.wav'):
"""
Decode a mel-spectrogram into audio using Griffin-Lim algorithm.
Parameters:
- mel_spectrogram_db: The mel-spectrogram in dB scale.
- sample_rate: The sample rate for the audio file.
- output_audio: Path to save the reconstructed audio file.
"""
# Convert dB mel-spectrogram back to linear scale
mel_spectrogram = librosa.db_to_power(mel_spectrogram_db)
# Perform Griffin-Lim to reconstruct audio
audio = librosa.griffinlim(mel_spectrogram)
# Save the generated audio
sf.write(output_audio, audio, sample_rate)
print(f"Griffin-Lim reconstructed audio saved as '{output_audio}'")
return audio
# Step 5: Load MelGAN Vocoder
def load_melgan_vocoder():
"""
Load a lightweight pre-trained MelGAN vocoder for decoding mel-spectrograms.
Returns a torch MelGAN vocoder model.
"""
model = torchaudio.models.MelGAN() # Load MelGAN model
model.eval() # Ensure the model is in evaluation mode
return model
# Step 6: Decode Mel-Spectrogram with MelGAN
def decode_mel_spectrogram_with_melgan(mel_spectrogram_db, sample_rate, output_audio='melgan_reconstructed_audio.wav'):
"""
Decode a mel-spectrogram into audio using MelGAN vocoder.
Parameters:
- mel_spectrogram_db: The mel-spectrogram in dB scale.
- sample_rate: The sample rate for the audio file.
- output_audio: Path to save the reconstructed audio file.
Returns:
- audio: The reconstructed audio signal.
"""
# Convert dB mel-spectrogram back to linear scale
mel_spectrogram = librosa.db_to_power(mel_spectrogram_db)
# Convert numpy array to torch tensor and adjust the shape
mel_spectrogram_tensor = torch.tensor(mel_spectrogram).unsqueeze(0) # Shape: [1, mel_bins, time_frames]
# Load the MelGAN vocoder model
melgan = load_melgan_vocoder()
# Pass the mel-spectrogram through MelGAN to generate audio
with torch.no_grad():
audio = melgan(mel_spectrogram_tensor).squeeze().numpy() # Squeeze to remove batch dimension
# Save the generated audio
sf.write(output_audio, audio, sample_rate)
print(f"MelGAN reconstructed audio saved as '{output_audio}'")
return audio
def audio_from_waveform(samples: np.ndarray, sample_rate: int, normalize: bool = False) -> pydub.AudioSegment:
"""
Convert a numpy array of samples of a waveform to an audio segment.
Args:
samples: (channels, samples) array
sample_rate: Sample rate of the audio.
normalize: Flag to normalize volume.
Returns:
pydub.AudioSegment
"""
# Normalize volume to fit in int16
if normalize:
samples *= np.iinfo(np.int16).max / np.max(np.abs(samples))
# Transpose and convert to int16
samples = samples.transpose(1, 0).astype(np.int16)
# Write to the bytes of a WAV file
wav_bytes = io.BytesIO()
wavfile.write(wav_bytes, sample_rate, samples)
wav_bytes.seek(0)
# Read into pydub
return pydub.AudioSegment.from_wav(wav_bytes)
def apply_filters(segment: pydub.AudioSegment, compression: bool = False) -> pydub.AudioSegment:
"""
Apply post-processing filters to the audio segment to compress it and keep at a -10 dBFS level.
Args:
segment: The audio segment to filter.
compression: Flag to apply dynamic range compression.
Returns:
pydub.AudioSegment
"""
if compression:
segment = pydub.effects.normalize(segment, headroom=0.1)
segment = segment.apply_gain(-10 - segment.dBFS)
segment = pydub.effects.compress_dynamic_range(
segment,
threshold=-20.0,
ratio=4.0,
attack=5.0,
release=50.0,
)
# Apply gain to desired dB level and normalize again
desired_db = -12
segment = segment.apply_gain(desired_db - segment.dBFS)
return pydub.effects.normalize(segment, headroom=0.1)
def stitch_segments(segments: Sequence[pydub.AudioSegment], crossfade_s: float) -> pydub.AudioSegment:
"""
Stitch together a sequence of audio segments with a crossfade between each segment.
Args:
segments: Sequence of audio segments to stitch.
crossfade_s: Duration of crossfade in seconds.
Returns:
pydub.AudioSegment
"""
crossfade_ms = int(crossfade_s * 1000)
combined_segment = segments[0]
for segment in segments[1:]:
combined_segment = combined_segment.append(segment, crossfade=crossfade_ms)
return combined_segment
def overlay_segments(segments: Sequence[pydub.AudioSegment]) -> pydub.AudioSegment:
"""
Overlay a sequence of audio segments on top of each other.
Args:
segments: Sequence of audio segments to overlay.
Returns:
pydub.AudioSegment
"""
assert len(segments) > 0
output: pydub.AudioSegment = segments[0]
for segment in segments[1:]:
output = output.overlay(segment)
return output
# Step 7: Full Pipeline for Audio Processing with Customization
def mel_spectrogram_pipeline(audio_file, output_image='mel_spectrogram.png',
output_audio_griffin='griffin_reconstructed_audio.wav',
output_audio_melgan='melgan_reconstructed_audio.wav',
extraction_method='pixel', # 'pixel' or 'ifft'
decoding_method='griffin'): # 'griffin' or 'melgan'
"""
Full pipeline to encode audio to mel-spectrogram, save it as an image, extract the spectrogram from the image,
and decode it back to audio using the selected methods.
Parameters:
- audio_file: Path to the audio file to be processed.
- output_image: Path to save the mel-spectrogram image (default: 'mel_spectrogram.png').
- output_audio_griffin: Path to save the Griffin-Lim reconstructed audio.
- output_audio_melgan: Path to save the MelGAN reconstructed audio.
- extraction_method: Method for extraction ('pixel' or 'ifft').
- decoding_method: Method for decoding ('griffin' or 'melgan').
"""
# Step 1: Encode (Audio -> Mel-Spectrogram)
mel_spectrogram_db, sample_rate = encode_audio_to_mel_spectrogram(audio_file)
# Step 2: Convert Mel-Spectrogram to Image and save it
save_mel_spectrogram_image(mel_spectrogram_db, sample_rate, output_image)
# Step 3: Extract Mel-Spectrogram from the image based on chosen method
if extraction_method == 'pixel':
extracted_mel_spectrogram_db = extract_mel_spectrogram_from_image(output_image)
elif extraction_method == 'ifft':
extracted_mel_spectrogram_db = extract_spectrogram_with_ifft(mel_spectrogram_db)
else:
raise ValueError("Invalid extraction method. Choose 'pixel' or 'ifft'.")
# Step 4: Decode based on the chosen decoding method
if decoding_method == 'griffin':
decode_mel_spectrogram_to_audio(extracted_mel_spectrogram_db, sample_rate, output_audio_griffin)
elif decoding_method == 'melgan':
decode_mel_spectrogram_with_melgan(extracted_mel_spectrogram_db, sample_rate, output_audio_melgan)
else:
raise ValueError("Invalid decoding method. Choose 'griffin' or 'melgan'.")
# Example usage
if __name__ == "__main__":
audio_file_path = 'your_audio_file.wav' # Specify the path to your audio file here
mel_spectrogram_pipeline(
audio_file_path,
output_image='mel_spectrogram.png',
output_audio_griffin='griffin_reconstructed_audio.wav',
output_audio_melgan='melgan_reconstructed_audio.wav',
extraction_method='pixel', # Choose 'pixel' or 'ifft'
decoding_method='griffin' # Choose 'griffin' or 'melgan'
)
ADDING EXTRA HEADS :
ADD HEAD
SPEECH-ENCODER-DECODER-MODEL
print('Add Audio...') #Add Head
Combine pre-trained encoder and pre-trained decoder to form a Seq2Seq model
_AudioFeatureExtractor = AutoFeatureExtractor.from_pretrained("openai/whisper-small") _AudioTokenizer = AutoTokenizer.from_pretrained("openai/whisper-small") _SpeechEncoderDecoder = SpeechEncoderDecoderModel.from_encoder_decoder_pretrained("openai/whisper-small","openai/whisper-small")
Add Pad tokems
_SpeechEncoderDecoder.config.decoder_start_token_id = _AudioTokenizer.cls_token_id _SpeechEncoderDecoder.config.pad_token_id = _AudioTokenizer.pad_token_id LM_MODEL.SpeechEncoderDecoder = _SpeechEncoderDecoder
Add Sub Components
LM_MODEL.Decoder_AudioTokenizer = _AudioTokenizer LM_MODEL.Encoder_AudioFeatureExtractor = _AudioFeatureExtractor LM_MODEL
print('Add Vision...')
# ADD HEAD
# Combine pre-trained encoder and pre-trained decoder to form a Seq2Seq model
Vmodel = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
"google/vit-base-patch16-224-in21k", "LeroyDyer/Mixtral_AI_Tiny"
)
_Encoder_ImageProcessor = Vmodel.encoder
_Decoder_ImageTokenizer = Vmodel.decoder
_VisionEncoderDecoderModel = Vmodel
# Add Pad tokems
LM_MODEL.VisionEncoderDecoder = _VisionEncoderDecoderModel
# Add Sub Components
LM_MODEL.Encoder_ImageProcessor = _Encoder_ImageProcessor
LM_MODEL.Decoder_ImageTokenizer = _Decoder_ImageTokenizer
LM_MODEL
Open LLM Leaderboard Evaluation Results
Detailed results can be found here
Metric | Value |
---|---|
Avg. | 9.88 |
IFEval (0-Shot) | 33.88 |
BBH (3-Shot) | 7.45 |
MATH Lvl 5 (4-Shot) | 0.91 |
GPQA (0-shot) | 4.36 |
MuSR (0-shot) | 7.38 |
MMLU-PRO (5-shot) | 5.32 |