Great project! Some questions on model versions and performance

#1
by aotrih - opened

I enjoyed reading the blog post, kudos to the team! https://research.roblox.com/tech-blog/2024/07/deploying-ml-for-voice-safety

I have two questions:

1-) It looks like this particular model corresponds to the 96M Fine-tuned WavLM. Do you have plans on publishing the Distilled version? Per the blog post, the distilled model adds MFCC pre-processing before the CNN stage so the model can process a shorter sequence. The models without MFCC will need to process a raw waveform of 16,000 kHz * 15 seconds = 240,000 in the CNN layers which adds a lot of compute to the forward pass.

2-) The reported latency numbers are normalized by input audio length but the forward pass FLOPs are not linearly correlated with the input length so it is unclear how fast the system would be for longer inputs from the shared information, e.g. It should be slower than 15 seconds * latency of 1 second input. Are you able to share more insights on this?

Screenshot 2024-07-08 at 11.55.48 PM.png

Sign up or log in to comment