Spaces:

amphion
/

naturalspeech3_facodec

Running on Zero

App Files Files Community

Hecheng0625 commited on Mar 10

Commit

cd03576

•

1 Parent(s): 4c9108a

Upload 12 files

Browse files

Files changed (12) hide show

Amphion/models/ns3_codec/README.md +160 -0
Amphion/models/ns3_codec/__init__.py +1 -0
Amphion/models/ns3_codec/alias_free_torch/__init__.py +6 -0
Amphion/models/ns3_codec/alias_free_torch/act.py +30 -0
Amphion/models/ns3_codec/alias_free_torch/filter.py +99 -0
Amphion/models/ns3_codec/alias_free_torch/resample.py +58 -0
Amphion/models/ns3_codec/facodec.py +593 -0
Amphion/models/ns3_codec/gradient_reversal.py +30 -0
Amphion/models/ns3_codec/quantize/__init__.py +2 -0
Amphion/models/ns3_codec/quantize/fvq.py +111 -0
Amphion/models/ns3_codec/quantize/rvq.py +82 -0
Amphion/models/ns3_codec/transformer.py +217 -0

Amphion/models/ns3_codec/README.md ADDED Viewed

	@@ -0,0 +1,160 @@

+## FACodec: Speech Codec with Attribute Factorization used for NaturalSpeech 3
+[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/pdf/2403.03100.pdf)
+[![demo](https://img.shields.io/badge/FACodec-Demo-red)](https://speechresearch.github.io/naturalspeech3/)
+[![model](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Models-pink)](https://huggingface.co/amphion/naturalspeech3_facodec)
+[![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Spaces-yellow)](https://huggingface.co/spaces/amphion/naturalspeech3_facodec)
+## Overview
+FACodec is a core component of the advanced text-to-speech (TTS) model NaturalSpeech 3. FACodec converts complex speech waveform into disentangled subspaces representing speech attributes of content, prosody, timbre, and acoustic details and reconstruct high-quality speech waveform from these attributes. FACodec decomposes complex speech into subspaces representing different attributes, thus simplifying the modeling of speech representation.
+Research can use FACodec to develop different modes of TTS models, such as non-autoregressive based discrete diffusion (NaturalSpeech 3) or autoregressive models (like VALL-E).
+<br>
+<div align="center">
+<img src="../../imgs/ns3/ns3_overview.png" width="65%">
+</div>
+<br>
+<br>
+<div align="center">
+<img src="../../imgs/ns3/ns3_facodec.png" width="100%">
+</div>
+<br>
+## Useage
+Download the pre-trained FACodec model from HuggingFace: [Pretrained FACodec checkpoint](https://huggingface.co/amphion/naturalspeech3_facodec)
+Install Amphion
+```bash
+git https://github.com/open-mmlab/Amphion.git
+```
+Few lines of code to use the pre-trained FACodec model
+```python
+from AmphionOpen.models.ns3_codec import FACodecEncoder, FACodecDecoder
+fa_encoder = FACodecEncoder(
+    ngf=32,
+    up_ratios=[2, 4, 5, 5],
+    out_channels=256,
+)
+fa_decoder = FACodecDecoder(
+    in_channels=256,
+    upsample_initial_channel=1024,
+    ngf=32,
+    up_ratios=[5, 5, 4, 2],
+    vq_num_q_c=2,
+    vq_num_q_p=1,
+    vq_num_q_r=3,
+    vq_dim=256,
+    codebook_dim=8,
+    codebook_size_prosody=10,
+    codebook_size_content=10,
+    codebook_size_residual=10,
+    use_gr_x_timbre=True,
+    use_gr_residual_f0=True,
+    use_gr_residual_phone=True,
+)
+fa_encoder = torch.load("ns3_facodec_encoder.bin")
+fa_decoder = torch.load("ns3_facodec_decoder.bin")
+fa_encoder.eval()
+fa_decoder.eval()
+```
+Test
+```python
+test_wav_path = "test.wav"
+test_wav = librosa.load(test_wav_path, sr=16000)[0]
+test_wav = torch.from_numpy(test_wav).float()
+test_wav = test_wav.unsqueeze(0).unsqueeze(0)
+with torch.no_grad():
+    # encode
+    enc_out = fa_encoder(test_wav)
+    print(enc_out.shape)
+    # quantize
+    vq_post_emb, vq_id, _, quantized, spk_embs = fa_decoder(enc_out, eval_vq=False, vq=True)
+    # latent after quantization
+    print(vq_post_emb.shape)
+    # codes
+    print("vq id shape:", vq_id.shape)
+    # get prosody code
+    prosody_code = vq_id[:1]
+    print("prosody code shape:", prosody_code.shape)
+    # get content code
+    cotent_code = vq_id[1:3]
+    print("content code shape:", cotent_code.shape)
+    # get residual code (acoustic detail codes)
+    residual_code = vq_id[3:]
+    print("residual code shape:", residual_code.shape)
+    # speaker embedding
+    print("speaker embedding shape:", spk_embs.shape)
+    # decode (recommand)
+    recon_wav = fa_decoder.inference(vq_post_emb, spk_embs)
+    print(recon_wav.shape)
+    sf.write("recon.wav", recon_wav[0][0].cpu().numpy(), 16000)
+```
+## Some Q&A
+Q1: What audio sample rate does FACodec support? What is the hop size? How many codes will be generated for each frame?
+A1: FACodec supports 16KHz speech audio. The hop size is 200 samples, and (16000/200) * 6 (total number of codebooks) codes will be generated for each frame.
+Q2: Is it possible to train an autoregressive TTS model like VALL-E using FACodec?
+A2: Yes. In fact, the authors of NaturalSpeech 3 have already employ explore the autoregressive generative model for discrete token generation with FACodec. They use an autoregressive language model to generate prosody codes, followed by a non-autoregressive model to generate the remaining content and acoustic details codes.
+Q3: Is it possible to train a latent diffusion TTS model like NaturalSpeech2 using FACodec?
+A3: Yes. You can use the latent getted after quanzaition as the modelling target for the latent diffusion model.
+Q4: Can FACodec compress and reconstruct audio from other domains? Such as sound effects, music, etc.
+A4: Since FACodec is designed for speech, it may not be suitable for other audio domains. However, it is possible to use the FACodec model to compress and reconstruct audio from other domains, but the quality may not be as good as the original audio.
+Q5: Can FACodec be used for content feature for some other tasks like voice conversion?
+A5: I think the answer is yes. Researchers can use the content code of FACodec as the content feature for voice conversion. We hope to see more research in this direction.
+## Citations
+If you use our FACodec model, please cite the following paper:
+```bibtex
+@misc{ju2024naturalspeech,
+      title={NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models},
+      author={Zeqian Ju and Yuancheng Wang and Kai Shen and Xu Tan and Detai Xin and Dongchao Yang and Yanqing Liu and Yichong Leng and Kaitao Song and Siliang Tang and Zhizheng Wu and Tao Qin and Xiang-Yang Li and Wei Ye and Shikun Zhang and Jiang Bian and Lei He and Jinyu Li and Sheng Zhao},
+      year={2024},
+      eprint={2403.03100},
+      archivePrefix={arXiv},
+      primaryClass={eess.AS}
+}
+@article{zhang2023amphion,
+      title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
+      author={Xueyao Zhang and Liumeng Xue and Yicheng Gu and Yuancheng Wang and Haorui He and Chaoren Wang and Xi Chen and Zihao Fang and Haopeng Chen and Junan Zhang and Tze Ying Tang and Lexiao Zou and Mingxuan Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
+      journal={arXiv},
+      year={2024},
+      volume={abs/2312.09911}
+}
+```

Amphion/models/ns3_codec/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .facodec import *

Amphion/models/ns3_codec/alias_free_torch/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
+#   LICENSE is in incl_licenses directory.
+from .filter import *
+from .resample import *
+from .act import *

Amphion/models/ns3_codec/alias_free_torch/act.py ADDED Viewed

	@@ -0,0 +1,30 @@

+# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
+#   LICENSE is in incl_licenses directory.
+import torch.nn as nn
+from .resample import UpSample1d, DownSample1d
+class Activation1d(nn.Module):
+    def __init__(
+        self,
+        activation,
+        up_ratio: int = 2,
+        down_ratio: int = 2,
+        up_kernel_size: int = 12,
+        down_kernel_size: int = 12,
+    ):
+        super().__init__()
+        self.up_ratio = up_ratio
+        self.down_ratio = down_ratio
+        self.act = activation
+        self.upsample = UpSample1d(up_ratio, up_kernel_size)
+        self.downsample = DownSample1d(down_ratio, down_kernel_size)
+    # x: [B,C,T]
+    def forward(self, x):
+        x = self.upsample(x)
+        x = self.act(x)
+        x = self.downsample(x)
+        return x

Amphion/models/ns3_codec/alias_free_torch/filter.py ADDED Viewed

	@@ -0,0 +1,99 @@

+# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
+#   LICENSE is in incl_licenses directory.
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math
+if "sinc" in dir(torch):
+    sinc = torch.sinc
+else:
+    # This code is adopted from adefossez's julius.core.sinc under the MIT License
+    # https://adefossez.github.io/julius/julius/core.html
+    #   LICENSE is in incl_licenses directory.
+    def sinc(x: torch.Tensor):
+        """
+        Implementation of sinc, i.e. sin(pi * x) / (pi * x)
+        __Warning__: Different to julius.sinc, the input is multiplied by `pi`!
+        """
+        return torch.where(
+            x == 0,
+            torch.tensor(1.0, device=x.device, dtype=x.dtype),
+            torch.sin(math.pi * x) / math.pi / x,
+        )
+# This code is adopted from adefossez's julius.lowpass.LowPassFilters under the MIT License
+# https://adefossez.github.io/julius/julius/lowpass.html
+#   LICENSE is in incl_licenses directory.
+def kaiser_sinc_filter1d(
+    cutoff, half_width, kernel_size
+):  # return filter [1,1,kernel_size]
+    even = kernel_size % 2 == 0
+    half_size = kernel_size // 2
+    # For kaiser window
+    delta_f = 4 * half_width
+    A = 2.285 * (half_size - 1) * math.pi * delta_f + 7.95
+    if A > 50.0:
+        beta = 0.1102 * (A - 8.7)
+    elif A >= 21.0:
+        beta = 0.5842 * (A - 21) ** 0.4 + 0.07886 * (A - 21.0)
+    else:
+        beta = 0.0
+    window = torch.kaiser_window(kernel_size, beta=beta, periodic=False)
+    # ratio = 0.5/cutoff -> 2 * cutoff = 1 / ratio
+    if even:
+        time = torch.arange(-half_size, half_size) + 0.5
+    else:
+        time = torch.arange(kernel_size) - half_size
+    if cutoff == 0:
+        filter_ = torch.zeros_like(time)
+    else:
+        filter_ = 2 * cutoff * window * sinc(2 * cutoff * time)
+        # Normalize filter to have sum = 1, otherwise we will have a small leakage
+        # of the constant component in the input signal.
+        filter_ /= filter_.sum()
+        filter = filter_.view(1, 1, kernel_size)
+    return filter
+class LowPassFilter1d(nn.Module):
+    def __init__(
+        self,
+        cutoff=0.5,
+        half_width=0.6,
+        stride: int = 1,
+        padding: bool = True,
+        padding_mode: str = "replicate",
+        kernel_size: int = 12,
+    ):
+        # kernel_size should be even number for stylegan3 setup,
+        # in this implementation, odd number is also possible.
+        super().__init__()
+        if cutoff < -0.0:
+            raise ValueError("Minimum cutoff must be larger than zero.")
+        if cutoff > 0.5:
+            raise ValueError("A cutoff above 0.5 does not make sense.")
+        self.kernel_size = kernel_size
+        self.even = kernel_size % 2 == 0
+        self.pad_left = kernel_size // 2 - int(self.even)
+        self.pad_right = kernel_size // 2
+        self.stride = stride
+        self.padding = padding
+        self.padding_mode = padding_mode
+        filter = kaiser_sinc_filter1d(cutoff, half_width, kernel_size)
+        self.register_buffer("filter", filter)
+    # input [B, C, T]
+    def forward(self, x):
+        _, C, _ = x.shape
+        if self.padding:
+            x = F.pad(x, (self.pad_left, self.pad_right), mode=self.padding_mode)
+        out = F.conv1d(x, self.filter.expand(C, -1, -1), stride=self.stride, groups=C)
+        return out

Amphion/models/ns3_codec/alias_free_torch/resample.py ADDED Viewed

	@@ -0,0 +1,58 @@

+# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
+#   LICENSE is in incl_licenses directory.
+import torch.nn as nn
+from torch.nn import functional as F
+from .filter import LowPassFilter1d
+from .filter import kaiser_sinc_filter1d
+class UpSample1d(nn.Module):
+    def __init__(self, ratio=2, kernel_size=None):
+        super().__init__()
+        self.ratio = ratio
+        self.kernel_size = (
+            int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
+        )
+        self.stride = ratio
+        self.pad = self.kernel_size // ratio - 1
+        self.pad_left = self.pad * self.stride + (self.kernel_size - self.stride) // 2
+        self.pad_right = (
+            self.pad * self.stride + (self.kernel_size - self.stride + 1) // 2
+        )
+        filter = kaiser_sinc_filter1d(
+            cutoff=0.5 / ratio, half_width=0.6 / ratio, kernel_size=self.kernel_size
+        )
+        self.register_buffer("filter", filter)
+    # x: [B, C, T]
+    def forward(self, x):
+        _, C, _ = x.shape
+        x = F.pad(x, (self.pad, self.pad), mode="replicate")
+        x = self.ratio * F.conv_transpose1d(
+            x, self.filter.expand(C, -1, -1), stride=self.stride, groups=C
+        )
+        x = x[..., self.pad_left : -self.pad_right]
+        return x
+class DownSample1d(nn.Module):
+    def __init__(self, ratio=2, kernel_size=None):
+        super().__init__()
+        self.ratio = ratio
+        self.kernel_size = (
+            int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
+        )
+        self.lowpass = LowPassFilter1d(
+            cutoff=0.5 / ratio,
+            half_width=0.6 / ratio,
+            stride=ratio,
+            kernel_size=self.kernel_size,
+        )
+    def forward(self, x):
+        xx = self.lowpass(x)
+        return xx

Amphion/models/ns3_codec/facodec.py ADDED Viewed

	@@ -0,0 +1,593 @@

+import numpy as np
+import torch
+from torch import nn, sin, pow
+from torch.nn import Parameter
+import torch.nn.functional as F
+from torch.nn.utils import weight_norm
+from .alias_free_torch import *
+from .quantize import *
+from einops import rearrange
+from einops.layers.torch import Rearrange
+from .transformer import TransformerEncoder
+from .gradient_reversal import GradientReversal
+def init_weights(m):
+    if isinstance(m, nn.Conv1d):
+        nn.init.trunc_normal_(m.weight, std=0.02)
+        nn.init.constant_(m.bias, 0)
+def WNConv1d(*args, **kwargs):
+    return weight_norm(nn.Conv1d(*args, **kwargs))
+def WNConvTranspose1d(*args, **kwargs):
+    return weight_norm(nn.ConvTranspose1d(*args, **kwargs))
+class CNNLSTM(nn.Module):
+    def __init__(self, indim, outdim, head, global_pred=False):
+        super().__init__()
+        self.global_pred = global_pred
+        self.model = nn.Sequential(
+            ResidualUnit(indim, dilation=1),
+            ResidualUnit(indim, dilation=2),
+            ResidualUnit(indim, dilation=3),
+            Activation1d(activation=SnakeBeta(indim, alpha_logscale=True)),
+            Rearrange("b c t -> b t c"),
+        )
+        self.heads = nn.ModuleList([nn.Linear(indim, outdim) for i in range(head)])
+    def forward(self, x):
+        # x: [B, C, T]
+        x = self.model(x)
+        if self.global_pred:
+            x = torch.mean(x, dim=1, keepdim=False)
+        outs = [head(x) for head in self.heads]
+        return outs
+class SnakeBeta(nn.Module):
+    """
+    A modified Snake function which uses separate parameters for the magnitude of the periodic components
+    Shape:
+        - Input: (B, C, T)
+        - Output: (B, C, T), same shape as the input
+    Parameters:
+        - alpha - trainable parameter that controls frequency
+        - beta - trainable parameter that controls magnitude
+    References:
+        - This activation function is a modified version based on this paper by Liu Ziyin, Tilman Hartwig, Masahito Ueda:
+        https://arxiv.org/abs/2006.08195
+    Examples:
+        >>> a1 = snakebeta(256)
+        >>> x = torch.randn(256)
+        >>> x = a1(x)
+    """
+    def __init__(
+        self, in_features, alpha=1.0, alpha_trainable=True, alpha_logscale=False
+    ):
+        """
+        Initialization.
+        INPUT:
+            - in_features: shape of the input
+            - alpha - trainable parameter that controls frequency
+            - beta - trainable parameter that controls magnitude
+            alpha is initialized to 1 by default, higher values = higher-frequency.
+            beta is initialized to 1 by default, higher values = higher-magnitude.
+            alpha will be trained along with the rest of your model.
+        """
+        super(SnakeBeta, self).__init__()
+        self.in_features = in_features
+        # initialize alpha
+        self.alpha_logscale = alpha_logscale
+        if self.alpha_logscale:  # log scale alphas initialized to zeros
+            self.alpha = Parameter(torch.zeros(in_features) * alpha)
+            self.beta = Parameter(torch.zeros(in_features) * alpha)
+        else:  # linear scale alphas initialized to ones
+            self.alpha = Parameter(torch.ones(in_features) * alpha)
+            self.beta = Parameter(torch.ones(in_features) * alpha)
+        self.alpha.requires_grad = alpha_trainable
+        self.beta.requires_grad = alpha_trainable
+        self.no_div_by_zero = 0.000000001
+    def forward(self, x):
+        """
+        Forward pass of the function.
+        Applies the function to the input elementwise.
+        SnakeBeta := x + 1/b * sin^2 (xa)
+        """
+        alpha = self.alpha.unsqueeze(0).unsqueeze(-1)  # line up with x to [B, C, T]
+        beta = self.beta.unsqueeze(0).unsqueeze(-1)
+        if self.alpha_logscale:
+            alpha = torch.exp(alpha)
+            beta = torch.exp(beta)
+        x = x + (1.0 / (beta + self.no_div_by_zero)) * pow(sin(x * alpha), 2)
+        return x
+class ResidualUnit(nn.Module):
+    def __init__(self, dim: int = 16, dilation: int = 1):
+        super().__init__()
+        pad = ((7 - 1) * dilation) // 2
+        self.block = nn.Sequential(
+            Activation1d(activation=SnakeBeta(dim, alpha_logscale=True)),
+            WNConv1d(dim, dim, kernel_size=7, dilation=dilation, padding=pad),
+            Activation1d(activation=SnakeBeta(dim, alpha_logscale=True)),
+            WNConv1d(dim, dim, kernel_size=1),
+        )
+    def forward(self, x):
+        return x + self.block(x)
+class EncoderBlock(nn.Module):
+    def __init__(self, dim: int = 16, stride: int = 1):
+        super().__init__()
+        self.block = nn.Sequential(
+            ResidualUnit(dim // 2, dilation=1),
+            ResidualUnit(dim // 2, dilation=3),
+            ResidualUnit(dim // 2, dilation=9),
+            Activation1d(activation=SnakeBeta(dim // 2, alpha_logscale=True)),
+            WNConv1d(
+                dim // 2,
+                dim,
+                kernel_size=2 * stride,
+                stride=stride,
+                padding=stride // 2 + stride % 2,
+            ),
+        )
+    def forward(self, x):
+        return self.block(x)
+class FACodecEncoder(nn.Module):
+    def __init__(
+        self,
+        ngf=32,
+        up_ratios=(2, 4, 5, 5),
+        out_channels=1024,
+    ):
+        super().__init__()
+        self.hop_length = np.prod(up_ratios)
+        self.up_ratios = up_ratios
+        # Create first convolution
+        d_model = ngf
+        self.block = [WNConv1d(1, d_model, kernel_size=7, padding=3)]
+        # Create EncoderBlocks that double channels as they downsample by `stride`
+        for stride in up_ratios:
+            d_model *= 2
+            self.block += [EncoderBlock(d_model, stride=stride)]
+        # Create last convolution
+        self.block += [
+            Activation1d(activation=SnakeBeta(d_model, alpha_logscale=True)),
+            WNConv1d(d_model, out_channels, kernel_size=3, padding=1),
+        ]
+        # Wrap black into nn.Sequential
+        self.block = nn.Sequential(*self.block)
+        self.enc_dim = d_model
+        self.reset_parameters()
+    def forward(self, x):
+        out = self.block(x)
+        return out
+    def inference(self, x):
+        return self.block(x)
+    def remove_weight_norm(self):
+        """Remove weight normalization module from all of the layers."""
+        def _remove_weight_norm(m):
+            try:
+                torch.nn.utils.remove_weight_norm(m)
+            except ValueError:  # this module didn't have weight norm
+                return
+        self.apply(_remove_weight_norm)
+    def apply_weight_norm(self):
+        """Apply weight normalization module from all of the layers."""
+        def _apply_weight_norm(m):
+            if isinstance(m, nn.Conv1d):
+                torch.nn.utils.weight_norm(m)
+        self.apply(_apply_weight_norm)
+    def reset_parameters(self):
+        self.apply(init_weights)
+class DecoderBlock(nn.Module):
+    def __init__(self, input_dim: int = 16, output_dim: int = 8, stride: int = 1):
+        super().__init__()
+        self.block = nn.Sequential(
+            Activation1d(activation=SnakeBeta(input_dim, alpha_logscale=True)),
+            WNConvTranspose1d(
+                input_dim,
+                output_dim,
+                kernel_size=2 * stride,
+                stride=stride,
+                padding=stride // 2 + stride % 2,
+                output_padding=stride % 2,
+            ),
+            ResidualUnit(output_dim, dilation=1),
+            ResidualUnit(output_dim, dilation=3),
+            ResidualUnit(output_dim, dilation=9),
+        )
+    def forward(self, x):
+        return self.block(x)
+class FACodecDecoder(nn.Module):
+    def __init__(
+        self,
+        in_channels=256,
+        upsample_initial_channel=1536,
+        ngf=32,
+        up_ratios=(5, 5, 4, 2),
+        vq_num_q_c=2,
+        vq_num_q_p=1,
+        vq_num_q_r=3,
+        vq_dim=1024,
+        vq_commit_weight=0.005,
+        vq_weight_init=False,
+        vq_full_commit_loss=False,
+        codebook_dim=8,
+        codebook_size_prosody=10,  # true codebook size is equal to 2^codebook_size
+        codebook_size_content=10,
+        codebook_size_residual=10,
+        quantizer_dropout=0.0,
+        dropout_type="linear",
+        use_gr_content_f0=False,
+        use_gr_prosody_phone=False,
+        use_gr_residual_f0=False,
+        use_gr_residual_phone=False,
+        use_gr_x_timbre=False,
+        use_random_mask_residual=True,
+        prob_random_mask_residual=0.75,
+    ):
+        super().__init__()
+        self.hop_length = np.prod(up_ratios)
+        self.ngf = ngf
+        self.up_ratios = up_ratios
+        self.use_random_mask_residual = use_random_mask_residual
+        self.prob_random_mask_residual = prob_random_mask_residual
+        self.vq_num_q_p = vq_num_q_p
+        self.vq_num_q_c = vq_num_q_c
+        self.vq_num_q_r = vq_num_q_r
+        self.codebook_size_prosody = codebook_size_prosody
+        self.codebook_size_content = codebook_size_content
+        self.codebook_size_residual = codebook_size_residual
+        quantizer_class = ResidualVQ
+        self.quantizer = nn.ModuleList()
+        # prosody
+        quantizer = quantizer_class(
+            num_quantizers=vq_num_q_p,
+            dim=vq_dim,
+            codebook_size=codebook_size_prosody,
+            codebook_dim=codebook_dim,
+            threshold_ema_dead_code=2,
+            commitment=vq_commit_weight,
+            weight_init=vq_weight_init,
+            full_commit_loss=vq_full_commit_loss,
+            quantizer_dropout=quantizer_dropout,
+            dropout_type=dropout_type,
+        )
+        self.quantizer.append(quantizer)
+        # phone
+        quantizer = quantizer_class(
+            num_quantizers=vq_num_q_c,
+            dim=vq_dim,
+            codebook_size=codebook_size_content,
+            codebook_dim=codebook_dim,
+            threshold_ema_dead_code=2,
+            commitment=vq_commit_weight,
+            weight_init=vq_weight_init,
+            full_commit_loss=vq_full_commit_loss,
+            quantizer_dropout=quantizer_dropout,
+            dropout_type=dropout_type,
+        )
+        self.quantizer.append(quantizer)
+        # residual
+        if self.vq_num_q_r > 0:
+            quantizer = quantizer_class(
+                num_quantizers=vq_num_q_r,
+                dim=vq_dim,
+                codebook_size=codebook_size_residual,
+                codebook_dim=codebook_dim,
+                threshold_ema_dead_code=2,
+                commitment=vq_commit_weight,
+                weight_init=vq_weight_init,
+                full_commit_loss=vq_full_commit_loss,
+                quantizer_dropout=quantizer_dropout,
+                dropout_type=dropout_type,
+            )
+            self.quantizer.append(quantizer)
+        # Add first conv layer
+        channels = upsample_initial_channel
+        layers = [WNConv1d(in_channels, channels, kernel_size=7, padding=3)]
+        # Add upsampling + MRF blocks
+        for i, stride in enumerate(up_ratios):
+            input_dim = channels // 2**i
+            output_dim = channels // 2 ** (i + 1)
+            layers += [DecoderBlock(input_dim, output_dim, stride)]
+        # Add final conv layer
+        layers += [
+            Activation1d(activation=SnakeBeta(output_dim, alpha_logscale=True)),
+            WNConv1d(output_dim, 1, kernel_size=7, padding=3),
+            nn.Tanh(),
+        ]
+        self.model = nn.Sequential(*layers)
+        self.timbre_encoder = TransformerEncoder(
+            enc_emb_tokens=None,
+            encoder_layer=4,
+            encoder_hidden=256,
+            encoder_head=4,
+            conv_filter_size=1024,
+            conv_kernel_size=5,
+            encoder_dropout=0.1,
+            use_cln=False,
+        )
+        self.timbre_linear = nn.Linear(in_channels, in_channels * 2)
+        self.timbre_linear.bias.data[:in_channels] = 1
+        self.timbre_linear.bias.data[in_channels:] = 0
+        self.timbre_norm = nn.LayerNorm(in_channels, elementwise_affine=False)
+        self.f0_predictor = CNNLSTM(in_channels, 1, 2)
+        self.phone_predictor = CNNLSTM(in_channels, 5003, 1)
+        self.use_gr_content_f0 = use_gr_content_f0
+        self.use_gr_prosody_phone = use_gr_prosody_phone
+        self.use_gr_residual_f0 = use_gr_residual_f0
+        self.use_gr_residual_phone = use_gr_residual_phone
+        self.use_gr_x_timbre = use_gr_x_timbre
+        if self.vq_num_q_r > 0 and self.use_gr_residual_f0:
+            self.res_f0_predictor = nn.Sequential(
+                GradientReversal(alpha=1.0), CNNLSTM(in_channels, 1, 2)
+            )
+        if self.vq_num_q_r > 0 and self.use_gr_residual_phone > 0:
+            self.res_phone_predictor = nn.Sequential(
+                GradientReversal(alpha=1.0), CNNLSTM(in_channels, 5003, 1)
+            )
+        if self.use_gr_content_f0:
+            self.content_f0_predictor = nn.Sequential(
+                GradientReversal(alpha=1.0), CNNLSTM(in_channels, 1, 2)
+            )
+        if self.use_gr_prosody_phone:
+            self.prosody_phone_predictor = nn.Sequential(
+                GradientReversal(alpha=1.0), CNNLSTM(in_channels, 5003, 1)
+            )
+        if self.use_gr_x_timbre:
+            self.x_timbre_predictor = nn.Sequential(
+                GradientReversal(alpha=1),
+                CNNLSTM(in_channels, 245200, 1, global_pred=True),
+            )
+        self.reset_parameters()
+    def quantize(self, x, n_quantizers=None):
+        outs, qs, commit_loss, quantized_buf = 0, [], [], []
+        # prosody
+        f0_input = x  # (B, d, T)
+        f0_quantizer = self.quantizer[0]
+        out, q, commit, quantized = f0_quantizer(f0_input, n_quantizers=n_quantizers)
+        outs += out
+        qs.append(q)
+        quantized_buf.append(quantized.sum(0))
+        commit_loss.append(commit)
+        # phone
+        phone_input = x
+        phone_quantizer = self.quantizer[1]
+        out, q, commit, quantized = phone_quantizer(
+            phone_input, n_quantizers=n_quantizers
+        )
+        outs += out
+        qs.append(q)
+        quantized_buf.append(quantized.sum(0))
+        commit_loss.append(commit)
+        # residual
+        if self.vq_num_q_r > 0:
+            residual_quantizer = self.quantizer[2]
+            residual_input = x - (quantized_buf[0] + quantized_buf[1]).detach()
+            out, q, commit, quantized = residual_quantizer(
+                residual_input, n_quantizers=n_quantizers
+            )
+            outs += out
+            qs.append(q)
+            quantized_buf.append(quantized.sum(0))  # [L, B, C, T] -> [B, C, T]
+            commit_loss.append(commit)
+        qs = torch.cat(qs, dim=0)
+        commit_loss = torch.cat(commit_loss, dim=0)
+        return outs, qs, commit_loss, quantized_buf
+    def forward(
+        self,
+        x,
+        vq=True,
+        get_vq=False,
+        eval_vq=True,
+        speaker_embedding=None,
+        n_quantizers=None,
+        quantized=None,
+    ):
+        if get_vq:
+            return self.quantizer.get_emb()
+        if vq is True:
+            if eval_vq:
+                self.quantizer.eval()
+            x_timbre = x
+            outs, qs, commit_loss, quantized_buf = self.quantize(
+                x, n_quantizers=n_quantizers
+            )
+            x_timbre = x_timbre.transpose(1, 2)
+            x_timbre = self.timbre_encoder(x_timbre, None, None)
+            x_timbre = x_timbre.transpose(1, 2)
+            spk_embs = torch.mean(x_timbre, dim=2)
+            return outs, qs, commit_loss, quantized_buf, spk_embs
+        out = {}
+        layer_0 = quantized[0]
+        f0, uv = self.f0_predictor(layer_0)
+        f0 = rearrange(f0, "... 1 -> ...")
+        uv = rearrange(uv, "... 1 -> ...")
+        layer_1 = quantized[1]
+        (phone,) = self.phone_predictor(layer_1)
+        out = {"f0": f0, "uv": uv, "phone": phone}
+        if self.use_gr_prosody_phone:
+            (prosody_phone,) = self.prosody_phone_predictor(layer_0)
+            out["prosody_phone"] = prosody_phone
+        if self.use_gr_content_f0:
+            content_f0, content_uv = self.content_f0_predictor(layer_1)
+            content_f0 = rearrange(content_f0, "... 1 -> ...")
+            content_uv = rearrange(content_uv, "... 1 -> ...")
+            out["content_f0"] = content_f0
+            out["content_uv"] = content_uv
+        if self.vq_num_q_r > 0:
+            layer_2 = quantized[2]
+            if self.use_gr_residual_f0:
+                res_f0, res_uv = self.res_f0_predictor(layer_2)
+                res_f0 = rearrange(res_f0, "... 1 -> ...")
+                res_uv = rearrange(res_uv, "... 1 -> ...")
+                out["res_f0"] = res_f0
+                out["res_uv"] = res_uv
+            if self.use_gr_residual_phone:
+                (res_phone,) = self.res_phone_predictor(layer_2)
+                out["res_phone"] = res_phone
+        style = self.timbre_linear(speaker_embedding).unsqueeze(2)  # (B, 2d, 1)
+        gamma, beta = style.chunk(2, 1)  # (B, d, 1)
+        if self.vq_num_q_r > 0:
+            if self.use_random_mask_residual:
+                bsz = quantized[2].shape[0]
+                res_mask = np.random.choice(
+                    [0, 1],
+                    size=bsz,
+                    p=[
+                        self.prob_random_mask_residual,
+                        1 - self.prob_random_mask_residual,
+                    ],
+                )
+                res_mask = (
+                    torch.from_numpy(res_mask).unsqueeze(1).unsqueeze(1)
+                )  # (B, 1, 1)
+                res_mask = res_mask.to(
+                    device=quantized[2].device, dtype=quantized[2].dtype
+                )
+                x = (
+                    quantized[0].detach()
+                    + quantized[1].detach()
+                    + quantized[2] * res_mask
+                )
+                # x = quantized_perturbe[0].detach() + quantized[1].detach() + quantized[2] * res_mask
+            else:
+                x = quantized[0].detach() + quantized[1].detach() + quantized[2]
+                # x = quantized_perturbe[0].detach() + quantized[1].detach() + quantized[2]
+        else:
+            x = quantized[0].detach() + quantized[1].detach()
+            # x = quantized_perturbe[0].detach() + quantized[1].detach()
+        if self.use_gr_x_timbre:
+            (x_timbre,) = self.x_timbre_predictor(x)
+            out["x_timbre"] = x_timbre
+        x = x.transpose(1, 2)
+        x = self.timbre_norm(x)
+        x = x.transpose(1, 2)
+        x = x * gamma + beta
+        x = self.model(x)
+        out["audio"] = x
+        return out
+    def vq2emb(self, vq, use_residual_code=True):
+        # vq: [num_quantizer, B, T]
+        self.quantizer = self.quantizer.eval()
+        out = 0
+        out += self.quantizer[0].vq2emb(vq[0 : self.vq_num_q_p])
+        out += self.quantizer[1].vq2emb(
+            vq[self.vq_num_q_p : self.vq_num_q_p + self.vq_num_q_c]
+        )
+        if self.vq_num_q_r > 0 and use_residual_code:
+            out += self.quantizer[2].vq2emb(vq[self.vq_num_q_p + self.vq_num_q_c :])
+        return out
+    def inference(self, x, speaker_embedding):
+        style = self.timbre_linear(speaker_embedding).unsqueeze(2)  # (B, 2d, 1)
+        gamma, beta = style.chunk(2, 1)  # (B, d, 1)
+        x = x.transpose(1, 2)
+        x = self.timbre_norm(x)
+        x = x.transpose(1, 2)
+        x = x * gamma + beta
+        x = self.model(x)
+        return x
+    def remove_weight_norm(self):
+        """Remove weight normalization module from all of the layers."""
+        def _remove_weight_norm(m):
+            try:
+                torch.nn.utils.remove_weight_norm(m)
+            except ValueError:  # this module didn't have weight norm
+                return
+        self.apply(_remove_weight_norm)
+    def apply_weight_norm(self):
+        """Apply weight normalization module from all of the layers."""
+        def _apply_weight_norm(m):
+            if isinstance(m, nn.Conv1d) or isinstance(m, nn.ConvTranspose1d):
+                torch.nn.utils.weight_norm(m)
+        self.apply(_apply_weight_norm)
+    def reset_parameters(self):
+        self.apply(init_weights)

Amphion/models/ns3_codec/gradient_reversal.py ADDED Viewed

	@@ -0,0 +1,30 @@

+from torch.autograd import Function
+import torch
+from torch import nn
+class GradientReversal(Function):
+    @staticmethod
+    def forward(ctx, x, alpha):
+        ctx.save_for_backward(x, alpha)
+        return x
+    @staticmethod
+    def backward(ctx, grad_output):
+        grad_input = None
+        _, alpha = ctx.saved_tensors
+        if ctx.needs_input_grad[0]:
+            grad_input = -alpha * grad_output
+        return grad_input, None
+revgrad = GradientReversal.apply
+class GradientReversal(nn.Module):
+    def __init__(self, alpha):
+        super().__init__()
+        self.alpha = torch.tensor(alpha, requires_grad=False)
+    def forward(self, x):
+        return revgrad(x, self.alpha)

Amphion/models/ns3_codec/quantize/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ from .fvq import *
2	+ from .rvq import *

Amphion/models/ns3_codec/quantize/fvq.py ADDED Viewed

	@@ -0,0 +1,111 @@

+from typing import Union
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import rearrange
+from torch.nn.utils import weight_norm
+class FactorizedVectorQuantize(nn.Module):
+    def __init__(self, dim, codebook_size, codebook_dim, commitment, **kwargs):
+        super().__init__()
+        self.codebook_size = codebook_size
+        self.codebook_dim = codebook_dim
+        self.commitment = commitment
+        if dim != self.codebook_dim:
+            self.in_proj = weight_norm(nn.Linear(dim, self.codebook_dim))
+            self.out_proj = weight_norm(nn.Linear(self.codebook_dim, dim))
+        else:
+            self.in_proj = nn.Identity()
+            self.out_proj = nn.Identity()
+        self._codebook = nn.Embedding(codebook_size, self.codebook_dim)
+    @property
+    def codebook(self):
+        return self._codebook
+    def forward(self, z):
+        """Quantized the input tensor using a fixed codebook and returns
+        the corresponding codebook vectors
+        Parameters
+        ----------
+        z : Tensor[B x D x T]
+        Returns
+        -------
+        Tensor[B x D x T]
+            Quantized continuous representation of input
+        Tensor[1]
+            Commitment loss to train encoder to predict vectors closer to codebook
+            entries
+        Tensor[1]
+            Codebook loss to update the codebook
+        Tensor[B x T]
+            Codebook indices (quantized discrete representation of input)
+        Tensor[B x D x T]
+            Projected latents (continuous representation of input before quantization)
+        """
+        # transpose since we use linear
+        z = rearrange(z, "b d t -> b t d")
+        # Factorized codes project input into low-dimensional space
+        z_e = self.in_proj(z)  # z_e : (B x T x D)
+        z_e = rearrange(z_e, "b t d -> b d t")
+        z_q, indices = self.decode_latents(z_e)
+        if self.training:
+            commitment_loss = (
+                F.mse_loss(z_e, z_q.detach(), reduction="none").mean([1, 2])
+                * self.commitment
+            )
+            codebook_loss = F.mse_loss(z_q, z_e.detach(), reduction="none").mean([1, 2])
+            commit_loss = commitment_loss + codebook_loss
+        else:
+            commit_loss = torch.zeros(z.shape[0], device=z.device)
+        z_q = (
+            z_e + (z_q - z_e).detach()
+        )  # noop in forward pass, straight-through gradient estimator in backward pass
+        z_q = rearrange(z_q, "b d t -> b t d")
+        z_q = self.out_proj(z_q)
+        z_q = rearrange(z_q, "b t d -> b d t")
+        return z_q, indices, commit_loss
+    def vq2emb(self, vq, proj=True):
+        emb = self.embed_code(vq)
+        if proj:
+            emb = self.out_proj(emb)
+        return emb.transpose(1, 2)
+    def get_emb(self):
+        return self.codebook.weight
+    def embed_code(self, embed_id):
+        return F.embedding(embed_id, self.codebook.weight)
+    def decode_code(self, embed_id):
+        return self.embed_code(embed_id).transpose(1, 2)
+    def decode_latents(self, latents):
+        encodings = rearrange(latents, "b d t -> (b t) d")
+        codebook = self.codebook.weight  # codebook: (N x D)
+        # L2 normalize encodings and codebook
+        encodings = F.normalize(encodings)
+        codebook = F.normalize(codebook)
+        # Compute euclidean distance with codebook
+        dist = (
+            encodings.pow(2).sum(1, keepdim=True)
+            - 2 * encodings @ codebook.t()
+            + codebook.pow(2).sum(1, keepdim=True).t()
+        )
+        indices = rearrange((-dist).max(1)[1], "(b t) -> b t", b=latents.size(0))
+        z_q = self.decode_code(indices)
+        return z_q, indices

Amphion/models/ns3_codec/quantize/rvq.py ADDED Viewed

	@@ -0,0 +1,82 @@

+import math
+import torch
+from torch import nn
+from .fvq import FactorizedVectorQuantize
+class ResidualVQ(nn.Module):
+    """Follows Algorithm 1. in https://arxiv.org/pdf/2107.03312.pdf"""
+    def __init__(self, *, num_quantizers, codebook_size, **kwargs):
+        super().__init__()
+        VQ = FactorizedVectorQuantize
+        if type(codebook_size) == int:
+            codebook_size = [codebook_size] * num_quantizers
+        self.layers = nn.ModuleList(
+            [VQ(codebook_size=2**size, **kwargs) for size in codebook_size]
+        )
+        self.num_quantizers = num_quantizers
+        self.quantizer_dropout = kwargs.get("quantizer_dropout", 0.0)
+        self.dropout_type = kwargs.get("dropout_type", None)
+    def forward(self, x, n_quantizers=None):
+        quantized_out = 0.0
+        residual = x
+        all_losses = []
+        all_indices = []
+        all_quantized = []
+        if n_quantizers is None:
+            n_quantizers = self.num_quantizers
+        if self.training:
+            n_quantizers = torch.ones((x.shape[0],)) * self.num_quantizers + 1
+            if self.dropout_type == "linear":
+                dropout = torch.randint(1, self.num_quantizers + 1, (x.shape[0],))
+            elif self.dropout_type == "exp":
+                dropout = torch.randint(
+                    1, int(math.log2(self.num_quantizers)), (x.shape[0],)
+                )
+                dropout = torch.pow(2, dropout)
+            n_dropout = int(x.shape[0] * self.quantizer_dropout)
+            n_quantizers[:n_dropout] = dropout[:n_dropout]
+            n_quantizers = n_quantizers.to(x.device)
+        for idx, layer in enumerate(self.layers):
+            if not self.training and idx >= n_quantizers:
+                break
+            quantized, indices, loss = layer(residual)
+            mask = (
+                torch.full((x.shape[0],), fill_value=idx, device=x.device)
+                < n_quantizers
+            )
+            residual = residual - quantized
+            quantized_out = quantized_out + quantized * mask[:, None, None]
+            # loss
+            loss = (loss * mask).mean()
+            all_indices.append(indices)
+            all_losses.append(loss)
+            all_quantized.append(quantized)
+        all_losses, all_indices, all_quantized = map(
+            torch.stack, (all_losses, all_indices, all_quantized)
+        )
+        return quantized_out, all_indices, all_losses, all_quantized
+    def vq2emb(self, vq):
+        # vq: [n_quantizers, B, T]
+        quantized_out = 0.0
+        for idx, layer in enumerate(self.layers):
+            quantized = layer.vq2emb(vq[idx])
+            quantized_out += quantized
+        return quantized_out
+    def get_emb(self):
+        embs = []
+        for idx, layer in enumerate(self.layers):
+            embs.append(layer.get_emb())
+        return embs

Amphion/models/ns3_codec/transformer.py ADDED Viewed

	@@ -0,0 +1,217 @@

+import numpy as np
+import torch
+import torch.nn as nn
+import math
+from torch.nn import functional as F
+class StyleAdaptiveLayerNorm(nn.Module):
+    def __init__(self, normalized_shape, eps=1e-5):
+        super().__init__()
+        self.in_dim = normalized_shape
+        self.norm = nn.LayerNorm(self.in_dim, eps=eps, elementwise_affine=False)
+        self.style = nn.Linear(self.in_dim, self.in_dim * 2)
+        self.style.bias.data[: self.in_dim] = 1
+        self.style.bias.data[self.in_dim :] = 0
+class PositionalEncoding(nn.Module):
+    def __init__(self, d_model, dropout, max_len=5000):
+        super().__init__()
+        self.dropout = dropout
+        position = torch.arange(max_len).unsqueeze(1)
+        div_term = torch.exp(
+            torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model)
+        )
+        pe = torch.zeros(max_len, 1, d_model)
+        pe[:, 0, 0::2] = torch.sin(position * div_term)
+        pe[:, 0, 1::2] = torch.cos(position * div_term)
+        self.register_buffer("pe", pe)
+    def forward(self, x):
+        x = x + self.pe[: x.size(0)]
+        return F.dropout(x, self.dropout, training=self.training)
+class TransformerFFNLayer(nn.Module):
+    def __init__(
+        self, encoder_hidden, conv_filter_size, conv_kernel_size, encoder_dropout
+    ):
+        super().__init__()
+        self.encoder_hidden = encoder_hidden
+        self.conv_filter_size = conv_filter_size
+        self.conv_kernel_size = conv_kernel_size
+        self.encoder_dropout = encoder_dropout
+        self.ffn_1 = nn.Conv1d(
+            self.encoder_hidden,
+            self.conv_filter_size,
+            self.conv_kernel_size,
+            padding=self.conv_kernel_size // 2,
+        )
+        self.ffn_1.weight.data.normal_(0.0, 0.02)
+        self.ffn_2 = nn.Linear(self.conv_filter_size, self.encoder_hidden)
+        self.ffn_2.weight.data.normal_(0.0, 0.02)
+    def forward(self, x):
+        # x: (B, T, d)
+        x = self.ffn_1(x.permute(0, 2, 1)).permute(
+            0, 2, 1
+        )  # (B, T, d) -> (B, d, T) -> (B, T, d)
+        x = F.relu(x)
+        x = F.dropout(x, self.encoder_dropout, training=self.training)
+        x = self.ffn_2(x)
+        return x
+class TransformerEncoderLayer(nn.Module):
+    def __init__(
+        self,
+        encoder_hidden,
+        encoder_head,
+        conv_filter_size,
+        conv_kernel_size,
+        encoder_dropout,
+        use_cln,
+    ):
+        super().__init__()
+        self.encoder_hidden = encoder_hidden
+        self.encoder_head = encoder_head
+        self.conv_filter_size = conv_filter_size
+        self.conv_kernel_size = conv_kernel_size
+        self.encoder_dropout = encoder_dropout
+        self.use_cln = use_cln
+        if not self.use_cln:
+            self.ln_1 = nn.LayerNorm(self.encoder_hidden)
+            self.ln_2 = nn.LayerNorm(self.encoder_hidden)
+        else:
+            self.ln_1 = StyleAdaptiveLayerNorm(self.encoder_hidden)
+            self.ln_2 = StyleAdaptiveLayerNorm(self.encoder_hidden)
+        self.self_attn = nn.MultiheadAttention(
+            self.encoder_hidden, self.encoder_head, batch_first=True
+        )
+        self.ffn = TransformerFFNLayer(
+            self.encoder_hidden,
+            self.conv_filter_size,
+            self.conv_kernel_size,
+            self.encoder_dropout,
+        )
+    def forward(self, x, key_padding_mask, conditon=None):
+        # x: (B, T, d); key_padding_mask: (B, T), mask is 0; condition: (B, T, d)
+        # self attention
+        residual = x
+        if self.use_cln:
+            x = self.ln_1(x, conditon)
+        else:
+            x = self.ln_1(x)
+        if key_padding_mask != None:
+            key_padding_mask_input = ~(key_padding_mask.bool())
+        else:
+            key_padding_mask_input = None
+        x, _ = self.self_attn(
+            query=x, key=x, value=x, key_padding_mask=key_padding_mask_input
+        )
+        x = F.dropout(x, self.encoder_dropout, training=self.training)
+        x = residual + x
+        # ffn
+        residual = x
+        if self.use_cln:
+            x = self.ln_2(x, conditon)
+        else:
+            x = self.ln_2(x)
+        x = self.ffn(x)
+        x = residual + x
+        return x
+class TransformerEncoder(nn.Module):
+    def __init__(
+        self,
+        enc_emb_tokens=None,
+        encoder_layer=4,
+        encoder_hidden=256,
+        encoder_head=4,
+        conv_filter_size=1024,
+        conv_kernel_size=5,
+        encoder_dropout=0.1,
+        use_cln=False,
+        cfg=None,
+    ):
+        super().__init__()
+        self.encoder_layer = (
+            encoder_layer if encoder_layer is not None else cfg.encoder_layer
+        )
+        self.encoder_hidden = (
+            encoder_hidden if encoder_hidden is not None else cfg.encoder_hidden
+        )
+        self.encoder_head = (
+            encoder_head if encoder_head is not None else cfg.encoder_head
+        )
+        self.conv_filter_size = (
+            conv_filter_size if conv_filter_size is not None else cfg.conv_filter_size
+        )
+        self.conv_kernel_size = (
+            conv_kernel_size if conv_kernel_size is not None else cfg.conv_kernel_size
+        )
+        self.encoder_dropout = (
+            encoder_dropout if encoder_dropout is not None else cfg.encoder_dropout
+        )
+        self.use_cln = use_cln if use_cln is not None else cfg.use_cln
+        if enc_emb_tokens != None:
+            self.use_enc_emb = True
+            self.enc_emb_tokens = enc_emb_tokens
+        else:
+            self.use_enc_emb = False
+        self.position_emb = PositionalEncoding(
+            self.encoder_hidden, self.encoder_dropout
+        )
+        self.layers = nn.ModuleList([])
+        self.layers.extend(
+            [
+                TransformerEncoderLayer(
+                    self.encoder_hidden,
+                    self.encoder_head,
+                    self.conv_filter_size,
+                    self.conv_kernel_size,
+                    self.encoder_dropout,
+                    self.use_cln,
+                )
+                for i in range(self.encoder_layer)
+            ]
+        )
+        if self.use_cln:
+            self.last_ln = StyleAdaptiveLayerNorm(self.encoder_hidden)
+        else:
+            self.last_ln = nn.LayerNorm(self.encoder_hidden)
+    def forward(self, x, key_padding_mask, condition=None):
+        if len(x.shape) == 2 and self.use_enc_emb:
+            x = self.enc_emb_tokens(x)
+            x = self.position_emb(x)
+        else:
+            x = self.position_emb(x)  # (B, T, d)
+        for layer in self.layers:
+            x = layer(x, key_padding_mask, condition)
+        if self.use_cln:
+            x = self.last_ln(x, condition)
+        else:
+            x = self.last_ln(x)
+        return x