ai-cookbook / src /theory /activations.qmd
Sébastien De Greef
feat: Add slideshow on optimizers in neural networks
db1f0f8
raw
history blame
8.19 kB
---
title: "Activation functions"
notebook-links: false
crossref:
lof-title: "List of Figures"
number-sections: false
---
When choosing an activation function, consider the following:
- **Non-saturation:** Avoid activations that saturate (e.g., sigmoid, tanh) to prevent vanishing gradients.
- **Computational efficiency:** Choose activations that are computationally efficient (e.g., ReLU, Swish) for large models or real-time applications.
- **Smoothness:** Smooth activations (e.g., GELU, Mish) can help with optimization and convergence.
- **Domain knowledge:** Select activations based on the problem domain and desired output (e.g., softmax for multi-class classification).
- **Experimentation:** Try different activations and evaluate their performance on your specific task.
[Slideshow](activations_slideshow.qmd)
{{< embed ActivationFunctions.ipynb#fig-overview >}}
## Sigmoid {#sec-sigmoid}
**Strengths:** Maps any real-valued number to a value between 0 and 1, making it suitable for binary classification problems.
**Weaknesses:** Saturates (i.e., output values approach 0 or 1) for large inputs, leading to vanishing gradients during backpropagation.
**Usage:** Binary classification, logistic regression.
::: columns
::: {.column width="50%"}
$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$
``` python
def sigmoid(x):
return 1 / (1 + np.exp(-x))
```
:::
::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-sigmoid >}}
:::
:::
## Hyperbolic Tangent (Tanh) {#sec-tanh}
**Strengths:** Similar to sigmoid, but maps to (-1, 1), which can be beneficial for some models.
**Weaknesses:** Also saturates, leading to vanishing gradients.
**Usage:** Similar to sigmoid, but with a larger output range.
::: columns
::: {.column width="50%"}
$$
\tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}
$$
``` python
def tanh(x):
return np.tanh(x)
```
:::
::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-tanh >}}
:::
:::
## Rectified Linear Unit (ReLU)
**Strengths:** Computationally efficient, non-saturating, and easy to compute.
**Weaknesses:** Not differentiable at x=0, which can cause issues during optimization.
**Usage:** Default activation function in many deep learning frameworks, suitable for most neural networks.
::: columns
::: {.column width="50%"}
$$
\text{ReLU}(x) = \max(0, x)
$$
``` python
def relu(x):
return np.maximum(0, x)
```
:::
::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-relu >}}
:::
:::
## Leaky ReLU
**Strengths:** Similar to ReLU, but allows a small fraction of the input to pass through, helping with dying neurons.
**Weaknesses:** Still non-differentiable at x=0.
**Usage:** Alternative to ReLU, especially when dealing with dying neurons.
::: columns
::: {.column width="50%"}
$$
\text{Leaky ReLU}(x) =
\begin{cases}
x & \text{if } x > 0 \\
\alpha x & \text{if } x \leq 0
\end{cases}
$$
``` python
def leaky_relu(x, alpha=0.01):
# where α is a small constant (e.g., 0.01)
return np.where(x > 0, x, x * alpha)
```
:::
::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-leaky_relu >}}
:::
:::
## Swish
**Formula:**
where g(x) is a learned function (e.g., sigmoid or ReLU)
**Strengths:** Self-gated, adaptive, and non-saturating.
**Weaknesses:** Computationally expensive, requires additional learnable parameters.
**Usage:** Can be used in place of ReLU or other activations, but may not always outperform them.
::: columns
::: {.column width="50%"}
$$
\text{Swish}(x) = x \cdot \sigma(x)
$$
``` python
def swish(x):
return x * sigmoid(x)
```
See also: [sigmoid](#sec-sigmoid)
:::
::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-swish >}}
:::
:::
## Mish
**Strengths:** Non-saturating, smooth, and computationally efficient.
**Weaknesses:** Not as well-studied as ReLU or other activations.
**Usage:** Alternative to ReLU, especially in computer vision tasks.
::: columns
::: {.column width="50%"}
$$
\text{Mish}(x) = x \cdot \tanh(\text{Softplus}(x))
$$
``` python
def mish(x):
return x * np.tanh(softplus(x))
```
:::
::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-mish >}}
:::
:::
See also: [softplus](#softplus) [tanh](#sec-tanh)
## Softmax
**Strengths:** Normalizes output to ensure probabilities sum to 1, making it suitable for multi-class classification.
**Weaknesses:** Only suitable for output layers with multiple classes.
**Usage:** Output layer activation for multi-class classification problems.
::: columns
::: {.column width="50%"}
$$
\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{k=1}^{K} e^{x_k}}
$$
``` python
def softmax(x):
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
```
:::
::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-softmax >}}
:::
:::
## Softsign
**Strengths:** Similar to sigmoid, but with a more gradual slope.
**Weaknesses:** Not commonly used, may not provide significant benefits over sigmoid or tanh.
**Usage:** Alternative to sigmoid or tanh in certain situations.
::: columns
::: {.column width="50%"}
$$
\text{Softsign}(x) = \frac{x}{1 + |x|}
$$
``` python
def softsign(x):
return x / (1 + np.abs(x))
```
:::
::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-softsign >}}
:::
:::
## SoftPlus {#softplus}
**Strengths:** Smooth, continuous, and non-saturating.
**Weaknesses:** Not commonly used, may not outperform other activations.
**Usage:** Experimental or niche applications.
::: columns
::: {.column width="50%"}
$$
\text{Softplus}(x) = \log(1 + e^x)
$$
``` python
def softplus(x):
return np.log1p(np.exp(x))
```
:::
::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-softplus >}}
:::
:::
## ArcTan
**Strengths:** Non-saturating, smooth, and continuous.
**Weaknesses:** Not commonly used, may not outperform other activations.
**Usage:** Experimental or niche applications.
::: columns
::: {.column width="50%"}
$$
arctan(x) = arctan(x)
$$
``` python
def arctan(x):
return np.arctan(x)
```
:::
::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-arctan >}}
:::
:::
## Gaussian Error Linear Unit (GELU)
**Strengths:** Non-saturating, smooth, and computationally efficient.
**Weaknesses:** Not as well-studied as ReLU or other activations.
**Usage:** Alternative to ReLU, especially in Bayesian neural networks.
::: columns
::: {.column width="50%"}
$$
\text{GELU}(x) = x \cdot \Phi(x)
$$
``` python
def gelu(x):
return 0.5 * x
* (1 + np.tanh(np.sqrt(2 / np.pi)
* (x + 0.044715 * np.power(x, 3))))
```
:::
::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-gelu >}}
:::
:::
See also: [tanh](#sec-tanh)
## Silu (SiLU)
**Strengths:** Non-saturating, smooth, and computationally efficient.
**Weaknesses:** Not as well-studied as ReLU or other activations.
**Usage:** Alternative to ReLU, especially in computer vision tasks.
::: columns
::: {.column width="50%"}
$$
silu(x) = x * sigmoid(x)
$$
``` python
def silu(x):
return x / (1 + np.exp(-x))
```
:::
::: {.column width="50%"}
{{< embed ActivationFunctions.ipynb#fig-silu >}}
:::
:::
## GELU Approximation (GELU Approx.)
$$
f(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x^3)))
$$
**Strengths:** Fast, non-saturating, and smooth.
**Weaknesses:** Approximation, not exactly equal to GELU.
**Usage:** Alternative to GELU, especially when computational efficiency is crucial.
## SELU (Scaled Exponential Linear Unit)
$$
f(x) = \lambda
\begin{cases}
x & x > 0 \\
\alpha e^x - \alpha & x \leq 0
\end{cases}
$$
**Strengths:** Self-normalizing, non-saturating, and computationally efficient.
**Weaknesses:** Requires careful initialization and α tuning.
**Usage:** Alternative to ReLU, especially in deep neural networks.