Spaces:

jordyvl
/

ece

Running

App Files Files Community

jordyvl commited on Jun 30, 2022

Commit

0736615

•

1 Parent(s): dea6ecb

small update to include visual definition of ECE

Browse files

Files changed (5) hide show

ECE_definition.jpg +0 -0
README.md +40 -11
app.py +4 -3
ece.py +1 -4
local_app.py +31 -20

ECE_definition.jpg ADDED Viewed

README.md CHANGED Viewed

@@ -14,30 +14,32 @@ pinned: false
 # Metric Card for ECE
-***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
 ## Metric Description
 <!---
 *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 -->
-Expected Calibration Error `ECE` is a standard metric to evaluate top-1 prediction miscalibration.
 It measures the L^p norm difference between a model’s posterior and the true likelihood of being correct.
-```
-$$ ECE_p(f)^p= \mathbb{E}_{(X,Y)} \left[\|\mathbb{E}[Y = \hat{y} \mid f(X) = \hat{p}] - f(X)\|^p_p\right]$$, where $$ \hat{y} = \argmax_{y'}[f(X)]_y'$$ is a class prediction with associated posterior probability $$ \hat{p}= \max_{y'}[f(X)]_y'$$.
-```
-It is generally implemented as a binned estimator that discretizes predicted probabilities into a range of possible values (bins) for which conditional expectation can be estimated.
-As a metric of calibration *error*, it holds that the lower, the better calibrated a model is.
-For valid model comparisons, ensure to use the same keyword arguments.
 ## How to Use
 <!---
 *Give general statement of how to use the metric*
 *Provide simplest possible example for using the metric*
--->
 ### Inputs
@@ -55,12 +57,32 @@ For valid model comparisons, ensure to use the same keyword arguments.
 #### Values from Popular Papers
 *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
 -->
 ### Examples
 <!---
 *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 -->
 ## Limitations and Bias
@@ -71,11 +93,18 @@ See [3],[4] and [5].
 ## Citation
 [1] Naeini, M.P., Cooper, G. and Hauskrecht, M., 2015, February. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
 [2] Guo, C., Pleiss, G., Sun, Y. and Weinberger, K.Q., 2017, July. On calibration of modern neural networks. In International Conference on Machine Learning (pp. 1321-1330). PMLR.
 [3] Nixon, J., Dusenberry, M.W., Zhang, L., Jerfel, G. and Tran, D., 2019, June. Measuring Calibration in Deep Learning. In CVPR Workshops (Vol. 2, No. 7).
 [4] Kumar, A., Liang, P.S. and Ma, T., 2019. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32.
 [5] Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten, F., Roll, J. and Schön, T., 2019, April. Evaluating model calibration in classification. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 3459-3467). PMLR.
 [6] Allen-Zhu, Z., Li, Y. and Liang, Y., 2019. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32.
 ## Further References
 *Add any useful further references.*

 # Metric Card for ECE
 ## Metric Description
 <!---
 *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 -->
+Expected Calibration Error *ECE* is a popular metric to evaluate top-1 prediction miscalibration.
 It measures the L^p norm difference between a model’s posterior and the true likelihood of being correct.
+![ECE definition](./ECE_definition.jpg)
+It is generally implemented as a binned estimator that discretizes predicted probabilities into ranges of possible values (bins) for which conditional expectation can be estimated.
 ## How to Use
 <!---
 *Give general statement of how to use the metric*
 *Provide simplest possible example for using the metric*
+-->
+```
+>>> metric = evaluate.load("jordyvl/ece")
+>>> results = metric.compute(references=[0, 1, 2], predictions=[[0.6, 0.2, 0.2], [0, 0.95, 0.05], [0.7, 0.1 ,0.2]])
+>>> print(results)
+{'ECE': 0.1333333333333334}
+```
+For valid model comparisons, ensure to use the same keyword arguments.
 ### Inputs
 #### Values from Popular Papers
 *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
 -->
+As a metric of calibration *error*, it holds that the lower, the better calibrated a model is. Depending on the L^p norm, ECE will either take value between 0 and 1 (p=2) or between 0 and \infty_+.
+The module returns dictionary with a key value pair, e.g., {"ECE": 0.64}.
 ### Examples
 <!---
 *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 -->
+```
+N = 10  # N evaluation instances {(x_i,y_i)}_{i=1}^N
+K = 5  # K class problem
+def random_mc_instance(concentration=1, onehot=False):
+    reference = np.argmax(
+        np.random.dirichlet(([concentration for _ in range(K)])), -1
+    )  # class targets
+    prediction = np.random.dirichlet(([concentration for _ in range(K)]))  # probabilities
+    if onehot:
+        reference = np.eye(K)[np.argmax(reference, -1)]
+    return reference, prediction
+references, predictions = list(zip(*[random_mc_instance() for i in range(N)]))
+references = np.array(references, dtype=np.int64)
+predictions = np.array(predictions, dtype=np.float32)
+res = ECE()._compute(predictions, references)  # {'ECE': float}
+```
 ## Limitations and Bias
 ## Citation
 [1] Naeini, M.P., Cooper, G. and Hauskrecht, M., 2015, February. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
 [2] Guo, C., Pleiss, G., Sun, Y. and Weinberger, K.Q., 2017, July. On calibration of modern neural networks. In International Conference on Machine Learning (pp. 1321-1330). PMLR.
 [3] Nixon, J., Dusenberry, M.W., Zhang, L., Jerfel, G. and Tran, D., 2019, June. Measuring Calibration in Deep Learning. In CVPR Workshops (Vol. 2, No. 7).
 [4] Kumar, A., Liang, P.S. and Ma, T., 2019. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32.
 [5] Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten, F., Roll, J. and Schön, T., 2019, April. Evaluating model calibration in classification. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 3459-3467). PMLR.
 [6] Allen-Zhu, Z., Li, Y. and Liang, Y., 2019. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32.
 ## Further References
+<!---
 *Add any useful further references.*
+-->

app.py CHANGED Viewed

@@ -1,7 +1,8 @@
 import evaluate
-import numpy as np
 from evaluate.utils import launch_gradio_widget
 module = evaluate.load("jordyvl/ece")
-launch_gradio_widget(module)

 import evaluate
 from evaluate.utils import launch_gradio_widget
 module = evaluate.load("jordyvl/ece")
+launch_gradio_widget(module)

ece.py CHANGED Viewed

@@ -63,13 +63,10 @@ Returns
     Expected calibration error (ECE), float.
 Examples:
-    Examples should be written in doctest format, and should illustrate how
-    to use the function.
     >>> my_new_module = evaluate.load("jordyvl/ece")
     >>> results = my_new_module.compute(references=[0, 1, 2], predictions=[[0.6, 0.2, 0.2], [0, 0.95, 0.05], [0.7, 0.1 ,0.2]])
     >>> print(results)
-    {'ECE': 1.0}
 """
 # TODO: Define external resources urls if needed

     Expected calibration error (ECE), float.
 Examples:
     >>> my_new_module = evaluate.load("jordyvl/ece")
     >>> results = my_new_module.compute(references=[0, 1, 2], predictions=[[0.6, 0.2, 0.2], [0, 0.95, 0.05], [0.7, 0.1 ,0.2]])
     >>> print(results)
+    {'ECE': 0.1333333333333334}
 """
 # TODO: Define external resources urls if needed

local_app.py CHANGED Viewed

@@ -1,23 +1,33 @@
 import evaluate
 import numpy as np
 import pandas as pd
 import ast
-import json
-import gradio as gr
-from evaluate.utils import launch_gradio_widget
-from ece import ECE
 import matplotlib.pyplot as plt
 import seaborn as sns
 sns.set_style('white')
 sns.set_context("paper", font_scale=1)  # 2
 # plt.rcParams['figure.figsize'] = [10, 7]
-plt.rcParams['figure.dpi'] = 300
-plt.switch_backend('agg') #; https://stackoverflow.com/questions/14694408/runtimeerror-main-thread-is-not-in-main-loop
 sliders = [
     gr.Slider(0, 100, value=10, label="n_bins"),
-    gr.Slider(0, 100, value=None, label="bin_range", visible=False), #DEV: need to have a double slider
     gr.Dropdown(choices=["equal-range", "equal-mass"], value="equal-range", label="scheme"),
     gr.Dropdown(choices=["upper-edge", "center"], value="upper-edge", label="proxy"),
     gr.Dropdown(choices=[1, 2, np.inf], value=1, label="p"),
@@ -42,6 +52,7 @@ component.value = [
 sample_data = [[component] + slider_defaults]  ##json.dumps(df)
 metric = ECE()
 # module = evaluate.load("jordyvl/ece")
 # launch_gradio_widget(module)
@@ -50,6 +61,7 @@ metric = ECE()
 Switch inputs and compute_fn
 """
 def reliability_plot(results):
     fig = plt.figure()
     ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
@@ -77,7 +89,7 @@ def reliability_plot(results):
     bin_freqs[anindices] = results["bin_freq"]
     ax2.hist(results["y_bar"], results["y_bar"], weights=bin_freqs)
-    #widths = np.diff(results["y_bar"])
     for j, bin in enumerate(results["y_bar"]):
         perfect = results["y_bar"][j]
         empirical = results["p_bar"][j]
@@ -86,31 +98,30 @@ def reliability_plot(results):
             continue
         ax1.bar([perfect], height=[empirical], width=-ranged[j], align="edge", color="lightblue")
         if perfect == empirical:
             continue
-    acc_plt = ax2.axvline(
-        x=results["accuracy"], ls="solid", lw=3, c="black", label="Accuracy"
-    )
     conf_plt = ax2.axvline(
         x=results["p_bar_cont"], ls="dotted", lw=3, c="#444", label="Avg. confidence"
     )
     ax2.legend(handles=[acc_plt, conf_plt])
-    #Bin differences
     ax1.set_ylabel("Conditional Expectation")
-    ax1.set_ylim([-0.05, 1.05]) #respective to bin range
     ax1.legend(loc="lower right")
     ax1.set_title("Reliability Diagram")
-    #Bin frequencies
     ax2.set_xlabel("Confidence")
     ax2.set_ylabel("Count")
-    ax2.legend(loc="upper left")#, ncol=2
     plt.tight_layout()
     return fig
 def compute_and_plot(data, n_bins, bin_range, scheme, proxy, p):
     # DEV: check on invalid datatypes with better warnings
@@ -127,7 +138,6 @@ def compute_and_plot(data, n_bins, bin_range, scheme, proxy, p):
         predictions,
         references,
         n_bins=n_bins,
-        # bin_range=None,#not needed
         scheme=scheme,
         proxy=proxy,
         p=p,
@@ -135,7 +145,7 @@ def compute_and_plot(data, n_bins, bin_range, scheme, proxy, p):
     )
     plot = reliability_plot(results)
-    return results["ECE"], plot #plt.gcf()
 outputs = [gr.outputs.Textbox(label="ECE"), gr.Plot(label="Reliability diagram")]
@@ -145,6 +155,7 @@ iface = gr.Interface(
     inputs=[component] + sliders,
     outputs=outputs,
     description=metric.info.description,
-    article=metric.info.citation,
     # examples=sample_data; # ValueError: Examples argument must either be a directory or a nested list, where each sublist represents a set of inputs.
 ).launch()

 import evaluate
+import json
+import sys
+from pathlib import Path
+import gradio as gr
 import numpy as np
 import pandas as pd
 import ast
+from ece import ECE  # loads local instead
 import matplotlib.pyplot as plt
+"""
 import seaborn as sns
 sns.set_style('white')
 sns.set_context("paper", font_scale=1)  # 2
+"""
 # plt.rcParams['figure.figsize'] = [10, 7]
+plt.rcParams["figure.dpi"] = 300
+plt.switch_backend(
+    "agg"
+)  # ; https://stackoverflow.com/questions/14694408/runtimeerror-main-thread-is-not-in-main-loop
 sliders = [
     gr.Slider(0, 100, value=10, label="n_bins"),
+    gr.Slider(
+        0, 100, value=None, label="bin_range", visible=False
+    ),  # DEV: need to have a double slider
     gr.Dropdown(choices=["equal-range", "equal-mass"], value="equal-range", label="scheme"),
     gr.Dropdown(choices=["upper-edge", "center"], value="upper-edge", label="proxy"),
     gr.Dropdown(choices=[1, 2, np.inf], value=1, label="p"),
 sample_data = [[component] + slider_defaults]  ##json.dumps(df)
+local_path = Path(sys.path[0])
 metric = ECE()
 # module = evaluate.load("jordyvl/ece")
 # launch_gradio_widget(module)
 Switch inputs and compute_fn
 """
 def reliability_plot(results):
     fig = plt.figure()
     ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
     bin_freqs[anindices] = results["bin_freq"]
     ax2.hist(results["y_bar"], results["y_bar"], weights=bin_freqs)
+    # widths = np.diff(results["y_bar"])
     for j, bin in enumerate(results["y_bar"]):
         perfect = results["y_bar"][j]
         empirical = results["p_bar"][j]
             continue
         ax1.bar([perfect], height=[empirical], width=-ranged[j], align="edge", color="lightblue")
+        """
         if perfect == empirical:
             continue
+        """
+    acc_plt = ax2.axvline(x=results["accuracy"], ls="solid", lw=3, c="black", label="Accuracy")
     conf_plt = ax2.axvline(
         x=results["p_bar_cont"], ls="dotted", lw=3, c="#444", label="Avg. confidence"
     )
     ax2.legend(handles=[acc_plt, conf_plt])
+    # Bin differences
     ax1.set_ylabel("Conditional Expectation")
+    ax1.set_ylim([-0.05, 1.05])  # respective to bin range
     ax1.legend(loc="lower right")
     ax1.set_title("Reliability Diagram")
+    # Bin frequencies
     ax2.set_xlabel("Confidence")
     ax2.set_ylabel("Count")
+    ax2.legend(loc="upper left")  # , ncol=2
     plt.tight_layout()
     return fig
 def compute_and_plot(data, n_bins, bin_range, scheme, proxy, p):
     # DEV: check on invalid datatypes with better warnings
         predictions,
         references,
         n_bins=n_bins,
         scheme=scheme,
         proxy=proxy,
         p=p,
     )
     plot = reliability_plot(results)
+    return results["ECE"], plot  # plt.gcf()
 outputs = [gr.outputs.Textbox(label="ECE"), gr.Plot(label="Reliability diagram")]
     inputs=[component] + sliders,
     outputs=outputs,
     description=metric.info.description,
+    article=evaluate.utils.parse_readme(local_path / "README.md"),
+    title=f"Metric: {metric.name}",
     # examples=sample_data; # ValueError: Examples argument must either be a directory or a nested list, where each sublist represents a set of inputs.
 ).launch()