jordyvl commited on
Commit
0736615
1 Parent(s): dea6ecb

small update to include visual definition of ECE

Browse files
Files changed (5) hide show
  1. ECE_definition.jpg +0 -0
  2. README.md +40 -11
  3. app.py +4 -3
  4. ece.py +1 -4
  5. local_app.py +31 -20
ECE_definition.jpg ADDED
README.md CHANGED
@@ -14,30 +14,32 @@ pinned: false
14
 
15
  # Metric Card for ECE
16
 
17
- ***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
18
-
19
  ## Metric Description
20
  <!---
21
  *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
22
  -->
23
- Expected Calibration Error `ECE` is a standard metric to evaluate top-1 prediction miscalibration.
 
24
  It measures the L^p norm difference between a model’s posterior and the true likelihood of being correct.
25
- ```
26
- $$ ECE_p(f)^p= \mathbb{E}_{(X,Y)} \left[\|\mathbb{E}[Y = \hat{y} \mid f(X) = \hat{p}] - f(X)\|^p_p\right]$$, where $$ \hat{y} = \argmax_{y'}[f(X)]_y'$$ is a class prediction with associated posterior probability $$ \hat{p}= \max_{y'}[f(X)]_y'$$.
27
- ```
28
- It is generally implemented as a binned estimator that discretizes predicted probabilities into a range of possible values (bins) for which conditional expectation can be estimated.
29
 
30
- As a metric of calibration *error*, it holds that the lower, the better calibrated a model is.
31
- For valid model comparisons, ensure to use the same keyword arguments.
 
32
 
33
 
34
  ## How to Use
35
  <!---
36
  *Give general statement of how to use the metric*
37
  *Provide simplest possible example for using the metric*
38
- -->
39
-
 
 
 
 
 
40
 
 
41
 
42
 
43
  ### Inputs
@@ -55,12 +57,32 @@ For valid model comparisons, ensure to use the same keyword arguments.
55
  #### Values from Popular Papers
56
  *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
57
  -->
 
 
58
 
59
 
60
  ### Examples
61
  <!---
62
  *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
63
  -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
 
66
  ## Limitations and Bias
@@ -71,11 +93,18 @@ See [3],[4] and [5].
71
 
72
  ## Citation
73
  [1] Naeini, M.P., Cooper, G. and Hauskrecht, M., 2015, February. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
 
74
  [2] Guo, C., Pleiss, G., Sun, Y. and Weinberger, K.Q., 2017, July. On calibration of modern neural networks. In International Conference on Machine Learning (pp. 1321-1330). PMLR.
 
75
  [3] Nixon, J., Dusenberry, M.W., Zhang, L., Jerfel, G. and Tran, D., 2019, June. Measuring Calibration in Deep Learning. In CVPR Workshops (Vol. 2, No. 7).
 
76
  [4] Kumar, A., Liang, P.S. and Ma, T., 2019. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32.
 
77
  [5] Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten, F., Roll, J. and Schön, T., 2019, April. Evaluating model calibration in classification. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 3459-3467). PMLR.
 
78
  [6] Allen-Zhu, Z., Li, Y. and Liang, Y., 2019. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32.
79
 
80
  ## Further References
 
81
  *Add any useful further references.*
 
 
14
 
15
  # Metric Card for ECE
16
 
 
 
17
  ## Metric Description
18
  <!---
19
  *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
20
  -->
21
+
22
+ Expected Calibration Error *ECE* is a popular metric to evaluate top-1 prediction miscalibration.
23
  It measures the L^p norm difference between a model’s posterior and the true likelihood of being correct.
 
 
 
 
24
 
25
+ ![ECE definition](./ECE_definition.jpg)
26
+
27
+ It is generally implemented as a binned estimator that discretizes predicted probabilities into ranges of possible values (bins) for which conditional expectation can be estimated.
28
 
29
 
30
  ## How to Use
31
  <!---
32
  *Give general statement of how to use the metric*
33
  *Provide simplest possible example for using the metric*
34
+ -->
35
+ ```
36
+ >>> metric = evaluate.load("jordyvl/ece")
37
+ >>> results = metric.compute(references=[0, 1, 2], predictions=[[0.6, 0.2, 0.2], [0, 0.95, 0.05], [0.7, 0.1 ,0.2]])
38
+ >>> print(results)
39
+ {'ECE': 0.1333333333333334}
40
+ ```
41
 
42
+ For valid model comparisons, ensure to use the same keyword arguments.
43
 
44
 
45
  ### Inputs
 
57
  #### Values from Popular Papers
58
  *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
59
  -->
60
+ As a metric of calibration *error*, it holds that the lower, the better calibrated a model is. Depending on the L^p norm, ECE will either take value between 0 and 1 (p=2) or between 0 and \infty_+.
61
+ The module returns dictionary with a key value pair, e.g., {"ECE": 0.64}.
62
 
63
 
64
  ### Examples
65
  <!---
66
  *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
67
  -->
68
+ ```
69
+ N = 10 # N evaluation instances {(x_i,y_i)}_{i=1}^N
70
+ K = 5 # K class problem
71
+
72
+ def random_mc_instance(concentration=1, onehot=False):
73
+ reference = np.argmax(
74
+ np.random.dirichlet(([concentration for _ in range(K)])), -1
75
+ ) # class targets
76
+ prediction = np.random.dirichlet(([concentration for _ in range(K)])) # probabilities
77
+ if onehot:
78
+ reference = np.eye(K)[np.argmax(reference, -1)]
79
+ return reference, prediction
80
+
81
+ references, predictions = list(zip(*[random_mc_instance() for i in range(N)]))
82
+ references = np.array(references, dtype=np.int64)
83
+ predictions = np.array(predictions, dtype=np.float32)
84
+ res = ECE()._compute(predictions, references) # {'ECE': float}
85
+ ```
86
 
87
 
88
  ## Limitations and Bias
 
93
 
94
  ## Citation
95
  [1] Naeini, M.P., Cooper, G. and Hauskrecht, M., 2015, February. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
96
+
97
  [2] Guo, C., Pleiss, G., Sun, Y. and Weinberger, K.Q., 2017, July. On calibration of modern neural networks. In International Conference on Machine Learning (pp. 1321-1330). PMLR.
98
+
99
  [3] Nixon, J., Dusenberry, M.W., Zhang, L., Jerfel, G. and Tran, D., 2019, June. Measuring Calibration in Deep Learning. In CVPR Workshops (Vol. 2, No. 7).
100
+
101
  [4] Kumar, A., Liang, P.S. and Ma, T., 2019. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32.
102
+
103
  [5] Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten, F., Roll, J. and Schön, T., 2019, April. Evaluating model calibration in classification. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 3459-3467). PMLR.
104
+
105
  [6] Allen-Zhu, Z., Li, Y. and Liang, Y., 2019. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32.
106
 
107
  ## Further References
108
+ <!---
109
  *Add any useful further references.*
110
+ -->
app.py CHANGED
@@ -1,7 +1,8 @@
1
  import evaluate
2
- import numpy as np
3
-
4
 
5
  from evaluate.utils import launch_gradio_widget
 
6
  module = evaluate.load("jordyvl/ece")
7
- launch_gradio_widget(module)
 
 
 
1
  import evaluate
 
 
2
 
3
  from evaluate.utils import launch_gradio_widget
4
+
5
  module = evaluate.load("jordyvl/ece")
6
+ launch_gradio_widget(module)
7
+
8
+
ece.py CHANGED
@@ -63,13 +63,10 @@ Returns
63
  Expected calibration error (ECE), float.
64
 
65
  Examples:
66
- Examples should be written in doctest format, and should illustrate how
67
- to use the function.
68
-
69
  >>> my_new_module = evaluate.load("jordyvl/ece")
70
  >>> results = my_new_module.compute(references=[0, 1, 2], predictions=[[0.6, 0.2, 0.2], [0, 0.95, 0.05], [0.7, 0.1 ,0.2]])
71
  >>> print(results)
72
- {'ECE': 1.0}
73
  """
74
 
75
  # TODO: Define external resources urls if needed
 
63
  Expected calibration error (ECE), float.
64
 
65
  Examples:
 
 
 
66
  >>> my_new_module = evaluate.load("jordyvl/ece")
67
  >>> results = my_new_module.compute(references=[0, 1, 2], predictions=[[0.6, 0.2, 0.2], [0, 0.95, 0.05], [0.7, 0.1 ,0.2]])
68
  >>> print(results)
69
+ {'ECE': 0.1333333333333334}
70
  """
71
 
72
  # TODO: Define external resources urls if needed
local_app.py CHANGED
@@ -1,23 +1,33 @@
1
  import evaluate
 
 
 
 
 
2
  import numpy as np
3
  import pandas as pd
4
  import ast
5
- import json
6
- import gradio as gr
7
- from evaluate.utils import launch_gradio_widget
8
- from ece import ECE
9
 
10
  import matplotlib.pyplot as plt
 
 
11
  import seaborn as sns
12
  sns.set_style('white')
13
  sns.set_context("paper", font_scale=1) # 2
 
14
  # plt.rcParams['figure.figsize'] = [10, 7]
15
- plt.rcParams['figure.dpi'] = 300
16
- plt.switch_backend('agg') #; https://stackoverflow.com/questions/14694408/runtimeerror-main-thread-is-not-in-main-loop
 
 
17
 
18
  sliders = [
19
  gr.Slider(0, 100, value=10, label="n_bins"),
20
- gr.Slider(0, 100, value=None, label="bin_range", visible=False), #DEV: need to have a double slider
 
 
21
  gr.Dropdown(choices=["equal-range", "equal-mass"], value="equal-range", label="scheme"),
22
  gr.Dropdown(choices=["upper-edge", "center"], value="upper-edge", label="proxy"),
23
  gr.Dropdown(choices=[1, 2, np.inf], value=1, label="p"),
@@ -42,6 +52,7 @@ component.value = [
42
  sample_data = [[component] + slider_defaults] ##json.dumps(df)
43
 
44
 
 
45
  metric = ECE()
46
  # module = evaluate.load("jordyvl/ece")
47
  # launch_gradio_widget(module)
@@ -50,6 +61,7 @@ metric = ECE()
50
  Switch inputs and compute_fn
51
  """
52
 
 
53
  def reliability_plot(results):
54
  fig = plt.figure()
55
  ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
@@ -77,7 +89,7 @@ def reliability_plot(results):
77
  bin_freqs[anindices] = results["bin_freq"]
78
  ax2.hist(results["y_bar"], results["y_bar"], weights=bin_freqs)
79
 
80
- #widths = np.diff(results["y_bar"])
81
  for j, bin in enumerate(results["y_bar"]):
82
  perfect = results["y_bar"][j]
83
  empirical = results["p_bar"][j]
@@ -86,31 +98,30 @@ def reliability_plot(results):
86
  continue
87
 
88
  ax1.bar([perfect], height=[empirical], width=-ranged[j], align="edge", color="lightblue")
89
-
90
  if perfect == empirical:
91
  continue
92
-
93
- acc_plt = ax2.axvline(
94
- x=results["accuracy"], ls="solid", lw=3, c="black", label="Accuracy"
95
- )
96
  conf_plt = ax2.axvline(
97
  x=results["p_bar_cont"], ls="dotted", lw=3, c="#444", label="Avg. confidence"
98
  )
99
  ax2.legend(handles=[acc_plt, conf_plt])
100
 
101
- #Bin differences
102
  ax1.set_ylabel("Conditional Expectation")
103
- ax1.set_ylim([-0.05, 1.05]) #respective to bin range
104
  ax1.legend(loc="lower right")
105
  ax1.set_title("Reliability Diagram")
106
 
107
- #Bin frequencies
108
  ax2.set_xlabel("Confidence")
109
  ax2.set_ylabel("Count")
110
- ax2.legend(loc="upper left")#, ncol=2
111
  plt.tight_layout()
112
  return fig
113
 
 
114
  def compute_and_plot(data, n_bins, bin_range, scheme, proxy, p):
115
  # DEV: check on invalid datatypes with better warnings
116
 
@@ -127,7 +138,6 @@ def compute_and_plot(data, n_bins, bin_range, scheme, proxy, p):
127
  predictions,
128
  references,
129
  n_bins=n_bins,
130
- # bin_range=None,#not needed
131
  scheme=scheme,
132
  proxy=proxy,
133
  p=p,
@@ -135,7 +145,7 @@ def compute_and_plot(data, n_bins, bin_range, scheme, proxy, p):
135
  )
136
 
137
  plot = reliability_plot(results)
138
- return results["ECE"], plot #plt.gcf()
139
 
140
 
141
  outputs = [gr.outputs.Textbox(label="ECE"), gr.Plot(label="Reliability diagram")]
@@ -145,6 +155,7 @@ iface = gr.Interface(
145
  inputs=[component] + sliders,
146
  outputs=outputs,
147
  description=metric.info.description,
148
- article=metric.info.citation,
 
149
  # examples=sample_data; # ValueError: Examples argument must either be a directory or a nested list, where each sublist represents a set of inputs.
150
  ).launch()
 
1
  import evaluate
2
+ import json
3
+ import sys
4
+ from pathlib import Path
5
+ import gradio as gr
6
+
7
  import numpy as np
8
  import pandas as pd
9
  import ast
10
+ from ece import ECE # loads local instead
11
+
 
 
12
 
13
  import matplotlib.pyplot as plt
14
+
15
+ """
16
  import seaborn as sns
17
  sns.set_style('white')
18
  sns.set_context("paper", font_scale=1) # 2
19
+ """
20
  # plt.rcParams['figure.figsize'] = [10, 7]
21
+ plt.rcParams["figure.dpi"] = 300
22
+ plt.switch_backend(
23
+ "agg"
24
+ ) # ; https://stackoverflow.com/questions/14694408/runtimeerror-main-thread-is-not-in-main-loop
25
 
26
  sliders = [
27
  gr.Slider(0, 100, value=10, label="n_bins"),
28
+ gr.Slider(
29
+ 0, 100, value=None, label="bin_range", visible=False
30
+ ), # DEV: need to have a double slider
31
  gr.Dropdown(choices=["equal-range", "equal-mass"], value="equal-range", label="scheme"),
32
  gr.Dropdown(choices=["upper-edge", "center"], value="upper-edge", label="proxy"),
33
  gr.Dropdown(choices=[1, 2, np.inf], value=1, label="p"),
 
52
  sample_data = [[component] + slider_defaults] ##json.dumps(df)
53
 
54
 
55
+ local_path = Path(sys.path[0])
56
  metric = ECE()
57
  # module = evaluate.load("jordyvl/ece")
58
  # launch_gradio_widget(module)
 
61
  Switch inputs and compute_fn
62
  """
63
 
64
+
65
  def reliability_plot(results):
66
  fig = plt.figure()
67
  ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
 
89
  bin_freqs[anindices] = results["bin_freq"]
90
  ax2.hist(results["y_bar"], results["y_bar"], weights=bin_freqs)
91
 
92
+ # widths = np.diff(results["y_bar"])
93
  for j, bin in enumerate(results["y_bar"]):
94
  perfect = results["y_bar"][j]
95
  empirical = results["p_bar"][j]
 
98
  continue
99
 
100
  ax1.bar([perfect], height=[empirical], width=-ranged[j], align="edge", color="lightblue")
101
+ """
102
  if perfect == empirical:
103
  continue
104
+ """
105
+ acc_plt = ax2.axvline(x=results["accuracy"], ls="solid", lw=3, c="black", label="Accuracy")
 
 
106
  conf_plt = ax2.axvline(
107
  x=results["p_bar_cont"], ls="dotted", lw=3, c="#444", label="Avg. confidence"
108
  )
109
  ax2.legend(handles=[acc_plt, conf_plt])
110
 
111
+ # Bin differences
112
  ax1.set_ylabel("Conditional Expectation")
113
+ ax1.set_ylim([-0.05, 1.05]) # respective to bin range
114
  ax1.legend(loc="lower right")
115
  ax1.set_title("Reliability Diagram")
116
 
117
+ # Bin frequencies
118
  ax2.set_xlabel("Confidence")
119
  ax2.set_ylabel("Count")
120
+ ax2.legend(loc="upper left") # , ncol=2
121
  plt.tight_layout()
122
  return fig
123
 
124
+
125
  def compute_and_plot(data, n_bins, bin_range, scheme, proxy, p):
126
  # DEV: check on invalid datatypes with better warnings
127
 
 
138
  predictions,
139
  references,
140
  n_bins=n_bins,
 
141
  scheme=scheme,
142
  proxy=proxy,
143
  p=p,
 
145
  )
146
 
147
  plot = reliability_plot(results)
148
+ return results["ECE"], plot # plt.gcf()
149
 
150
 
151
  outputs = [gr.outputs.Textbox(label="ECE"), gr.Plot(label="Reliability diagram")]
 
155
  inputs=[component] + sliders,
156
  outputs=outputs,
157
  description=metric.info.description,
158
+ article=evaluate.utils.parse_readme(local_path / "README.md"),
159
+ title=f"Metric: {metric.name}",
160
  # examples=sample_data; # ValueError: Examples argument must either be a directory or a nested list, where each sublist represents a set of inputs.
161
  ).launch()