A Python package for verifying and evaluating models and predictions with xarray and pandas (2024)

\makesavenoteenv

longtable\setkeysGinwidth=\Gin@nat@width,height=\Gin@nat@height,keepaspectratio\NewDocumentCommand\citeproctext\NewDocumentCommand\citeprocmm[#1]

TennesseeLeeuwenburg\XeTeXLinkBox

Bureau of Meteorology,Australia

tennessee.leeuwenburg@bom.gov.au&NicholasLoveday\XeTeXLinkBox

Bureau of Meteorology,Australia

&Elizabeth E. EbertBureau of Meteorology,Australia

&HarrisonCook\XeTeXLinkBox

Bureau of Meteorology,Australia

&MohammadrezaKhanarmuei\XeTeXLinkBox

Bureau of Meteorology,Australia

&Robert J.Taggart\XeTeXLinkBox

Bureau of Meteorology,Australia

&NikeethRamanathan\XeTeXLinkBox

Bureau of Meteorology,Australia

1 Summary

scores is a Python package containing mathematical functionsfor the verification, evaluation and optimisation of forecasts,predictions or models. It primarily supports the geoscience communities;in particular, the meteorological, climatological and oceanographiccommunities. In addition to supporting the Earth system sciencecommunities, it also has wide potential application in machine learningand other domains such as economics.

scores not only includes common scores (e.g.Mean AbsoluteError), it also includes novel scores not commonly found elsewhere(e.g.FIxed Risk Multicategorical (FIRM) score, Flip-Flop Index),complex scores (e.g.threshold-weighted continuous ranked probabilityscore), and statistical tests (such as the Diebold Mariano test). Italso contains isotonic regression which is becoming an increasinglyimportant tool in forecast verification and can be used to generatestable reliability diagrams. Additionally, it provides pre-processingtools for preparing data for scores in a variety of formats includingcumulative distribution functions (CDF). At the time of writing,scores includes over 50 metrics, statistical techniques anddata processing tools.

All of the scores and statistical techniques in this package haveundergone a thorough scientific and software review. Every score has acompanion Jupyter Notebook tutorial that demonstrates its use inpractice.

scores primarily supports xarray datatypes for Earthsystem data, allowing it to work with NetCDF4, HDF5, Zarr and GRIB datasources among others. scores uses Dask for scaling andperformance. It has expanding support for pandas.

The software repository can be found athttps://github.com/nci/scores/.

2 Statement of Need

The purpose of this software is (a) to mathematically verify andvalidate models and predictions and (b) to foster research into newscores and metrics.

2.1 Key Benefits ofscores

In order to meet the needs of researchers and other users,scores provides the following key benefits.

Data Handling

•
Works with n-dimensional data (e.g., geospatial, vertical and temporaldimensions) for both point-based and gridded data. scores caneffectively handle the dimensionality, data size and data structurescommonly used for:
- –
  gridded Earth system data (e.g.numerical weather prediction models)
- –
  tabular, point, latitude/longitude or site-based data(e.g.forecasts for specific locations).
•
Handles missing data, masking of data and weighting of results.
•
Supports xarray (Hoyer & Hamman, 2017) datatypes, and workswith NetCDF4 (Unidata, 2024), HDF5 (The HDF Group & Koziol, 2020),Zarr (Miles et al., 2020) and GRIB (World Meteorological Organization,2024) data sources among others.

Usability

•
A companion Jupyter Notebook (Jupyter Team, 2024) tutorial for eachmetric and statistical test that demonstrates its use in practice.
•
Novel scores not commonly found elsewhere (e.g.FIRM (Taggart,Loveday, & Griffiths, 2022), Flip-Flop Index (Griffiths, Foley,Ioannou, & Leeuwenburg, 2019; Griffiths, Loveday, Price, Foley, &McKelvie, 2021)).
•
All scores and statistical techniques have undergone a thoroughscientific and software review.
•
An area specifically to hold emerging scores which are stillundergoing research and development. This provides a clear mechanismfor people to share, access and collaborate on new scores, and be ableto easily re-use versioned implementations of those scores.

Compatability

•
Highly modular - provides its own implementations, avoids extensivedependencies and offers a consistent API.
•
Easy to integrate and use in a wide variety of environments. It hasbeen used on workstations, servers and in high performance computing(supercomputing) environments.
•
Maintains 100% automated test coverage.
•
Uses Dask (Dask Development Team, 2016) for scaling and performance.
•
Expanding support for pandas (McKinney, 2010; The pandasdevelopment team, 2024).

2.2 Metrics, Statistical Techniques and DataProcessing Tools Included inscores

At the time of writing, scores includes over 50 metrics,statistical techniques and data processing tools. For an up to datelist, please see the scores documentation.

The ongoing development roadmap includes the addition of more metrics,tools, and statistical tests.


	Description	A Selection of the Functions Included in scores
Continuous	Scores for evaluating single-valued continuousforecasts.	Mean Absolute Error (MAE), Mean Squared Error (MSE), RootMean Squared Error (RMSE), Additive Bias, Multiplicative Bias, Pearson’sCorrelation Coefficient, Flip-Flop Index (Griffiths et al., 2019, 2021),Quantile Loss, Murphy Score (Ehm, Gneiting, Jordan, & Krüger, 2016).

Probability	Scores for evaluating forecasts that areexpressed as predictive distributions, ensembles, and probabilities ofbinary events.	Brier Score (Brier, 1950), Continuous RankedProbability Score (CRPS) for Cumulative Distribution Functions (CDFs)(including threshold-weighting, see Gneiting & Ranjan (2011)), CRPS forensembles (Ferro, 2013; Gneiting & Raftery, 2007), Receiver OperatingCharacteristic (ROC), Isotonic Regression (reliability diagrams)(Dimitriadis, Gneiting, & Jordan, 2021).

Categorical	Scores for evaluating forecasts of categories.	Probability of Detection (POD), Probability of False Detection (POFD),False Alarm Ratio (FAR), Success Ratio, Accuracy, Peirce’s Skill Score(Peirce, 1884), Critical Success Index (CSI), Gilbert Skill Score(Gilbert, 1884), Heidke Skill Score, Odds Ratio, Odds Ratio Skill Score,F1 Score, Symmetric Extremal Dependence Index (Ferro & Stephenson,2011), FIxed Risk Multicategorical (FIRM) Score (Taggart et al.,2022).

Spatial	Scores that take into account spatial structure.	Fractions Skill Score (Roberts & Lean, 2008).

Statistical Tests	Tools to conduct statistical tests andgenerate confidence intervals.	Diebold-Mariano (Diebold & Mariano,1995) with both the Harvey, Leybourne, & Newbold (1997) and Hering &Genton (2011) modifications.

Processing Tools	Tools to pre-process data.	Data matching,discretisation, cumulative density function manipulation.

2.3 Use in Academic Work

In 2015, the Australian Bureau of Meteorology began developing a newverification system called Jive, which became operational in 2022. For adescription of Jive see Loveday, Griffiths, et al. (2024). The Jiveverification metrics have been used to support several publications(Foley & Loveday, 2020; Griffiths, Jack, Foley, Ioannou, & Liu, 2017;Taggart, 2022a, 2022b, 2022c). scores has arisen from the Jiveverification system and was created to modularise the Jive verificationfunctions and make them available as an open source package.scores also includes additional metrics that Jive does notcontain.

scores has been used to explore user-focused approaches toevaluating probabilistic and categorical forecasts (Loveday, Taggart, &Khanarmuei, 2024).

2.4 Related Software Packages

There are multiple open source verification packages in a range oflanguages. Below is a comparison of scores to other open sourcePython verification packages. None of these include all of the metricsimplemented in scores (and vice versa).

xskillscore (Bell et al., 2021) provides many but not all ofthe same functions as scores and does not have direct supportfor pandas. The Jupyter Notebook tutorials in scores cover awider array of metrics.

climpred (Brady & Spring, 2021) uses xskillscorecombined with data handling functionality, and is focused on ensembleforecasts for climate and weather. climpred makes some designchoices related to data structure (specifically associated with climatemodelling) which may not generalise effectively to broader use cases.Releasing scores separately allows the differing designphilosophies to be considered by the community.

METplus (Brown et al., 2021) is a substantial verificationsystem used by weather and climate model developers. METplusincludes a database and a visualisation system, with Python and shellscript wrappers to use the MET package for the calculation ofscores. MET is implemented in C++ rather than Python.METplus is used as a system rather than providing a modularPython API.

Verif (Nipen, Stull, Lussana, & Seierstad, 2023) is a commandline tool for generating verification plots whereas scoresprovides a Python API for generating numerical scores.

Pysteps (Imhoff et al., 2023; Pulkkinen et al., 2019) is apackage for short-term ensemble prediction systems, and includes asignificant verification submodule with many useful verification scores.PySteps does not provide a standalone verification API.

PyForecastTools (Morley & Burrell, 2020) is a Python packagefor model and forecast verification which supports dmarrayrather than xarray data structures and does not include JupyterNotebook tutorials.

3 Acknowledgements

We would like to thank Jason West and Robert Johnson from the Bureau ofMeteorology for their feedback on an earlier version of this manuscript.

We would like to thank and acknowledge the National ComputationalInfrastructure (nci.org.au) for hosting the scores repositorywithin their GitHub organisation.

References

Bell, R., Spring, A., Brady, R., Huang, A., Squire, D., Blackwood, Z.,… Chegini., T. (2021). xarray-contrib/xskillscore: Metricsfor verifying forecasts. Zenodo.https://doi.org/10.5281/zenodo.5173153
Brady, R. X., & Spring, A. (2021). climpred: Verification of weatherand climate forecasts. Journal of Open Source Software,6(59), 2781. https://doi.org/10.21105/joss.02781
Brier, G. W. (1950). Verification of forecasts expressed in terms ofprobability. Monthly Weather Review, 78(1), 1–3.https://doi.org/10.1175/1520-0493(1950)078%3C0001:vofeit%3E2.0.co;2
Brown, B., Jensen, T., Gotway, J. H., Bullock, R., Gilleland, E.,Fowler, T., … Wolff, J. (2021). The Model Evaluation Tools(MET): More than a decade of community-supported forecast verification.Bulletin of the American Meteorological Society, 102(4),E782–E807. https://doi.org/10.1175/BAMS-D-19-0093.1
Dask Development Team. (2016). Dask: Library for dynamic taskscheduling. Retrieved from http://dask.pydata.org
Diebold, F. X., & Mariano, R. S. (1995). Comparing predictive accuracy.Journal of Business & Economic Statistics, 13(3),253–263. https://doi.org/10.3386/t0169
Dimitriadis, T., Gneiting, T., & Jordan, A. I. (2021). Stablereliability diagrams for probabilistic classifiers. Proceedings ofthe National Academy of Sciences, 118(8), e2016191118.https://doi.org/10.1073/pnas.2016191118
Ehm, W., Gneiting, T., Jordan, A., & Krüger, F. (2016). Of quantilesand expectiles: Consistent scoring functions, Choquet representationsand forecast rankings. Journal of the Royal Statistical Society.Series B (Statistical Methodology), 78(3), 505–562.https://doi.org/10.1111/rssb.12154
Ferro, C. A. T. (2013). Fair scores for ensemble forecasts: Fair scoresfor ensemble forecasts. Quarterly Journal of the RoyalMeteorological Society, 140(683), 1917–1923.https://doi.org/10.1002/qj.2270
Ferro, C. A. T., & Stephenson, D. B. (2011). Extremal dependenceindices: Improved verification measures for deterministic forecasts ofrare binary events. Weather and Forecasting, 26(5),699–713. https://doi.org/10.1175/WAF-D-10-05030.1
Foley, M., & Loveday, N. (2020). Comparison of single-valued forecastsin a user-oriented framework. Weather and Forecasting,35(3), 1067–1080. https://doi.org/10.1175/waf-d-19-0248.1
Gilbert, G. K. (1884). Finley’s tornado predictions. AmericanMeteorological Journal, 1(5), 166–172.
Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules,prediction, and estimation. Journal of the American StatisticalAssociation, 102(477), 359–378.https://doi.org/10.1198/016214506000001437
Gneiting, T., & Ranjan, R. (2011). Comparing density forecasts usingthreshold-and quantile-weighted scoring rules. Journal of Business& Economic Statistics, 29(3), 411–422.https://doi.org/10.1198/jbes.2010.08110
Griffiths, D., Foley, M., Ioannou, I., & Leeuwenburg, T. (2019).Flip-Flop Index: Quantifying revision stability for fixed-eventforecasts. Meteorological Applications, 26(1), 30–35.https://doi.org/10.1002/met.1732
Griffiths, D., Jack, H., Foley, M., Ioannou, I., & Liu, M. (2017).Advice for automation of forecasts: A framework. Bureau ofMeteorology. https://doi.org/10.22499/4.0021
Griffiths, D., Loveday, N., Price, B., Foley, M., & McKelvie, A.(2021). Circular Flip-Flop Index: Quantifying revision stability offorecasts of direction. Journal of Southern Hemisphere EarthSystems Science, 71(3), 266–271.https://doi.org/10.1071/es21010
Harvey, D., Leybourne, S., & Newbold, P. (1997). Testing the equalityof prediction mean squared errors. International Journal ofForecasting, 13(2), 281–291.https://doi.org/10.1016/S0169-2070(96)00719-4
Hering, A. S., & Genton, M. G. (2011). Comparing spatial predictions.Technometrics, 53(4), 414–425.https://doi.org/10.1198/tech.2011.10136
Hoyer, S., & Hamman, J. (2017). xarray: N-D labeled arrays anddatasets in Python. Journal of Open Research Software,5(1). https://doi.org/10.5334/jors.148
Imhoff, R. O., De Cruz, L., Dewettinck, W., Brauer, C. C., Uijlenhoet,R., Heeringen, K.-J. van, … Weerts, A. H. (2023). Scale-dependentblending of ensemble rainfall nowcasts and numerical weather predictionin the open-source pysteps library. Quarterly Journal of the RoyalMeteorological Society, 149(753), 1335–1364.https://doi.org/10.1002/qj.4461
Jupyter Team. (2024). Jupyter interactive notebook. GitHub.Retrieved from https://github.com/jupyter/notebook
Loveday, N., Griffiths, D., Leeuwenburg, T., Taggart, R., Pagano, T. C.,Cheng, G., … Nagpal, I. (2024). The Jive verificationsystem and its transformative impact on weather forecastingoperations. arXiv. https://doi.org/10.48550/arXiv.2404.18429
Loveday, N., Taggart, R., & Khanarmuei, M. (2024). A user-focusedapproach to evaluating probabilistic and categorical forecasts.Weather and Forecasting.https://doi.org/10.1175/waf-d-23-0201.1
McKinney, W. (2010). Data structures for statistical computing inPython. In S. van der Walt & J. Millman (Eds.), Proceedingsof the 9th Python in Science Conference (pp. 56–61).https://doi.org/10.25080/Majora-92bf1922-00a
Miles, A., Kirkham, J., Durant, M., Bourbeau, J., Onalan, T., Hamman,J., … Banihirwe, A. (2020). Zarr-developers/zarr-python:v2.4.0. Zenodo. https://doi.org/10.5281/zenodo.3773450
Morley, S., & Burrell, A. (2020). Drsteve/PyForecastTools:Version 1.1.1. Zenodo. https://doi.org/10.5281/zenodo.3764117
Nipen, T. N., Stull, R. B., Lussana, C., & Seierstad, I. A. (2023).Verif: A weather-prediction verification tool for effective productdevelopment. Bulletin of the American Meteorological Society,104(9), E1610–E1618.https://doi.org/10.1175/bams-d-22-0253.1
Peirce, C. S. (1884). The numerical measure of the success ofpredictions. Science, ns-4(93), 453–454.https://doi.org/10.1126/science.ns-4.93.453.b
Pulkkinen, S., Nerini, D., Pérez Hortal, A. A., Velasco-Forero, C.,Seed, A., Germann, U., & Foresti, L. (2019). Pysteps: An open-sourcePython library for probabilistic precipitation nowcasting (v1.0).Geoscientific Model Development, 12(10), 4185–4219.https://doi.org/10.5194/gmd-12-4185-2019
Roberts, N. M., & Lean, H. W. (2008). Scale-selective verification ofrainfall accumulations from high-resolution forecasts of convectiveevents. Monthly Weather Review, 136(1), 78–97.https://doi.org/10.1175/2007MWR2123.1
Taggart, R. (2022a). Assessing calibration when predictivedistributions have discontinuities. Retrieved fromhttp://www.bom.gov.au/research/publications/researchreports/BRR-064.pdf
Taggart, R. (2022b). Evaluation of point forecasts for extreme eventsusing consistent scoring functions. Quarterly Journal of the RoyalMeteorological Society, 148(742), 306–320.https://doi.org/10.1002/qj.4206
Taggart, R. (2022c). Point forecasting and forecast evaluation withgeneralized huber loss. Electronic Journal of Statistics,16(1), 201–231. https://doi.org/10.1214/21-ejs1957
Taggart, R., Loveday, N., & Griffiths, D. (2022). A scoring frameworkfor tiered warnings and multicategorical forecasts based on fixed riskmeasures. Quarterly Journal of the Royal Meteorological Society,148(744), 1389–1406. https://doi.org/10.1002/qj.4266
The HDF Group, & Koziol, Q. (2020). HDF5-version 1.12.0.https://doi.org/10.11578/dc.20180330.1
The pandas development team. (2024). Pandas-dev/pandas: pandas.Zenodo. https://doi.org/10.5281/zenodo.10957263
Unidata. (2024). Network common data form (NetCDF). UCAR/UnidataProgram Center. https://doi.org/10.5065/D6H70CW6
World Meteorological Organization. (2024). WMO no. 306 FM 92 GRIB(edition 2). World Meteorological Organization. Retrieved fromhttps://codes.wmo.int/grib2