A Python package for verifying and evaluating models and predictions with xarray and pandas (2024)

\makesavenoteenv

longtable\setkeysGinwidth=\Gin@nat@width,height=\Gin@nat@height,keepaspectratio\NewDocumentCommand\citeproctext\NewDocumentCommand\citeprocmm[#1]

TennesseeLeeuwenburg\XeTeXLinkBox

Bureau of Meteorology,Australia

tennessee.leeuwenburg@bom.gov.au&NicholasLoveday\XeTeXLinkBox

Bureau of Meteorology,Australia

&Elizabeth E. EbertBureau of Meteorology,Australia

&HarrisonCook\XeTeXLinkBox

Bureau of Meteorology,Australia

&MohammadrezaKhanarmuei\XeTeXLinkBox

Bureau of Meteorology,Australia

&Robert J.Taggart\XeTeXLinkBox

Bureau of Meteorology,Australia

&NikeethRamanathan\XeTeXLinkBox

Bureau of Meteorology,Australia

&MareeCarroll\XeTeXLinkBox

Bureau of Meteorology,Australia

&StephanieChong\XeTeXLinkBox

Independent Contributor,Australia

&Aidan GriffithsWork undertaken while atthe Bureau of Meteorology, Australia

&JohnSharplesBureau of Meteorology, Australia

1 Summary

scores is a Python package containing mathematical functionsfor the verification, evaluation and optimisation of forecasts,predictions or models. It primarily supports the geoscience communities;in particular, the meteorological, climatological and oceanographiccommunities. In addition to supporting the Earth system sciencecommunities, it also has wide potential application in machine learningand other domains such as economics.

scores not only includes common scores (e.g.Mean AbsoluteError), it also includes novel scores not commonly found elsewhere(e.g.FIxed Risk Multicategorical (FIRM) score, Flip-Flop Index),complex scores (e.g.threshold-weighted continuous ranked probabilityscore), and statistical tests (such as the Diebold Mariano test). Italso contains isotonic regression which is becoming an increasinglyimportant tool in forecast verification and can be used to generatestable reliability diagrams. Additionally, it provides pre-processingtools for preparing data for scores in a variety of formats includingcumulative distribution functions (CDF). At the time of writing,scores includes over 50 metrics, statistical techniques anddata processing tools.

All of the scores and statistical techniques in this package haveundergone a thorough scientific and software review. Every score has acompanion Jupyter Notebook tutorial that demonstrates its use inpractice.

scores primarily supports xarray datatypes for Earthsystem data, allowing it to work with NetCDF4, HDF5, Zarr and GRIB datasources among others. scores uses Dask for scaling andperformance. It has expanding support for pandas.

The software repository can be found athttps://github.com/nci/scores/.

2 Statement of Need

The purpose of this software is (a) to mathematically verify andvalidate models and predictions and (b) to foster research into newscores and metrics.

2.1 Key Benefits ofscores

In order to meet the needs of researchers and other users,scores provides the following key benefits.

Data Handling

  • Works with n-dimensional data (e.g., geospatial, vertical and temporaldimensions) for both point-based and gridded data. scores caneffectively handle the dimensionality, data size and data structurescommonly used for:

    • gridded Earth system data (e.g.numerical weather prediction models)

    • tabular, point, latitude/longitude or site-based data(e.g.forecasts for specific locations).

  • Handles missing data, masking of data and weighting of results.

  • Supports xarray (Hoyer & Hamman, 2017) datatypes, and workswith NetCDF4 (Unidata, 2024), HDF5 (The HDF Group & Koziol, 2020),Zarr (Miles et al., 2020) and GRIB (World Meteorological Organization,2024) data sources among others.

Usability

  • A companion Jupyter Notebook (Jupyter Team, 2024) tutorial for eachmetric and statistical test that demonstrates its use in practice.

  • Novel scores not commonly found elsewhere (e.g.FIRM (Taggart,Loveday, & Griffiths, 2022), Flip-Flop Index (Griffiths, Foley,Ioannou, & Leeuwenburg, 2019; Griffiths, Loveday, Price, Foley, &McKelvie, 2021)).

  • All scores and statistical techniques have undergone a thoroughscientific and software review.

  • An area specifically to hold emerging scores which are stillundergoing research and development. This provides a clear mechanismfor people to share, access and collaborate on new scores, and be ableto easily re-use versioned implementations of those scores.

Compatability

  • Highly modular - provides its own implementations, avoids extensivedependencies and offers a consistent API.

  • Easy to integrate and use in a wide variety of environments. It hasbeen used on workstations, servers and in high performance computing(supercomputing) environments.

  • Maintains 100% automated test coverage.

  • Uses Dask (Dask Development Team, 2016) for scaling and performance.

  • Expanding support for pandas (McKinney, 2010; The pandasdevelopment team, 2024).

2.2 Metrics, Statistical Techniques and DataProcessing Tools Included inscores

At the time of writing, scores includes over 50 metrics,statistical techniques and data processing tools. For an up to datelist, please see the scores documentation.

The ongoing development roadmap includes the addition of more metrics,tools, and statistical tests.

DescriptionA Selection of the Functions Included in scores
ContinuousScores for evaluating single-valued continuousforecasts.Mean Absolute Error (MAE), Mean Squared Error (MSE), RootMean Squared Error (RMSE), Additive Bias, Multiplicative Bias, Pearson’sCorrelation Coefficient, Flip-Flop Index (Griffiths et al., 2019, 2021),Quantile Loss, Murphy Score (Ehm, Gneiting, Jordan, & Krüger, 2016).
ProbabilityScores for evaluating forecasts that areexpressed as predictive distributions, ensembles, and probabilities ofbinary events.Brier Score (Brier, 1950), Continuous RankedProbability Score (CRPS) for Cumulative Distribution Functions (CDFs)(including threshold-weighting, see Gneiting & Ranjan (2011)), CRPS forensembles (Ferro, 2013; Gneiting & Raftery, 2007), Receiver OperatingCharacteristic (ROC), Isotonic Regression (reliability diagrams)(Dimitriadis, Gneiting, & Jordan, 2021).
CategoricalScores for evaluating forecasts of categories.Probability of Detection (POD), Probability of False Detection (POFD),False Alarm Ratio (FAR), Success Ratio, Accuracy, Peirce’s Skill Score(Peirce, 1884), Critical Success Index (CSI), Gilbert Skill Score(Gilbert, 1884), Heidke Skill Score, Odds Ratio, Odds Ratio Skill Score,F1 Score, Symmetric Extremal Dependence Index (Ferro & Stephenson,2011), FIxed Risk Multicategorical (FIRM) Score (Taggart et al.,2022).
SpatialScores that take into account spatial structure.Fractions Skill Score (Roberts & Lean, 2008).
Statistical TestsTools to conduct statistical tests andgenerate confidence intervals.Diebold-Mariano (Diebold & Mariano,1995) with both the Harvey, Leybourne, & Newbold (1997) and Hering &Genton (2011) modifications.
Processing ToolsTools to pre-process data.Data matching,discretisation, cumulative density function manipulation.

2.3 Use in Academic Work

In 2015, the Australian Bureau of Meteorology began developing a newverification system called Jive, which became operational in 2022. For adescription of Jive see Loveday, Griffiths, et al. (2024). The Jiveverification metrics have been used to support several publications(Foley & Loveday, 2020; Griffiths, Jack, Foley, Ioannou, & Liu, 2017;Taggart, 2022a, 2022b, 2022c). scores has arisen from the Jiveverification system and was created to modularise the Jive verificationfunctions and make them available as an open source package.scores also includes additional metrics that Jive does notcontain.

scores has been used to explore user-focused approaches toevaluating probabilistic and categorical forecasts (Loveday, Taggart, &Khanarmuei, 2024).

2.4 Related Software Packages

There are multiple open source verification packages in a range oflanguages. Below is a comparison of scores to other open sourcePython verification packages. None of these include all of the metricsimplemented in scores (and vice versa).

xskillscore (Bell et al., 2021) provides many but not all ofthe same functions as scores and does not have direct supportfor pandas. The Jupyter Notebook tutorials in scores cover awider array of metrics.

climpred (Brady & Spring, 2021) uses xskillscorecombined with data handling functionality, and is focused on ensembleforecasts for climate and weather. climpred makes some designchoices related to data structure (specifically associated with climatemodelling) which may not generalise effectively to broader use cases.Releasing scores separately allows the differing designphilosophies to be considered by the community.

METplus (Brown et al., 2021) is a substantial verificationsystem used by weather and climate model developers. METplusincludes a database and a visualisation system, with Python and shellscript wrappers to use the MET package for the calculation ofscores. MET is implemented in C++ rather than Python.METplus is used as a system rather than providing a modularPython API.

Verif (Nipen, Stull, Lussana, & Seierstad, 2023) is a commandline tool for generating verification plots whereas scoresprovides a Python API for generating numerical scores.

Pysteps (Imhoff et al., 2023; Pulkkinen et al., 2019) is apackage for short-term ensemble prediction systems, and includes asignificant verification submodule with many useful verification scores.PySteps does not provide a standalone verification API.

PyForecastTools (Morley & Burrell, 2020) is a Python packagefor model and forecast verification which supports dmarrayrather than xarray data structures and does not include JupyterNotebook tutorials.

3 Acknowledgements

We would like to thank Jason West and Robert Johnson from the Bureau ofMeteorology for their feedback on an earlier version of this manuscript.

We would like to thank and acknowledge the National ComputationalInfrastructure (nci.org.au) for hosting the scores repositorywithin their GitHub organisation.

References

    References

    • Bell, R., Spring, A., Brady, R., Huang, A., Squire, D., Blackwood, Z.,… Chegini., T. (2021). xarray-contrib/xskillscore: Metricsfor verifying forecasts. Zenodo.https://doi.org/10.5281/zenodo.5173153
    • Brady, R. X., & Spring, A. (2021). climpred: Verification of weatherand climate forecasts. Journal of Open Source Software,6(59), 2781. https://doi.org/10.21105/joss.02781
    • Brier, G. W. (1950). Verification of forecasts expressed in terms ofprobability. Monthly Weather Review, 78(1), 1–3.https://doi.org/10.1175/1520-0493(1950)078%3C0001:vofeit%3E2.0.co;2
    • Brown, B., Jensen, T., Gotway, J. H., Bullock, R., Gilleland, E.,Fowler, T., … Wolff, J. (2021). The Model Evaluation Tools(MET): More than a decade of community-supported forecast verification.Bulletin of the American Meteorological Society, 102(4),E782–E807. https://doi.org/10.1175/BAMS-D-19-0093.1
    • Dask Development Team. (2016). Dask: Library for dynamic taskscheduling. Retrieved from http://dask.pydata.org
    • Diebold, F. X., & Mariano, R. S. (1995). Comparing predictive accuracy.Journal of Business & Economic Statistics, 13(3),253–263. https://doi.org/10.3386/t0169
    • Dimitriadis, T., Gneiting, T., & Jordan, A. I. (2021). Stablereliability diagrams for probabilistic classifiers. Proceedings ofthe National Academy of Sciences, 118(8), e2016191118.https://doi.org/10.1073/pnas.2016191118
    • Ehm, W., Gneiting, T., Jordan, A., & Krüger, F. (2016). Of quantilesand expectiles: Consistent scoring functions, Choquet representationsand forecast rankings. Journal of the Royal Statistical Society.Series B (Statistical Methodology), 78(3), 505–562.https://doi.org/10.1111/rssb.12154
    • Ferro, C. A. T. (2013). Fair scores for ensemble forecasts: Fair scoresfor ensemble forecasts. Quarterly Journal of the RoyalMeteorological Society, 140(683), 1917–1923.https://doi.org/10.1002/qj.2270
    • Ferro, C. A. T., & Stephenson, D. B. (2011). Extremal dependenceindices: Improved verification measures for deterministic forecasts ofrare binary events. Weather and Forecasting, 26(5),699–713. https://doi.org/10.1175/WAF-D-10-05030.1
    • Foley, M., & Loveday, N. (2020). Comparison of single-valued forecastsin a user-oriented framework. Weather and Forecasting,35(3), 1067–1080. https://doi.org/10.1175/waf-d-19-0248.1
    • Gilbert, G. K. (1884). Finley’s tornado predictions. AmericanMeteorological Journal, 1(5), 166–172.
    • Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules,prediction, and estimation. Journal of the American StatisticalAssociation, 102(477), 359–378.https://doi.org/10.1198/016214506000001437
    • Gneiting, T., & Ranjan, R. (2011). Comparing density forecasts usingthreshold-and quantile-weighted scoring rules. Journal of Business& Economic Statistics, 29(3), 411–422.https://doi.org/10.1198/jbes.2010.08110
    • Griffiths, D., Foley, M., Ioannou, I., & Leeuwenburg, T. (2019).Flip-Flop Index: Quantifying revision stability for fixed-eventforecasts. Meteorological Applications, 26(1), 30–35.https://doi.org/10.1002/met.1732
    • Griffiths, D., Jack, H., Foley, M., Ioannou, I., & Liu, M. (2017).Advice for automation of forecasts: A framework. Bureau ofMeteorology. https://doi.org/10.22499/4.0021
    • Griffiths, D., Loveday, N., Price, B., Foley, M., & McKelvie, A.(2021). Circular Flip-Flop Index: Quantifying revision stability offorecasts of direction. Journal of Southern Hemisphere EarthSystems Science, 71(3), 266–271.https://doi.org/10.1071/es21010
    • Harvey, D., Leybourne, S., & Newbold, P. (1997). Testing the equalityof prediction mean squared errors. International Journal ofForecasting, 13(2), 281–291.https://doi.org/10.1016/S0169-2070(96)00719-4
    • Hering, A. S., & Genton, M. G. (2011). Comparing spatial predictions.Technometrics, 53(4), 414–425.https://doi.org/10.1198/tech.2011.10136
    • Hoyer, S., & Hamman, J. (2017). xarray: N-D labeled arrays anddatasets in Python. Journal of Open Research Software,5(1). https://doi.org/10.5334/jors.148
    • Imhoff, R. O., De Cruz, L., Dewettinck, W., Brauer, C. C., Uijlenhoet,R., Heeringen, K.-J. van, … Weerts, A. H. (2023). Scale-dependentblending of ensemble rainfall nowcasts and numerical weather predictionin the open-source pysteps library. Quarterly Journal of the RoyalMeteorological Society, 149(753), 1335–1364.https://doi.org/10.1002/qj.4461
    • Jupyter Team. (2024). Jupyter interactive notebook. GitHub.Retrieved from https://github.com/jupyter/notebook
    • Loveday, N., Griffiths, D., Leeuwenburg, T., Taggart, R., Pagano, T. C.,Cheng, G., … Nagpal, I. (2024). The Jive verificationsystem and its transformative impact on weather forecastingoperations. arXiv. https://doi.org/10.48550/arXiv.2404.18429
    • Loveday, N., Taggart, R., & Khanarmuei, M. (2024). A user-focusedapproach to evaluating probabilistic and categorical forecasts.Weather and Forecasting.https://doi.org/10.1175/waf-d-23-0201.1
    • McKinney, W. (2010). Data structures for statistical computing inPython. In S. van der Walt & J. Millman (Eds.), Proceedingsof the 9th Python in Science Conference (pp. 56–61).https://doi.org/10.25080/Majora-92bf1922-00a
    • Miles, A., Kirkham, J., Durant, M., Bourbeau, J., Onalan, T., Hamman,J., … Banihirwe, A. (2020). Zarr-developers/zarr-python:v2.4.0. Zenodo. https://doi.org/10.5281/zenodo.3773450
    • Morley, S., & Burrell, A. (2020). Drsteve/PyForecastTools:Version 1.1.1. Zenodo. https://doi.org/10.5281/zenodo.3764117
    • Nipen, T. N., Stull, R. B., Lussana, C., & Seierstad, I. A. (2023).Verif: A weather-prediction verification tool for effective productdevelopment. Bulletin of the American Meteorological Society,104(9), E1610–E1618.https://doi.org/10.1175/bams-d-22-0253.1
    • Peirce, C. S. (1884). The numerical measure of the success ofpredictions. Science, ns-4(93), 453–454.https://doi.org/10.1126/science.ns-4.93.453.b
    • Pulkkinen, S., Nerini, D., Pérez Hortal, A. A., Velasco-Forero, C.,Seed, A., Germann, U., & Foresti, L. (2019). Pysteps: An open-sourcePython library for probabilistic precipitation nowcasting (v1.0).Geoscientific Model Development, 12(10), 4185–4219.https://doi.org/10.5194/gmd-12-4185-2019
    • Roberts, N. M., & Lean, H. W. (2008). Scale-selective verification ofrainfall accumulations from high-resolution forecasts of convectiveevents. Monthly Weather Review, 136(1), 78–97.https://doi.org/10.1175/2007MWR2123.1
    • Taggart, R. (2022a). Assessing calibration when predictivedistributions have discontinuities. Retrieved fromhttp://www.bom.gov.au/research/publications/researchreports/BRR-064.pdf
    • Taggart, R. (2022b). Evaluation of point forecasts for extreme eventsusing consistent scoring functions. Quarterly Journal of the RoyalMeteorological Society, 148(742), 306–320.https://doi.org/10.1002/qj.4206
    • Taggart, R. (2022c). Point forecasting and forecast evaluation withgeneralized huber loss. Electronic Journal of Statistics,16(1), 201–231. https://doi.org/10.1214/21-ejs1957
    • Taggart, R., Loveday, N., & Griffiths, D. (2022). A scoring frameworkfor tiered warnings and multicategorical forecasts based on fixed riskmeasures. Quarterly Journal of the Royal Meteorological Society,148(744), 1389–1406. https://doi.org/10.1002/qj.4266
    • The HDF Group, & Koziol, Q. (2020). HDF5-version 1.12.0.https://doi.org/10.11578/dc.20180330.1
    • The pandas development team. (2024). Pandas-dev/pandas: pandas.Zenodo. https://doi.org/10.5281/zenodo.10957263
    • Unidata. (2024). Network common data form (NetCDF). UCAR/UnidataProgram Center. https://doi.org/10.5065/D6H70CW6
    • World Meteorological Organization. (2024). WMO no. 306 FM 92 GRIB(edition 2). World Meteorological Organization. Retrieved fromhttps://codes.wmo.int/grib2
    A Python package for verifying and evaluating models and predictions with xarray and pandas (2024)

    References

    Top Articles
    Latest Posts
    Article information

    Author: Moshe Kshlerin

    Last Updated:

    Views: 6238

    Rating: 4.7 / 5 (57 voted)

    Reviews: 80% of readers found this page helpful

    Author information

    Name: Moshe Kshlerin

    Birthday: 1994-01-25

    Address: Suite 609 315 Lupita Unions, Ronnieburgh, MI 62697

    Phone: +2424755286529

    Job: District Education Designer

    Hobby: Yoga, Gunsmithing, Singing, 3D printing, Nordic skating, Soapmaking, Juggling

    Introduction: My name is Moshe Kshlerin, I am a gleaming, attractive, outstanding, pleasant, delightful, outstanding, famous person who loves writing and wants to share my knowledge and understanding with you.