Skip to content

Data Formats and Standards

Model evaluation often requires comparison across different models, such as for the Coupled Model Intercomparison Project (CMIP). However, comparing output from different models can be tricky due to the multiple data formats and standards used across models. This is why ACCESS-NRI supports and encourages the use of common, community-supported data formats and variables.

Data Standards

Data standards are agreed-upon guidelines for the "representation, format, definition, structuring, tagging, transmission, manipulation, use, and management" of datasets (definition from Geoscience Australia). Abiding by these standardized guidelines allow for, among other things, easier sharing and combining of data, as well as the ability to better understand which quantities can be compared across datasets - very important for model evaluation.

An example data standard in climate models is the use of Climate and Forecast metadata conventions (CF conventions). These are designed to promote the processing and sharing of NetCDF files (described in more detail below). The conventions specify metadata that provide a definitive description of what the data in each variable represents, and the spatial and temporal properties of the data.

Metadata is information about the data, which can include variable names, dimension names, units, grid information and many others. Standardized metadata can also be more easily made machine readable, allowing software packages to interpret, for example, variable names automatically and making data analysis more efficient and less error prone. The machine readability of standardized formats thus facilitates building software applications with powerful extraction, regridding and display capabilities.

Currently, many models do not abide by the CF conventions by default. However, there is a software library called CMOR (Climate Model Output Rewriter) that translates native climate model output into output that complies with the CF conventions. The process of CMORizing is specifically designed for model intercomparison projects, like CMIP.

Network Common Data Format (NetCDF)

Numerous organisations and scientific groups worldwide have adopted a file format called NetCDF as a standard way to store multidimensional scientific data.

NetCDF, which has the file extension *.nc, is a self-describing, machine-independent data format of array-oriented scientific data.

  • Self-describing
    *.nc files include not only the data, but also a header with metadata that describes the data layout.
  • Machine-independent
    *.nc files can be accessed by computers with different ways of storing integers, characters and floating-point numbers.
  • Array-oriented
    *.nc data typically spans multiple dimensions with the same lengths (e.g., latitude, longitude and time) and variables (e.g., temperature and humidity), which are stored in arrays.

    Schematic of a NetCDF file with data (temperature and pressure as variables stored over the dimensions latitude, longitude, and time) and metadata

Data in a NetCDF file is stored in the form of arrays, where each NetCDF dimension has a name and a length. NetCDF variables and coordinates can also have a different number of dimensions.

For example, surface temperature variation over time at a fixed location would be stored as a one-dimensional array (with dimension time), whereas surface temperature that varies over a region at a fixed point in time would be stored as a two-dimensional array (with dimensions longitude, latitude). An example of three-dimensional (3D) data would be surface temperature varying with time over a region (with dimensions longitude, latitude, time), and four-dimensional (4D) data would be temperature varying with time over a region with varying altitude (with dimensions longitude, latitude, altitude, time).

Loading NetCDF files

There are many ways of reading files, though a common way is via the Python package xarray.
For more information, refer to a quick overview of xarray and xarray tutorials.

xarray is a python package avaliable through the conda environment on NCI.
Hence, you can either use it directly (as shown below) or through the dataset capabilities of the ACCESS-NRI Model Intake Catalog Tool.

import xarray as xr
dataset = xr.open_dataset("example.nc")
dataset
Example of an actual NetCDF file with data (precipitation/rainfall over the dimensions latitude, longitude, and time) and metadata.

Other Data formats

NetCDF has been described in detail here as it is the most common format for climate data and then for comparison and optimizing evaluation workflows all data would be in the same format. Observational data can come from different institutions and measured with various instruments. These institutions can manage their data for users other than climate researchers, therefore the data can come in other formats including plain text formats. This data can be CMORised, for evaluation frameworks. Reach out on the Hive Forum for assistance and suggestions of any datasets that may be missing or could be useful.

References