pycontrails.core.datalib

Datalib utilities.

Module Attributes

NETCDF_ENGINE

NetCDF engine to use for parsing netcdf files

DEFAULT_CHUNKS

Default chunking strategy when opening datasets with xarray

OPEN_IN_PARALLEL

Whether to open multi-file datasets in parallel

OPEN_WITH_LOCK

Whether to use file locking when opening multi-file datasets

Functions

parse_grid(grid, supported)

Parse input grid spacing.

parse_pressure_levels(pressure_levels[, ...])

Check input pressure levels are consistent type and ensure levels exist in ECMWF data source.

parse_timesteps(time[, freq])

Parse time input into set of time steps.

parse_variables(variables, supported)

Parse input variables.

round_hour(time, hour)

Round time to the nearest whole hour before input time.

validate_timestep_freq(freq, datasource_freq)

Check that input timestep frequency is compatible with the data source timestep frequency.

Classes

MetDataSource(time, variables[, ...])

Abstract class for wrapping meteorology data sources.

pycontrails.core.datalib.DEFAULT_CHUNKS = {'time': 1}

Default chunking strategy when opening datasets with xarray

class pycontrails.core.datalib.MetDataSource(time, variables, pressure_levels=-1, paths=None, grid=None, **kwargs)

Bases: ABC

Abstract class for wrapping meteorology data sources.

abstract cache_dataset(dataset)

Cache data from data source.

Parameters:

dataset (xarray.Dataset) – Dataset loaded from remote API or local files. The dataset must have the same format as the original data source API or files.

cachestore

Cache store for intermediates while processing data source If None, cache is turned off.

abstract create_cachepath(t)

Return cachepath to local data file based on datetime.

Parameters:

t (datetime) – Datetime of datafile

Returns:

str – Path to cached data file

download(**xr_kwargs)

Confirm all data files are downloaded and available locally in the cachestore.

Parameters:

**xr_kwargs – Passed into xarray.open_dataset() via is_datafile_cached().

abstract download_dataset(times)

Download data from data source for input times.

Parameters:

times (list[datetime]) – List of datetimes to download a store in cache

grid

Lat / Lon grid spacing

property hash

Generate a unique hash for this datasource.

Returns:

str – Unique hash for met instance (sha1)

is_datafile_cached(t, **xr_kwargs)

Check datafile defined by datetime for variables and pressure levels in class.

If using a cloud cache store (i.e. cache.GCPCacheStore), this is where the datafile will be mirrored to a local file for access.

Parameters:
  • t (datetime) – Datetime of datafile

  • **xr_kwargs (Any) – Additional kwargs passed directly to xarray.open_mfdataset() when opening files. By default, the following values are used if not specified:

    • chunks: {“time”: 1}

    • engine: “netcdf4”

    • parallel: True

Returns:

bool – True if data file exists for datetime with all variables and pressure levels, False otherwise

property is_single_level

Return True if the datasource is single level data.

Added in version 0.50.0.

list_timesteps_cached(**xr_kwargs)

Get a list of data files available locally in the cachestore.

Parameters:

**xr_kwargs – Passed into xarray.open_dataset() via is_datafile_cached().

list_timesteps_not_cached(**xr_kwargs)

Get a list of data files not available locally in the cachestore.

Parameters:

**xr_kwargs – Passed into xarray.open_dataset() via is_datafile_cached().

open_dataset(disk_paths, **xr_kwargs)

Open multi-file dataset in xarray.

Parameters:
  • disk_paths (str | list[str] | pathlib.Path | list[pathlib.Path]) – list of string paths to local files to open

  • **xr_kwargs (Any) – Additional kwargs passed directly to xarray.open_mfdataset() when opening files. By default, the following values are used if not specified:

    • chunks: {“time”: 1}

    • engine: “netcdf4”

    • parallel: False

    • lock: False

Returns:

xarray.Dataset – Open xarray dataset

abstract open_metdataset(dataset=None, xr_kwargs=None, **kwargs)

Open MetDataset from data source.

This method should download / load any required datafiles and returns a MetDataset of the multi-file dataset opened by xarray.

Parameters:
  • dataset (xr.Dataset | None, optional) – Input xr.Dataset loaded manually. The dataset must have the same format as the original data source API or files.

  • xr_kwargs (dict[str, Any] | None, optional) – Dictionary of keyword arguments passed into xarray.open_mfdataset() when opening files. Examples include “chunks”, “engine”, “parallel”, etc. Ignored if dataset is input.

  • **kwargs (Any) – Keyword arguments passed through directly into MetDataset constructor.

Returns:

MetDataset – Meteorology dataset

paths

Path to local source files to load. Set to the paths of files cached in cachestore if no paths input is provided on init.

property pressure_level_variables

Parameters available from data source.

Returns:

list[MetVariable] | None – List of MetVariable available in datasource

pressure_levels

List of pressure levels. Set to [-1] for data without level coordinate. Use parse_pressure_levels() to handle PressureLevelInput.

abstract set_metadata(ds)

Set met source metadata on ds.attrs.

This is called within the open_metdataset() method to set metadata on the returned MetDataset instance.

Parameters:

ds (xr.Dataset | MetDataset) – Dataset to set metadata on. Mutated in place.

property single_level_variables

Parameters available from data source.

Returns:

list[MetVariable] | None – List of MetVariable available in datasource

property supported_pressure_levels

Pressure levels available from datasource.

Returns:

list[int] | None – List of integer pressure levels for class. If None, no pressure level information available for class.

property supported_variables

Parameters available from data source.

Returns:

list[MetVariable] | None – List of MetVariable available in datasource

timesteps

List of individual timesteps from data source derived from time Use parse_time() to handle TimeInput.

property variable_shortnames

Return a list of variable short names.

Returns:

list[str] – Lst of variable short names.

property variable_standardnames

Return a list of variable standard names.

Returns:

list[str] – Lst of variable standard names.

variables

Variables requested from data source Use parse_variables() to handle VariableInput.

pycontrails.core.datalib.NETCDF_ENGINE = 'netcdf4'

NetCDF engine to use for parsing netcdf files

pycontrails.core.datalib.OPEN_IN_PARALLEL = False

Whether to open multi-file datasets in parallel

pycontrails.core.datalib.OPEN_WITH_LOCK = False

Whether to use file locking when opening multi-file datasets

pycontrails.core.datalib.parse_grid(grid, supported)

Parse input grid spacing.

Parameters:
  • grid (float) – Input grid float

  • supported (Sequence[float]) – Sequence of support grid values

Returns:

float – Parsed grid spacing

Raises:

ValueError – Raises ValueError when grid is not in supported

pycontrails.core.datalib.parse_pressure_levels(pressure_levels, supported=None)

Check input pressure levels are consistent type and ensure levels exist in ECMWF data source.

Changed in version 0.50.0: The returned pressure levels are now sorted. Pressure levels must be unique. Raises ValueError if pressure levels have mixed signs.

Parameters:
  • pressure_levels (PressureLevelInput) – Input pressure levels for data, in hPa (mbar) Set to [-1] to represent surface level.

  • supported (list[int], optional) – List of supported pressures levels in data source

Returns:

list[int] – List of integer pressure levels supported by ECMWF data source

Raises:

ValueError – Raises ValueError if pressure level is not supported by ECMWF data source

pycontrails.core.datalib.parse_timesteps(time, freq='1h')

Parse time input into set of time steps.

If input time is length 2, this creates a range of equally spaced time points between [start, end] with interval freq.

Parameters:
  • time (TimeInput | None) – Input datetime(s) specifying the time or time range of the data [start, end]. Either a single datetime-like or tuple of datetime-like with the first value the start of the date range and second value the end of the time range. Input values can be any type compatible with pandas.to_datetime().

  • freq (str | None, optional) – Timestep interval in range. See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases for a list of frequency aliases. If None, returns input time as a list. Defaults to “1h”.

Returns:

list[datetime] – List of unique datetimes. If input time is None, returns an empty list

Raises:

ValueError – Raises when the time has len > 2 or when time elements fail to be parsed with pd.to_datetime

pycontrails.core.datalib.parse_variables(variables, supported)

Parse input variables.

Changed in version 0.50.0: The output is no longer copied. Each MetVariable is a frozen dataclass, so copying is unnecessary.

Parameters:
  • variables (VariableInput) – Variable name, or sequence of variable names. i.e. "air_temperature", ["air_temperature, relative_humidity"], [130], [AirTemperature], [[EastwardWind, NorthwardWind]] If an element is a list of MetVariable, the first MetVariable that is supported will be chosen.

  • supported (list[MetVariable]) – Supported MetVariable.

Returns:

list[MetVariable] – List of MetVariable

Raises:

ValueError – Raises ValueError if variable is not supported

pycontrails.core.datalib.round_hour(time, hour)

Round time to the nearest whole hour before input time.

Parameters:
  • time (datetime) – Input time

  • hour (int) – Hour to round down time

Returns:

datetime – Rounded time

Raises:

ValueError – Description

pycontrails.core.datalib.validate_timestep_freq(freq, datasource_freq)

Check that input timestep frequency is compatible with the data source timestep frequency.

A data source timestep frequency of 1 hour allows input timestep frequencies of 1 hour, 2 hours, 3 hours, etc., but not 1.5 hours or 30 minutes.

Parameters:
  • freq (str) – Input timestep frequency

  • datasource_freq (str) – Datasource timestep frequency

Returns:

bool – True if the input timestep frequency is an even multiple of the data source timestep frequency.