Databrowser python module#

The following section gives an overview over the usage of the available zarr uitility python module. Please see the Installation and configuration section on how to install and configure the freva-client library.

Convert data to zarr#

With help of the freva_client.zarr_utils.convert() you can convert and optionally aggregate your data to zarr.

freva_client.zarr_utils.convert(*paths: str, aggregate: Literal['auto', 'merge', 'concat'] | None = None, host: str | None = None, join: Literal['outer', 'inner', 'exact', 'left', 'right'] = 'outer', compat: Literal['no_conflicts', 'equals', 'override'] = 'override', data_vars: Literal['minimal', 'different', 'all'] = 'minimal', coords: Literal['minimal', 'different', 'all'] = 'minimal', dim: str | None = None, group_by: str | None = None, zarr_options: Dict[str, str | int | float | bool | None] | None = None) List[str]#

Convert data files to a zarr store in the cloud.

This method lets you convert data files in netCDF, hdf5, geotiff etc. to zarr-stores that are avialale via http.

I can either directly map one input file to a zarr store or aggregate the files into one single zarr-store. There are three main aggregation modes. (auto, merge or concat). Once you’ve chosen the main aggregation mode. You can fine tune the aggregation using the join, compat, data_vars, coords dim and group_by parameters to fine the the aggregation.

Parameters:
  • paths (str) – Collection of paths that are converted to zarr.

  • aggregate (str, choices: None, auto, merge, concat) – None will not aggregate data. The string indicating how the aggregation should be done: - “auto”: let the system choose how to aggregate data. - “merge”: merge datasets as variables - “concat”: concatenated datasets along a dimension

  • host (str, default: None) – Override the host name of the databrowser server. This is usually the url where the freva web site can be found. Such as www.freva.dkrz.de. By default no host name is given and the host name will be taken from the freva config file.

  • join (str, choices: outer, inner, exact, left, right) –

    String indicating how to combine differing indexes - “outer”: use the union of object indexes. - “inner”: use the intersection of object indexes. - “left”: use indexes from the first object with each dimension. - “right”: use indexes from the last object with each dimension. - “exact”: instead of aligning, errors when indexes are not equal.

    This option is only taken into account it aggregate is not None.

  • compat (str, choices: no_conflicts, equals, override) –

    String indicating how to compare non-concatenated variables of the same name for:

    • ”equals”: all values and dimensions must be the same.

    • ”no_conflicts”: only values which are not null in both datasets must be equal. The returned dataset then contains the combination of all non-null values.

    • ”override”: skip comparing and pick variable from first dataset

    This option is only taken into account it aggregate is not None.

  • data_vars (str, choices: minimal, different, all) –

    These data variables will be combined together:

    • ”minimal”: Only data variables in which the dimension already appears are included.

    • ”different”: Data variables which are not equal (ignoring

      attributes) across all datasets are also concatenated (as well as all for which dimension already appears).

    • ”all”: All data variables will be concatenated.

    This option is only taken into account it aggregate is not None.

  • coords (str, choices: minimal, different, all) –

    These coordinate variables will be combined together:

    • ”minimal”: Only coordinates in which the dimension already

      appears are included.

    • ”different”: Coordinates which are not equal (ignoring

      attributes) across all datasets are also concatenated (as well as all for which dimension already appears).

    • ”all”: All coordinates will be concatenated.

    This option is only taken into account it aggregate is not None.

  • dim (str) –

    Name of the dimension to concatenate along. This can either be a new dimension name, in which case it is added along axis=0, or an existing dimension name, in which case the location of the dimension is unchanged.

    This option is only taken into account it aggregate is not None.

  • group_by (str) –

    If set, forces grouping by a signature key. Otherwise grouping is attempted only when direct combine fails.

    This option is only taken into account it aggregate is not None.

  • zarr_options (dict, default: None) –

    Set additional options for creating the dynamic zarr streams. You can set the following options (see also freva_client.utils.types.ZarrOptions):

    Public urls:* If you wish to create public instead of a private url that expires in one hour you can set:

    zarr_options={"public": True, "ttl_seconds": 3600}.

    Access Pattern: To optimise the chunk size according to your access pattern you can add the access_pattern, chunk_size and map_primary_chunksize parameters. access_pattern can either be map or time_series. chunk_size` should be target chunk size of dataset. If you choose a map access pattern you can set the chunk size of the primary dimensions, such as time, using the map_primary_chunksize parameter:

    zarr_options={"access_pattern": "time_series", "chunk_size": 2.0}

    Force reloading: To improve access performance data store requests are cached server side. To force a reload you can add the reaload=True option:

    zarr_options={"reload": True}`

Example

from freva_client import authenticate
from freva_client.zarr_utils import convert
storage_options = authenticate()["headers"]

urls = convert("/mnt/data/test1.nc", "/mnt/data/test2.nc")
dset = xr.open_zarr(
    url,
    storage_options=storage_options
)

Public access:

You can also create zarr stores that are public. For example creating a temporary public store that is valid for one day.

from freva_client.zarr_utils import convert

urls = convert(
    "/mnt/data/test1.nc",
    "/mnt/data/test2.nc",
    zarr_options={"public": True, "ttl_seconds": 86400}
)
dset = xr.open_zarr(urls[0])

Chunk sizes:*

Depending on how you would like to access the data different chunk sizes can make the data access more performant. freva_client defines two major access patterns map and time_series. Depending on which of the two access pattern you choose chunk sizes will changed accordingly. For example with access_pattern=time_series the chunk size will be optimized for time series analysis at given geographical points. map on the other hand optimize access pattern for map comparisons over time dimensions. You can set the access_pattern and the optimal chunk size in MB by setting the access_pattern, chunk_size and map_primary_chunksize entries in the zarr_options dictionary. map_primary_chunksize is the chunk size of the major axis, such as time, if the access_pattern=map is chosen.

For example you can request an access pattern optimize for time series analysis using the fowlloing zarr_options

from freva_client.zarr_utils import convert
urls = convert(
    "/mnt/data/test1.nc",
    "/mnt/data/test2.nc",
    zarr_options={"public": True,
                  "ttl_seconds": 86400,
                  "chunk_size": 2,
                  "access_pattern": "time_series"
                  }
)
dset = xr.open_zarr(urls[0])

For a map access pattern that optimises the chunk size for map based analysis you can set the slice size of the primary dimension - often time.

For example instead of reading one time step after another read 100 time steps at once.

from freva_client.zarr_utils import convert
urls = convert(
    "/mnt/data/test1.nc",
    "/mnt/data/test2.nc",
    zarr_options={"public": True,
                  "ttl_seconds": 86400,
                  "chunk_size": 2,
                  "map_primary_chunksize": 100
                  }
)
dset = xr.open_zarr(urls[0])

Dataset aggregation:

You can also be more specific on the aggregation operation

from freva_client import authenticate
storage_options = authenticate()["headers"]
url = convert(
    "/mnt/data/test1.nc", "/mnt/data/test2.nc",
    aggregate="concat",
    join="inner",
    dim="ensemble",
)
dset = xr.open_zarr(
    url,
    storage_options=storage_options
)

The zarr_options dictionary can be used to request public zarr stores:

from freva_client import authenticate
_ = authenticate()
url = convert(
    "/mnt/data/test1.nc", "/mnt/data/test2.nc",
    aggregate="concat",
    join="inner",
    dim="ensemble",
    zarr_options={"public": True, ttl_seconds: 86400}
)
dset = xr.open_zarr(url)

Check the status of a zarr store#

You can use the freva_client.zarr_utils.status(): to check the status of a conversion job. This method can be useful is client tools like xarray fail to open the remote zarr stores but don’t give any describtive error message. You can then simply occurrences of search results with this method.

freva_client.zarr_utils.status(url: str, headers: Dict[str, str] | None = None, host: str | None = None) Status#

Query the status of a pre signed zarr store.

This method can be useful to check the state of a zarr store if clients like xarray file to load the data.

Parameters:
  • url (str) – The url of the zarr store that is should be checked.

  • headers (Dict[str, str]) – Non-Public zarr stores will need a valid OAuth2 token to query the status.

  • host (str, default: None) – Override the host name of the databrowser server. This is usually the url where the freva web site can be found. Such as www.freva.dkrz.de. By default no host name is given and the host name will be taken from the freva config file.

Returns:

  • Status (Dict(status=0, reason=””))

  • The status status is a int between 0 reason represents a human readale

  • reason of the status.

class freva_client.utils.types.ZarrOptions(public: bool = False, ttl_seconds: float = 86400.0, access_pattern: Literal['map', 'time_series'] = 'map', chunk_size: float = 16.0, map_primary_chunksize: int = 1, reload: bool = False)#

Configuration options for Zarr URL requests.

Controls URL generation, caching behavior, and chunk size optimization for different data access patterns.