Databrowser python module#
The following section gives an overview over the usage of the available zarr
uitility python module. Please see the Installation and configuration section on how
to install and configure the freva-client library.
Convert data to zarr#
With help of the freva_client.zarr_utils.convert() you can convert
and optionally aggregate your data to zarr.
- freva_client.zarr_utils.convert(*paths: str, aggregate: Literal['auto', 'merge', 'concat'] | None = None, host: str | None = None, join: Literal['outer', 'inner', 'exact', 'left', 'right'] = 'outer', compat: Literal['no_conflicts', 'equals', 'override'] = 'override', data_vars: Literal['minimal', 'different', 'all'] = 'minimal', coords: Literal['minimal', 'different', 'all'] = 'minimal', dim: str | None = None, group_by: str | None = None, zarr_options: Dict[str, str | int | float | bool | None] | None = None) List[str]#
Convert data files to a zarr store in the cloud.
This method lets you convert data files in netCDF, hdf5, geotiff etc. to zarr-stores that are avialale via http.
I can either directly map one input file to a zarr store or aggregate the files into one single zarr-store. There are three main aggregation modes. (
auto,mergeorconcat). Once you’ve chosen the main aggregation mode. You can fine tune the aggregation using thejoin,compat,data_vars,coordsdimandgroup_byparameters to fine the the aggregation.- Parameters:
paths (str) – Collection of paths that are converted to zarr.
aggregate (str, choices: None, auto, merge, concat) – None will not aggregate data. The string indicating how the aggregation should be done: - “auto”: let the system choose how to aggregate data. - “merge”: merge datasets as variables - “concat”: concatenated datasets along a dimension
host (str, default: None) – Override the host name of the databrowser server. This is usually the url where the freva web site can be found. Such as www.freva.dkrz.de. By default no host name is given and the host name will be taken from the freva config file.
join (str, choices: outer, inner, exact, left, right) –
String indicating how to combine differing indexes - “outer”: use the union of object indexes. - “inner”: use the intersection of object indexes. - “left”: use indexes from the first object with each dimension. - “right”: use indexes from the last object with each dimension. - “exact”: instead of aligning, errors when indexes are not equal.
This option is only taken into account it
aggregateis not None.compat (str, choices: no_conflicts, equals, override) –
String indicating how to compare non-concatenated variables of the same name for:
”equals”: all values and dimensions must be the same.
”no_conflicts”: only values which are not null in both datasets must be equal. The returned dataset then contains the combination of all non-null values.
”override”: skip comparing and pick variable from first dataset
This option is only taken into account it
aggregateis not None.data_vars (str, choices: minimal, different, all) –
These data variables will be combined together:
”minimal”: Only data variables in which the dimension already appears are included.
- ”different”: Data variables which are not equal (ignoring
attributes) across all datasets are also concatenated (as well as all for which dimension already appears).
”all”: All data variables will be concatenated.
This option is only taken into account it
aggregateis not None.coords (str, choices: minimal, different, all) –
These coordinate variables will be combined together:
- ”minimal”: Only coordinates in which the dimension already
appears are included.
- ”different”: Coordinates which are not equal (ignoring
attributes) across all datasets are also concatenated (as well as all for which dimension already appears).
”all”: All coordinates will be concatenated.
This option is only taken into account it
aggregateis not None.dim (str) –
Name of the dimension to concatenate along. This can either be a new dimension name, in which case it is added along axis=0, or an existing dimension name, in which case the location of the dimension is unchanged.
This option is only taken into account it
aggregateis not None.group_by (str) –
If set, forces grouping by a signature key. Otherwise grouping is attempted only when direct combine fails.
This option is only taken into account it
aggregateis not None.zarr_options (dict, default: None) –
Set additional options for creating the dynamic zarr streams. You can set the following options (see also
freva_client.utils.types.ZarrOptions):Public urls:* If you wish to create public instead of a private url that expires in one hour you can set:
zarr_options={"public": True, "ttl_seconds": 3600}.Access Pattern: To optimise the chunk size according to your access pattern you can add the
access_pattern,chunk_sizeandmap_primary_chunksizeparameters.access_patterncan either bemaportime_series. chunk_size` should be target chunk size of dataset. If you choose amapaccess pattern you can set the chunk size of the primary dimensions, such as time, using themap_primary_chunksizeparameter:zarr_options={"access_pattern": "time_series", "chunk_size": 2.0}Force reloading: To improve access performance data store requests are cached server side. To force a reload you can add the
reaload=Trueoption:zarr_options={"reload": True}`
Example
from freva_client import authenticate from freva_client.zarr_utils import convert storage_options = authenticate()["headers"] urls = convert("/mnt/data/test1.nc", "/mnt/data/test2.nc") dset = xr.open_zarr( url, storage_options=storage_options )
Public access:
You can also create zarr stores that are public. For example creating a temporary public store that is valid for one day.
from freva_client.zarr_utils import convert urls = convert( "/mnt/data/test1.nc", "/mnt/data/test2.nc", zarr_options={"public": True, "ttl_seconds": 86400} ) dset = xr.open_zarr(urls[0])
Chunk sizes:*
Depending on how you would like to access the data different chunk sizes can make the data access more performant.
freva_clientdefines two major access patternsmapandtime_series. Depending on which of the two access pattern you choose chunk sizes will changed accordingly. For example withaccess_pattern=time_seriesthe chunk size will be optimized for time series analysis at given geographical points.mapon the other hand optimize access pattern for map comparisons over time dimensions. You can set theaccess_patternand the optimal chunk size in MB by setting theaccess_pattern,chunk_sizeandmap_primary_chunksizeentries in thezarr_optionsdictionary.map_primary_chunksizeis the chunk size of the major axis, such as time, if theaccess_pattern=mapis chosen.For example you can request an access pattern optimize for time series analysis using the fowlloing
zarr_optionsfrom freva_client.zarr_utils import convert urls = convert( "/mnt/data/test1.nc", "/mnt/data/test2.nc", zarr_options={"public": True, "ttl_seconds": 86400, "chunk_size": 2, "access_pattern": "time_series" } ) dset = xr.open_zarr(urls[0])
For a
mapaccess pattern that optimises the chunk size for map based analysis you can set the slice size of the primary dimension - oftentime.For example instead of reading one time step after another read 100 time steps at once.
from freva_client.zarr_utils import convert urls = convert( "/mnt/data/test1.nc", "/mnt/data/test2.nc", zarr_options={"public": True, "ttl_seconds": 86400, "chunk_size": 2, "map_primary_chunksize": 100 } ) dset = xr.open_zarr(urls[0])
Dataset aggregation:
You can also be more specific on the aggregation operation
from freva_client import authenticate storage_options = authenticate()["headers"] url = convert( "/mnt/data/test1.nc", "/mnt/data/test2.nc", aggregate="concat", join="inner", dim="ensemble", ) dset = xr.open_zarr( url, storage_options=storage_options )
The
zarr_optionsdictionary can be used to request public zarr stores:from freva_client import authenticate _ = authenticate() url = convert( "/mnt/data/test1.nc", "/mnt/data/test2.nc", aggregate="concat", join="inner", dim="ensemble", zarr_options={"public": True, ttl_seconds: 86400} ) dset = xr.open_zarr(url)
Check the status of a zarr store#
You can use the freva_client.zarr_utils.status(): to check the
status of a conversion job. This method can be useful is client tools like
xarray fail to open the remote zarr stores but don’t give any describtive
error message. You can then simply occurrences of search results with this
method.
- freva_client.zarr_utils.status(url: str, headers: Dict[str, str] | None = None, host: str | None = None) Status#
Query the status of a pre signed zarr store.
This method can be useful to check the state of a zarr store if clients like
xarrayfile to load the data.- Parameters:
url (str) – The url of the zarr store that is should be checked.
headers (Dict[str, str]) – Non-Public zarr stores will need a valid OAuth2 token to query the status.
host (str, default: None) – Override the host name of the databrowser server. This is usually the url where the freva web site can be found. Such as www.freva.dkrz.de. By default no host name is given and the host name will be taken from the freva config file.
- Returns:
Status (Dict(status=0, reason=””))
The status status is a int between 0 reason represents a human readale
reason of the status.
- class freva_client.utils.types.ZarrOptions(public: bool = False, ttl_seconds: float = 86400.0, access_pattern: Literal['map', 'time_series'] = 'map', chunk_size: float = 16.0, map_primary_chunksize: int = 1, reload: bool = False)#
Configuration options for Zarr URL requests.
Controls URL generation, caching behavior, and chunk size optimization for different data access patterns.