Tutorial#
xcollection extends xarray’s data model to be able to handle a dictionary of xarray Datasets. A xcollection.main.Collection
behaves like a regular dictionary, but it also has a few extra methods that make it easier to work with.
Let’s start by importing the necessary packages.
import xarray as xr
import xcollection as xc
import typing
Creating a collection from a dictionary of datasets#
To create a collection, we just pass a dictionary of {py:class} xarray.Dataset
to the xcollection.main.Collection
constructor.
ds = xr.tutorial.open_dataset('air_temperature')
ds.attrs = {}
dsa = xr.tutorial.open_dataset('rasm')
dsa.attrs = {}
col = xc.Collection({'foo': ds, 'bar': dsa})
col
Accessing keys and values in a collection#
To access the keys and values of a collection, we can use the xcollection.main.Collection.keys()
and xcollection.main.Collection.values()
methods.
col.keys()
dict_keys(['foo', 'bar'])
col.values()
dict_values([<xarray.Dataset>
Dimensions: (lat: 25, time: 2920, lon: 53)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float32 ..., <xarray.Dataset>
Dimensions: (time: 36, y: 205, x: 275)
Coordinates:
* time (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
xc (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91
yc (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51
Dimensions without coordinates: y, x
Data variables:
Tair (time, y, x) float64 ...])
In addition, we can use the xcollection.main.Collection.items()
method to get a list of tuples of the keys and values.
for key, value in col.items():
print(key, value)
foo <xarray.Dataset>
Dimensions: (lat: 25, time: 2920, lon: 53)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float32 ...
bar <xarray.Dataset>
Dimensions: (time: 36, y: 205, x: 275)
Coordinates:
* time (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
xc (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91
yc (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51
Dimensions without coordinates: y, x
Data variables:
Tair (time, y, x) float64 ...
Mapping operations over a collection#
xcollection provides a number of methods that allow us to map arbitrary operations (functions) over the keys and values of a collection.
One such method is xcollection.main.Collection.map()
. This method takes a function and applies it to the values of the collection and returns a new collection.
To demonstrate this, we’ll create a new collection with the same keys and values as the original, but with the values of the original collection subsetted along the time dimension.
def subset(ds: xr.Dataset, dim_slice: typing.Dict[str, slice]):
return ds.isel(**dim_slice)
new_col = col.map(subset, dim_slice={"time": slice(0, 3)})
new_col
As you can see, the new collection has the same keys as the original, but the values are subsets of the original values.
Another method is xcollection.main.Collection.keymap()
. This method takes a function and applies it to the keys of the collection and returns a new collection. This is useful for manipulating the keys of the collection. Let’s create a new collection with the same keys as the original, but with the keys capitalized.
def capitalize(key: str):
return key.upper()
new_col_capitalized = col.keymap(capitalize)
new_col_capitalized
Filtering a collection#
xcollection provides a number of methods that allow us to filter the keys and values of a collection. One such method is xcollection.main.Collection.filter()
. This method expectes two arguments:
func
: a function that returns a boolean valueby
: which specifies whether the function is applied on keys, values or items.
Filtering based on keys#
def contains_foo(key: str) -> bool:
return 'foo' in key.lower()
col.filter(func=contains_foo, by='key')
Filtering based on values#
def contains_time(ds: xr.Dataset) -> bool:
return 'time' in ds.coords
col.filter(func=contains_time, by='value')
Filtering based on items#
def contains_foo_and_spans_2014(item: tuple) -> bool:
key, ds = item
return 'foo' in key.lower() and 2014 in ds.time.dt.year
col.filter(func=contains_foo_and_spans_2014, by='item')
Choosing a subset of a collection based on data variables#
xcollection provides a xcollection.main.Collection.choose()
method that allows us to filter a collection based on whether datasets in a collection contain one or more data variables. For example, our existing col
collection contains two datasets. foo
consists of a dataset with air
as a data variable and bar
has air_temperature
as a data variable. We can filter the collection to only include datasets that have air
as a data variable as follows:
new_col = col.choose(['air'], mode='any')
new_col
As you can see in the output, the new collection only contains the dataset foo
.
By default, the mode
argument is set to any
, meaning that the collection will only contain datasets that contain one or more data variables. If we set the mode
argument to all
, xcollection will error if any of the datasets in the collection do not contain all of the data variables specified.
new_col = col.choose(['air'], mode='all')
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/checkouts/readthedocs.org/user_builds/xcollection/conda/latest/lib/python3.10/site-packages/xarray/core/dataset.py:1279, in Dataset._copy_listed(self, names)
1278 try:
-> 1279 variables[name] = self._variables[name]
1280 except KeyError:
KeyError: 'air'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
File ~/checkouts/readthedocs.org/user_builds/xcollection/checkouts/latest/xcollection/main.py:254, in Collection.choose.<locals>._select_vars(dset)
253 try:
--> 254 return dset[data_vars]
255 except KeyError:
File ~/checkouts/readthedocs.org/user_builds/xcollection/conda/latest/lib/python3.10/site-packages/xarray/core/dataset.py:1412, in Dataset.__getitem__(self, key)
1411 if utils.iterable_of_hashable(key):
-> 1412 return self._copy_listed(key)
1413 raise ValueError(f"Unsupported key-type {type(key)}")
File ~/checkouts/readthedocs.org/user_builds/xcollection/conda/latest/lib/python3.10/site-packages/xarray/core/dataset.py:1281, in Dataset._copy_listed(self, names)
1280 except KeyError:
-> 1281 ref_name, var_name, var = _get_virtual_variable(
1282 self._variables, name, self.dims
1283 )
1284 variables[var_name] = var
File ~/checkouts/readthedocs.org/user_builds/xcollection/conda/latest/lib/python3.10/site-packages/xarray/core/dataset.py:175, in _get_virtual_variable(variables, key, dim_sizes)
174 if len(split_key) != 2:
--> 175 raise KeyError(key)
177 ref_name, var_name = split_key
KeyError: 'air'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
Input In [13], in <cell line: 1>()
----> 1 new_col = col.choose(['air'], mode='all')
File ~/checkouts/readthedocs.org/user_builds/xcollection/checkouts/latest/xcollection/main.py:260, in Collection.choose(self, data_vars, mode)
257 raise KeyError(f'No data variables: `{data_vars}` found in dataset: {dset!r}')
259 if mode == 'all':
--> 260 result = toolz.valmap(_select_vars, self.datasets)
261 elif mode == 'any':
262 result = toolz.valfilter(_select_vars, self.datasets)
File ~/checkouts/readthedocs.org/user_builds/xcollection/conda/latest/lib/python3.10/site-packages/toolz/dicttoolz.py:85, in valmap(func, d, factory)
74 """ Apply function to values of dictionary
75
76 >>> bills = {"Alice": [20, 15, 30], "Bob": [10, 35]}
(...)
82 itemmap
83 """
84 rv = factory()
---> 85 rv.update(zip(d.keys(), map(func, d.values())))
86 return rv
File ~/checkouts/readthedocs.org/user_builds/xcollection/checkouts/latest/xcollection/main.py:257, in Collection.choose.<locals>._select_vars(dset)
255 except KeyError:
256 if mode == 'all':
--> 257 raise KeyError(f'No data variables: `{data_vars}` found in dataset: {dset!r}')
KeyError: "No data variables: `['air']` found in dataset: <xarray.Dataset>\nDimensions: (time: 36, y: 205, x: 275)\nCoordinates:\n * time (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00\n xc (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91\n yc (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51\nDimensions without coordinates: y, x\nData variables:\n Tair (time, y, x) float64 ..."
As you can see in the output, we get error because the dataset bar
does not contain the air
data variable.
Saving a collection to disk#
To save a collection to disk, we can use the xcollection.main.Collection.to_zarr()
method. This method takes a path to a directory or a cloud bucket storage and writes the collection as a zarr store. Each key in the collection is saved as a zarr group with the same name as the key.
col.to_zarr('/tmp/my_collection.zarr', consolidated=True, mode='w')
/home/docs/checkouts/readthedocs.org/user_builds/xcollection/conda/latest/lib/python3.10/site-packages/xarray/core/dataset.py:2060: SerializationWarning: saving variable None with floating point data as an integer dtype without any _FillValue to use for NaNs
return to_zarr( # type: ignore
[<xarray.backends.zarr.ZarrStore at 0x7f1022bda030>,
<xarray.backends.zarr.ZarrStore at 0x7f10207cce40>]
!ls -ltrha /tmp/my_collection.zarr
total 28K
-rw-r--r-- 1 docs docs 24 Sep 6 03:23 .zgroup
drwxrwxrwt 1 root root 4.0K Sep 6 03:23 ..
drwxr-xr-x 6 docs docs 4.0K Sep 6 03:23 foo
drwxr-xr-x 6 docs docs 4.0K Sep 6 03:23 bar
-rw-r--r-- 1 docs docs 7.0K Sep 6 03:23 .zmetadata
drwxr-xr-x 4 docs docs 4.0K Sep 6 03:23 .
Loading a collection from disk#
To load a collection from disk, xcollection provides a xcollection.main.open_collection()
function. This method takes a path to a directory or a cloud bucket storage and reads the collection from a zarr store.
new_col = xc.open_collection('/tmp/my_collection.zarr')
assert col == new_col
new_col