Tutorial

xcollection extends xarray’s data model to be able to handle a dictionary of xarray Datasets. A xcollection.main.Collection behaves like a regular dictionary, but it also has a few extra methods that make it easier to work with.

Let’s start by importing the necessary packages.

import xarray as xr
import xcollection as xc
import typing

Creating a collection from a dictionary of datasets

To create a collection, we just pass a dictionary of {py:class} xarray.Dataset to the xcollection.main.Collection constructor.

ds = xr.tutorial.open_dataset('air_temperature')
ds.attrs = {}
dsa = xr.tutorial.open_dataset('rasm')
dsa.attrs = {}
col = xc.Collection({'foo': ds, 'bar': dsa})
col

Accessing keys and values in a collection

To access the keys and values of a collection, we can use the xcollection.main.Collection.keys() and xcollection.main.Collection.values() methods.

col.keys()
dict_keys(['foo', 'bar'])
col.values()
dict_values([<xarray.Dataset>
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 ..., <xarray.Dataset>
Dimensions:  (time: 36, y: 205, x: 275)
Coordinates:
  * time     (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
    xc       (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91
    yc       (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51
Dimensions without coordinates: y, x
Data variables:
    Tair     (time, y, x) float64 ...])

In addition, we can use the xcollection.main.Collection.items() method to get a list of tuples of the keys and values.

for key, value in col.items():
    print(key, value)
foo <xarray.Dataset>
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 ...
bar <xarray.Dataset>
Dimensions:  (time: 36, y: 205, x: 275)
Coordinates:
  * time     (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
    xc       (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91
    yc       (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51
Dimensions without coordinates: y, x
Data variables:
    Tair     (time, y, x) float64 ...

Mapping operations over a collection

xcollection provides a number of methods that allow us to map arbitrary operations (functions) over the keys and values of a collection.

One such method is xcollection.main.Collection.map(). This method takes a function and applies it to the values of the collection and returns a new collection. To demonstrate this, we’ll create a new collection with the same keys and values as the original, but with the values of the original collection subsetted along the time dimension.

def subset(ds: xr.Dataset, dim_slice: typing.Dict[str, slice]):
    return ds.isel(**dim_slice)


new_col = col.map(subset, dim_slice={"time": slice(0, 3)})
new_col

As you can see, the new collection has the same keys as the original, but the values are subsets of the original values.

Another method is xcollection.main.Collection.keymap(). This method takes a function and applies it to the keys of the collection and returns a new collection. This is useful for manipulating the keys of the collection. Let’s create a new collection with the same keys as the original, but with the keys capitalized.

def capitalize(key: str):
    return key.upper()

new_col_capitalized = col.keymap(capitalize)
new_col_capitalized

Filtering a collection

xcollection provides a number of methods that allow us to filter the keys and values of a collection. One such method is xcollection.main.Collection.filter(). This method expectes two arguments:

  1. func: a function that returns a boolean value

  2. by: which specifies whether the function is applied on keys, values or items.

Filtering based on keys

def contains_foo(key: str) -> bool:
    return 'foo' in  key.lower()

col.filter(func=contains_foo, by='key')

Filtering based on values

def contains_time(ds: xr.Dataset) -> bool:
    return 'time' in ds.coords

col.filter(func=contains_time, by='value')

Filtering based on items

def contains_foo_and_spans_2014(item: tuple) -> bool:
    key, ds = item
    return 'foo' in key.lower() and 2014 in ds.time.dt.year

col.filter(func=contains_foo_and_spans_2014, by='item')

Choosing a subset of a collection based on data variables

xcollection provides a xcollection.main.Collection.choose() method that allows us to filter a collection based on whether datasets in a collection contain one or more data variables. For example, our existing col collection contains two datasets. foo consists of a dataset with air as a data variable and bar has air_temperature as a data variable. We can filter the collection to only include datasets that have air as a data variable as follows:

new_col = col.choose(['air'], mode='any')
new_col

As you can see in the output, the new collection only contains the dataset foo.

By default, the mode argument is set to any, meaning that the collection will only contain datasets that contain one or more data variables. If we set the mode argument to all, xcollection will error if any of the datasets in the collection do not contain all of the data variables specified.

new_col = col.choose(['air'], mode='all')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/xarray/core/dataset.py in _copy_listed(self, names)
   1359             try:
-> 1360                 variables[name] = self._variables[name]
   1361             except KeyError:

KeyError: 'air'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
~/checkouts/readthedocs.org/user_builds/xcollection/checkouts/stable/xcollection/main.py in _select_vars(dset)
    253             try:
--> 254                 return dset[data_vars]
    255             except KeyError:

~/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/xarray/core/dataset.py in __getitem__(self, key)
   1500         else:
-> 1501             return self._copy_listed(key)
   1502 

~/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/xarray/core/dataset.py in _copy_listed(self, names)
   1361             except KeyError:
-> 1362                 ref_name, var_name, var = _get_virtual_variable(
   1363                     self._variables, name, self._level_coords, self.dims

~/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/xarray/core/dataset.py in _get_virtual_variable(variables, key, level_vars, dim_sizes)
    169     else:
--> 170         ref_var = variables[ref_name]
    171 

KeyError: 'air'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_2256/409816939.py in <module>
----> 1 new_col = col.choose(['air'], mode='all')

~/checkouts/readthedocs.org/user_builds/xcollection/checkouts/stable/xcollection/main.py in choose(self, data_vars, mode)
    258 
    259         if mode == 'all':
--> 260             result = toolz.valmap(_select_vars, self.datasets)
    261         elif mode == 'any':
    262             result = toolz.valfilter(_select_vars, self.datasets)

~/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/toolz/dicttoolz.py in valmap(func, d, factory)
     81     """
     82     rv = factory()
---> 83     rv.update(zip(d.keys(), map(func, d.values())))
     84     return rv
     85 

~/checkouts/readthedocs.org/user_builds/xcollection/checkouts/stable/xcollection/main.py in _select_vars(dset)
    255             except KeyError:
    256                 if mode == 'all':
--> 257                     raise KeyError(f'No data variables: `{data_vars}` found in dataset: {dset!r}')
    258 
    259         if mode == 'all':

KeyError: "No data variables: `['air']` found in dataset: <xarray.Dataset>\nDimensions:  (time: 36, y: 205, x: 275)\nCoordinates:\n  * time     (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00\n    xc       (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91\n    yc       (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51\nDimensions without coordinates: y, x\nData variables:\n    Tair     (time, y, x) float64 ..."

As you can see in the output, we get error because the dataset bar does not contain the air data variable.

Saving a collection to disk

To save a collection to disk, we can use the xcollection.main.Collection.to_zarr() method. This method takes a path to a directory or a cloud bucket storage and writes the collection as a zarr store. Each key in the collection is saved as a zarr group with the same name as the key.

col.to_zarr('/tmp/my_collection.zarr', consolidated=True, mode='w')
/home/docs/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/xarray/core/dataset.py:2037: SerializationWarning: saving variable None with floating point data as an integer dtype without any _FillValue to use for NaNs
  return to_zarr(
[<xarray.backends.zarr.ZarrStore at 0x7f7977a2a570>,
 <xarray.backends.zarr.ZarrStore at 0x7f797784c580>]
!ls -ltrha /tmp/my_collection.zarr
total 28K
-rw-r--r-- 1 docs docs   24 Dec 23 16:57 .zgroup
drwxrwxrwt 1 root root 4.0K Dec 23 16:57 ..
drwxr-xr-x 6 docs docs 4.0K Dec 23 16:57 foo
drwxr-xr-x 6 docs docs 4.0K Dec 23 16:57 bar
-rw-r--r-- 1 docs docs 7.0K Dec 23 16:57 .zmetadata
drwxr-xr-x 4 docs docs 4.0K Dec 23 16:57 .

Loading a collection from disk

To load a collection from disk, xcollection provides a xcollection.main.open_collection() function. This method takes a path to a directory or a cloud bucket storage and reads the collection from a zarr store.

new_col = xc.open_collection('/tmp/my_collection.zarr')
assert col == new_col
new_col
/home/docs/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/xarray/core/pycompat.py:37: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  duck_array_version = LooseVersion("0.0.0")
/home/docs/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/xarray/core/pycompat.py:37: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  duck_array_version = LooseVersion("0.0.0")