Tutorial#

xcollection extends xarray’s data model to be able to handle a dictionary of xarray Datasets. A xcollection.main.Collection behaves like a regular dictionary, but it also has a few extra methods that make it easier to work with.

Let’s start by importing the necessary packages.

import xarray as xr
import xcollection as xc
import typing

Creating a collection from a dictionary of datasets#

To create a collection, we just pass a dictionary of {py:class} xarray.Dataset to the xcollection.main.Collection constructor.

ds = xr.tutorial.open_dataset('air_temperature')
ds.attrs = {}
dsa = xr.tutorial.open_dataset('rasm')
dsa.attrs = {}
col = xc.Collection({'foo': ds, 'bar': dsa})
col

Accessing keys and values in a collection#

To access the keys and values of a collection, we can use the xcollection.main.Collection.keys() and xcollection.main.Collection.values() methods.

col.keys()
dict_keys(['foo', 'bar'])
col.values()
dict_values([<xarray.Dataset>
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 ..., <xarray.Dataset>
Dimensions:  (time: 36, y: 205, x: 275)
Coordinates:
  * time     (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
    xc       (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91
    yc       (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51
Dimensions without coordinates: y, x
Data variables:
    Tair     (time, y, x) float64 ...])

In addition, we can use the xcollection.main.Collection.items() method to get a list of tuples of the keys and values.

for key, value in col.items():
    print(key, value)
foo <xarray.Dataset>
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 ...
bar <xarray.Dataset>
Dimensions:  (time: 36, y: 205, x: 275)
Coordinates:
  * time     (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
    xc       (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91
    yc       (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51
Dimensions without coordinates: y, x
Data variables:
    Tair     (time, y, x) float64 ...

Mapping operations over a collection#

xcollection provides a number of methods that allow us to map arbitrary operations (functions) over the keys and values of a collection.

One such method is xcollection.main.Collection.map(). This method takes a function and applies it to the values of the collection and returns a new collection. To demonstrate this, we’ll create a new collection with the same keys and values as the original, but with the values of the original collection subsetted along the time dimension.

def subset(ds: xr.Dataset, dim_slice: typing.Dict[str, slice]):
    return ds.isel(**dim_slice)


new_col = col.map(subset, dim_slice={"time": slice(0, 3)})
new_col

As you can see, the new collection has the same keys as the original, but the values are subsets of the original values.

Another method is xcollection.main.Collection.keymap(). This method takes a function and applies it to the keys of the collection and returns a new collection. This is useful for manipulating the keys of the collection. Let’s create a new collection with the same keys as the original, but with the keys capitalized.

def capitalize(key: str):
    return key.upper()

new_col_capitalized = col.keymap(capitalize)
new_col_capitalized

Filtering a collection#

xcollection provides a number of methods that allow us to filter the keys and values of a collection. One such method is xcollection.main.Collection.filter(). This method expectes two arguments:

  1. func: a function that returns a boolean value

  2. by: which specifies whether the function is applied on keys, values or items.

Filtering based on keys#

def contains_foo(key: str) -> bool:
    return 'foo' in  key.lower()

col.filter(func=contains_foo, by='key')

Filtering based on values#

def contains_time(ds: xr.Dataset) -> bool:
    return 'time' in ds.coords

col.filter(func=contains_time, by='value')

Filtering based on items#

def contains_foo_and_spans_2014(item: tuple) -> bool:
    key, ds = item
    return 'foo' in key.lower() and 2014 in ds.time.dt.year

col.filter(func=contains_foo_and_spans_2014, by='item')

Choosing a subset of a collection based on data variables#

xcollection provides a xcollection.main.Collection.choose() method that allows us to filter a collection based on whether datasets in a collection contain one or more data variables. For example, our existing col collection contains two datasets. foo consists of a dataset with air as a data variable and bar has air_temperature as a data variable. We can filter the collection to only include datasets that have air as a data variable as follows:

new_col = col.choose(['air'], mode='any')
new_col

As you can see in the output, the new collection only contains the dataset foo.

By default, the mode argument is set to any, meaning that the collection will only contain datasets that contain one or more data variables. If we set the mode argument to all, xcollection will error if any of the datasets in the collection do not contain all of the data variables specified.

new_col = col.choose(['air'], mode='all')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/checkouts/readthedocs.org/user_builds/xcollection/conda/latest/lib/python3.10/site-packages/xarray/core/dataset.py:1279, in Dataset._copy_listed(self, names)
   1278 try:
-> 1279     variables[name] = self._variables[name]
   1280 except KeyError:

KeyError: 'air'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
File ~/checkouts/readthedocs.org/user_builds/xcollection/checkouts/latest/xcollection/main.py:254, in Collection.choose.<locals>._select_vars(dset)
    253 try:
--> 254     return dset[data_vars]
    255 except KeyError:

File ~/checkouts/readthedocs.org/user_builds/xcollection/conda/latest/lib/python3.10/site-packages/xarray/core/dataset.py:1412, in Dataset.__getitem__(self, key)
   1411 if utils.iterable_of_hashable(key):
-> 1412     return self._copy_listed(key)
   1413 raise ValueError(f"Unsupported key-type {type(key)}")

File ~/checkouts/readthedocs.org/user_builds/xcollection/conda/latest/lib/python3.10/site-packages/xarray/core/dataset.py:1281, in Dataset._copy_listed(self, names)
   1280 except KeyError:
-> 1281     ref_name, var_name, var = _get_virtual_variable(
   1282         self._variables, name, self.dims
   1283     )
   1284     variables[var_name] = var

File ~/checkouts/readthedocs.org/user_builds/xcollection/conda/latest/lib/python3.10/site-packages/xarray/core/dataset.py:175, in _get_virtual_variable(variables, key, dim_sizes)
    174 if len(split_key) != 2:
--> 175     raise KeyError(key)
    177 ref_name, var_name = split_key

KeyError: 'air'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
Input In [13], in <cell line: 1>()
----> 1 new_col = col.choose(['air'], mode='all')

File ~/checkouts/readthedocs.org/user_builds/xcollection/checkouts/latest/xcollection/main.py:260, in Collection.choose(self, data_vars, mode)
    257             raise KeyError(f'No data variables: `{data_vars}` found in dataset: {dset!r}')
    259 if mode == 'all':
--> 260     result = toolz.valmap(_select_vars, self.datasets)
    261 elif mode == 'any':
    262     result = toolz.valfilter(_select_vars, self.datasets)

File ~/checkouts/readthedocs.org/user_builds/xcollection/conda/latest/lib/python3.10/site-packages/toolz/dicttoolz.py:85, in valmap(func, d, factory)
     74 """ Apply function to values of dictionary
     75 
     76 >>> bills = {"Alice": [20, 15, 30], "Bob": [10, 35]}
   (...)
     82     itemmap
     83 """
     84 rv = factory()
---> 85 rv.update(zip(d.keys(), map(func, d.values())))
     86 return rv

File ~/checkouts/readthedocs.org/user_builds/xcollection/checkouts/latest/xcollection/main.py:257, in Collection.choose.<locals>._select_vars(dset)
    255 except KeyError:
    256     if mode == 'all':
--> 257         raise KeyError(f'No data variables: `{data_vars}` found in dataset: {dset!r}')

KeyError: "No data variables: `['air']` found in dataset: <xarray.Dataset>\nDimensions:  (time: 36, y: 205, x: 275)\nCoordinates:\n  * time     (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00\n    xc       (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91\n    yc       (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51\nDimensions without coordinates: y, x\nData variables:\n    Tair     (time, y, x) float64 ..."

As you can see in the output, we get error because the dataset bar does not contain the air data variable.

Saving a collection to disk#

To save a collection to disk, we can use the xcollection.main.Collection.to_zarr() method. This method takes a path to a directory or a cloud bucket storage and writes the collection as a zarr store. Each key in the collection is saved as a zarr group with the same name as the key.

col.to_zarr('/tmp/my_collection.zarr', consolidated=True, mode='w')
/home/docs/checkouts/readthedocs.org/user_builds/xcollection/conda/latest/lib/python3.10/site-packages/xarray/core/dataset.py:2060: SerializationWarning: saving variable None with floating point data as an integer dtype without any _FillValue to use for NaNs
  return to_zarr(  # type: ignore
[<xarray.backends.zarr.ZarrStore at 0x7f1022bda030>,
 <xarray.backends.zarr.ZarrStore at 0x7f10207cce40>]
!ls -ltrha /tmp/my_collection.zarr
total 28K
-rw-r--r-- 1 docs docs   24 Sep  6 03:23 .zgroup
drwxrwxrwt 1 root root 4.0K Sep  6 03:23 ..
drwxr-xr-x 6 docs docs 4.0K Sep  6 03:23 foo
drwxr-xr-x 6 docs docs 4.0K Sep  6 03:23 bar
-rw-r--r-- 1 docs docs 7.0K Sep  6 03:23 .zmetadata
drwxr-xr-x 4 docs docs 4.0K Sep  6 03:23 .

Loading a collection from disk#

To load a collection from disk, xcollection provides a xcollection.main.open_collection() function. This method takes a path to a directory or a cloud bucket storage and reads the collection from a zarr store.

new_col = xc.open_collection('/tmp/my_collection.zarr')
assert col == new_col
new_col