Tutorial¶
xcollection extends xarray’s data model to be able to handle a dictionary of xarray Datasets. A xcollection.main.Collection behaves like a regular dictionary, but it also has a few extra methods that make it easier to work with.
Let’s start by importing the necessary packages.
import xarray as xr
import xcollection as xc
import typing
Creating a collection from a dictionary of datasets¶
To create a collection, we just pass a dictionary of {py:class} xarray.Dataset to the xcollection.main.Collection constructor.
ds = xr.tutorial.open_dataset('air_temperature')
ds.attrs = {}
dsa = xr.tutorial.open_dataset('rasm')
dsa.attrs = {}
col = xc.Collection({'foo': ds, 'bar': dsa})
col
Accessing keys and values in a collection¶
To access the keys and values of a collection, we can use the xcollection.main.Collection.keys() and xcollection.main.Collection.values() methods.
col.keys()
dict_keys(['foo', 'bar'])
col.values()
dict_values([<xarray.Dataset>
Dimensions: (lat: 25, time: 2920, lon: 53)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float32 ..., <xarray.Dataset>
Dimensions: (time: 36, y: 205, x: 275)
Coordinates:
* time (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
xc (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91
yc (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51
Dimensions without coordinates: y, x
Data variables:
Tair (time, y, x) float64 ...])
In addition, we can use the xcollection.main.Collection.items() method to get a list of tuples of the keys and values.
for key, value in col.items():
print(key, value)
foo <xarray.Dataset>
Dimensions: (lat: 25, time: 2920, lon: 53)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float32 ...
bar <xarray.Dataset>
Dimensions: (time: 36, y: 205, x: 275)
Coordinates:
* time (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
xc (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91
yc (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51
Dimensions without coordinates: y, x
Data variables:
Tair (time, y, x) float64 ...
Mapping operations over a collection¶
xcollection provides a number of methods that allow us to map arbitrary operations (functions) over the keys and values of a collection.
One such method is xcollection.main.Collection.map(). This method takes a function and applies it to the values of the collection and returns a new collection.
To demonstrate this, we’ll create a new collection with the same keys and values as the original, but with the values of the original collection subsetted along the time dimension.
def subset(ds: xr.Dataset, dim_slice: typing.Dict[str, slice]):
return ds.isel(**dim_slice)
new_col = col.map(subset, dim_slice={"time": slice(0, 3)})
new_col
As you can see, the new collection has the same keys as the original, but the values are subsets of the original values.
Another method is xcollection.main.Collection.keymap(). This method takes a function and applies it to the keys of the collection and returns a new collection. This is useful for manipulating the keys of the collection. Let’s create a new collection with the same keys as the original, but with the keys capitalized.
def capitalize(key: str):
return key.upper()
new_col_capitalized = col.keymap(capitalize)
new_col_capitalized
Filtering a collection¶
xcollection provides a number of methods that allow us to filter the keys and values of a collection. One such method is xcollection.main.Collection.filter(). This method expectes two arguments:
func: a function that returns a boolean valueby: which specifies whether the function is applied on keys, values or items.
Filtering based on keys¶
def contains_foo(key: str) -> bool:
return 'foo' in key.lower()
col.filter(func=contains_foo, by='key')
Filtering based on values¶
def contains_time(ds: xr.Dataset) -> bool:
return 'time' in ds.coords
col.filter(func=contains_time, by='value')
Filtering based on items¶
def contains_foo_and_spans_2014(item: tuple) -> bool:
key, ds = item
return 'foo' in key.lower() and 2014 in ds.time.dt.year
col.filter(func=contains_foo_and_spans_2014, by='item')
Choosing a subset of a collection based on data variables¶
xcollection provides a xcollection.main.Collection.choose() method that allows us to filter a collection based on whether datasets in a collection contain one or more data variables. For example, our existing col collection contains two datasets. foo consists of a dataset with air as a data variable and bar has air_temperature as a data variable. We can filter the collection to only include datasets that have air as a data variable as follows:
new_col = col.choose(['air'], mode='any')
new_col
As you can see in the output, the new collection only contains the dataset foo.
By default, the mode argument is set to any, meaning that the collection will only contain datasets that contain one or more data variables. If we set the mode argument to all, xcollection will error if any of the datasets in the collection do not contain all of the data variables specified.
new_col = col.choose(['air'], mode='all')
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/xarray/core/dataset.py in _copy_listed(self, names)
1359 try:
-> 1360 variables[name] = self._variables[name]
1361 except KeyError:
KeyError: 'air'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
~/checkouts/readthedocs.org/user_builds/xcollection/checkouts/stable/xcollection/main.py in _select_vars(dset)
253 try:
--> 254 return dset[data_vars]
255 except KeyError:
~/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/xarray/core/dataset.py in __getitem__(self, key)
1500 else:
-> 1501 return self._copy_listed(key)
1502
~/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/xarray/core/dataset.py in _copy_listed(self, names)
1361 except KeyError:
-> 1362 ref_name, var_name, var = _get_virtual_variable(
1363 self._variables, name, self._level_coords, self.dims
~/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/xarray/core/dataset.py in _get_virtual_variable(variables, key, level_vars, dim_sizes)
169 else:
--> 170 ref_var = variables[ref_name]
171
KeyError: 'air'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
/tmp/ipykernel_2256/409816939.py in <module>
----> 1 new_col = col.choose(['air'], mode='all')
~/checkouts/readthedocs.org/user_builds/xcollection/checkouts/stable/xcollection/main.py in choose(self, data_vars, mode)
258
259 if mode == 'all':
--> 260 result = toolz.valmap(_select_vars, self.datasets)
261 elif mode == 'any':
262 result = toolz.valfilter(_select_vars, self.datasets)
~/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/toolz/dicttoolz.py in valmap(func, d, factory)
81 """
82 rv = factory()
---> 83 rv.update(zip(d.keys(), map(func, d.values())))
84 return rv
85
~/checkouts/readthedocs.org/user_builds/xcollection/checkouts/stable/xcollection/main.py in _select_vars(dset)
255 except KeyError:
256 if mode == 'all':
--> 257 raise KeyError(f'No data variables: `{data_vars}` found in dataset: {dset!r}')
258
259 if mode == 'all':
KeyError: "No data variables: `['air']` found in dataset: <xarray.Dataset>\nDimensions: (time: 36, y: 205, x: 275)\nCoordinates:\n * time (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00\n xc (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91\n yc (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51\nDimensions without coordinates: y, x\nData variables:\n Tair (time, y, x) float64 ..."
As you can see in the output, we get error because the dataset bar does not contain the air data variable.
Saving a collection to disk¶
To save a collection to disk, we can use the xcollection.main.Collection.to_zarr() method. This method takes a path to a directory or a cloud bucket storage and writes the collection as a zarr store. Each key in the collection is saved as a zarr group with the same name as the key.
col.to_zarr('/tmp/my_collection.zarr', consolidated=True, mode='w')
/home/docs/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/xarray/core/dataset.py:2037: SerializationWarning: saving variable None with floating point data as an integer dtype without any _FillValue to use for NaNs
return to_zarr(
[<xarray.backends.zarr.ZarrStore at 0x7f7977a2a570>,
<xarray.backends.zarr.ZarrStore at 0x7f797784c580>]
!ls -ltrha /tmp/my_collection.zarr
total 28K
-rw-r--r-- 1 docs docs 24 Dec 23 16:57 .zgroup
drwxrwxrwt 1 root root 4.0K Dec 23 16:57 ..
drwxr-xr-x 6 docs docs 4.0K Dec 23 16:57 foo
drwxr-xr-x 6 docs docs 4.0K Dec 23 16:57 bar
-rw-r--r-- 1 docs docs 7.0K Dec 23 16:57 .zmetadata
drwxr-xr-x 4 docs docs 4.0K Dec 23 16:57 .
Loading a collection from disk¶
To load a collection from disk, xcollection provides a xcollection.main.open_collection() function. This method takes a path to a directory or a cloud bucket storage and reads the collection from a zarr store.
new_col = xc.open_collection('/tmp/my_collection.zarr')
assert col == new_col
new_col
/home/docs/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/xarray/core/pycompat.py:37: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
duck_array_version = LooseVersion("0.0.0")
/home/docs/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/xarray/core/pycompat.py:37: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
duck_array_version = LooseVersion("0.0.0")