Tutorial¶
xcollection extends xarray’s data model to be able to handle a dictionary of xarray Datasets. A xcollection.main.Collection
behaves like a regular dictionary, but it also has a few extra methods that make it easier to work with.
Let’s start by importing the necessary packages.
import xarray as xr
import xcollection as xc
import typing
Creating a collection from a dictionary of datasets¶
To create a collection, we just pass a dictionary of {py:class} xarray.Dataset
to the xcollection.main.Collection
constructor.
ds = xr.tutorial.open_dataset('air_temperature')
ds.attrs = {}
dsa = xr.tutorial.open_dataset('rasm')
dsa.attrs = {}
col = xc.Collection({'foo': ds, 'bar': dsa})
col
Accessing keys and values in a collection¶
To access the keys and values of a collection, we can use the xcollection.main.Collection.keys()
and xcollection.main.Collection.values()
methods.
col.keys()
dict_keys(['foo', 'bar'])
col.values()
dict_values([<xarray.Dataset>
Dimensions: (lat: 25, time: 2920, lon: 53)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float32 ..., <xarray.Dataset>
Dimensions: (time: 36, y: 205, x: 275)
Coordinates:
* time (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
xc (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91
yc (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51
Dimensions without coordinates: y, x
Data variables:
Tair (time, y, x) float64 ...])
In addition, we can use the xcollection.main.Collection.items()
method to get a list of tuples of the keys and values.
for key, value in col.items():
print(key, value)
foo <xarray.Dataset>
Dimensions: (lat: 25, time: 2920, lon: 53)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float32 ...
bar <xarray.Dataset>
Dimensions: (time: 36, y: 205, x: 275)
Coordinates:
* time (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
xc (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91
yc (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51
Dimensions without coordinates: y, x
Data variables:
Tair (time, y, x) float64 ...
Mapping operations over a collection¶
xcollection provides a number of methods that allow us to map arbitrary operations (functions) over the keys and values of a collection.
One such method is xcollection.main.Collection.map()
. This method takes a function and applies it to the values of the collection and returns a new collection.
To demonstrate this, we’ll create a new collection with the same keys and values as the original, but with the values of the original collection subsetted along the time dimension.
def subset(ds: xr.Dataset, dim_slice: typing.Dict[str, slice]):
return ds.isel(**dim_slice)
new_col = col.map(subset, dim_slice={"time": slice(0, 3)})
new_col
As you can see, the new collection has the same keys as the original, but the values are subsets of the original values.
Another method is xcollection.main.Collection.keymap()
. This method takes a function and applies it to the keys of the collection and returns a new collection. This is useful for manipulating the keys of the collection. Let’s create a new collection with the same keys as the original, but with the keys capitalized.
def capitalize(key: str):
return key.upper()
new_col_capitalized = col.keymap(capitalize)
new_col_capitalized
Filtering a collection¶
xcollection provides a number of methods that allow us to filter the keys and values of a collection. One such method is xcollection.main.Collection.filter()
. This method expectes two arguments:
func
: a function that returns a boolean valueby
: which specifies whether the function is applied on keys, values or items.
Filtering based on keys¶
def contains_foo(key: str) -> bool:
return 'foo' in key.lower()
col.filter(func=contains_foo, by='key')
Filtering based on values¶
def contains_time(ds: xr.Dataset) -> bool:
return 'time' in ds.coords
col.filter(func=contains_time, by='value')
Filtering based on items¶
def contains_foo_and_spans_2014(item: tuple) -> bool:
key, ds = item
return 'foo' in key.lower() and 2014 in ds.time.dt.year
col.filter(func=contains_foo_and_spans_2014, by='item')
Choosing a subset of a collection based on data variables¶
xcollection provides a xcollection.main.Collection.choose()
method that allows us to filter a collection based on whether datasets in a collection contain one or more data variables. For example, our existing col
collection contains two datasets. foo
consists of a dataset with air
as a data variable and bar
has air_temperature
as a data variable. We can filter the collection to only include datasets that have air
as a data variable as follows:
new_col = col.choose(['air'], mode='any')
new_col
As you can see in the output, the new collection only contains the dataset foo
.
By default, the mode
argument is set to any
, meaning that the collection will only contain datasets that contain one or more data variables. If we set the mode
argument to all
, xcollection will error if any of the datasets in the collection do not contain all of the data variables specified.
new_col = col.choose(['air'], mode='all')
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/xarray/core/dataset.py in _copy_listed(self, names)
1359 try:
-> 1360 variables[name] = self._variables[name]
1361 except KeyError:
KeyError: 'air'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
~/checkouts/readthedocs.org/user_builds/xcollection/checkouts/stable/xcollection/main.py in _select_vars(dset)
253 try:
--> 254 return dset[data_vars]
255 except KeyError:
~/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/xarray/core/dataset.py in __getitem__(self, key)
1500 else:
-> 1501 return self._copy_listed(key)
1502
~/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/xarray/core/dataset.py in _copy_listed(self, names)
1361 except KeyError:
-> 1362 ref_name, var_name, var = _get_virtual_variable(
1363 self._variables, name, self._level_coords, self.dims
~/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/xarray/core/dataset.py in _get_virtual_variable(variables, key, level_vars, dim_sizes)
169 else:
--> 170 ref_var = variables[ref_name]
171
KeyError: 'air'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
/tmp/ipykernel_2256/409816939.py in <module>
----> 1 new_col = col.choose(['air'], mode='all')
~/checkouts/readthedocs.org/user_builds/xcollection/checkouts/stable/xcollection/main.py in choose(self, data_vars, mode)
258
259 if mode == 'all':
--> 260 result = toolz.valmap(_select_vars, self.datasets)
261 elif mode == 'any':
262 result = toolz.valfilter(_select_vars, self.datasets)
~/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/toolz/dicttoolz.py in valmap(func, d, factory)
81 """
82 rv = factory()
---> 83 rv.update(zip(d.keys(), map(func, d.values())))
84 return rv
85
~/checkouts/readthedocs.org/user_builds/xcollection/checkouts/stable/xcollection/main.py in _select_vars(dset)
255 except KeyError:
256 if mode == 'all':
--> 257 raise KeyError(f'No data variables: `{data_vars}` found in dataset: {dset!r}')
258
259 if mode == 'all':
KeyError: "No data variables: `['air']` found in dataset: <xarray.Dataset>\nDimensions: (time: 36, y: 205, x: 275)\nCoordinates:\n * time (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00\n xc (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91\n yc (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51\nDimensions without coordinates: y, x\nData variables:\n Tair (time, y, x) float64 ..."
As you can see in the output, we get error because the dataset bar
does not contain the air
data variable.
Saving a collection to disk¶
To save a collection to disk, we can use the xcollection.main.Collection.to_zarr()
method. This method takes a path to a directory or a cloud bucket storage and writes the collection as a zarr store. Each key in the collection is saved as a zarr group with the same name as the key.
col.to_zarr('/tmp/my_collection.zarr', consolidated=True, mode='w')
/home/docs/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/xarray/core/dataset.py:2037: SerializationWarning: saving variable None with floating point data as an integer dtype without any _FillValue to use for NaNs
return to_zarr(
[<xarray.backends.zarr.ZarrStore at 0x7f7977a2a570>,
<xarray.backends.zarr.ZarrStore at 0x7f797784c580>]
!ls -ltrha /tmp/my_collection.zarr
total 28K
-rw-r--r-- 1 docs docs 24 Dec 23 16:57 .zgroup
drwxrwxrwt 1 root root 4.0K Dec 23 16:57 ..
drwxr-xr-x 6 docs docs 4.0K Dec 23 16:57 foo
drwxr-xr-x 6 docs docs 4.0K Dec 23 16:57 bar
-rw-r--r-- 1 docs docs 7.0K Dec 23 16:57 .zmetadata
drwxr-xr-x 4 docs docs 4.0K Dec 23 16:57 .
Loading a collection from disk¶
To load a collection from disk, xcollection provides a xcollection.main.open_collection()
function. This method takes a path to a directory or a cloud bucket storage and reads the collection from a zarr store.
new_col = xc.open_collection('/tmp/my_collection.zarr')
assert col == new_col
new_col
/home/docs/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/xarray/core/pycompat.py:37: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
duck_array_version = LooseVersion("0.0.0")
/home/docs/checkouts/readthedocs.org/user_builds/xcollection/conda/stable/lib/python3.10/site-packages/xarray/core/pycompat.py:37: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
duck_array_version = LooseVersion("0.0.0")