--- jupytext: text_representation: format_name: myst kernelspec: display_name: Python 3 name: python3 --- # Tutorial xcollection extends [xarray's data model](https://xarray.pydata.org/en/stable/getting-started-guide/why-xarray.html) to be able to handle a dictionary of xarray Datasets. A {py:class}`xcollection.main.Collection` behaves like a regular dictionary, but it also has a few extra methods that make it easier to work with. Let's start by importing the necessary packages. ```{code-cell} ipython3 import xarray as xr import xcollection as xc import typing ``` ## Creating a collection from a dictionary of datasets To create a collection, we just pass a dictionary of {py:class} `xarray.Dataset` to the {py:class}`xcollection.main.Collection` constructor. ```{code-cell} ipython3 ds = xr.tutorial.open_dataset('air_temperature') ds.attrs = {} dsa = xr.tutorial.open_dataset('rasm') dsa.attrs = {} ``` ```{code-cell} ipython3 col = xc.Collection({'foo': ds, 'bar': dsa}) col ``` ## Accessing keys and values in a collection To access the keys and values of a collection, we can use the {py:func}`xcollection.main.Collection.keys` and {py:func}`xcollection.main.Collection.values` methods. ```{code-cell} ipython3 col.keys() ``` ```{code-cell} ipython3 col.values() ``` In addition, we can use the {py:func}`xcollection.main.Collection.items` method to get a list of tuples of the keys and values. ```{code-cell} ipython3 for key, value in col.items(): print(key, value) ``` ## Mapping operations over a collection xcollection provides a number of methods that allow us to map arbitrary operations (functions) over the keys and values of a collection. One such method is {py:func}`xcollection.main.Collection.map`. This method takes a function and applies it to the values of the collection and returns a new collection. To demonstrate this, we'll create a new collection with the same keys and values as the original, but with the values of the original collection subsetted along the time dimension. ```{code-cell} ipython3 def subset(ds: xr.Dataset, dim_slice: typing.Dict[str, slice]): return ds.isel(**dim_slice) new_col = col.map(subset, dim_slice={"time": slice(0, 3)}) new_col ``` As you can see, the new collection has the same keys as the original, but the values are subsets of the original values. Another method is {py:func}`xcollection.main.Collection.keymap`. This method takes a function and applies it to the **keys** of the collection and returns a new collection. This is useful for manipulating the keys of the collection. Let's create a new collection with the same keys as the original, but with the keys capitalized. ```{code-cell} ipython3 def capitalize(key: str): return key.upper() new_col_capitalized = col.keymap(capitalize) new_col_capitalized ``` ## Filtering a collection xcollection provides a number of methods that allow us to filter the keys and values of a collection. One such method is {py:func}`xcollection.main.Collection.filter`. This method expectes two arguments: 1. `func`: a function that returns a boolean value 2. `by`: which specifies whether the function is applied on keys, values or items. ### Filtering based on keys ```{code-cell} ipython3 def contains_foo(key: str) -> bool: return 'foo' in key.lower() col.filter(func=contains_foo, by='key') ``` ### Filtering based on values ```{code-cell} ipython3 def contains_time(ds: xr.Dataset) -> bool: return 'time' in ds.coords col.filter(func=contains_time, by='value') ``` ### Filtering based on items ```{code-cell} ipython3 def contains_foo_and_spans_2014(item: tuple) -> bool: key, ds = item return 'foo' in key.lower() and 2014 in ds.time.dt.year col.filter(func=contains_foo_and_spans_2014, by='item') ``` ## Choosing a subset of a collection based on data variables xcollection provides a {py:func}`xcollection.main.Collection.choose` method that allows us to filter a collection based on whether datasets in a collection contain one or more data variables. For example, our existing `col` collection contains two datasets. `foo` consists of a dataset with `air` as a data variable and `bar` has `air_temperature` as a data variable. We can filter the collection to only include datasets that have `air` as a data variable as follows: ```{code-cell} ipython3 new_col = col.choose(['air'], mode='any') new_col ``` As you can see in the output, the new collection only contains the dataset `foo`. By default, the `mode` argument is set to `any`, meaning that the collection will only contain datasets that contain one or more data variables. If we set the `mode` argument to `all`, xcollection will error if any of the datasets in the collection do not contain all of the data variables specified. ```{code-cell} ipython3 new_col = col.choose(['air'], mode='all') ``` As you can see in the output, we get error because the dataset `bar` does not contain the `air` data variable. ## Saving a collection to disk To save a collection to disk, we can use the {py:func}`xcollection.main.Collection.to_zarr` method. This method takes a path to a directory or a cloud bucket storage and writes the collection as a zarr store. Each key in the collection is saved as a zarr group with the same name as the key. ```{code-cell} ipython3 col.to_zarr('/tmp/my_collection.zarr', consolidated=True, mode='w') ``` ```{code-cell} ipython3 !ls -ltrha /tmp/my_collection.zarr ``` ## Loading a collection from disk To load a collection from disk, xcollection provides a {py:func}`xcollection.main.open_collection` function. This method takes a path to a directory or a cloud bucket storage and reads the collection from a zarr store. ```{code-cell} ipython3 new_col = xc.open_collection('/tmp/my_collection.zarr') assert col == new_col new_col ```