Indexing¶
Xarray is fully documented in terms of its concepts and features, and we recommend going through it at least once. However, the documentation is sometimes considered opaque or somewhat abstract; it is a full reference, it lists all available operations, but doesn’t necessarily tell what you have to do to combine these operations.
The documentation below attempts to provide a more specific guide, hopefully giving you a running start … or at the least a walking start rather than a crawling start.
Boolean arrays and bitwise operators¶
Python has a number of logical operators:
and
or
not
exclusive or, or xor, via the caret symbol:
^
These work on a single boolean value, or a single pair of boolean values, i.e.
True
and False
.
not True # False
True and False # False
True or False # True
not (True and False) # True
True ^ False # True
True ^ True # False
False ^ False # False
Additionally, numpy provides so-called bitwise operators. These work on one or more boolean values:
a = np.array([True, True, False])
b = np.array([False, True, False])
# bitwise or
a | b # True, True, False
# bitwise and
a & b # False, True, False
# bitwise inversion
~a # False, False, True
~b # True, False, True
# bitwise xor
a ^ b # True, False, False
These operators can be used to manipulate entire boolean arrays at once.
Conditional selection¶
Conditional selection in xarray is done with .where
. The syntax is:
array.where(condition=boolean_array, other=otherwise)
In English this translates to:
* Take the array
where the boolean_array
is True
* Take otherwise
where the boolean_array
is False
For example, if we’re working on a digital elevation model (DEM), we might want exclude certain values exceeding a threshold:
above_sea_level = dem > 0.0
dem_above_sea_level = dem.where(above_sea_level)
Alternatively, we might want to replace these values by the threshold.
In this case, we have to take nan
values properly into account. Note:
np.nan > 0.0 # False
np.nan < 0.0 # False
This means the following does not give the required result:
corrected_dem = dem.where(above_sea_level, other=0.0)
Since the nodata values in dem
will have been filled with 0.0
as well.
There are multiple ways around this, by taking an additional step:
corrected_dem = dem.where(above_sea_level, other=0.0)
hasdata = np.isfinite(dem)
corrected_dem = corrected_dem.where(hasdata)
The cleanest way however, is by using one of the bitwise operators, inversion:
below_sea_level = dem < 0.0
not_below_sea_level = ~below_sea_level
corrected_dem = dem.where(not_below_sea_level, other=0.0)
Step by step:
First we compare the
dem
to 0.0. This means nodata values end upFalse
inbelow_sea_level
.Next we invert below_sea_level (flip around
True
andFalse
. This means values larger than 0.0, and the nodata values are marked byTrue
.This is the appropriate selection condition by which the other values will be replaced correctly by the threshold value.
The default value of other
is np.nan
, which also acts as the default
nodata value in xarray. This plays well with xarray’s reduction functions which
(unlike numpy) ignore np.nan
values (thereby matching the behaviour of the
numpy.nan
- functions such as np.nansum
.
concentration = xr.DataArray(np.random.rand(100, 100), coords, dims)
above_threshold = (concentration > threshold).sum()
Note that other
can also be another xarray.DataArray.
Combining¶
Another powerful feature is combining two arrays via combine_first
.
combine_first
will automatically align two different arrays, and replace
nan
values in the first array by the values of the second.
Be warned however that if you’re working with arrays that do not overlap
spatially, combine_first
will simply concatenate dimensions:
da_1 = xr.DataArray(np.random.rand(10), {"x": np.arange(0.0, 10.0)}, ["x"])
da_2 = xr.DataArray(np.random.rand(10), {"x": np.arange(20.0, 30.0)}, ["x"])
da_combined = da_1.combine_first(da_2)
combine_first
combines well with .where
. We can easily select the relevant part
using .where
, and then fill the
dem = dem.where(dem < 0.0, other=alternative)
dem = dem.where(dem < 0.0)
dem = dem.combine_first(alternative)
Broadcasting¶
Numpy provides a feature called “broadcasting” to ease the use of many functions. Consider the following example, where we have a three dimensional array which describes the thickness of a number of geological layers. We wish to know the relative contribution of every layer to the total. In numpy, we would do so as follows:
thickness = np.random.rand(4, 100, 100)
total_thickness = np.sum(thickness, axis=0)
relative_thickness = thickness / total_thickness
This works well. However, it only works because the shape of the numpy arrays
matches, starting from the trailing dimensions and working forward.
* The dimensions of thickness
are (4 layer, 100 rows, 100 columns)
* The dimensions of total_thickness
are (100 rows, 100 columns)
This means that if we want to multiply the thickness of each layer by a certain factor, we have to tranpose the array, multiply it, and transpose it back:
factor = np.array([5.0, 10.0, 5.0, 10.0])
multiplied_thickness = (thickness.tranpose() * factor).transpose()
This is rather bothersome. Fortunately, xarray provides labelled dimension, and can align automatically based on dimension name, rather than axis order. In the examples below, this requires defining coordinates and dimensions first, but these generally already exist in real-life applications.
Note that the xarray.DataArrays have been prefixed with da_
for clarity.
coords = {"layer": [1, 2, 3, 4], "x": np.arange(100.0), "y": np.arange(100.0)}
dims = ("layer", "y", "x")
da_thickness = xr.DataArray(thickness, coords, dims)
da_factor = xr.DataArray(factor, {"layer": [1, 2, 3, 4]}, ["layer"])
da_mult_thickness = da_thickness * da_factor
Note that it’s not just the infix (arithmetic) operators which support
broadcasting, .where
automatically broadcasts as well!
z = np.array([-5.0, -15.0, -25.0, -35.0])
da_z = xr.DataArray(z, {"layer": [1, 2, 3, 4]}, ["layer"])
2dcoords = {"y": coords["y"], "x": coords["x"]}
layer_bottom = xr.DataArray(np.random.rand(100, 100), 2dcoords, ("y", "x"))
da_layer_conc = da_conc.where(da_z > layer_bottom)
Step by step:
We create a DataArray describing the “z” dimension of the concentration array.
We create the boundary of a first layer.
We take values from
da_conc
(with dimensionlayer, y, x
) using da_z (with dimensionlayer
), and layer_bottom (with dimensionsy, x
).This creates a three dimension boolean array by which we can select the appropriate values from
da_conc
.
Morphology operations¶
scipy.ndimage.morphology
provides a number of powerful morphological
operations, which can be easily combined with xarray to provide additional
versatility in modifying multi-dimensional arrays.
scikit-image.morphology
re-uses and expands on
scipy.ndimage.morphology
’s functionality, and provides very rich
documentation.
A typical operation is binary_dilation
or binary_erosion
. Consider a
(binary) mask, such as the location of land and sea, derived from a dem, as
done above. After determing the land surface, we might want to add a buffer of
one cell thick to describe the transition zone from land to sea, which we can
then proceed to use to assign a boundary condition to the model.
To identify these cells, we might use:
buffered = xr.full_like(is_land, False)
buffered.values = skimage.morphology.binary_dilation(is_land.values, iterations=1)
coastal_zone = buffered - is_land
Or using the inverse:
eroded = xr.full_like(is_land, False)
eroded.values = skimage.morphology.binary_erosion(is_sea.values, iterations=1)
coastal_zone = is_sea - eroded
coastal_zone = coastal_zone.where(is_sea)
These operations also work in three dimensions.
skimage.morphology
has a few other functions to be aware of:
skimage.morphology.label
to find connected zones (also in 3D)skimage.morphology.watershed
to find watershed basinsskimage.morphology.skeletonize
to reduce connected zones to a pixel wide lineskimage.morphology.fill
to perform flood filling on an image
Aggregating with groupby¶
A common operation in geospatial analyses is aggregating one grid based on another classification grid, for example land use.
We can do this via .where
:
landuse_classes = np.unique(landuse_classes)
landuse_classes = landuse_classes[~np.isnan(landuse_classes)]
aggregated = {}
for class in landuse_classes:
aggregated[class] = grid.where(landuse_classes == class).sum(["y", "x"])
A more convenient way of doing this is by using .groupby
.
ds = xr.Dataset()
ds["landuse"] = landuse_classes
ds["grid"] = grid
grouped = ds.groupby("landuse")
aggregated = grouped.sum("stacked_yx")
Note that the reduction on the groupby object is rather non-obvious:
stacked_yx
. To find out how coordinates have been combined, the following
provides a peek at the first group:
list(grouped)[0][0]
A note on performance¶
These implementations are technically not the most efficient: the aggregation takes multiple passes over the data, while in principle a single pass is sufficient:
aggregated = {}
nrow, ncol = grid.shape
for i in range(nrow):
for j in range(ncol):
landuse = landuse_classes[i, j]
aggregated[landuse] += grid[i, j]
While this code takes only a single pass over the data, it will generally be
much slower than the two alternatives above. The reason is that while they take
multiple passes, the passes are executed within compiled and fast numpy code,
rather than slow dynamic Python loops. imod
also provides a fast
aggregation function, which does aggregate in a single pass.
This is true in general. Getting and setting values via .sel
, .isel
,
and .where
will be generally much faster than writing loops to collect or
change elements. Where these methods are insufficiently fast, we might have
defined a specific fast implementation in imod
.
A short demonstration¶
head = imod.idf.open("head*_l1.idf")
is_max_in_time = head == head.max("time")
max_year = head["time.year"].where(is_max_in_time).min("time")
This code snippet combines several features:
We open the data for the first layer, for all times.
We compute the year of every timestamp.
We look for the heads (at every
(x, y)
) that are the highest at that location.We select the year of the timestamp at the moment; the other values end up as nan.
Finally, we reduce this DataArray over time, finding the first moment where the head reaches its highest values.
Theoretically, a highest value might be reached at more that one point in time. This is easy to check:
occurrence = head["time.year"].where(head == head.max("time")).count("time")
occurrence.plot()