STAC, xarray and dask | terrabyte Documentation (2024)

This tutorial demonstrates how to load satellite images from aSpatioTemporal Asset Catalog (STAC), createan xarray cube from them, and run a computationusing dask. For those who don't intend to usexarray and dask, the tutorial generally demonstrates how to find earthobservation data on our systems using the STAC metadata catalog andPython.

As an example, the temporal median of Sentinel-2 L2A data acquired overMunich in 2022 is computed, stored in a Zarr file, and visualized.

Setting up the JupyterLab environment

This demo is intended to run in a Jupyter notebook with a JupyterLabinstance inside a SLURM job. We are going to use theterrabyte Portal to set this up.

First we will create a mamba environment and install all Python packagesrequired to run the notebook. For this, we will start JupyterLab basedon micromamba and the pre-configured environment terrabyte_base.

Click on Jupyter Notebook in the portal (e.g. under Apps) and configurethe following (the micromamba version may change):

STAC, xarray and dask | terrabyte Documentation (1)

(full path to the environment:micromamba/<version>:/dss/dsstbyfs01/pn56su/pn56su-dss80020/micromamba/envs/terrabyte_base)

You may leave the rest unchanged and then launch the notebook. It willtake a little moment and you can then start the session.

Once connected, start a Jupyter notebook from the Launcher menu. Here weexecute the code to create our mamba environment "myenv" (or any nameyou prefer) with all Python packages needed for the tutorial:

!micromamba create -y -n myenv jupyter xarray rioxarray odc-stac odc-geo pystac-client dask folium geopandas zarr jupyter-server-proxy

Once installation has finished (it will take a while), you may close thisJupyterLab session and return to the terrabyte portal.

Here, you can now select the newly created micromamba environment tostart our actual JupyterLab session:

STAC, xarray and dask | terrabyte Documentation (2)

Here it is also worth defining a larger amount of resources foraccelerating the computation.

Once the JupyterLab is started you can begin with the tutorial.

A short word on STAC

STAC is the central way of accessing any spatio-temporal data onterrabyte. See here for an introduction and the detailed sepcification:

In principal, data is offered as a catalog containing data fromvarious sources. This catalog is further sub-divided intocollections. A collection could for example contain a certainsatellite data product like Sentinel-1 GRD, SLC or Sentinel-2 L2A. Eachcollection consists of multiple items, which might representindividual satellite scenes or product tiles (e.g. the MGRS tiles ofSentinel-2). Each item consists of one or many assets, whichcontain links to the actual data. For example individual GeoTIFF filesfor each band.

The terrabyte STAC catalog URL ishttps://stac.terrabyte.lrz.de/public/api.

Further down you will find examples on how to list all availablecollections and how to query the STAC catalog in Python using thepystac library. The content canalso be explored by opening the above link in a browser.

STAC, xarray and dask | terrabyte Documentation (3)

From there you can navigate to individual collections and explore theirmetadata content for search queries. For example, underhttps://stac.terrabyte.lrz.de/public/api/collections/sentinel-2-c0-l2a/itemsfeatures0properties we can list the properties of thefirst item of thesentinel-2-c0-l2a collection. This list contains,amongst others, the entries eo:cloud_cover and s2:mgrs_tile, whichwill be queried in the tutorial below. The namespaces eo, grid, proj,mgrs, sat and view refer to dedicatedSTAC extensions. s2is a custom namespace for expressing Sentinel-2 specific metadata thatcannot be described in other STAC extensions.

STAC, xarray and dask | terrabyte Documentation (4)

Let's load some Python packages

import os
import socket
import xarray
import rioxarray
from odc import stac as odc_stac
from odc.geo import geobox
from pystac_client import Client as pystacclient
import dask
from dask.distributed import Client
import folium
import folium.plugins as folium_plugins
import geopandas as gpd

Configuration

# the output directory to store the results
# the path may vary depending on your granted DSS storage location
# for now we use the Home Folder ~
dir_out = '~/xarray-dask-tutorial'

# STAC settings
catalog_url = 'https://stac.terrabyte.lrz.de/public/api'
collection_s2 = 'sentinel-2-c1-l2a'

# input data settings
year = 2022
resolution = 60
mgrs = '32UPU'
max_cloud_cover = 60
bands = ['nir', 'red']

# Output data
filename_base = f'S2_y{year}_res{resolution}_{band}_median_{mgrs}.zarr'
filename = os.path.join(dir_out, filename_base)

# dask
dask_tmpdir = os.path.join(dir_out, 'scratch', 'localCluster')
#we use the values supplied when starting the Jupyterlab session

Explore the STAC catalog

catalog = pystacclient.open(catalog_url, ignore_conformance=True)

# list the IDs of all STAC catalog collections
for collection in catalog.get_all_collections():
print( collection.id)

Query data from the STAC catalog

gte and lte describe greater than or equal and lower than or equal, respectively.

# STAC metadata entries for discovering assets
query = {
'eo:cloud_cover': {
"gte": 0,
"lte": max_cloud_cover
},
's2:mgrs_tile': {'eq': mgrs}
}

start = f'{year}-01-01T00:00:00Z'
end = f'{year}-12-31T23:59:59Z'

resultsS2Stac = catalog.search(
collections=[collection_s2],
datetime=[start, end],
query=query
)

items = list(resultsS2Stac.items())

Visualize the covered area

map = folium.Map(tiles='Stamen Terrain')
layer_control = folium.LayerControl(position='topright', collapsed=True)
fullscreen = folium_plugins.Fullscreen()
style = {'fillColor': '#00000000', "color": "#0000ff", "weight": 1}

footprints = folium.GeoJson(
gpd.GeoDataFrame.from_features([item.to_dict() for item in items]).to_json(),
name='Stac Item footprints',
style_function=lambda x: style,
control=True
)

footprints.add_to(map)
layer_control.add_to(map)
fullscreen.add_to(map)
map.fit_bounds(map.get_bounds())
map

STAC, xarray and dask | terrabyte Documentation (5)

Load the STAC items into xarray

cube = odc_stac.load(
items,
bands=[band],
resolution=resolution,
stac_cfg=cfg,
groupby='solar_day',
chunks={'x': 1024, 'y': 1024}, # spatial chunking
anchor=geobox.AnchorEnum.FLOATING # preserve original pixel grid
)

# temporal chunking
cube = cube.chunk(chunks={'time': -1})

# write CF-compliant CRS representation
cube = cube.rio.write_crs(cube.coords['spatial_ref'].values)
cube

STAC, xarray and dask | terrabyte Documentation (6)

Define the computation

median = cube.quantile(0.5, dim='time', skipna=True, keep_attrs=True)
median = median.rename({b: f'{b}_median' for b in list(cube.keys())})
median

STAC, xarray and dask | terrabyte Documentation (7)

Start the dask client with port forwarding via the jupyter-server-proxy

The client object displayed in the notebook (see example screenshotbelow) contains a link to the dask dashboard, which can be displayedlocally to monitor the progress of the computation once it started.

#get host and port information for the currently running Jupyterlab from Environment Variables
host=os.getenv("host")
JLPort=os.getenv("port")

#create to URL to point to the jupyter-server-proxy
daskUrl=f"https://portal.terrabyte.lrz.de/node/{host}/{JLPort}/proxy/"+"{port}/status"

#dask will insert the port it is running in the {port} part of the URL
dask.config.set({"distributed.dashboard.link": daskUrl})

#set the temporary dir (for intermediate files)
dask.config.set(temporary_directory=dask_tmpdir)

#Dask gets the information on how much CPU/MEM you have selected and sets its workers accordingly
client = Client(threads_per_worker=1)

client

STAC, xarray and dask | terrabyte Documentation (8)

Start the computation

%%time
delayed = median.to_zarr(filename, mode='w', compute=False, consolidated=True)
dask.compute(delayed)

Close the dask cluster

client.cluster.close()
time.sleep(5)
client.close()

Load and visualize the result

result = xarray.open_zarr(filename)
result.to_array().squeeze().plot.imshow()

STAC, xarray and dask | terrabyte Documentation (9)

STAC, xarray and dask | terrabyte Documentation (2024)

References

Top Articles
Latest Posts
Article information

Author: Catherine Tremblay

Last Updated:

Views: 6690

Rating: 4.7 / 5 (67 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Catherine Tremblay

Birthday: 1999-09-23

Address: Suite 461 73643 Sherril Loaf, Dickinsonland, AZ 47941-2379

Phone: +2678139151039

Job: International Administration Supervisor

Hobby: Dowsing, Snowboarding, Rowing, Beekeeping, Calligraphy, Shooting, Air sports

Introduction: My name is Catherine Tremblay, I am a precious, perfect, tasty, enthusiastic, inexpensive, vast, kind person who loves writing and wants to share my knowledge and understanding with you.