Computational aspects of AI for environmental sciences

Interpretability & Analysis
Design patterns
large-scale data management
large-scale data analysis
data science
storage systems
data management

Course Overview

This lecture provides an introduction to modern techniques for large-scale data handling and data analysis. It focuses primarily on Earth system data, but the principles described here can also be applied to many other types of data from other scientific disciplines. The course begins with an easy and general introduction and leads to advanced data management concepts and design patterns towards the end. We use the notion of design patterns here, which is borrowed from software development to describe reusable patterns. While reusability of large-scale data workflows is difficult, there are nevertheless some overarching principles and techniques that are useful to know and understand. Examples for such techniques include concepts from database design, such as sharding, and modern programming paradigms for massive data analysis on distributed clouds, i.e. map and reduce. At the end of this lecture you should have a basic understanding of the main challenges of large-scale data handling and of several important techniques you can use to address these challenges.

Details

Lessons:
17
Course Length:
1h : 57min

Lecturer

PD Dr. Martin Schultz

Overview

This lecture provides an introduction to modern techniques for large-scale data handling and data analysis. It focuses primarily on Earth system data, but the principles described here can also be applied to many other types of data from other scientific disciplines. The course begins with an easy and general introduction and leads to advanced data management concepts and design patterns towards the end. We use the notion of design patterns here, which is borrowed from software development to describe reusable patterns. While reusability of large-scale data workflows is difficult, there are nevertheless some overarching principles and techniques that are useful to know and understand. Examples for such techniques include concepts from database design, such as sharding, and modern programming paradigms for massive data analysis on distributed clouds, i.e. map and reduce. At the end of this lecture you should have a basic understanding of the main challenges of large-scale data handling and of several important techniques you can use to address these challenges.

PD Dr. Martin Schultz
This is some text inside of a div block.
This is some text inside of a div block.

Intro Part I - Data science and big data analytics

General introduction to data science

PD Dr. Martin Schultz
This is some text inside of a div block.
This is some text inside of a div block.

Web accessible data & data publications

Many datasets of environmental models or observations are now available through web services. Here, I explain, how you can work with these.

PD Dr. Martin Schultz
This is some text inside of a div block.
This is some text inside of a div block.

Pythons request library

Python's request library is an important cornerstone for interacting with web services from within a Python program. We therefore dive a little deeper into it here.

PD Dr. Martin Schultz
This is some text inside of a div block.
This is some text inside of a div block.

Some hints for good data management

Before you get lost in massive amounts of data, it is useful if you understand some good practices for data management. This part of the lecture shall help you with that.

PD Dr. Martin Schultz
This is some text inside of a div block.
This is some text inside of a div block.

The netCDF file format

Netcdf is one important format for storing environmental data and it is primarily used for gridded model output and input. Here, I explain you the data model of netcdf and tell you how you can work with data in this format.

PD Dr. Martin Schultz
This is some text inside of a div block.
This is some text inside of a div block.

The role of metadata

Data is useless if you don't know what these data are, in what units th evariables are stored, or where the data comes from. Here, I introduce a number of fundamental aspects about environmental metadata.

PD Dr. Martin Schultz
This is some text inside of a div block.
This is some text inside of a div block.

Work with netCDF data in Python

This final section of lecture part 1 introduces some advanced Python tools and libraries for efficient and user-friendly work with netcdf data.

PD Dr. Martin Schultz
This is some text inside of a div block.
This is some text inside of a div block.

Intro Part II - Data science and big data analytics

Introduction to part 2 of this course.

PD Dr. Martin Schultz
This is some text inside of a div block.
This is some text inside of a div block.

Types of data in Earth system science

Earth system data comes in many different formats and shapes. This part provides a general overview of Earth system data types and their key properties.

PD Dr. Martin Schultz
This is some text inside of a div block.
This is some text inside of a div block.

5 "V" of Earth system data types

Here, we explore what characterizes "large" data and provide some examples. The first important aspect to investigate is data volume.

PD Dr. Martin Schultz
This is some text inside of a div block.
This is some text inside of a div block.

How to cope with > 1 TByte of data

This final section of part 2 provides a glimpse on tools and techniques for working with really large datasets.

PD Dr. Martin Schultz
This is some text inside of a div block.
This is some text inside of a div block.

Intro Part III - Data science and big data analytics

Background information on large-scale data handling

PD Dr. Martin Schultz
This is some text inside of a div block.
This is some text inside of a div block.

Challenges of large-scale data analysis and data system architectures

This part of the lecture describes different types of data storage systems and discusses some implications for the management of data.It covers simple file systems, databases and data warehouses, hierarchical storage architectures on HPC systems, and complex client-server architectures.

PD Dr. Martin Schultz
This is some text inside of a div block.
This is some text inside of a div block.

Data structures, data models & data patterns

Data come in many different ways and formats. Relevant for Earth sciences are the following data types: unstructured data, point clouds, series and time series, tree structures, relational tables, graphs, gridded data, images and videos. Data structures and formats influence access patterns and access speed.

PD Dr. Martin Schultz
This is some text inside of a div block.
This is some text inside of a div block.

Classic design patterns

This section explains a number of classic data handling patterns and techniques. It starts with the extract-transform-load pattern, describes some aspects of chunking and tiling, introduces index tables, and memory mapping.

PD Dr. Martin Schultz
This is some text inside of a div block.
This is some text inside of a div block.

Modern design patterns

Here, we discuss some modern design patterns for data management with particular focus on distributed architectures. Key concepts that are introduced in this section with examples are asynchronous processing, caching, messaging, and sharding. These are important concepts to allow for parallel data processing in heterogeenous environments.

PD Dr. Martin Schultz
This is some text inside of a div block.
This is some text inside of a div block.

Hadoop & MapReduce

Hadoop and Map (and) Reduce are two examples of modern, sophisticated designs for the asynchronous parallel processing of massive amounts of data. This section of the lecture provides an overview on the Hadoop architecture and describes the map-reduce algorithm with an example.