D5.2 - Heterogeneous Data Fusion¶
- Table of contents
- D5.2 - Heterogeneous Data Fusion
- Overall Development Approach to Support the Use Cases
- Deployed VRE
The domain of agro-climatic and agro-economic modelling, identified personas for this community, and selected use cases for work package 5 have been introduced and described in deliverable 5.1. In particular use cases were selected focussing on opportunities to bring researchers from their current local, single computer and mostly peer network based work space to a cluster compute and cloud based (VRE-like) collaborative work environment. Therefore specific work processes have been described that currently have not made that transition, but in our view are suitable for, and would benefit from, transforming, provided that VREs can fulfil the requirements. In deliverable D5.3 details are given about such requirements, from the user community perspective, as well as planned methods for assessing how well they are met and the value they provide to the users. Finally, this document (5.2) describes the current status and future steps to be taken to implement VRE functionality in accordance with the user cases’ requirements.
Overall Development Approach to Support the Use Cases¶
Initially three separate use cases were described, however one details the more generic activities of the community’s users for working with their typical spatio-temporal datasets. Handling geographic data through SDIs (spatial data infrastructures) and GIS (Geographic Information Systems) is an inherent part of the work. In order to avoid complexities of data availability and licenses and to to keep the use cases manageable within the project it was decided to integrate these generic activities into the two other use cases, and limit it to mostly the use of data from the AgroDataCube (http://agrodatacube.wur.nl). This is an already pre-compiled database of selected Open Data from The Netherlands of relevance to agronomic modelling. It contains data about the farm fields, crops, soil types, soil physical parameters, weather, and altitude, for multiple years. This will be extended with more detailed satellite imagery. For the use cases it will be necessary to make the data from the AgroDataCube findable and accessible from within the VRE, and users should be able to inspect it, e.g. by viewing the data as geographic maps and through graphs of time series, and retrieve data from it.
The central parts of the two remaining use cases are (i) crop modelling, and (ii) crop phenology estimation. Both can benefit from a VRE when it gives access to a compute and storage infrastructure that allows remote and parallel processing of the core algorithms of each use case. In one this is a well known crop model (WOFOST), in the other more explorative defined curve fitting algorithms. These give computational ‘engines’ that then can be run independently of each other for different spatial locations, e.g. all agricultural parcels in the Netherlands (approximately 750.000 for each year), after which results have to be aggregated. Potentially researchers will want to run a large number of calculations, resulting in large amounts of data needing to be handled. Besides the specific algorithmic work for each use case it is therefore needed that the VRE will provide some form of worker - scheduler (or map-reduce) functionality for the processing.
Depending on the worker-scheduler implementation a customized wrapper will have to be written so that the existing WOFOST crop model can be fit into the VRE. This also applies to the crop phenology curve fitting algorithm(s), although it is expected that these will be Python based scripts developed exploratively in a (scientific) notebook environment such Jupyter or Apache Zeppelin. This is (intentionally) similar to what researchers currently use, but in the VRE with easy access to data and other functionalities.
After running the computations, both use cases will require output data to be aggregated, analysed and visualised, amongst others as geographic maps. This will be further detailed once the VRE is able to produce the data.
An AGINFRA+ VRE has been deployed to support both WP5 use cases, containing the standard collaboration tools such as file sharing, message posting, and user management. The VRE is available via the AGINFRA+ Gateway (https://aginfra.d4science.org/) with the following details:
Title: Agro Climatic Modelling
Description: This Virtual Research Environment is for supporting AGINFRA+ WP5 use cases development
DataMiner and the SAI (Statistical Algorithm Importer) are available in the VRE to work on the use cases. Additionally RStudio has been added, as well as GeoExplorer and GISViewer as initial ways to work on some data processing (with R) and visualisation of spatial data. Lastly the issue tracker has been made accessible from this VRE too to ease collaborative development work. All further work for WP5 to implement the use cases and the assessment will take place within this VRE.
The selected use cases do not require extensive semantic features currently. They could play a role in describing the data within the AgroDataCube and publishing it within the VRE. Perhaps also making it available to other users. The same might be done for the calculation results produced by the use cases, e.g. through the use of the AgroVOC thesaurus and linking it to data generated or used in the other work packages. However, currently such work is not foreseen within WP5.
As far as analytics features goes, the key will be to have a functional worker-scheduler solution on the VRE that can support both use cases, so that large scale computations can be parallelized. Integration of the WOFOST crop model, and a Python notebook environment are also crucial. The curve fitting algorithm and other data aggregation and processing algorithms required are currently considered to be ‘standard’, i.e. already available as Python or R libraries. However when calculated data becomes available in the VRE more specific requests may arise from the user community.
In both use cases being able to work with spatio-temporal data is essential. For researchers working with such data includes being able to view it, e.g. as geographic maps and time series charts. At first this can be done based on data from the AgroDataCube. Later, when the VRE is able to run the computations for the use cases, we will work on the presentation of the calculation results.
Components / Features for M12 (Completed Work)¶
The following list describes the components that were available in the VRE as the M12 deliverable:
Core VRE: Basic VRE with standard collaboration features such as file sharing and message posting.
DataMiner: e-Infrastructure service providing data mining algorithms and ecological modelling approaches under the Web Processing Service (WPS) standard (see https://gcube.wiki.gcube-system.org/gcube/Data_Mining_Facilities).
SAI: Statistical Algorithms Importer (SAI) is a tool to import algorithms in the D4Science e-Infrastructure (see https://wiki.gcube-system.org/gcube/Statistical_Algorithms_Importer).
RStudio: A free and open-source integrated development environment (IDE) for R, a programming language for statistical computing and graphics.
GeoExplorer: A web application that allows users to interactive navigate, organize, analyze, search and discovery internal or external GIS Layers. GeoExplorer Portlet operates on a GeoNetwork instance in order to discover layers residing on a distributed e-Infrastructure (see https://wiki.gcube-system.org/gcube/GeoExplorer).
- GISViewer: A web application system that allows you to interactive explore, manipulate and analyze geographic data (see https://gcube.wiki.gcube-system.org/gcube/GIS_Viewer).
Components / Features for M18 (Completed Work)¶
In addition to the components available as the M12 deliverable, the following have been worked on and implemented for the M18 deliverable:
- WOFOST Crop Model Wrapper: The original WOFOST crop model is implemented as a FORTRAN executable. Work is on its way (in another project) to make a more modern Java based version. For this latest version a wrapper has been developed that makes it suitable for use with the SAI, so that the model can be deployed as DataMiner algorithm. Currently two WOFOST algorithms have been added to the VRE. “Wofost 20180424 1” is an algorithm that runs the crop model given an input file that contains all parameters for a single simulation. “Wofost 20180430 1” is an experiment showcasing the use of various available input field types in the SAI, including selection of spatial locations. A small set of combined input parameter files has been added to the VRE workspace for demonstration purposes. The WOFOST algorithm can be run using DataMiner and will produce a log file of the run and the simulation outputs (the daily simulation states such as leaf area index, crop development stage, and biomass). These files are available in DataMiner and the workspace for further processing.
- WOFOST Data Extractor: An algorithm to extract data required for the WOFOST crop model from the AgroDataCube and other sources. This has been partly implemented in DataMiner and is available as “Agrodatacube Reader 20180430 1”. Given inputs such as a geographic area, year, and crop name, it will extract the relevant data using the AgroDataCube REST API. Some additional work is still needed to calculate specific parameters for WOFOST related to soil moisture characteristics. Formulas have been derived to calculate these parameters from the soil parameters that are available in the AgroDataCube. Another point of attention is the time it takes to retrieve the required data from the AgroDataCube. The current solution is too slow for large scale processing of data, e.g. the number of simultaneous read operations is very limited. Lastly the collected data has to be written into a format suitable for WOFOST algorithm.
DataMiner WOFOST Worker: A finalized version of a DataMiner algorithm that includes the WOFOST crop model and can run crop simulations on the VRE. This was initially listed as a separate component but has been included into the already mentioned Wofost Crop Model Wrapper. Internally it is set up to run multiple simulations in parallel on a single node, by using Akka’s (http://akka.io) Actor Model implementation.
Python Notebooks: Jupyter Lab has been made available in the VRE and supports a Python 3 kernel. The instance runs at EGI, but can access DataMiner algorithms through their WPS (Web Processing Service) API over http, and files in the VRE workspace using gCube’s Home Library REST API (https://wiki.gcube-system.org/gcube/Home_Library_REST_API). The user’s VRE token has to be added to each web request for access permissions. A few example workbooks have been written to demonstrate how to use the functionality.
- Visualization - Create and View Graphs: WP4 has made avaible a draf version of the portal for creating and viewing graphs, which has been added to the VRE. It can be used to create a variety of charts using the CSV (comma seperated values) files that are the output of WOFOST crop simulations. E.g. it is possible to create a chart of the simulated LAI (leaf area index), minimum and maximum day temperatures, crop production (kg/ha), and so on.
Components / Features for M24 (Completed Work)¶
Based on the mid-term review results some adjustments were made to the originally planned work for this period, shifting focus on providing added-value features for the target communities, rather than more general purpose data science functionalities. In close collaboration with the technical work packages efforts have been redirected towards:
- Support for spatial and temporal metadata in the catalogue, and for agrisemantics (e.g. GACS).
- Support for processing and visualisation of spatial data, e.g. satellite imagery in GeoTIFF format, and vector data in GeoJSON format.
- Implementation of a Dashboard for users to explore metrics and KPIs of (Dutch) farm fields, both from observations and crop simulations.
Concept of dashboard visualisation for agronomic data
To address the fact that AgroDataCube currently only contains data about farm fields in The Netherlands, and that this could be a limiting factor for real-world applications, integrating access to available Earth Observation (SentinelHub) and soil (SoilGrids) data is being examined. Again in collaboration with the technical work packages. However, this is a complex issue and beyond the original scope of the project.
Basic integration of access to SoilGrids data
Furthermore there has been ongoing, yet mostly behind the scenes, work on AgroDataCube, the WOFOST crop model, and parallel model execution, needed to support our use cases. One issue being the deriving of all required inputs for running crop simulations based from what is available in AgroDataCube, with soil related water retention curves causing some significant challenges.
Lastly, work is on its way to develop a Crop Phenology algorithm. So far this is being done in Jupyter notebooks. Special care was taken to make it possible to show simple maps wherever a location or a particular area would be involved. A grid of 25 x 25 km pixels was used to divide up the territory of the Netherlands into manageable pieces. A Jupyter notebook was then developed to retrieve NDVI values pertaining to individual parcels and to derive conclusions from those values concerning emergence, flowering and maturity.
Eventual aim is to retrieve data for each parcel and to establish the phenology for it. The approach followed is to do this within each 25 x 25 km pixel, and to supplement and / or correct data with those from other parcels nearby – i.e. at least with data from parcels found within the same pixel.
Data relevant for crop parcels were retrieved from the AgroDataCube. For each parcel NDVI values were retrieved and it was attempted to supplement these in order to arrive at a time series with enough points within the growing season (April 1 to November 1), so that a function with some parameters could be fitted. With more points more parameters can be fitted and a better fit can be obtained, but there are not always enough points.
Work on NDVI time series curve fitting in AgInfra+ Jupyter Notebook environment
With regards to end user workflows: The M24 VRE still consists of several loosely integrated components. The new StorageHub API that has been deployed should make manual workflows at least a little bit easier (less copying of files required). It is still under consideration if workflow systems such as KNime or Galaxy are usable for our community.
Components / Features for M33 (Completed Work)¶
Crop Modelling Use Case¶
In the final stage of the project that focus has been on completing the work on the capabilities for running crop simulations at scale using DataMiner, testing, debugging, and streamlining the integration with core D4Science components such as the Workspace, and assist WP4 with further developing the Dashboard. Besides that the use case has been demonstrated during two evaluation events, which included hands-on sessions. These resulted in further feedback that will help in further refining the solution and move it beyond the current proof-of-concept implementation.
Continuing from the previous work the implementation now, in short, consists of an actor and message passing based solution using the Akka framework ([akka.io]) and written in the Scala (functional) programming language. Both Akka and the functional programming style are very helpful when writing parallel and distributed software systems.
In order to make use of D4Science DataMiner and the compute cluster it manages, the system consists of two main parts. One that can run many crop simulations in parallel on a single fat node of the (Master) cluster, and one that can distribute a total requested workload of crop parcels to run simulations for by starting batch jobs, monitor the progress of every job, and summarise the resulting simulations outputs and processing logs.
Overall actor design for running the crop simulations
Overall actor design for processing crop simulation batches
In the end an overall HTML report is generated giving an agronomist a quick overview of key output values and allowing opening detailed output files for further inspection when needed. DataMiner automatically stores those in the user’s Workspace, including provenance details which allow the algorithm execution to be easily repeated.
HTML output report example
DataMiner integration with the Workspace
For both parts several types of DataMiner operators are created and installed, using the SAI (the DataMiner importer). This integration approach automatically makes the functionality available via a standard OGC Web Processing Service (WPS), so it can be called from other components in the VRE, e.g. the Dashboard or a Jupyter notebook, as well as from external clients that support the WPS standard (e.g. QGIS).
Various crop simulation model operators in DataMiner
The flexibility it gives is of course a trade-off with raw compute power when data access and method calls would be more direct instead of using an Internet protocol. Yet, current performance is not bad, e.g. it takes about 5 minutes to complete 14.000 crop simulations (included saving results on the Workspace and creating the summaries), using 6 nodes in the DataMiner cluster. Hardly any serious performance tuning has been done so far, and the system can be scaled horizontally (i.e. increase the cluster to reduce the overall calculation time), leaving plenty of room to improve the overall processing time.
Automatically making use of all nodes available in the DataMiner cluster
Most of the data needed for running the crop simulations currently is retrieved from the AgroDataCube, a harmonised dataset for The Netherlands of registered crop fields, weather, and soil data, plus detailed data about terrain height and the vegetation index (derived from satellite imagery). Based on this and the capability to run the crop simulation model through DataMiner via the WPS interface, the University of Athens (WP4) constructed a Dashboard that allows the user to explore all these resources.
Dashboard to explore crop simulation input and output data
Crop Phenology Use Case¶
- Crop phenology is the study of the timing of recurring biological events
- Remote sensing data have long been used for this.
- NDVI is the most well-known remote sensing indicator; it’s a so-called vegetation index; its values range from 0 to 1, with 0 meaning no crop yet and 1 meaning complete coverage of the soil by the crop
- If enough NDVI observations are available, then we have a time series from which phenology indicators can be derived
- Phenology indicators are needed for crop simulation: they help one to choose the right parameters.
Example of NDVI time series for several crop parcels at a trial farm:
We used explorative modelling to develop a method. Therefore modelling was first done in a Jupyter notebook, with Python 3 as scripting language. This environment allowed us to document the code and to visualise input data as well as results. For our purpose, we retrieved data about a number of sugar beet plots in the Netherlands.
The NDVI time series that was retrieved for the sugar beet parcels, was plotted. In most cases the retrieved NDVI-values are averages over the parcel, so standard deviations are also known and are plotted as vertical lines.
Applied curve fitting technique¶
- The retrieved NDVI-time series have a typical double S-curve (or Sigmoid function)
- The standard deviations were also used as input for the fitting algorithm
- We tried a few different functions; eventually we chose to continue with this one:
The graph shows the “meaning” of each of the parameters a - g.
In order to interpret the curves, certain points are estimated from the found parameters:
- Maximum growth / canopy closure
- Maximum NDVI
- Maturity / harvest.
The four points found for a field can easily be compared with those found for other fields.
The graph below shows the result of the fitting exercise, i.e. fitted and plotted curves in the same graph.
The following graph shows 3 such points; the temperature sums calculated for the time elapsed between the 1st and 2nd point and for the time elapsed between 2nd and 3rd points are known as “growing degree days” (GDD):
Two sources are needed:
- Parcel data, incl. geometry as well as type of crop
- NDVI data for a given parcel / geometry.
We used 2 ways to obtain our data:
- AgroDataCube for the Netherlands
- Shapefile with parcel data from the Flemish government in combination with NDVI data from SentinelHub.
Code from the developed notebooks was used to develop a Python script, suitable for running on the VRE. This script was installed by means of the DataMiner importer.
On the VRE we normally have to indicate how the needed calculations should be divided up into manageable pieces, so that each of those pieces can be handled by a separate worker computer.
We divided up the territory of the Netherlands with this aim, into 16 pieces. Then we developed a script that can retrieve parcel data – including geometry – from a data source, for each bounding box; this script was imported into the DataMiner and was named JobScheduler.
Output of this JobScheduler are data in the form of GeoJson files. These GeoJson files form input for another script, which is meant to start the curve fitting script on several worker computers at the same time; this script was also imported into the DataMiner and was named JobStarter.
The script was run for all the sugar beet parcels in the Netherlands. The found dates for emergence, maximum growth, maximum NDVI and harvest were plotted. The picture below shows the date of emergence.
A list of follow-up work under consideration, mostly for after the project:
- Crop simulation model improvements
- Upgrade to version WOFOST-WISS 7.2
- Add water limited model runs, and other variants
- Calibrate for Dutch crops
- Usability improvements
- Adjust final HTML reporting based on feedback
- Adjust Dashboard based on feedback
- New features
- Externalise crop parameters
- Support parameter sweep runs
- Allow other input data sources
- Visualise outputs on a map
- Galaxy workflows