D7.2 - Big Data Analysis and Visualisation

1. Introduction

The use case chosen for the Food Security Community is high-throughput phenotyping. High-throughput phenotyping produces a large amount of data which need to be analysed right away. For example in a greenhouse platform, a lot of images of plants are taken : 13 images per plants per day are taken in the Montpellier platform which works on 1600 plants (more than 20 000 images per day). Field platforms produce and need a lot of images too with UAV or satellite.
High-throughput phenotyping platforms produce complex data (sensors data, human readings) at different scales (population, individuals, molecular…).
The phenomics community needs tools to easily access to large datasets and to be able to visualize and analyse them. Moreover, sharing data, analytics process and results is essential.
A Virtual Research Environment (VRE) should be adapted to these requirements.

2. Overall Development Approach to Support the Use Cases

The objective of the project is to provide a collaborative work environment (a VRE) for phenomics searchers and to assess the effectiveness of the VRE to meet the community needs. The VRE user should be able to :
- Have access to several ontologies (crop ontology, plant ontology, etc)
- Have access to phenomics platforms data which are accessible through web services
- Visualize heterogeneous data
- Import and run data analytics scripts in different languages (R, Python, etc)
- Import or update and run data analytics workflows (Galaxy)
- Share data, algorithms or workflows.
- Publish his results
- Collaborate with the other members

Each component will be tested by users during 3 evaluation phases. These evaluations will help us to improve the VRE functionalities and best serve the users needs and requirements.

3. Deployed VRE

3.1 Overall Description

The Food Security VRE has been deployed to support the WP7 use case. This VRE is intended primarily for use by the high-throughput phenotyping community but it is also addressed to any plant scientist who would need phenotyping data. It is available via the AGINFRA+ Gateway (https://aginfra.d4science.org/) with the following details:

Title: FoodSecurity
URL: https://aginfra.d4science.org/explore
Description: This Virtual Research Environment is conceived to be the working environment for plant breeders. This VRE should help them to select plant species and varieties better adapted to global change in order to maximize crop yields and provide food security.

The VRE provides The collaboration and communication features and the DataMiner and the SAI (Statistical Algorithm Importer) are available in the VRE to work on the use case. These tools allow to import scripts in different languages (R, python, java) which can be easily ran by other users of the VRE.

3.2 Semantics Features

Phenomics data are complexed and produced in different platforms using different devices, protocols etc. Interoperability is a difficulty, in particular in semantic. The VRE provides access to several ontologies which are used in plant phenotyping such as the crop ontology, the plant ontology, the genome ontology, etc. All these ontologies are available on Agroportal (http://agroportal.lirmm.fr/). To look at these ontologies inside the VRE, the user can use :
- the interactive visualization tool WebVOWL which has been integrated to the VRE. It enables to easily explore an ontology.
- the ontology management platform Vocbench where all the relevant ontologies have been imported.
These tools should help the phenotyping researchers to find the right concept to describe the variables they want to measure in their own experiments.

A matching tool (yam++) has also been deployed in order to determine correspondences between concepts of several ontologies. It will also assess the relevance of the matchings and give this information to the user by weighting its results.
Research topics change over time and matching needs have to be dynamically managed.

3.3 Analytics Features

Data Exploration and Data Access

The VRE must give the possibility to the user to find and access easily to phenotyping platform data.

The Breeding API (BRAPI) is an API standard specification for plant phenotype/genotype databases to serve their data to crop breeding applications (see https://brapi.org/). Several phenotyping data providers have implemented in their system some web services which are BrAPI compliant. The open-source project OpenSILEX-PHIS developed by INRA is an Information System for plant phenoyping (http://www.opensilex.org/). It is currently used by 5 platforms in France and is being deployed in Wageningen, in Japan, Australia etc. As part of AGINFRA+ project, it has been enriched with BRAPI compliant Web Services to access to the studies observation measurements.

Dataminer algorithms have been implemented in order to call BRAPI compliant services and give access to phenotyping data from different sources. The user can retrieve data from any servers with BrAPI compliant Web Services. But before retrieving data, the user needs to know which study he is interested in. That is why an R-Shiny application "Studies Exploration" has been developed and added to the VRE. This application enables the user to look to any server studies which are publicly available.

First he can have brief information on every studies stored in the selected server.
He can also see the measured variables
Then he can preview the observations data. By clicking on the button "Launch DataMiner algorithm", this will directly run the DataMiner algorithm that retrieves data with the input parameters corresponding to the user selection.

Data analytics

The user must also be able to integrate data analysis scripts into the VRE. In this use case we focused on R and Python language which are the most widely used.
The phenomics platforms data will be analysed with R packages available on CRAN and the package PhisStatR which should be provided by INRA in the next couple of months. This package is a set of functions and rmarkdown scripts dedicated to experiment analysis in Montpellier phenomics platform. It includes outlier detection, bayesian method for genotype comparaison and various statistical models. Frequent itemset mining algorithms or Machine learning algorithms such as leaf detection could also be implemented.
With the large number of plants or plots, a workflow management system is essential to manage the data flow (Velocity). Workflows are clearly needed by the HT phenotyping community. It is now possible to import Knime workflows as dataminer algorithms. The workflow management system Galaxy has also been integrated into the VRE, so that the user can build a workflow and run it on data available in the VRE. It is also possible to add nodes which call a dataminer algorithm.

3.4 Visualization of heterogeneous data

Visualizing Phenotyping data requires new visualisation tools. In greenhouses the observed variables are measured on plant images. It is important to the researcher to visualize the measurements over time with the correspondent images in order to detect and understand some outliers. A visualization tool has been implementing. It enables the user to create a chart of plant height measurements for example with the images, so that when the user click on a specific point, the correspondent image appears.

For now the tool takes data from a zip file containing the images and the data csv file. The objective is to retrieve images and data directly from Web Services. The tool will also integrate the events data.

A prototype has been developed to visualize in the same graph observations data, images but also events. This is very useful to check the accuracy of data. For example, if a plant fell off the station while the picture was taken, the computed plant height will be inaccurate. The user can easily check that the data must be rejected. On the image he can see that the plant fell off and he can also see that an event had been declared saying that the plant fell off. This prototype has not been added to the VRE.

4. Implementation Plan

4.1 Components / Features for M12

The following components have been implemented in the Food Security VRE :

Component Description
Core VRE Basic VRE with standard collaboration features such as file sharing and message posting.
DataMiner e-Infrastructure service providing data mining algorithms.
SAI Statistical Algorithms Importer (SAI) is a tool to import algorithms in the D4Science e-Infrastructure (see https://wiki.gcube-system.org/gcube/Statistical_Algorithms_Importer).
Integration of R script A script of curve fitting from a csv file data has been imported with the SAI into DataMiner
Integration of Python scripts A script for image analysis has been imported with the SAI into DataMiner.

4.2 Components / Features for M18

The following components have been implemented in the Food Security VRE :

Component Description
Access to data with web services Have access to INRA phenomics platform data through web services using R scripts
Data Visualization A tool to create and view charts from a csv file
Data Visualization Deployment of an Rshiny application (map of european phenotyping infrastructures)

4.3 Components / Features for M24

The following components have been implemented in the Food Security VRE :

Component Description
Workflow management Importing a knime workflow in dataminer
Workflow management Integration of Galaxy inside the VRE
Access to ontologies Have access to ontologies used by the phenomics community (AgroVoc, crop ontology, plant ontology, genomic ontology, etc)
Ontology visualization Interactive visualization of ontologies (WebVOWL).
Ontology matcher Takes as an input ontologies in different formats (owl, skos, and various serializations of rdf, such as ttl) and produces alignments (yam++)
Data Visualization Charts of observations over time with correspondent images
Data Visualization Spatial visualization (plants in the greenhouse or plots in the field)
Data Access Implementation of BrAPI compliant Web Services in OpenSILEX - PHIS (GET/POST studies-search, GET studies/observations)
Data Access Importing Python scripts in DataMiner to retrieve data from BrAPI WS

4.4 Components / Features for M36

The following components will be implemented in the Food Security VRE :

Component Description
Data Exploration Integration of a Rshiny application for studies exploration - the application calls BRAPI compliant web services from several databases
Workflow management Full Integration of Galaxy inside the VRE with access to DataMiner algorithms
Data analytics Have access to VRE workspace from Rstudio and Jupyter
Data Visualization Prototype developed into PHIS (Plant Phenotyping Information System) to visualize heteregeneous data on the same chart (plant observations data, images and events) extracted by calling BrAPI WS

plant_height_images.png (68.4 KB) Alice Boizet, Dec 10, 2018 01:17 PM

webvowl.png (166 KB) Alice Boizet, Dec 10, 2018 02:00 PM

shiny_app_studies.png (62.3 KB) Alice Boizet, Sep 27, 2019 09:40 AM

studies_shinyapp.png (105 KB) Alice Boizet, Jan 27, 2020 04:24 PM

variables_shinyapp.png (108 KB) Alice Boizet, Jan 27, 2020 04:24 PM

observations_shinyapp.png (107 KB) Alice Boizet, Jan 27, 2020 04:25 PM

phis_viz.png (201 KB) Alice Boizet, Feb 10, 2020 04:56 PM