D3.1 - Open Science Data Analytics Technologies¶
- Table of contents
- D3.1 - Open Science Data Analytics Technologies
This solution was designed to satisfy the following key principles and requirements:
- Extensibility: the platform is “open” with respect to (i) the analytics techniques it offers and support and (ii) the computing infrastructures and solutions it relies on to enact the processing tasks. It is based on a plug-in architecture to support adding new algorithms / methods, new computing platforms;
- Distributed processing: the platform is conceived to execute processing tasks by relying on “local engines” / “workers” that can be deployed in multiple instances and execute tasks in parallel and seamlessly. The platform is able to rely on computing resources offered by both well-known e-Infrastructures (e.g. EGI) as well as resources made available by the Research Infrastructure to deploy instances of the “local engines” / “workers”. This is key to make it possible to “move” the computation close to the data;
- Multiple interfaces: the platform offer its services via both a (web-based) graphical user interface and a (web-based) programmatic interface (aka API) thus to enlarge the possible application contexts. For instance, having a proper API facilitates the development of components capable to execute processing tasks from well-known applications (e.g. R, KNIME);
- Cater for scientific workflows: the platform is both exploitable by existing WFMS (e.g. a node of a workflow can be the execution of a task / method offered by the platform) and support the execution of a workflow specification (e.g. by relying on one or more instances of WFMSs);
- Easy to use: the platform is easy to use for both (a) algorithms / methods providers, i.e., scientists and practitioners called to realise processing methods of interest for the specific community, and (b) algorithms / methods users, i.e., scientists and practitioners called to exploit existing methods to analyse certain datasets;
- Open science friendly: the platform transparently inject open science practices in the processing tasks executed through it. This includes mechanisms for capturing and producing “provenance records” out of any computing task, mechanisms aiming at producing “research objects” so as to make it possible for others to repeat the task and reproduce the experiment (yet guaranteeing any policy regulating “access” and exploitation of the experiment and its results, i.e. the “owner” associates a well-defined policy and licence governing future exploitation and access).
The Data Analytics Technology¶
The Data Analytics technology mainly comprises:
- the Data Miner platform, a portlet and a series of services enabling users to execute analytics tasks, monitor the execution, and access the final results. Analytic tasks are based on integrated analytics methods and algorithms;
- the Statistical Algorithm Importer (SAI), a portlet and a series of services enabling users to integrate their own algorithms into the Data Miner platform thus to make them available for Data Miner users;
These two services are integrated with others to realize a complete solution, namely:
- the workspace, for sharing files as well as recording the outcomes of every analytics task;
- the catalogue, for publishing the results of analytics tasks;
- various services playing the role of data source to be consumed by Data Miner algorithms;
During the project lifetime, the following facilities have been added to the overall solution:
- Galaxy a scientific workflow management system;
- KNIME a data analytics and reporting platform;
- JupyterLab the web-based user interface for Project Jupyter;
DataMiner is an open-source computational system based on the gCube system. This platform is fully integrated with the D4Science e-Infrastructure, which underlies AGINFRA, and has been conceived to meet new Science paradigms requirements.
DataMiner was born in this context and supports a number of new paradigms-related requirements. This system is able to interoperate with the services of the D4Science e-Infrastructure; it uses the Web Processing Service (WPS) standard to publish the hosted processes and saves the provenance of an executed experiment using the standard Prov-O ontological representation. DataMiner implements a Cloud computing Map-Reduce approach and is able to process Big Data and to save outputs into a collaborative experimentation space (the D4Science Workspace), which allows users to share computational information with other colleagues. DataMiner was also conceived to execute processes provided by communities of practice in several domains, reducing integration effort at the same time. The DataMiner deployment is fully automatic and is spread across different machines providers (including the European Grid Infrastructure Federated Cloud system).
The DataMiner (DM) architecture is made up of two sets of machines (clusters): the Master and the Worker cluster. In a typical deployment scenario, the Master cluster is made up of a number of powerful machines (e.g. Ubuntu 14.04.5 LTS x86 64 with 16 virtual CPUs, 32 GB of random access memory, 100 GB of disk) managed by a load balancer that distributes the requests uniformly to the machines. Each machine is endowed with a DM service that communicates with the D4Science Information System (IS), i.e. the central registry of the e-Infrastructure resources, to notify its presence and capabilities. The balancer is indexed on the IS and is the main access point to interact with the DMs. The machines of the Worker cluster have less local computational power (e.g. Ubuntu 14.04.5 LTS x86 64 with 2 virtual CPUs, 2 GB of random access memory, 10 GB of disk) and serve Cloud computations. DM is based on the 52North WPS service implementation, but extends it to meet the D4Science e-Infrastructure requirements. It is developed with Java and the Web service runs on an Apache Tomcat instance endowed with gCube system libraries. Further, it offers a development framework to integrate new algorithms and to interact with D4Science. When a WPS request comes to the Master cluster balancer, it is distributed to one of the cluster services (Master DM). The DMs host processes provided by several developers. In particular, two kinds of algorithms are hosted: “local” and “Cloud” algorithms. Local algorithms are directly executed on the Master DMs and possibly use parallel processing on several cores and a large amount of memory. Instead, Cloud algorithms use distributed computing with a Map-Reduce approach and rely on the DMs in the Worker cluster (Cloud nodes). With respect to the standard 52North implementation, DM adds the management of multiple scopes, i.e. the system returns a different list of processes according to the VRE in which the service is invoked. When an algorithm is installed on a DM, it is also indexed on the IS as a resource. Thus, an e-Infrastructure manager can assign it to a number of VREs. When invoked in a VRE, DM returns only the subset of hosted processes that have been assigned to that VRE. On the other hand, one may also want to create multidisciplinary VREs with algorithms belonging to different domains.
The DataMiner computations can take inputs from the D4Science Workspace. Inputs can also come from Workspace folders shared among several users. This fosters collaborative experimentation since the input selection phase. Inputs can also come from external repositories, because a file can be provided either as an HTTP link or embedded in a WPS execution request. The outputs of the computations are written onto the D4Science Distributed Storage System and are immediately returned to a client at the end of the computation. Afterwards, an independent thread also writes this information on the Workspace. Indeed, after a completed computation, a Workspace folder is created which contains the input, the output, the parameters of the computation, and a provenance document summarizing this information. This folder can be shared with other people and used to execute the process again. Thus, the complete information about the execution can be shared and reused. This is the main way by which DataMiner fosters collaborative experimentation. The DM processes can access to the resources available in a VRE by querying the IS. For example, it is possible to discover geospatial services, maps, databases, and files. The DM Java development framework simplifies the interaction with the IS. Since the IS interface is HTTP REST, it can be managed by the processes directly. Further, the DM development framework provides methods to transform heterogeneous GIS formats into a numeric matrix and thus simplifies the effort to process geospatial data. DataMiner can also import processes from other WPS services. If a WPS service is indexed on the IS for a certain VRE, its processes descriptions are automatically harvested, imported, and published among the DM capabilities for that VRE. During a computation, DM acts as a bridge towards the external WPS systems. Nevertheless, DM adds provenance management, authorization, and collaborative experimentation to the remote services. The processes currently hosted by DataMiner are written with the Java, R, Fortran, C, Octave, Linux-Shell, Windows-Batch, and Python programming languages and have been provided by developers with heterogeneous expertise (e.g. biologists, mathematicians, agronomists, physicists, data analysts etc.).
DataMiner offers a Web GUI to the users of a VRE (Figure above). On the left panel, the GUI presents the list of capabilities available in the VRE, which are semantically categorised (the category is indicated by the process provider). For each capability, the interface calls the WPS DescribeProcess operation to get the descriptions of the inputs and outputs. When a user selects a process, in the right panel the GUI on-the-fly generates different fields corresponding to the inputs. Input data can be selected from the Workspace (the button associated to the input opens the Workspace selection interface). The “Start Computation” button sends the request to the DM Master cluster, which is managed as explained in the previous section. The usage and the complexity of the Cloud computations are completely hidden to the user, but the type of the computation is reported as a metadata in the provenance file.
In the end, a view of the Workspace folders produced by the computations is given in the “Check the Computations” area, where a summary sheet of the provenance of the experiment can be obtained (“Show” button). From the same panel, the computation can be also re-submitted. In this case, the Web interface reads the Prov-O XML information associated to a computation and rebuilds a computation request with the same parameters. The computation folders may also include computations executed and shared by other users. Finally, the “Access to the Data Space” button allows obtaining a list of the overall input and output datasets involved in the executed computations, with provenance information attached that refers to the computation that used the dataset.
Further details and information can be found in the gCube Wiki pages related to the service.
Prototype scripting is the base of most models in various sciences including Agriculture and Food. Scientists making prototype scripts (e.g. using R and Matlab) often need to share results and make their models used also by other scientists on new data. To this aim, one way is to publish scripts as-a-Service, possibly under a recognized standard (e.g. WPS). The Statistical Algorithms Importer (SAI) is an interface that allows scientists to easily and quickly import R scripts onto DataMiner. DataMiner in turn publishes these scripts as-a-Service and manages multi-tenancy and concurrency. Additionally, it allows scientists to update their scripts without following long software re-deploying procedures each time. In summary, SAI produces processes that run on the DataMiner system and are accessible via the WPS standard.
The SAI interface resembles the R Studio environment, a popular IDE for R scripts, in order to make it friendly to script providers. The Project button allows creating, opening and saving a working session. A user uploads a set of files and data on the workspace area (lower-right panel). Upload can be done by dragging and dropping local desktop files. As next step, the user indicates the “main script”, i.e. the script that will be executed on DataMiner and that will use the other scripts and files. After selecting the main script, the left-side editor panel visualises it with R syntax highlighting and allows modifying it. Afterwards, the user indicates the input and output of the script by highlighting variable definitions in the script and pressing the +Input (or +Output) button: behind the scenes the application parses the script strings and guesses the name, description, default value and type of the variable. This information is visualised in the top-right side Input/Output panel, where the user can modify the guessed information. Alternatively, SAI can automatically fulfil the same information based on WPS4R annotations in the script. Other tabs in this interface area allow setting global variables and adding metadata to the process. In particular, the Interpreter tab allows indicating the R interpreter version and the packages required by the script and the Info tab allows indicating the name of the algorithm and its description. In the Info tab, the user can also specify the VRE the algorithm should be available to. Once the metadata and the variables information has been fulfilled, the user can create one DataMiner as-a-Service version of the script by pressing the Publish button in the Software panel. The term “software”, in this case indicates a Java program that implements an as-a-Service version of the user-provided scripts. The Java software contains instructions to automatically download the scripts and the other required resources on the server that will execute it. The computations are orchestrated by the DataMiner computing platform that ensures the program has one instance for each request and user. The servers will manage concurrent requests by several users and execute code in a closed sandbox folder, to avoid damage caused by malicious code. Based on the SAI Input/Output definitions written in the generated Java program, DataMiner automatically creates a Web GUI. During the publication process, the application notifies DataMiner that a new process should be deployed. DataMiner will not own the source code, which is downloaded on-the-fly by the computing machines and deleted after the execution. This approach meets the policy requirements of those users who do not want to share their code. The Repackage button re-creates the software so that the computational platform will be using the new version of the script. The repackaging function allows a user to modify the script and to immediately have the new code running on the computing system. This approach separates the script updating and deployment phases, making the script producer completely independent on e-Infrastructure deployment and maintenance issues. However, deployment is necessary again whenever Input/Output or algorithm’s metadata are changed.
To summarise, the SAI Web application relies on the D4Science e-Infrastructure and enables a script, provided by a community of practice working in a VRE, with as-a-Service features. SAI reduces integration time with respect to direct Java code writing. Additionally, it adds (i) multi-tenancy and concurrent access, (ii) scope and access management through Virtual Research Environments, (iii) output storage on a distributed, high-availability file system, (iv) graphical user interface, (v) WPS interface, (vi) data sharing and publication of results, (vii) provenance management and (viii) accounting facilities.
Further details and information can be found in the gCube Wiki pages related to the service.
Galaxy (https://galaxyproject.org) is a scientific workflow management system largely diffused in several scientific communities. In AGINFRAplus such technology has been extended to nicely integrate with Virtual Research Environments and operated on EGI premises. Regarding the extensions, the following have been implemented:
- single-sign-on support, i.e. users are transparently logged into Galaxy when accessing it via a VRE equipped with such a service;
- DataMiner algorithms are automatically offered asGalaxy tools (see screenshot below);
JupyterLab is the next-generation web-based user interface for Project Jupyter.
It enables its users to work with documents and activities such as Jupyter notebooks, text editors, terminals, and custom components in a flexible, integrated, and extensible manner. Its users can arrange multiple documents and activities side by side in the work area using tabs and splitters.
JupyterLab also offers a unified model for viewing and handling data formats. JupyterLab understands many file formats (images, CSV, JSON, Markdown, PDF, Vega, Vega-Lite, etc.) and can also display rich kernel output in these formats.
In the AGINFRA+ project it is provisioned by EGI and integrated in VREs. It is also equipped with (OWSLib)[[http://geopython.github.io/OWSLib/]], a Python package for client programming with Open Geospatial Consortium (OGC) web service interface standards, and their related content models. This is particularly relevant for the analytics, since every DM algorithm is made available by OGC WPS.
During the project lifetime a solution was developed to make dedicated KNIME Server services available for users working within the VRE.
It enables the execution of KNIME workflows, featuring FSK-Lab, on external KNIME Servers with the added benefit of allowing dynamic user interaction based on the content of the input file.
The KNIME Server Integration uses the services of different URI Resolvers in the D4Science Infrastructure, which allow the invocation of workflows hosted by an external KNIME Server which in turn call the Data Miner to invoke the executor of another DM algorithm (see figure below).
FSK-Lab, a KNIME extension for annotating and executing mathematical models, has been integrated into the Data Miner platform. A KNIME workflow for executing a FSK-ML model file is available as a DM algorithm (see figure below).
A KNIME based DM algorithm was added that incorporates these and other services of the D4Science e-infrastructure in order to publish FSK-ML model files to the AGINFRA Catalogue including the above mentioned options for execution (see figure below).
The Data Analytics technology is part of the overall gCube system.
Enabling Technology Version
Major changes and novelties developed to serve AGINFRA+ use cases
Nov '19 - today
Released an enhanced version of SAI providing users with information on the runtime platform (#12594), stronger publishing pipelines (#17569), and new parameters representing a workspace folder (#16503);
Reinforced DataMiner by adding an online visualization facility for process logs and outputs (#11711 and #17106), by reconsiderign the GUI reporting information on computations (#17030);
Revised the catalogue to fully support repeatable fields (#11331), references to authors (#17577);
Released a workspace GUI relyng only on StorageHub (#17226) and reconsidering links management (#17552);
May '19 - Nov '19
Released an enhanced version of SAI supporting zoom (#11708), coordinate systems (#11710) and resources selection (#16157);
Reinforced DataMiner executor widget with the "show" option (#16263);
Integrated workspace in RStudio and Jupyter;
Released a component publishing DataMiner components into the Catalogue;
Feb '19 - May '19
KNIME 3.7.1 (Feb '19)
Nov '18 - Feb. '19
Added support to Python 3.6 (#12742)
Jul '18 - Nov '18
Released an enhanced version of the workspace enacting to associate a DataMiner algorithm to a folder and execute it (#12100). (relevant for WP6)
JupyterLab has been extended with additional kernels by relying on BeakerX technology (#12724)(relevant for WP5)
JupyterLab has been extended with ipyleaflet, a library that can be used to create interactive maps within Python notebooks (#12724)(relevant for WP5)
Jun. '18 - Jul' 18
Added Status control in Algorithm Generator (#11750) to make it possible for algorithms to report on their completion;
Added System parameters support (#11768);
Apr. '18 - Jun. 18
SAI: Added support to Private algorithms (#10779)
SAI: Added user name to algorithms descriptions in SAI publication (#10705)
Reinforced JupyterLab to simplify the invocation of DataMiner algorithms via WPS (including token management);
Feb '18 - Apr. '18
SAI sends information to the Pool Manager (the component dealing with algotrithms deployment) about the user publishing the algorithm and all the users who share the SAI project with him (#10779) <BR>DataMiner filters the algorithms depending on the users' visibility rights (#10778) <BR>The algorithms installer manage a list of users with visibility rights on an algorithm (#10750) </td>
Dec. '17 - Feb '18
This is the first release of the technology including enhancements directly originating from AGINFRA+ requirements.
The black-box mechanism enacting to execute a large set of existing applications has been completed and released (#8819). It includes full support for KNIME-based algorithms (relevant for WP6) and Java compiled (Relevant for WP5);
Added the support to deal with private algorithms;
Two computing clusters have been deployed to decouple the cluster serving prototyping activities from the cluster serving production activities;
The Data Analytics technology has been configured to serve the FoodborneOutbreak VRE and ORIONKnowledgeHub VRE.
Nov. '17 - Dec. '17
The Data Analytics technology has been configured to serve the AgroClimaticModeling VRE and FoodSecurity VRE.
Oct. '17 - Nov. '17
Automating the technology provisioning mode.
Simplifying the access to logs.
Sep. '17 - Oct. '17
No major enhancements to the analytics part.
Jul. '17 - Sep. '17
The Data Analytics technology has been configured to serve the RAKIP_Portal VRE.
Jun '17 - Jul '17
A first version of the document containing a specification of the requirements and use cases characterizing WP5, WP6, and WP7 become available.
A mechanism aiming at enacting the execution of KNIME-based algorithm as DataMiner algorithms has been developed (#8667). It is based on a wrapper invoking the KNIME engine (WP6 requirement).
May '17 - Jun '17
No major enhancements to the analytics part.
Mar '17 - May '17
The Data Analytics technology has been configured to serve the DEMETER VRE.
Jan '17 - Mar '17
The legacy version of the technology, i.e. the version of the technology pre-existing AGINFRA+ project has been made available by the AGINFRAplus VRE.
The KNIME technology (of relevance for WP6) has been installed and KNIME-based workflows showcasing how to invoke algorithms running in the Data Analytics environments have been developed (#7194). This represents a type of interoperability scenarios between the two technologies.
(Coro et al. 2017) Coro, G., Panichi, G., Scarponi, P., & Pagano, P. (2017). Cloud computing in a distributed e‐infrastructure using the web processing service standard. Concurrency and Computation: Practice and Experience, 29(18) 10.1002/cpe.4219