D4.2 - Scientific Workflow Design Technologies¶
- Table of contents
- D4.2 - Scientific Workflow Design Technologies
As stated in the project’s mission, the major goal for AGINFRA+ is to establish an holistic infrastructure, capable to support researchers throughout the complete lifecycle of their research activities.
Apart from the mechanisms for submitting, finding and using individual processing components within the AGINFRA+ VREs, an essential functionality for effectively serving the scientific process is the incorporation of support for Scientific Workflows in the environment.
The workflows combine in an organized fashion the execution of distinct processing modules, combining their functionality towards producing the final results of the intended experiments.
Towards this purpose, different Workflow Management Systems (WMS) have been implemented and are used by different communities, tailored to their specific computational needs .
In the context of the project, tree of the major platforms, KNIME, Galaxy and Airflow, have been selected for use and extension as part of the AGINFRA+ overall e-infrastructure.
The following sections briefly present the selected systems and indicate the adaption, customization and extension steps for making them suitable for the requirements of the AGINFRA+ use cases and data infrastructure.
AGINFRA+ Workflow Management Systems¶
As elicited from the requirements specification for the AGINFRA+ infrastructure, the inclusion of workflow design and execution technologies is of paramount importance to the research activities of the relevant research communities. The two selected WMS used by the communities covered within the project as well as communities adjacent to them, are presented in the following subsections.
The open source Konstanz Information Miner (KNIME) (https://www.knime.com) is a data analytics and reporting platform initially developed at University of Konstanz and made available in 2006. Its initial purpose was directed towards serving the requirements of the pharmaceutical industry, i.e. to facilitate so-called virtual drug discovery. Eventually this modular open source technology has evolved into a true generic community-driven data analytics resource that is used in almost all fields of data analysis, i.e. from data mining, text mining to image processing or bio- and cheminformatics. The KNIME Analytics Platform is developed in Java (based on the Eclipse framework) and is available under an open source license.
The KNIME Analytics Platform software incorporates various features that can support end-users from the food safety, food security or agro-climatic modelling communities:
(1) it provides an interface for graphical assembly of data analysis pipelines, i.e. each individual data analysis task can be are represented by nodes that can be interconnected and in sum constitute a workflow (see Figure 1 for an example),
(2) workflows can be saved, imported, exported and thus shared conveniently,
(3) intermediate results and execution steps are monitored and accessible within the KNIME Analytics Platform software, and
(4) it can integrate other popular programming and data analytics languages, e.g. R, Python, SAS, Matlab and thus serve as a technology integration platform.
Furthermore, KNIME.com provides big data connectors that allow the interaction with big data analytics frameworks like Apache Spark and facilitates the training and execution of machine learning algorithms and neural network architectures in the cloud.
Figure 1. KNIME Workflow Editor
Galaxy (https://galaxyproject.org)  is a data integration and scientific workflow design and execution framework. It was originally developed to support genomics research and knowledge exchange, however it also allows the integration of arbitrary processing components, thus making it suitable for a multitude of disciplines and domains.
There are currently multiple Galaxy installations publicly available and usable. The software is available under an Academic License and can be freely downloaded and deployed as a stand-alone web platform. The basic distribution offers a generic set of widely used tools. However, additional components can be added via the Galaxy Tool Shed repository and custom scripts can be added by the administrator of a deployment.
Each component added in Galaxy bears annotations that provide basic information and guidelines for its usage. The components can be combined in workflows via the included design environment (see Figure 2).
Figure 2. Galaxy Workflow Editor
Apache Airflow (https://airflow.apache.org)  is an open-source workflow management platform. With Airflow, users can author workflows as Directed Acyclic Graphs (DAGs) of tasks. The Airflow scheduler executes authored tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs fast. Airflow also features a rich user interface, which makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.
Figure 3. Airflow DAGS View
Systems Deployment and Integration¶
Deployed Workflow Management Systems¶
In the case of the KNIME workflow management system the AGINFRA+ project facilitated the integration via two independent approaches:
Figure 4. KNIME workflow as an alogrithm for the Data Miner
2.: With support from WP3 it has been possible to integrate web-based services provided by the KNIME Server infrastructure operated by BfR to the RAKIP community VRE. KNIME Server is a commercial technology provided by KNIME.com that allows among others to provide easy to use websites for end-users using KNIME workflows as backend technology. These web services are so-to-say highly customizable wrapper for KNIME data analytics workflows that hide complexity of the sometimes complex data analytics workflows from end users. In case of the RAKIP food safety modelling community BfR developed dedicated web services that allow to create, annotate, edit, execute or even graphically combine FSK-ML compliant models online, i.e. there is no need now for end-users to learn KNIME or FSK-Lab in order to accomplish these tasks. The KNIME Server also offers access to the independent model repository(see Figure 4) that is jointly maintained by the three founding members of RAKIP: DTU, ANSES and BfR. As the desktop KNIME Analytics Platform itself might evolve into a web-based solution (like Galaxy) in the near future the mechanisms developed for the KNIME Server integration might even be adoptable once this new KNIME technology is made available by KNIME.com.
Figure 5. RAKIP-Web-repository provided by a BfR hosted KNIME server, integrated into the VRE
An initial testbed deployment of Galaxy was established at: http://18.104.22.168:9090. The deployment was used for assessing its functionalities by the different AGINFRA+ communities and helping the technical partners understand the architecture and dependencies of the component.
Following these assessment activities, a Galaxy deployment as an integrated VRE component was carried out. By the end of the project's first period (M18), Galaxy has been deployed over EGI infrastructure and made available to authorised users under the Analytics section of the VRE along with other processing frameworks (i.e. Data Miner, RStudio, Jupyter, nbviewer).
Galaxy has been integrated with the authorisation mechanism of D4Science in order to provide seamless authentication to VRE users without the need to re-login. A Galaxy deployment is created for each of the VREs that request access to the service. The VRE provides a link to its dedicated Galaxy deployment for easy access for users. Each of these deployments is configured to fetch from the VRE DataMiner facility a list of algorithms and make them available as tools in Galaxy. See in Figure 5, the main Galaxy interface with the list of DataMiner algorithms expanded on the left.
Figure 6. Galaxy for the Food Security VRE
Whenever one of these tools are invoked, the DataMiner service is contacted using the right user credentials and outputs are stored back into the Galaxy service which then can be used as input for another DataMiner invocation or for any other of the tools existing in the Galaxy ecosystem. A set of extra tools that facilitate conversion of outputs into valid Galaxy formats was added so users can create workflows easily.
The Galaxy service is now actively used by the use cases for execution of workflows and deployments are further customised as required to cover the needs of users.
3.: For the AGINFRA Data Harvesting Workflow, WP2 introduced Apache Airflow as part of the proposed data e-infrastructure. To answer data requirements, Agroknow introduced two DAGs. The first one includes all the configuration code for harvesting of data records and the second one includes data transformation and enrichment tasks so that data in the proper schema can be indexed and served. The DAGs have been tested on the following data sources:
(2) AGINFRA+ VREs,
The first DAG is scheduled to run every hour per day for sources 2-9, while the first source (AGRIS) is scheduled to be updated once a month.
Generated DAGs and their configuration per data source can be found on the AGINFRA PLUS GitHub:
The DAG View of the AGINFRA Data Harvesting Workflow can be found on http://22.214.171.124:8888/admin/.
Figure 7. Airflow DAGS for AGINFRA PLUS harvesting
Workflows Discovery and the Common Workflow Language¶
For the effective access and exploitation of the AGINFRA+ WMS the idea of a discovery mechanism for the workflows designed over the three systems was exploited. To that end, an established standard was introduced . The so-called Common Workflow Language specified the following entities for any given workflow:
- An object is a wrapping data structure that incorporates a set of name/value pairs.
- A document is a file defining an object or array of objects.
- A process is a basic unit of computation that is characterized by an input, and an output derived from that input after a specific computation of arbitrary complexity.
- An input object is an object describing the inputs for a process.
- An output object is an object describing the outputs of a process.
- An input schema describes the valid format for an input.
- An output schema describes the valid format for an output.
- A parameter is a named symbolic input or output of a process.
- A workflow is a process characterized by multiple intermediate steps, where step outputs connected to the inputs of subsequent steps to form a directed acyclic graph. Independent processes belong to a different graph, with the workflow essentially being the forest defined by the set of graphs defined by sets of connected subprocesses.
- A workflow platform is a specific setting for executing processes.
- Metadata is information about workflows, processes, inputs and outputs as declared in the native metadata formalisation of the targeted platform.
Although the premise was clear, this task was no longer pursued, as no sizeable bulk of workflows was generated through project activities to support a data discovery and also, no particular interest was kindled among the use-case communities for such a scenario.
 Curcin, V., and M. Ghanem. “Scientific Workflow Systems - Can One Size Fit All?” 2008 Cairo International Biomedical Engineering Conference (December 2008). doi:10.1109/cibec.2008.4786077.
 Berthold M.R. et al. (2008) KNIME: The Konstanz Information Miner. In: Preisach C., Burkhardt H., Schmidt-Thieme L., Decker R. (eds) Data Analysis, Machine Learning and Applications. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg.
 Enis Afgan, Dannon Baker, Marius van den Beek, Daniel Blankenberg, Dave Bouvier, Martin Čech, John Chilton, Dave Clements, Nate Coraor, Carl Eberhard, Björn Grüning, Aysam Guerler, Jennifer Hillman-Jackson, Greg Von Kuster, Eric Rasche, Nicola Soranzo, Nitesh Turaga, James Taylor, Anton Nekrutenko, and Jeremy Goecks. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Research (2016) 44(W1): W3-W10 doi:10.1093/nar/gkw343
 Peter Amstutz, Michael R. Crusoe, Nebojša Tijanić (editors), Brad Chapman, John Chilton, Michael Heuer, Andrey Kartashov, Dan Leehr, Hervé Ménager, Maya Nedeljkovich, Matt Scales, Stian Soiland-Reyes, Luka Stojanovic (2016): Common Workflow Language, v1.0. Specification, Common Workflow Language working group. https://w3id.org/cwl/v1.0/ doi:10.6084/m9.figshare.3115156.v2