Support #11605

Set DataMiner queue length to 1

Added by Gianpaolo Coro over 1 year ago. Updated 8 months ago.

Status:ClosedStart date:Nov 02, 2018
Priority:UrgentDue date:
Assignee:Leonardo Candela% Done:

100%

Category:High-Throughput-Computing
Sprint:UnSprintable
Infrastructure:Production
Milestones:
Duration:

Description

In the templates/setup.cfg file on the DataMiner, the maxcomputations parameter should be changed to 1 instead of 4. This parameter should be changed in the component too, possibly.
However, this modification is urgent to users currently using DataMiner as a cloud computing platform, thus action via provisioning is required.


Subtasks

Task #12614: Need server with joblength 1 to use all 16 cores and a cu...ClosedLars Valentin

Task #12806: Testing the new settings of the RAKIP infrastructureClosedLars Valentin

History

#1 Updated by Andrea Dell'Amico over 1 year ago

Where is it?

root@dataminer1-d-d4s:/home/gcube/tomcat/webapps/wps# find . -name setup.cfg
root@dataminer1-d-d4s:/home/gcube/tomcat/webapps/wps#

#2 Updated by Gianpaolo Coro over 1 year ago

@lucio.lelii@isti.cnr.it could you help us, please?

#3 Updated by Lucio Lelii over 1 year ago

It is in src/main/resources inside the dataminer jar, It cannot be changed in this way

#4 Updated by Lucio Lelii over 1 year ago

I'm building, via etics, the dataminer jar with the modification needed. When it will be ready we need to replace the jar in every dataminer.

#5 Updated by Andrea Dell'Amico over 1 year ago

Please let me know when it's ready and post the maven coordinates, so that I can put in place an ad hoc playbook.

#6 Updated by Andrea Dell'Amico over 1 year ago

  • Status changed from New to In Progress

We have the artifact. I'm going to install it on the dev dataminers first.

#7 Updated by Andrea Dell'Amico over 1 year ago

Can anybody check dataminer1-d-d4s.d4science.org? If I do not hear anything back in an hour, I'll start updating the production dataminers.

#8 Updated by Andrea Dell'Amico over 1 year ago

  • % Done changed from 0 to 30

The services in dev restarted correctly, I'm starting rolling out in production.

#9 Updated by Andrea Dell'Amico over 1 year ago

  • % Done changed from 30 to 100
  • Status changed from In Progress to Feedback

Done.

#10 Updated by Gianpaolo Coro over 1 year ago

As I have discussed with Andrea, this patch should be rolled-back since there are algorithms that invoke themselves, possibly on the same machine. In the next weeks, Andrea will deploy new DataMiner machines as Generic Worker machines behind the Generic Worker proxy, and this should solve the issue.

#11 Updated by Andrea Dell'Amico over 1 year ago

The rollback is running.

Gianpaolo Coro wrote:

As I have discussed with Andrea, this patch should be rolled-back since there are algorithms that invoke themselves, possibly on the same machine. In the next weeks, Andrea will deploy new DataMiner machines as Generic Worker machines behind the Generic Worker proxy, and this should solve the issue.

Would you open a dedicated ticket for the new generic workers? They substitute the old ones, correct?
And when deployed, the queue lenght parameter will be fixed at 1 on both the regular dataminer and the generic workers?

#12 Updated by Andrea Dell'Amico over 1 year ago

(rollback completed)

#13 Updated by Gianpaolo Coro over 1 year ago

Thank you. Given the way people are going to use the services, I think we will need to have the standard DataMiners running with the default configuration and the Generic Workers running with 1.

#14 Updated by Andrea Dell'Amico over 1 year ago

Gianpaolo Coro wrote:

Thank you. Given the way people are going to use the services, I think we will need to have the standard DataMiners running with the default configuration and the Generic Workers running with 1.

So that value must be converted in a property, configurable at provisioning time without swapping jar files.

#15 Updated by Gianpaolo Coro over 1 year ago

  • Status changed from Feedback to Closed

#16 Updated by Lars Valentin 11 months ago

  • Priority changed from High to Normal
  • Assignee changed from _InfraScience Systems Engineer to Leonardo Candela
  • Tracker changed from Task to Support

Hello Leonardo,

I hope you don't mind that I reuse this specific ticket instead of opening a new one with the same context.

Unfortunately, I did not use the chance in the webmeeting yesterday, since, I would like to know the current status of the queue length per damaminer server and the future possibilities.

My understanding in the past was that each dataminer has 16 cores and 4 slots to accept jobs - what means that each job is allowed to use max. 4 cores. Am I right at this point?
That is totally fine if we run workflows (models) which are not optimized for multicore use and therefore only use one core.

But what about workflows/models which are able to use all cores provided by the system? Usually, they should be programmed to leave one core for system processes, which would mean 3 cores are available to be used if one job can only assign 4 cores max.

I would like to ask if there is the possibility to execute workflows/models via REST in a way that the dataminer knows it should switch and run this job unique on the 16 cores dataminer to allow to incorporate all 16 cores for one job exclusively.

That would be 15 cores available to run the model instead of 3, what might roughly lead to 1/5 of the previous running time.

Thank you in advance!
Lars

PS. Do you may have dataminer with more cores to be addressed in the future for such cases?

#17 Updated by Leonardo Candela 11 months ago

Hi @lars.valentin@bfr.bund.de I would suggest to open specific tickets to discuss any need you / your use cases might have on aspects like that.

During our last PMB meeting (see slides here https://goo.gl/ZFYGJq) I tried to explain that:

  • right now there are two "clusters" configured to provide data miner facilities (proto and prod);
  • these clusters are not for exclusive use of AGINFRA+ cases;
  • we can have one cluster per VRE;

The business logic the service use for allocating the tasks on machines and consuming the available cores can be better explained by @gianpaolo.coro@isti.cnr.it ... in particular regarding cores (a) it depends also on how the algorithm has been implemented and (b) it cannot be changed at algorithm invocation time.

Last but not least, if you have particular settings / behaviors you have to satisfy we can try to configure a specific cluster yet this has a cost we have to carefully evaluate.

These are overall comments stemming from me, my suggestion is to try to be specific and report any specific need you have (e.g. enact process x that needs 100 cores) and we will do our best to satisfy it with the available resources and technologies.

#18 Updated by Lars Valentin 9 months ago

  • Start date changed from Oct 06, 2018 to Nov 02, 2018
  • Due date set to Nov 02, 2018

due to changes in a related task

Also available in: Atom PDF