GridPP Deployment Board & Project Management Board Combined Minutes 07 - 24th August 2010 (Ambleside) ======================================================================== ============================= Present: Steve Lloyd (Chair), Jeremy Coles, Phil Clark, Graeme Stewart, Roger Jones, Alessandra Forti, Pete Gronbech, Dave Colling, Duncan Rand, Andrew Sansum, Derek Ross, Stuart Wakefield, Dave Kelsey, Tony Doyle, Dave Britton, Sarah Pearce, Tony Cass, John Gordon, Pete Clarke (Suzanne Scott, Minutes) Apologies: Pete Watkins, James Catmore, Raja Nandakumar, Glenn Patrick, John Walsh, Andy Richards 1. Minutes of Previous Meeting =============================== The Minutes of the previous DB meeting, held on 16th April 2010 at RHUL, were accepted. 2. Actions and Matters Arising =============================== 01.15 DC to provide updated LondonGrid MoU. It was reported that a meeting would be held soon, and it would be agreed then. DR reported that there was a draft MoU coming soon. It was reported that another iteration was happening - a new version should go out shortly. DC reported that the meeting had happened, the LondonGrid MoU had been discussed, comments had been provided but there was no further action. Ongoing. DB asked whether we needed a formal MoU within the Tier-2? RJ advised that the NorthGrid MoU did not require to be changed. SL observed that we functioned well without signed MoUs. JG explained that EGEE had pushed service-level agreements with sites, but because we had pre-existing MoUs we didn't need to comply. JG noted that in EGI there were OLAs: 'Operations Level Agreement' and if an MoU didn't exist then we needed to do one - this would be an agreement with a site, and would be signed-off by sites. SL noted that grant conditions already ensured that a level of service was provided in return for funding given. JG added that there was also the issue of security being built-in to the MoU. SL suggested that we needed a generic MoU that could continue and would not require to be re-signed. This also related to Action 05.01 below regarding the NorthGrid MoU. TD asked if the GridPP MoU signed by the four Tiers would continue? Yes. It was agreed that the GridPP MoU should be revisited and made generic. DB advised that it should not specify individual sites but should specify Tiers instead. ACTION 07.01 SL to update the GridPP MoU document for 2010-2011 and ensure it was generic in order to future-proof it against constant amendment and re-signature. 04.07 DK to check that all is up-to-date in terms of GridPP Security Policies - email DB. If there are any issues, DK to let DB know. DK reported that the GridPP Security Policy phase was ongoing at present, however other policies had been approved by LCG. It was reported that Mingchao Ma had been looking at this in relation to EGI. JG advised that some Policies had been updated since April. Ongoing. 04.10 SP to contact Robin Tasker re GridMon update (it was noted that this was also a PMB action). SP reported that discussions were ongoing - they had effort at Glasgow to work with Mark Leese on GridMon, but this had been difficult as no response had been received from Mark Leese for some time. ML was currently working on a database. SP reported that she had contacted Robin Tasker at the time, Mark Leese had done some minor update at the start of 2010. TD asked if it was worth maintaining GridMon in GridPP4? GS observed that it was a useful tool if you had network problems. PG agreed that it was useful to have it. JG advised that there were iPerf tools and that the OPN people had been looking at this and would report-back to the GDB in September 2010. PC noted that if we lost our only tool and things didn't work, then it would be bad - he agreed that GridMon was necessary. DB noted that there had been difficulties in contacting Mark Leese in order to organise a handover. DB had managed to achieve some access to the backend database. It was agreed that renewed efforts would be made to engage someone at Glasgow to tackle this and transfer access in order to ensure the instances were up-to-date and running ok - DB would insist on a meeting with Mark Leese. It was noted that this had to be done by the end of GridPP3. New action below replaces 04.10. ACTION 07.02 TD/DB to make renewed efforts to engage someone at Glasgow to tackle GridMon and to have access transferred in order to ensure the instances were up-to-date and running ok - DB would insist on a meeting with Mark Leese for a handover. It was noted that this had to be done by the end of GridPP3. 04.11 Re future network requirements, dTeam to look into iperf as an alternative backup, and report-back. It was felt that this was re-inventing the wheel, but it was ongoing. This action remains pending meantime, depending on success in action 04.10. No further action re iPerf to be taken meantime. Ongoing. 05.01 RJ to provide an updated NorthGrid MoU (requires to be modified in relation to EGEE/EGI). 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). GS reported that they did this at Glasgow, and it worked. Item closed. 06.01 RJ/Graeme Stewart to provide urls of the place(s) where info is located re ATLAS site tests and measurements (so that sites understand what they're being measured on). Ongoing. 3. Future of the Deployment Board ================================== SL advised that the future of the DB should be discussed, and it was likely that today the DB meetings would be discontinued. SL asked for questions and proposals as to its replacement. DC asked about the User Board? This was a meeting that was larger than the PMB but included main stakeholders. DB advised that there were certainly issues to be discussed but he agreed that it was the people who were crucial - no Tier-2 point of view was ever discussed: the four Tier-2 Co-ordinators could overlap with the PMB. DC suggested that the smaller experiments could also be invited. DB thought that the key difference was to get the site view, and that whilst more experiment representation was valuable, it would have the effect of shifting the discussion to the middle ground - DB considered that site issues had to be given space. SL suggested we 'mend' the User Board. DB noted that there were overlap issues, but we also needed a longer term strategic view from the sites. DK observed that originally, the technical decisions were made below the level of the PMB. SL agreed that the dTeam also handled such issues. DB asked for the opinion of the Tier-2 Co-ordinators. AF considered that it was good to meet together, particularly to have F2F meetings. PG thought it was not always obvious what the PMB were thinking and that the current DB meeting was useful. DB noted that in GridPP4 there would be 8 group analysis sites forming a core of 8 plus Tier-2 Management, which equalled potentially 12 people plus the PMB, making a total of 27. DK asked what were the proposed outputs? SL noted that the Agenda could be set by the dTeam. DB preferred a two-way flow between the new Operations Team and the PMB, but the PMB could not really be expanded, and he suggested also that JC appoint a Deputy who could attend the PMB during GridPP4, and more formal get-togethers could be organised twice per year. It was agreed to have a joint F2F meeting of the Operations Team core members, plus the PMB, to take place twice per year at the Collaboration Meetings. The first such meeting would take place at Sheffield in April 2011. Actions from this new OT/PMB meeting would go to either the PMB or the Operations Team to progress. 4. Reporting Issues ==================== JC reported that there was inaccuracy in the gstat figures - he presented a slide and there was a discussion of the gstat table. It was noted that the gstat and Quarterly Reports figures didn't tally. JG noted that a different report was being implemented in relation to installed capacity. JC noted that some sites were using Storm. PG advised that Storm didn't publish correctly - it was currently being done by hand. JC noted that gstat was not accurate. JG reported that the MB were dealing with this and sites would be asked eventually about their figures being accurate - the bugs would get fixed and a Tier-2 report would be provided in due course. There was a discussion on CPU usage. 5. Manpower Issues =================== SP reported from the Quarterly Reports that there were staffing issues at sites - how do we support small sites who have limited staff? PG reported that in relation to Bristol, he had spoken to Nick Brook - they have a storage person who was 0.5FTE GridPP-funded, and he was trying to set them up like Oxford. One contract had been extended to the end of March therefore there was a small window in which to transfer the knowledge, following which there would be 'remote control' from Oxford. DC noted that Bristol would eventually be downgraded to a Tier-3. There was a discussion on PhEDEx data. SL observed that being a Tier-2, that meant local resources from the University and staff providing local expertise. PG advised that there were cluster issues regarding upgrades and also limited opportunity for expansion. DB asked if they would actively monitor the dashboard or wait until a GGUS ticket arrived? PG advised both - but if they dropped down to a local cluster, they wouldn't be part of SouthGrid at all, however if they remained as a Grid site then the kit would need monitored and maintained. SL commented that we didn't want a huge drain on Oxford to support this. PG confirmed it would be on a 'best effort' basis only. GS likened it to Durham and ScotGrid - it was possible to maintain but there were virtualisation issues - the NFS area was not available etc, they couldn't duplicate the setup at Glasgow but there were areas Glasgow could have been involved with before the staff member left. GS advised that a period of overlap was quite important at small sites. DR noted that UCL was similar - a staff member had gone to Switzerland for a year, but there was a Deputy there so it was less of a problem, and UCL Central look after their cluster - it was dormant in any case until they upgrade their new worker nodes. TD asked about the smaller sites in GridPP4? DB advised that SP had checked all of the sites with fractional posts and had verified that they could handle the funding issues - however we did need to anticipate problems at an operational level. 6. Other Issues from the Experiments ===================================== For ATLAS, RJ reported problems with disk available in space tokens, specifically at RAL. There had been an issue with the UK cloud being taken off the Panda brokering, which was a 'heavy handed' action after a server failure at RAL - the server had firmware problems on the raid controllers - it was easier for ATLAS to declare the files lost, but the ATLAS ADC response had not been appropriate. There were deployment issues for the experiments and for the Tier-1. RJ reported that in relation to network issues, there were potential changes to the way they do data distribution. GS advised that they needed to commission long distance links to the Tier-2. RJ considered that networking needs to take time to organise, and we don't want to rush to spend money on it. GS reported that they had seen recurrent problems at sites - the speed of response varied greatly, and also how well the fix was done - he wanted to emphasise that seeing an issue once is ok, however twice could be construed as carelessness, and three times was simply unacceptable. Attentiveness was required. RJ reported there were SE issues at sites, either amber or red, and that storage was always the challenge. For CMS, DC reported that since the last meeting there had been niggles here and there but overall things were OK. The Tier-2 sites had not been as responsive as they would have preferred - they do monitor the sites. For networking they needed more than one Gbit per site if there was contention - they will be investigating the bottlenecks. DC reported that Brunel needed an extra person as soon as possible in order to provide cover. There were xfs issues at Brunel and elsewhere - more Sysadmin support would help speed this up. In the UK, CMS can expand the number of analysis groups slightly, they may cautiously increase next year. For LHCb there was no-one present. PC advised that from the LHCb viewpoint, RAL was fine, as were the others. DB advised that there were two issues: the CASTOR upgrade and xrootd access (for LHCb). TD noted that direct FTP access inside Dirac was also an issue for LHCb. JG reported that there had also been a DPM problem in relation to the NAT but this was fixed at some sites now, especially Glasgow. For 'other' VOs, PClark reported that UKQCD was being supported - they now had 5-10 Tbytes at sites. JC reported that they also had space at both Edinburgh and Glasgow. DB thought this was an interesting application to host, for it used both HPC/HTC. PC advised that since QCD had started their own project, they had an overarching organisation, and we could possibly talk to them at the 'project management' level. JG thought this was useful, especially if we wore NGI/EGI hats. 7. Other Issues from the Tiers =============================== For London, DR reported that the staffing issue was resolved, new people had been appointed and there was a sharing of roles. UCL was a bit vulnerable due to the SysAdmin leaving. RHUL was working OK, and had moved to Egham. DB reported that RHUL had converted their hardware funding into a person and wanted to continue that in GridPP4. DB advised that we fund 0.5FTE - were there 'value for money' concerns? DC noted no - it seemed the correct approach, without the 'hardware-to-staff' conversion it had not been viable. DB advised that there was small and very limited flexibility in GridPP4 for 'hardware-to-staff' funding. PG noted that the site was easily meeting its MoU. For ScotGrid, GS reported that Glasgow and Edinburgh were doing well - there had been staff changes at Glasgow, and the Edinburgh fairshare negotiated at ECDF meant they were running 1000 jobs at the moment and delivering well over what was asked. Regarding Durham, GS noted that a difficult staff transition was currently being managed. For NorthGrid, AF reported that there had been SL5 issues on the servers, the Storage Group were looking at ext4 but it was not considered to be a stable file system. There had been problems with Nagios at Lancaster, bringing availability down. Liverpool were using an operating system that was not listed in Nagios tests - they were using a Red Hat Enterprise server not listed in the user systems therefore they were failing the tests at Liverpool - a ticket had been opened but there was currently no reply (ticket 61224). AF reported that all sites were working on new hardware, shared or local, and Manchester were commissioning new hardware and currently had a storage issue. Manchester had lost a post in GridPP4. Roger Barlow had also stepped down and Mike Seymour and Un-Ki Yang had taken over weekly management. It was noted that a Group Leader was preferred at the CB - the site needed to decide who to nominate. For SouthGrid, PG reported that Oxford were doing well over the last few months and they also had a tender out for new hardware, both CPU and storage - this had to be cancelled and new quotes procured. The problem had been on the computing node side - they have re-evaluated and started again. PG reported that RAL PPD was running well and almost back to full strength following air-conditioning issues which had been resolved. Cambridge were doing incremental upgrades but there were support query issues and ticket responses were slow. Birmingham had issues recently but they are running well, although there were long-term concerns re staffing and handover. Bristol had staffing problems and would receive help on a 'best effort' basis. Sussex might become a Tier-3. PG raised the issue of Oxford running the GridPP Nagios service for the UK - this was running on Grid hardware but the fallback should be set up outside Oxford? JG advised that NGS were trying to run the same thing and he should contact Andy Richards. PG advised that certain tasks were critical and others not so. JG thought this could be brought up at the Technical Forum in September. JC confirmed that a combined instance was already being discussed at NGS. For Cambridge, PG reported that Andy Parker was no longer on the SouthGrid MB - John Hill would be taking over. This was similar to Birmingham, where Chris Hawkes would take over. PC asked if the Tier-2 Co-ordinators could prepare slides on network connections at sites for the Wednesday morning session. Tier-1 issues had been covered at the PMB and did not need to be discussed again. There was no other business. ACTIONS AS AT 24.08.10 ====================== 01.15 DC to provide updated LondonGrid MoU. It was reported that a meeting would be held soon, and it would be agreed then. DR reported that there was a draft MoU coming soon. It was reported that another iteration was happening - a new version should go out shortly. DC reported that the meeting had happened, the LondonGrid MoU had been discussed, comments had been provided but there was no further action. Ongoing. 04.07 DK to check that all is up-to-date in terms of GridPP Security Policies - email DB. If there are any issues, DK to let DB know. DK reported that the GridPP Security Policy phase was ongoing at present, however other policies had been approved by LCG. It was reported that Mingchao Ma had been looking at this in relation to EGI. JG advised that some Policies had been updated since April 2010. Ongoing. 04.11 (JC) Re future network requirements, dTeam to look into iperf as an alternative backup, and report-back. It was felt that this was re-inventing the wheel, but it was ongoing. This action remains pending meantime, depending on success in action 07.02. No further action re iPerf to be taken meantime. Ongoing. 05.01 RJ to provide an updated NorthGrid MoU (only requires to be modified in relation to EGEE/EGI). See also action 01.15. 06.01 RJ/Graeme Stewart to provide urls of the place(s) where info is located re ATLAS site tests and measurements (so that sites understand what they're being measured on). Ongoing. 07.01 SL to update the GridPP MoU document for 2010-2011 and ensure it was generic in order to future-proof it against constant amendment and re-signature. 07.02 TD/DB to make renewed efforts to engage someone at Glasgow to tackle GridMon and to have access transferred in order to ensure the instances were up-to-date and running ok - DB would insist on a meeting with Mark Leese for a handover. It was noted that this had to be done by the end of GridPP3. The next meeting would be the first meeting (in GridPP4) of a combined Operations Team (core personnel) + PMB meeting, to take place at Sheffield, GridPP26 in April 2011. The actions above would be transferred to the PMB Action List.