GRIDPP24 COLLABORATION MEETING @ RHUL ===================================== Discussion Session - 15th April 2010 ------------------------------------ Topic: Discussion of Storage Issues ------------------------------------ Chair: Jeremy Coles Panel: Jens Jensen, Wahid Bhimji, Chris Brew, Sam Skipsey, Chris Walker, Brian Davies, Robert Fay The session comprised short presentations from the panel, followed by questions and discussion. BD began by observing that storage was about meeting experiment requirements, and requirements of the production team differed from user storage requirements. Furthermore, experiment requirements change over time therefore we had to adapt in order to try and preempt needs remembering that solutions for one VO might well help another. BD noted that different file systems in use, and differing levels of testing was an issue. The storage group are looking to the future developments to better support users - there were efficiency limits to take account of as well as levels of expectation. Ewan MacMahon noted that some of the challenges at the Tier-2 were harder than at the Tier-1. BD advised that access patterns were different for both. SS began by noting that file integrity testing was actually file catalogue integrity testing - this existed at low levels at sites due to catalogue inconsistency, and this meant different solutions with the LFC. ATLAS were looking at ways they could simplify the way they manage data. Roger Jones commented that the CMS model had been discussed as a solution for ATLAS, but the problem was, to move to such a model would be very disruptive to ATLAS operations, and things were actually moving in the other direction (i.e.. more centrally). CB asked whether CMS was more distributed than ATLAS? RJ noted yes, pushing things out to the periphery was a better design. SS advised that bad data was unusual. CW began by advising that originally, the decision to use Storm and Lustre had been seen as 'different' and had caused problems, however if he had to re-install now, it was exactly what he would choose as it gives good performance. The group file format was also improved. If the storage boxes go down, you can't access the data and for improved resilience either have to buy more expensive hardware or think of a software solution. So far, Lustre had been reliable. In the long term, CW noted we had to think about issues moving data from a site level to a worker node level. RF reported that user jobs submitted by Ganga could saturate the storage pool - it was difficult to tell a user they can't use rfio in that way. RF noted that Liverpool didn't want to stop user jobs, just stop them using rfio - file staging was more efficient. If there was better caching on the network, or a limited file system access approach, then there was more efficient caching overall. The headnode was a single point of failure, and was a concern. SS advised that they had carried out testing on local access, and this was limited to how good the file stager could be. CB commented that multi-threader jobs will change access patterns. ATLAS and CMS were looking at this, it changes the memory usage. John Gordon advised that there was a workshop at CERN in June, looking at using multi-cores. WB began by commenting that for file access, a good summary was random behaviour due to multiple trees, and that re-ordering didn't help. On DPM, they had a good relationship with the developers and also had a lot of expertise, but they needed to prioritise features etc, as there was limited manpower available. Jeremy Coles asked if there were any certification issues? WB noted that there was instability in the SRM at Glasgow and this had been fixed by the developer in a day, however the 6-month period of waiting for certification of patched middleware after that was too long. WB advised that there were issues over certification generally taking too long. Stephen Burke asked why we couldn't do the certification ourselves, given that Andrew Elwell was at the top of the process? WB thought that it might be possible to do the certification for ourselves in the UK for local consumption - of course the findings could be fed back to EGI/EMI too. CB then spoke on caching on worker nodes. CMS had an intelligent caching system that caches on local disk - this reduced data being read by a factor of 10. There was a discussion on lazy downloads. dCache had improved a lot over the last year or two and there was a good level of support from the dCache team - also, the user forum was good as sites helped one another and it meant there was a global resource. Configuration and defaults had helped make things simpler for users. They would get to a petabyte of disk at the Tier-2 this year. There was consensus that they might only be running dCache as a storage solution in the future, as users don't need to know what's behind the scenes, however it was recognised that dCache file placement was bad. Jens Jensen had placed some pictures only on the Grid, and had challenged the audience to download these. Only a few had (tried and) managed to do so, but JJ noted that whilst this showed complexity, it could work nonetheless. It was asked whether there were any plans for improvement? JJ advised that testing, installing and configuring were the crucial issues of publishing. There was a discussion of capacity, shared space, and publishing tools. Dave Wallom observed that there was an issue on the presentation slides and that lighter weight methods for accessing the Tier-3 might be desirable: given NGI etc, would they want to install tools in the current implementation without gLite software? Was there stand alone software? CB advised that dCache was the stand alone product. JJ made a last point that storage was a complex issue and it had been good to have the opportunity of the storage workshop to discuss and share experiences - a good team was in place. JJ asked if there had been any gaps not yet addressed? Dave Wallom asked whether in the solutions discussed was anyone looking at the SRM itself? JJ confirmed no - the integrated solution was dCache - the SRM was related to other things, no-one was integrating the SRM directly. DW asked if there were plans for a next version of SRM? JJ advised that version 3 was possible, but this would only be instigated by the users if they requested it. Jeremy Coles commented that in the past, the Oversight Committee had mentioned that we should reduce our solutions but that instead of this, we seemed to be adding solutions. Could we go towards one solution, or were things always likely to diverge? There was a discussion on SRM, Lustre, Oracle, and support. It was noted that different site setups (and technology) required different approaches. CB commented that you couldn't persuade him NOT to run dCache now. WB advised that we would need more effort in the UK to run more than three things now. JC advised that there was also the issue of the users identifying problems before the site did. CB acknowledged that this was an issue. Brian Davies asked whether a central national monitoring system was possible? CB thought this wouldn't work. WB noted that it depended what the problem was - the Tier-2s simply couldn't provide 24/7 cover. Ewan MacMahon asked whether this was really an issue? If the server truly died then we do see it - Nagios raises an alarm. Duncan Rand suggested that the SAM monitoring could be automated to send an email to the site? Roger Jones thought that passive monitoring using the payload would be a good solution. JC asked whether we could get more value through shared procurements? The consensus was no - this would lead to delays. Dave Wallom commented that Heriot Watt had done it, but there was no significant value and was a lot of hassle. Pete Gronbech noted that there was shared knowledge across sites in any case. RF noted that university departments also provided both infrastructure and support. Jeremy Coles drew the proceedings to a close, thanked the panel, and thanked all those who had participated.