GridPP21 Collaboration Meeting_Swansea ============================= 03.09.08 - Users' Discussion Session - Users on the Grid (Chair: Roger Jones) ------------------------------------------------------------------------------- Panel: Hugo Beauchemin (ATLAS Analysis), James Catmore (ATLAS issues), Janusz Martyniak (Portals and Porting), Greig Cowan (LHCb issues), Phil Clark (ScotGrid) RJ asked the panel to note any interests they represented. JM advised that he supported the smaller experiments at Imperial. GC was present on behalf of LHCb usrs and to present storage views in relation to LHCb. HB noted that he was a physicist at Oxford dealing with the user experience of ATLAS. PC advised he worked on ScotGrid at Edinburgh and was there to provide a view on general user analysis. JC was from Lancaster, the B_physics group, interested in analysis and MC data production. JM was asked to commence the session with a brief presentation. JM reported that his role was to help users who were unrelated to the LHC or to the large experiments - it was not easy to find them! He had local users, one from Astrophysics doing 'number crunching' jobs - their question was, could they run Matlab on the Grid? Another group was Bioinformatics at Imperial - their jobs were fairly small, their next meeting soon was to organise a VO certificate etc. He had also received a request from SuperNemo and MICE to prepare the LFC box for them - this was up and running now. RJ advised that some dialogue was needed at this point. DB suggested that there was a perception about the biggest hurdle which new users have, what was this hurdle for them in adopting the Grid? JM was not sure, he hadn't really begun to support any groups, but he noted that Matlab required licences. Jens Jensen advised that NGS were running Matlab and there were restrictions on the licensed software. PC advised that in ScotGrid, only local users could use Matlab. JM advised that there was also the issue of payment for the licence as well. DB asked whether he had pushed people in the direction of Ganga? JM noted no, they were mainly happy initially if they got certificates. GC spoke next by way of slides showing LHCb users on the Grid. His presentation covered user analysis, jobs, compute resources, Ganga. GC noted serious issues around a year ago: the job success rate was low, users were disillusioned, data access was a common problem, also software installation was often incorrect, there were Ganga issues. At that time at Edinburgh, users had been stopped from running jobs, data was replicated to the Grid CEs and there was someone on site dedicated to help. Today, GC advised that system instabilities were down, success rates were higher, Ganga5 had been released, users seemed happier but at a figure of 80% was this acceptable? GC noted that data access was still problematic, DIRAC2 was managing user analysis and there had been an evolution of software. For the future, GC noted that DIRAC3 was coming and teething troubles were expected. Solutions to this included the Grid Operations Team, new monitoring tools, and Ganga improvements. RJ noted that the big issue was job reliability and data access - GC agreed. DB asked if all the LHCb users in the UK used Ganga? GC confirmed mainly yes, but others did use DIRAC etc. HB was asked to speak next, and noted that he wanted to start with the basics - people found the Grid a good idea, but when they submitted to the Grid they got no data. If they use DQ2 and get samples on their own machine and run locally, it makes it difficult to find out what the problem and cure could be, therefore they tend to try something else rather than persisting with the Grid; they also didn't know who to contact when they experienced problems. The 2nd thing HB cited was that users were not aware of all the tools - he advised that there needed to be tutorials held regularly within the organisation to enable people to pass on info. JC advised that he was trying to organise tutorials at CERN once a month - announcements missed large sections of the population however. HB noted a lack of communication channels generally. RJ advised that such information could go to the Collaboration Board Institute heads. JC noted that there was a common point at CERN. Duncan Rand asked if it would be worth having a similar thing in the UK? RJ agreed yes, but there had been slow takeup from Institutes generally. HB suggested that physics departments in Institutions could be more proactive for local personnel. RJ agreed that local contacts were very important. HB advised that the ATLAS wiki has info but it was too generic and not specific enough for specific applications etc, therefore trying to get started often doesn't work. HB noted that not so many people were using the Grid with pAthena/Ganga, more work was being done locally, and he noted one final thing, that if people find something that works then they tend to stick with it, often this is PNL with Panda rather than Ganga, which isn't a proper use of resources. There ensued a general discussion of pAthena and Ganga comparisons. HB noted that a lot of people get scared of the Grid because of certificates being too difficult to get, also expiry was difficult to handle. DB noted that most of what HB was saying mainly rested within the experiments' province, not that of GridPP - the last incident with certificates was known to be difficult for everyone. RJ advised that the issue also concerned certificates plus VO registration. DB suggested that a single sign-on would be preferred. JC spoke next, noting that personally he felt they could use the system now and that they will be able to go to Ganga and submit jobs. At a tutorial two weeks ago at CERN, everyone who tried to submit jobs to Ganga got something back - this was the first time that this had happened. JC noted however that there was still ignorance among the experiments and some have difficulty in thinking about workflow - distributed systems were a bit more complex. JC advised that there were ongoing incompatibilities between the Python version used in in releases of Ganga and Athena; there were also CMT issues, this had been painful, but was being solved now. There was the essential problem of offline software which didn't consider the implications of running on the Grid - JC noted that developers of analysis software needed to do more in considering the complexities; error messages were helped by blacklists. The ATLAS Tag (thumbnails) navigator tool had been tested but was problematic, the web interface worked well. There was a discussion of Ganga TNT and lack of effort to keep it going. RJ advised that this was an ATLAS issue, and effort would be moved. JC advised that the distributed computing system users tend to use the lxplus at CERN; it would be good if they could run from a local machine more easily. It was commented that an EGEE installation kit for Grid UI would be good. RJ advised that at the ATLAS Jamboree distributed analysis workshop a lot of people wanted a slimmed-down UI. There ensued a discussion of UK UI and resources, plus administration of certificates. DB noted that there should be an action on the PMB to compile a list of UIs at Institutes to be on a webpage which would enable login. PC advised that tutorials at CERN told users to use lxplus. RJ noted that software should be on the site where it runs. GC suggested that we should mount the FS at home Institutes which would give everything from lxplus. Mark commented that many problems are associated with the fact that the UI at Institutes has not been installed correctly, and there were bugs associated with badly-configured UIs. JC noted that with group submission analysis certain users submitted jobs on behalf of others, and there were two issues here: 1) flexibility: there needs to be a quick turnaround, and 2) you need to be able to follow the provenance of datasets. Could the tools AMI and Ganga cope with that? PC was asked to speak next, to talk about analysis. PC noted two issues - the first was online reconstruction software - to get it running on the Grid would be an achievement, and this would be followed by a ramp-up of user analysis. The second issue was when users begin there will be lots of reprocessing and Monte Carlos, plus user ineptitude. In LHCb user jobs were given priority over central jobs, and that won't work when demand is higher. For those running analysis, they have their own composite candidates, people use the same code, which was inefficient - it was better to store composite and data lists and only produce them once. Users want analysis for 1-3 years therefore there is no way to remove that data. RJ asked for a response from the floor? Brian asked whether the panel could think of short-term & long-term specific/general problems with storage? GC advised that this issue would be dealt with tomorrow. Brian asked from a users' point of view? GC advised that things just seem to break for no reason - you run a job on Tuesday then can't access the file on the DPM on Wednesday - this is always the main user concern: accessing data. Sometimes there are problems with middleware also. GS noted that often they receive an error report from a user who can't access the file, they check and all seems ok - the user tries again and it works ok, therefore there is a consistent low level of errors. RJ asked about the behaviour of the SRM. JG advised that the BDII had load-bearing problems. GC noted that we had not seen a high level of user analysis but they were encountering problems transferring files via the FTS and rfio. The problem was no accessing one file with posix-IO, the problem was accessing 1000 files. RJ noted that we can't yet impose quotas on storage, or monitor file transfers. The point was made that there was nothing to stop users setting up their own FTS server. RJ advised that sysadmins on sites have to deal with this - it was local sysadmins who were implementing experiment policy for the space. JC noted that if a user runs a job on the grid, the user wants to know: 1) how long the file will be there before deletion, and 2) will the user be able to access data when s/he wants to perform further processing? The third thing was, if they make a mistake, how do they remove it from the disks? The point was made that this was an experiment responsibility, not a site responsibility. RJ advised that they didn't have that level of control - it would have to be a DQ2 removal in the ATLAS case, but in general the files need to be removed from experiment catalogues and not just the filesystem. GC advised that space tokens allowed a crude form of group level quota. DC noted that in CMS the model relied on a certain amount of human interaction. The point was made that we need an international federation of site admins - which would give us more clout. PC noted that a focus of these issues could be in tutorials - teaching general good practice. JC advised that in some cases it wasn't easy to determine what 'good' practice was until after you'd started. RJ thanked everyone for their contributions. The meeting closed at 5:30 pm.