GridPP 23rd Collaboration Meeting Cambridge ========================= User Discussion Session Tuesday 08.09.09 ========================= Chair: Steve Lloyd Panel: Graeme Stewart, Raja Nandakumar, Dave Colling, Lee Barnby, Stephen Burke, Catalin Condurache SL commenced the discussion, asking, what about the users? The main focus of user support was with the experiments. In order to get things working, support has to come from them - could each experiment representative provide a brief statement? GS spoke on behalf of ATLAS ------------------------------------------- User support meant supporting analysis, which was organised via the Distributed Analysis Helpline (which was essentially a mailing list). There were weekly shifts to cover this, per person, the 'gmail issue tracker' worked well. They do triage, dealing with all of the usual problems, eg: ganga bug reporting and tracking, or problems encountered at sites. For users, GS observed that it was difficult for them to tell why their job failed. GS reported that things were working, but more support was required. All queries do get answered by someone, there are other people involved in production who also get involved. The model, over a 6-month period, had worked well, although they need to improve their communciation with users and provide them with explanations, and also Grid status, as appropriate. In short, the model needed to be scaled-up. SL asked about documentation? GS advised that there was a unified ganga tutorial on the web, and documentation generally was now more unified, but had not been tested at scale yet. More effort is probably required, but the basic model was sound. SL asked about manpower? GS reported that shift credits were available. There was a comment from the floor: this time last year it had started as a service, there had been four shift workers on both US and EU time - they had four on each (including developers); there were six on both sides now and there was also user-to-user interaction. SL asked if there were any questions? TD asked whether user-to-user interaction was potentially problematic? GS advised that bad practise could propagate. RJ noted that a standard introduction was available, there was more than one way to solve problems, but standard tutorials did take place showing standardised responses, which was good. GS also noted that documentation was, in addition, available on the web. LB spoke on behalf of ALICE ----------------------------------------- They currently had six students - four would be at CERN from October, therefore they could speak direct to people there. LB reported that ALICE run tutorials per month - it was a 2-day programme but still it couldn't cover all potential problems. The ALICE project had an analysis taskforce to deal with user questions/problems, although some get directed to Savannah. The Grid stuff was generally picked up by one or two people doing production there - some issues were tracked, but some were dealt with in a more ad-hoc manner. LB noted that user-to-user information was shared, but it was not always good practise that was passed on. TD asked whether an analysis train was used? LB noted no, not really, analysis tasks can be run scif - there is a macro to test, then you can ask for it to be added. RJ advised that they needed to optimise access to events on tape - ATLAS had talked about this relating to group-based analysis. DC spoke on behalf of CMS ---------------------------------------- In CMS there was a direct two-way connection between physics analysis and object groups - which was effectively one-to-one mapping. There were dedicated mailing lists for groups (like SUSY London). For other users, each user had a home Tier-2, which meant a dedicated space for user files at the storage element in the Tier-2. There was communication between the people who run the site and the user. In terms of solving problems, DC noted that there was no triage, but the hypernews mailing list was very active, and all questions raised were answered. Regarding documentation, they had a lot of this, but much of it was not good and was out-of-date. DC observed that rapidly-moving events made things obsolete fairly quickly, and it was difficult to keep the documentation up-to-date. Regarding tutorials, they have a lot of these, which take people to a basic stage, but due to ever-expanding files and changing methods, the tutorials can't cover everything. DC reported that there was a new analysis ops group, which was dealing with changing releases, but this was with variable success so far. SL asked if the releases scaled? DC noted that it was working better than it did previously. SL asked about the fraction of users using the Grid? RJ advised that it was a complicated environment. DC noted that software makes it easier but also harder! SL noted that version changes make a difference - it would get worse in the lead-up to data. Jens Jensen asked whether they meant CMS software? DC confirmed there were multiple layers of problems, cmsw was complex and there were lots of version changes. Stuart Wakefield noted that official stances were ambivalent - they basically didn't want things to break. They wanted the latest bug fixes but this meant that there was a need to upgrade versions. DC agreed - it could take 2.5 weeks of work to upgrade something, but due to the rate of change, the documentation was also out-of-date. SW noted it took a lot of effort keeping up with software, and it was better to wait until the software worked. RN spoke on behalf of LHCb ---------------------------------------- RN reported that, as GS had said for ATLAS, LHCb also had a mailing list which all users were encouraged to join if they wanted to do analysis on the Grid. The first two days of LHCb software week were spent entirely on tutorials; there was also an annual software tutorial each January. This latter tutorial was more basic, and provided fresh users with an overview of software, it also spent time looking at ganga. The distributed analysis mailing list (DAML) had new, expert developers on it, so it was useful - users were also enthusiastic about answering queries. As more users come in, they help de-bug others. The ganga developers who were on the mailing list, helped to keep ganga documentation up-to-date. To a large extent, RN noted, problems relating to running on the Grid were getting smaller. When LHC starts, RN noted that there would likely be an influx of less competent users - the worry was not in relation to support, but, rather, once jobs went to the Tier-1 they targeted only a few files - and this would cause a backlog. The raw data wouldn't be accessible to users, only stripped data, so RN hoped there would be load-balancing. The problem was the potential for 1000 jobs trying to access 2 files at the Tier-1. In relation to current user support, RN reported that the DAML was holding up well. All user analysis went through ganga. Over the last 6 months they had c250 users for ganga - which was the bulk of the user base in LHCb. TD asked whether, given the distribution of analysis datasets, any experiment was looking at peer-to-peer networks? RJ noted that this had been mentioned at CHEP by ALICE, in relation to moving software versions around - but RAL were unlikely to allow this due to security. RN reported that for latest versions, they distribute software via the SAM tests, DIRAC automatically installs the latest version on the worker node. Pier-to-pier networking was not happening anytime soon for distributing data. AS advised that no formal statement or final word on this had been agreed. There was a comment from the floor that peer-to-peer was not a formal approach. RJ noted two approaches. TD advised that for derived physics examples it would be a benefit - it was a natural off-the-shelf technique. SL summarised that so far, all experiments had tutorials, documentation, and mailing lists set up - can those in GridPP access these? What was the preferred route? RN advised that there was a statement on the web to contact LHCb for distributed analysis. SL noted that there should be an obvious route into all of this from the GridPP website. RN suggested that the GridPP website point to the specific wiki pages of the experiments. SL asked what else GridPP could do? RN noted that they supported ganga. RJ advised that the GridSite map did help people on how to join a VO; however experiment-specific bugs, and training pages, were missing from there. GS suggested that this could be done and no-one would look at it for years. Neasan O'Neill advised that the GridPP website pointed elsewhere - there was no documentation on the GridPP site. ACTION It was agreed that the GridPP website needed to point to something useful, ie: to experiment links. 1. GS, RN, DC, LB (experiment reps) each to contact Neasan O'Neill, advising where, specifically, the GridPP website should point to for each of their experiments, in terms of user support information. 2. NO to update the GridPP website accordingly. CC spoke on behalf of the Tier-1 ---------------------------------------------- CC noted that when the LHC starts up, he doesn't know for definite, but they have 2 x FTE, half for CMS and half for ATLAS, plus 2 x half-time for general - these were new experiment support posts, both new for September. CC advised that another colleague was dealing with Grid Serivces, therefore they were sharing tasks. CC wasn't sure about support beyond November - no major upgrades were expected, and the R89 move had delayed things. CC gave a brief slide presentation in relation to user support information - a VO support survey had been carried out in 2008-9 showing areas where the Tier-1 had performed well, and also areas of concern which included service delivery and support of non-LHC VOs. They had recently begun the Tier-1 dashboard, which provided updated information from sub-teams within the Tier-1. AS noted that one post was EGEE-funded, the other two were Grid-funded - and not all staff had started yet. DC noted his concern about the half posts allocated to 'general'. AS confirmed that 2 x FTE posts were entirely dedicated to experiment support. The other 1 x FTE was EGEE-funded. SB spoke on behalf of non-LHC users ----------------------------------- SB reported that we don't do much as things stand - there is fairly substantial support for LHC VOs but it wasn't possible to provide comparable support to people who were new to the Grid. The GridPP Users Mailing List did not get traffic and we don't currently have any formal mechanism to provide support. In terms of manpower, a fraction of SB and Janusz Martyniak were allocated, but not a lot of support requests were normally received. This could be because of a lack of knowledge on the part of users, about who exactly to contact. Generally, there was a Global Grid User Support (GGUS) mechanism, but SB noted that few ggus tickets were 'real' user support - tickets were usually received from 'experts'. The mechanism was there, eg: the EGEE User Support Group, but only 5 tickets had been sent in since it was set up. For ATLAS, SB advised that there was ATLAS VO Support - a mechanism should therefore be available for users there - but for MICE probably not. SB noted that EGEE was coming to an end anyway, and there would be a transition to EGI - it was not yet known what their user support would be. In EGI the ggus model would likely continue but it will go to the UK helpdesk, possibly NGS will deal with this? The model is that 'handholding' should be local, however if NGS could do this task it might be an improvement. SL asked whether any sites had any comment about user support? Alessandra Forti noted that they provide this internally, there were not many users from 'outside'. It was understood that for Tier-2 admins, it was difficult to direct people to the right place - for contact purposes, GridPP provided help via the web, and also through Neasan O'Neill. It was commented that users fell into two categories, the first were people 'embedded' who did not therefore require help, the second related to 'quiet sufferers'. In 2-3 months time when resources go down, we would likely get a flood of enquiries. Simon George noted that which VO to join was one issue for new users. SL asked about regional VO's? It was commented that what was at stake if UK users couldn't get their work done? Was a better set up possible? SL advised probably not - support was available to all. Jens Jensen advised that he had helped new users in small VOs - it could be a scary process for them, they were more likely to use the LCG setup; they needed knowledge about what happens within GridPP to get anywhere, eg: about assigning tickets they would need training. SB noted that if the merge happens, the distinction should disappear. SL thanked the floor and the panel for their contributions.