Troubleshooting
Known problems
In some cases there may be general problems, either at a particular site or in the Grid as a whole. One place to look for site problems is the gstat monitor. This provides a large amount of detailed information about the status of sites which is mainly aimed at system administrators, but it may also help users - you can assume that any problems visible there will be followed up by the Grid operations staff. The Grid Map has a simple colour code for each site. The RAL GOC web site has links to a lot of status information, although again this is largely aimed at administrators and operations staff. Finally, the GridPP monitoring pages give a lot of detailed information about the state of UK sites.There may also be useful information in the operations portal, although this is currently still under development and the information is fairly limited.
As always, it's worth trying to check that you haven't made a simple mistake, although sometimes it can be hard to tell. It's worth having a few simple "hello world" type jobs which you know normally work, to check that the system is functioning at a basic level.
Error messages
There are many different middleware components, which generate error messages in different formats and with varying levels of usefulness. Error messages which tell you exactly what the problem is are rare, usually the best you can hope for is an indication of what kind of error it was and where it was (meaning both the middleware component involved and the physical location). At worst error messages can be misleading, in some cases the software author assumes that there can only be one cause for a failure when in fact there are many things which can go wrong.
The GOC wiki has a list of problems collected from GGUS tickets and other sources, largely indexed by a characteristic error message, with possible causes and fixes. However, this is mainly aimed at support staff so it may be hard to understand.
It may be useful to put an error message into a search engine; in some cases this will produce a helpful answer, but be aware that the results you get may not be relevant to your situation.
Locating a problem
One of the basic methods to locate any error or bug is "divide and conquer", changing things in a systematic way until you find the part which is failing. If something (a job or a file transfer) related to a particular site fails, try a few other sites to see if the problem is site-specific or general. You may also be able to vary the use of central services, e.g. resource brokers or myproxy servers, although some things (e.g. R-GMA) don't allow that.If your job is complex try something simpler. If data management or R-GMA publishing inside a job fails, try it directly from a UI. Publishing to R-GMA can also be directly useful to collect diagnostics as the job runs. In general, even if you still end up submitting a problem ticket it will help if you can give the simplest example which exhibits the problem.
LCG documents a set of basic system tests. These are aimed at site administrators, but most of the tests can also be performed by users.
VO-specific software
Some failures may relate to software which comes from your VO, rather than from the Grid. It may not always be possible to identify this, and GGUS should be able to cope with tickets which relate to VO software errors, but it will help if you can identify such cases. In general you should refer to documentation from your VO to deal with such problems.Security problems
Some problems relate to the security infrastructure (certificates, proxies etc). These can be complicated to diagnose and often can't be fixed by a user anyway. However, there are a few basic things you can check. Look at your proxy (voms-proxy-info --all) and see if it looks OK, and in particular hasn't expired. Note that the expiry time for the VO credentials is separate from the certificate as a whole - in general they must both be valid.Also try interacting with several remote sites/services to try to see if the problem is at your end or theirs. Similarly you may be able to try the same thing from a different UI. You could also ask a colleague to try, to see if the problem is specific to you.
When you submit a job you send (delegate) a copy of your proxy to the WMS. In turn the WMS delegates the proxy to the job, and can arrange to renew the proxy from a MyProxy server before it expires. If a job fails with an error like Job proxy is expired it implies that the copy of the proxy held by the job has expired without being renewed, while an error like X509 proxy not found or I/O error implies that the proxy has not been correctly transferred to the job. Note that you can also see an error like request expired which does not relate to the proxy, it means that the WMS has been unable to find anywhere to submit the job and has given up.
Job submission
The job submission system has several stages. You send a job to a resource broker, it sends it to the gatekeeper at a site, the gatekeeper puts it into a local batch system, and the batch system send it to a worker node and runs it. (For some of the gory details see the job submission chain diagram.) When the job finishes that unwinds: the batch system notices that the job has ended and cleans up, the output sandbox files are copied back to the broker, and eventually you retrieve them. The broker may have to renew your proxy along the way. Problems can occur at any point in this process. Errors with multiple job types, e.g. using "dags", can be even more complex. In some circumstances the broker will notice a problem and resubmit the job elsewhere - but that can sometimes happen even if the job actually ran at the first site.The basic tool here is the edg-job-get-logging-info command - usually you will want the "-v 2" option to get verbose output. This records the progress of a job as a series of "events", each with a timestamp, so you should be able to see how far the job got and where it failed. Even if you can't interpret the output it will be useful if you can include it in a ticket. One thing to bear in mind is that a job sent to a busy site may be queued, so the fact that it hasn't run is not necessarily an error. You can see the numbers of running and waiting jobs, and various other things, on the GridIce Monitor.
Be aware that the JDL file parsing can be quite picky, and mistakes (e.g. typos) don't always produce obvious error messages.
Data management
The data management system also involves many components - file and metadata catalogues, SEs, i/o protocols, and the File Transfer Service. There may also be an interaction with VO-specific catalogues and/or services. Most of these can be used, and their state queried, from the command line on a UI, and this is usually the best way to locate problems. Most tools can provide debug output of some kind. Again the key thing is to work out which component is failing; in many cases there are tools which operate on only one component, e.g. to talk to an SE directly.Also be aware that there is a possibility of ending up in an inconsistent state, for example a file may be on an SE but not in a catalogue or vice versa, or a file may be truncated, although work is underway to make the system more robust to this kind of thing.
Information system
Both job submission and data management tools query the information system to find information about Grid resources. For example, when you specify Requirements and a Rank expression in the JDL file the broker matches these against values for each site in the information system to choose which site will run the job. Similar queries are made internally to the various tools, e.g. to find a suitable SE to store a file.Information system problems are difficult to diagnose unless you are an expert, and in any case there is often little you can do as the problem will typically be at a remote site. However, you should check those things which you specify yourself: things like Rank and Requirements expressions in the JDL, and SE and site names in data management. Both the names of items in the schema and their values, e.g. SE names, need to be correct - typos are a common source of errors. The lcg-infosites command provides a packaged query to the information system to list some common information.
Last modified Fri 6 May 2011 . View page history
Switch to HTTPS . Website Help . Print View . Built with GridSite 1.4.3