Ticket #1456 (closed defect: fixed)

Opened 9 years ago

Last modified 9 years ago

MSS web service fails on many files

Reported by: dstn Owned by: RayPlante
Priority: normal Milestone:
Component: infrastructure Keywords:
Cc: mfreemon, dstn@… Blocked By:
Blocking: Project: LSST
Version Number:
How to repeat:

not applicable

Description

eg, http://lsst1.ncsa.uiuc.edu/lsstdata/dc3product/obs/CFHTLS/D2/raw/v731048-fu/s00/c00-a0.fits produces a web page showing a python stack trace, ending in:

 /home/rplante/devlp/sciarchtools-trunk/python/ncsa_sciarch/cachemgr/cache/simple.py in score(self=<ncsa_sciarch.cachemgr.cache.simple.TimeSizeScorer object at 0xb7de262c>, item=<ncsa_sciarch.cachemgr.cache.items.CacheItem object at 0xb7de2ecc>)
  267         age = (self._now - item[self.ATIME]) / self._tosecs
  268         p = (item[self.PRIORITY] > 0 and item[self.PRIORITY]) or 1
  269         return age * (1.0 + age/self._t0 + item[self.SIZE]/self._s0) / p
  270 
  271 
age = 13.84506967265021, self = <ncsa_sciarch.cachemgr.cache.simple.TimeSizeScorer object at 0xb7de262c>, self._t0 = '30', item = <ncsa_sciarch.cachemgr.cache.items.CacheItem object at 0xb7de2ecc>, self.SIZE = 1, self._s0 = '323232323232323232323232323232323232323232323232...2323232323232323232323232323232323232323232323232', p = 1

<type 'exceptions.TypeError'>: unsupported operand type(s) for /: 'float' and 'str'
      args = ("unsupported operand type(s) for /: 'float' and 'str'",)
      message = "unsupported operand type(s) for /: 'float' and 'str'" 

Attachments

err.html (16.0 KB) - added by dstn 9 years ago.
err2.html (24.8 KB) - added by dstn 9 years ago.

Change History

comment:1 Changed 9 years ago by DefaultCC Plugin

  • Cc mfreemon added

comment:2 Changed 9 years ago by RayPlante

  • Status changed from new to assigned

comment:3 Changed 9 years ago by RayPlante

  • Status changed from assigned to closed
  • Resolution set to fixed

Fixed. Example is now accessible.

Changed 9 years ago by dstn

comment:4 Changed 9 years ago by dstn

  • Status changed from closed to assigned
  • Resolution fixed deleted

I'm getting a new error occasionally; see attached.

FWIW, I'm running two wget jobs concurrently, so I'm hitting the web service with near-concurrent requests.

comment:5 Changed 9 years ago by dstn

  • Cc dstn@… added

comment:6 Changed 9 years ago by dstn

Yet another problem: web service claims no such file, but it exists on MSS:

http://lsst1.ncsa.uiuc.edu/lsstdata/dc3product/obs/CFHTLS/D1/raw/v723742-fu/s00/c00-a1.fits

Contents attached (err2.html); error message is:

OSError: [Errno 2] No such file or directory: '/data/cache/1/datacache/cache/cache/obs:CFHTLS:D3:raw:v740098-fr:s00.tar'

On MSS, I see:

mss ac/lsstread> ls -l /UROOT/projects/eiw/lsst/repos/obs/CFHTLS/D3/raw/v740098-fr/s00.tar
-rw-r----- 1 lsst eiw common  AR 1414031360 Jul 12 12:26 /UROOT/projects/eiw/lsst/repos/obs/CFHTLS/D3/raw/v740098-fr/s00.tar

Changed 9 years ago by dstn

comment:7 Changed 9 years ago by dstn

PS, I have now retrieved 64,000 images successfully, but got 45,000 failures.

comment:8 Changed 9 years ago by dstn

Based on the error message file sizes, I'm seeing a few different errors:

2655 times:

Traceback (most recent call last):
  File "/var/www/cgi-bin/DATAarch/dc3GetProduct.py", line 128, in &lt;module&gt;
    main()
  File "/var/www/cgi-bin/DATAarch/dc3GetProduct.py", line 56, in main
    deliver(id, cfg)
  File "/var/www/cgi-bin/DATAarch/dc3GetProduct.py", line 66, in deliver
    if not id or (not res.available(id) and not res.exists(id)):
  File "/appl/DATAarch/python/lsst/daf/web/dc3restore.py", line 399, in exists
    return ditem.exists()
  File "/appl/DATAarch/python/lsst/daf/web/dc3restore.py", line 84, in exists
    self._ensureStat()
  File "/appl/DATAarch/python/lsst/daf/web/dc3restore.py", line 170, in _ensureStat
    self._statinfo = self._getStat()
  File "/appl/DATAarch/python/lsst/daf/web/dc3restore.py", line 167, in _getStat
    return self._finditem()
  File "/appl/DATAarch/python/lsst/daf/web/dc3restore.py", line 157, in _finditem
    data = self._store.stat(paths)
  File "/appl/DATAarch/python/ncsa_sciarch/deepstore/sshfs.py", line 341, in stat
    out, err, ex = self._ssh(cmd)
  File "/appl/DATAarch/python/ncsa_sciarch/deepstore/sshfs.py", line 159, in _ssh
    return self._exec(launch, timeout)
  File "/appl/DATAarch/python/ncsa_sciarch/deepstore/sshfs.py", line 260, in _exec
    (256-ex, "\n".join(err)))
ConnectionError: ssh error (1): ssh: connect to host mss.ncsa.uiuc.edu port 22: Connection refused

2 times:

Traceback (most recent call last):
  File "/var/www/cgi-bin/DATAarch/dc3GetProduct.py", line 128, in &lt;module&gt;
    main()
  File "/var/www/cgi-bin/DATAarch/dc3GetProduct.py", line 56, in main
    deliver(id, cfg)
  File "/var/www/cgi-bin/DATAarch/dc3GetProduct.py", line 75, in deliver
    restore(res, id, int(cfg.timeout))
  File "/var/www/cgi-bin/DATAarch/dc3GetProduct.py", line 106, in restore
    res.restore(id)
  File "/appl/DATAarch/python/lsst/daf/web/dc3restore.py", line 455, in restore
    self.restore(tid)
  File "/appl/DATAarch/python/lsst/daf/web/dc3restore.py", line 415, in restore
    if not ditem.exists():
  File "/appl/DATAarch/python/lsst/daf/web/dc3restore.py", line 84, in exists
    self._ensureStat()
  File "/appl/DATAarch/python/lsst/daf/web/dc3restore.py", line 170, in _ensureStat
    self._statinfo = self._getStat()
  File "/appl/DATAarch/python/lsst/daf/web/dc3restore.py", line 167, in _getStat
    return self._finditem()
  File "/appl/DATAarch/python/lsst/daf/web/dc3restore.py", line 157, in _finditem
    data = self._store.stat(paths)
  File "/appl/DATAarch/python/ncsa_sciarch/deepstore/sshfs.py", line 341, in stat
    out, err, ex = self._ssh(cmd)
  File "/appl/DATAarch/python/ncsa_sciarch/deepstore/sshfs.py", line 159, in _ssh
    return self._exec(launch, timeout)
  File "/appl/DATAarch/python/ncsa_sciarch/deepstore/sshfs.py", line 252, in _exec
    raise ConnectionError("connection timed out (waiting for pw?)")
ConnectionError: connection timed out (waiting for pw?)

42,146 times:

Traceback (most recent call last):
  File "/var/www/cgi-bin/DATAarch/dc3GetProduct.py", line 128, in &lt;module&gt;
    main()
  File "/var/www/cgi-bin/DATAarch/dc3GetProduct.py", line 56, in main
    deliver(id, cfg)
  File "/var/www/cgi-bin/DATAarch/dc3GetProduct.py", line 75, in deliver
    restore(res, id, int(cfg.timeout))
  File "/var/www/cgi-bin/DATAarch/dc3GetProduct.py", line 106, in restore
    res.restore(id)
  File "/appl/DATAarch/python/lsst/daf/web/dc3restore.py", line 455, in restore
    self.restore(tid)
  File "/appl/DATAarch/python/lsst/daf/web/dc3restore.py", line 474, in restore
    reservation = self._opencache(sz)
  File "/appl/DATAarch/python/lsst/daf/web/dc3restore.py", line 394, in _opencache
    return self._cachemgr.reserve(amount)
  File "/home/rplante/devlp/sciarchtools-trunk/python/ncsa_sciarch/cachemgr/CacheMgr.py", line 83, in reserve
    plan = cache.createRemovalPlan(need)
  File "/appl/DATAarch/python/ncsa_sciarch/cachemgr/cache/simple.py", line 123, in createRemovalPlan
    return plnr.makePlan(amount)
  File "/appl/DATAarch/python/ncsa_sciarch/cachemgr/cache/simple.py", line 82, in makePlan
    items = self._listCache()
  File "/appl/DATAarch/python/ncsa_sciarch/cachemgr/cache/simple.py", line 74, in _listCache
    ditems.addPath(file, 0)
  File "/home/rplante/devlp/sciarchtools-trunk/python/ncsa_sciarch/cachemgr/cache/items.py", line 154, in addPath
    self._additem(CacheItem.makeFor(path, self._cachedir, priority), score)
  File "/home/rplante/devlp/sciarchtools-trunk/python/ncsa_sciarch/cachemgr/cache/items.py", line 43, in makeFor
    fs = os.stat(fullpath)
OSError: [Errno 2] No such file or directory: '/data/cache/1/datacache/cache/cache/obs:CFHTLS:D3:raw:v740098-fr:s00.tar'

465 times:

Traceback (most recent call last):
  File "/var/www/cgi-bin/DATAarch/dc3GetProduct.py", line 128, in &lt;module&gt;
    main()
  File "/var/www/cgi-bin/DATAarch/dc3GetProduct.py", line 56, in main
    deliver(id, cfg)
  File "/var/www/cgi-bin/DATAarch/dc3GetProduct.py", line 75, in deliver
    restore(res, id, int(cfg.timeout))
  File "/var/www/cgi-bin/DATAarch/dc3GetProduct.py", line 106, in restore
    res.restore(id)
  File "/appl/DATAarch/python/lsst/daf/web/dc3restore.py", line 465, in restore
    reservation = self._opencache(memsz)
  File "/appl/DATAarch/python/lsst/daf/web/dc3restore.py", line 394, in _opencache
    return self._cachemgr.reserve(amount)
  File "/home/rplante/devlp/sciarchtools-trunk/python/ncsa_sciarch/cachemgr/CacheMgr.py", line 83, in reserve
    plan = cache.createRemovalPlan(need)
  File "/appl/DATAarch/python/ncsa_sciarch/cachemgr/cache/simple.py", line 123, in createRemovalPlan
    return plnr.makePlan(amount)
  File "/appl/DATAarch/python/ncsa_sciarch/cachemgr/cache/simple.py", line 82, in makePlan
    items = self._listCache()
  File "/appl/DATAarch/python/ncsa_sciarch/cachemgr/cache/simple.py", line 74, in _listCache
    ditems.addPath(file, 0)
  File "/home/rplante/devlp/sciarchtools-trunk/python/ncsa_sciarch/cachemgr/cache/items.py", line 154, in addPath
    self._additem(CacheItem.makeFor(path, self._cachedir, priority), score)
  File "/home/rplante/devlp/sciarchtools-trunk/python/ncsa_sciarch/cachemgr/cache/items.py", line 43, in makeFor
    fs = os.stat(fullpath)
OSError: [Errno 2] No such file or directory: '/data/cache/1/datacache/cache/cache/obs:CFHTLS:D3:raw:v740098-fr:s00.tar'

34 times:

Traceback (most recent call last):
  File "/var/www/cgi-bin/DATAarch/dc3GetProduct.py", line 128, in &lt;module&gt;
    main()
  File "/var/www/cgi-bin/DATAarch/dc3GetProduct.py", line 56, in main
    deliver(id, cfg)
  File "/var/www/cgi-bin/DATAarch/dc3GetProduct.py", line 75, in deliver
    restore(res, id, int(cfg.timeout))
  File "/var/www/cgi-bin/DATAarch/dc3GetProduct.py", line 106, in restore
    res.restore(id)
  File "/appl/DATAarch/python/lsst/daf/web/dc3restore.py", line 469, in restore
    reservation.done()
  File "/appl/DATAarch/python/ncsa_sciarch/cachemgr/cache/gen.py", line 259, in done
    self.cancel()
  File "/appl/DATAarch/python/ncsa_sciarch/cachemgr/cache/gen.py", line 210, in cancel
    self._cache._cancel(self._id)
  File "/appl/DATAarch/python/ncsa_sciarch/cachemgr/cache/gen.py", line 473, in _cancel
    lock = self.lock()
  File "/appl/DATAarch/python/ncsa_sciarch/cachemgr/cache/gen.py", line 412, in lock
    self._home)
ApplicationLocked: Cache is already locked (/data/cache/5/datacache)

And 64 other miscellaneous...

cheers, dstn

comment:9 Changed 9 years ago by RayPlante

  • Status changed from assigned to closed
  • Resolution set to fixed

I have fixed this problem (related to cleaning up the cache; see also #1468), and the service appears to be operational again.

Note: See TracTickets for help on using tickets.