Web lists-archives.com

Re: [linux-usb-devel] OHCI hangs after failing to free resources

On Tuesday 03 July 2007, Mike Nuss wrote:
> David Brownell wrote:
> > On Wednesday 02 May 2007, Mike Nuss wrote:
> >
> > It's possible that the SKIP bit isn't handled correctly.
> >
> > I've not looked at that code in some time, but I seem to
> > recall thinking that setting SKIP was an action more in
> > the "defensive paranoia" category than the "essential" one,
> > so far as "functionally necessary" criteria go.  It's just
> > an optimization ... every time it's applied, the QH is also
> > removed from the schedule.  Either one alone should prevent
> > the HCD from doing much more with that QH...
> >
> > But I don't have time to sort out the relationship between
> > the SKIP bit and the software DEQUEUE flag.  The complication
> > is dl_done_list().
> >
> > See if you can make that code behave without turning on SKIP.
> > A quick'n'dirty experiment might be #defining that bit to zero
> > and deferring the clear of ED_H so that dl_done_list() still
> > has a way to tell when it's cleaning up after a halt.
> >
> > - Dave
> >
> Thanks for your reply. ISTR clearing SKIP after the fact without any
> success, but the damage is done at that point, especially if the
> hardware doesn't handle it as we'd expect. I'll take a look at the
> code and see if I can get things to work properly without it,
> per your suggestion.    
> I posted this dump in my last message, looks like the transfer
> completed but the TD was not put on the donelist: 
> ohci_hcd 0000:00:13.0: read endpoint, ed c2d912c0 state 0x0 type intr;
> next ed 00000000
> ohci_hcd 0000:00:13.0:   info 08405110 MAX=64 DQ SKIP EP=2-IN DEV=16
> ohci_hcd 0000:00:13.0:   tds: head 02ba7300 DATA0 tail 02ba7300
> ohci_hcd 0000:00:13.0:   -> td c2ba7340; urb c272ca40 index 0; hw next
> td 00000000
> ohci_hcd 0000:00:13.0:      info 02140000 CC=0 DATA0 DI=0 IN R
> ohci_hcd 0000:00:13.0:      cbp 02dbe37a be 02dbe39f (len 38)
> 1) CC = 0 seems to indicate that the HC has successfully completed
> the transfer (I believe the HCD sets it to 0xf initially). 

No.  CC==0 means that the HCD has started working on it;
ISTR it doesn't mean it's even managed to complete one
packet.  TD completion is signaled by pushing the TD
through the donelist (after removing it from the hardware
ED queue).

(With OHCI you also have to be careful, since there's a state
where the HCD can tell the TD has been finished but where it
hasn't yet been handed back through the donelist.  Can't take
the TD back except through the donelist.)

> 2) All our URBs are submitted with 64byte buffers. len=38 means
> 26 bytes have already been transfered, which is the number of bytes
> we were expecting in this particular test.  

... ok ...

> 3) HwNextTD is null. This would happen when the HC has moved it to
> the donelist when the donelist was previously empty (which it should
> be, because HccaDoneHead is updated and WDH is sent after every
> single completion).  

... ok ...

> 4) However, it never shows up on the donelist. I added some
> 'tracking' code to keep track of the last 50 TDs pulled off
> the donelist. The last TD for this endpoint that appeared on
> the donelist was the TD at 0x02ba7300 (the current 'dummy').    

Hmm.  Here's a theory.  The way that the current code unlinks
an ED is to set the SKIP bit *AND* remove the ED from the relevant
part of the schedule.

Maybe ... the hardware gets confused when the ED doesn't seem
to be on the relevant list.  Like maybe it expects it to stay
at the head of the ED list.  ISTR some silicon doesn't much
like to see null pointers written into the hardware registers,
and I know for a fact that the ed_deschedule logic was always
a bit racey.

That suggests that the safest route would be setting SKIP for
one frame (so all pending TDs get properly retired), and
only *THEN* taking it out of the queue.

> The spec mentions that setting CC and updating HwNextTD "may"
> be done in the same write cycle, but I don't know about updating
> the donehead.

It's easy to imagine one burst write updating the first four
words of the TD.  The donehead would necessarily be updated
at some other instant ... certainly the in-memory version,
plus on most hardware it's not possible to _observe_ a case
where that register updates at the same instant as the memory.

> Who knows what this particular controller is doing. 
> Maybe if the HCD happens to set SKIP in that small timing window
> it gets mishandled.

Similar small timing windows have a certainty of showing up in
real-world loads.  ;)

> As a side note - even though we don't know what the problem is,
> it seems to me that the error message "INTR_SF lossage" and
> the comments surrounding it should be changed. We're not losing
> interrupts. SOF and WDH are both still being generated, actually.
> ISTR the 2.4 code had a "?" after the equivalent message, which
> was at least a little more accurate ;)     

Fair enough.  The real issue is that the ED hasn't moved out
of the UNLINK state, for some as-yet-not-understood reason.

- Dave

This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
To unsubscribe, use the last form field at: