That is a little how I did it.
- Figure out what set of circumstances make the issue happen. (reproduction steps)
- Figure out what all code is being called in the circumstances. (including Observers and such)
- Use process of elimination to figure out which chunk of code causes the problem.
- Fix the problem and release the new version.
- Deal with support issues from people with the old version still installed.
This was a particularly bad one because an undo observer indirectly triggered some code that inadvertently added transactions to the undo stack, essentially making it impossible to undo!