Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Extreme Debugging (squanderingti.me)
193 points by _fnqu on Nov 1, 2020 | hide | past | favorite | 53 comments


A friend had a problem with EndNote on Mac. Although there are probably better tools, it's standard for their lab so they have to use it.

EndNote, on her machine, was just absurdly slow. Operations like editing a single reference would beachball for 30-60 seconds. She has a moderately-sized library - about 1000 references - but nothing that should be causing a modern MacBook to slow down that much. So I took a look. Happily, on a Mac, you can use spindump to automatically take a sample-based profile of any process, and you can even do it from Activity Monitor. Endnote had their symbols partially stripped - but the Objective-C functions weren't stripped, so I got to see that somewhere in the call chain was a "recalculateColumnWidths" function. That was apparently getting called on practically every operation with the library.

I got out Ghidra, traced the caller of that function, and `nop`ped the call site out. Problem fixed - no more random slowdowns, no more beachballs. Resizing the columns manually worked too - I never really saw any cases where the columns seemed out of whack.

If I had to guess, she probably had some reference with some unusual characters in it that was causing Endnote to think it needed to resize all the columns. I never quite figured that part out - but that would have been another way to solve the problem (e.g. by bisecting her library to find the offender(s)).


Also useful to note is that there is a "sample" tool (which is exposed in Activity Monitor) that targets just one process and does not require administrator privileges, plus Activity Monitor has native UI for expanding and collapsing it. spindump is useful when you know something is stuck in the kernel, or want to track multi-threaded or multi-process hangs ;)


I once fixed a Nullpointer crash my brother and I had been having when defeating an enemy in Battle for middle earth 2 by just simply nopping out the access ... After 100s of games, didn't seem to have any side effect.


Yes, this so much. This is how I know 90% of the things I know now. You see a problem–some random thing is crashing, or using more RAM than it needs to, or you want it to so something slightly different–and you resist the urge to go "eh, I'll deal with it later or ignore it" and dig in all the way using all your tools. Sometimes you fix the problem, sometimes you learn something new, all the time you get better at debugging random stuff. At some point crashes stop being scary and thread safety issues go from "how could you have possibly known this" to "I can spot the issue from a partially-symbolicated stack trace of two threads and five register values" and your answer to "what happens when I type google.com into a browser and press enter" is "well, once I did that my browser hung, so I followed the HID system in the kernel to…"

The obvious disclaimer is this can make you slow to get things done sometimes–but you'll be lightning fast the next time it comes around ;)


> "it’s charming in a way that only academic software can be."

I remember experiencing this as an undergraduate CS student with a bioinformatics research position. At the time I felt like I was too stupid to understand the code.

After I struggled with that, the professor reassigned me to making their software available for web use and wanted me to take over some PHP shell_exec rollercoaster for handling job submissions.

Your NSF funds hard at work.


If only grants were big enough (and permissive enough) to hire professional SWEs.

It's as if we sent a bunch of theoretical physics PhD's out to Geneva, gave them garden trowels, and let them break ground on a particle accelerator themselves.


The word "informatics" is included in the "bioinformatics" field name so I would expect them to know how to program.

Reality is that academics are bad at programming because academia is bad a teaching it and doesn't prioritise it.


Some of this debugging is, per the title, "extreme" (trying to "rewind" the realities of source behavior through time until things start to click, dealing with ancient binaries, using disassembly/binary analysis tools to hack up branching logic).

However, a lot of this is using tools that I'd expect everyone (even browser JS programmers, Java programmers, folks that write code in higher level languages) to have at least a passing familiarity with: strace/ltrace, nm/ldd, gdb...no matter who you are or what you do, there's immense benefit in trying to debug your programs like they're closed-source[1].

I'm perennially disappointed to find that extremely skilled and intelligent colleagues (far better at writing and designing code than me) are routinely stymied when asked to figure out why their code is acting a certain way when it runs--why a behavior of their code is occurring in an environment where their typical tools (language-specific debuggers, IDEs, etc.) aren't available.

I think this is a skill-gap that our iteration-focused, "just try shit until it works" industry has instilled in a generation of programmers to their detriment.

This is super important stuff--not just for debugging ability, but because understanding at least the "shallow end" of runtime-analysis tools (and with eBPF and core analysis tools, to say nothing of actual embedded/kernel programming, the pool gets a lot deeper, if you want) makes you become a way better programmer!

This isn't only for folks who write code in runtime-less languages (and don't come at me about crt0); even people who write Python for a living (like me!), get bloody superpowers from understanding things like kernel memory management, syscall semantics below the standard library, and garbage collection behavior. The folks I find that don't have those skills aren't stupid or talentless, but rather victims of a series of roles and managers that have instilled a kind of Stockholm Syndrome of "when these things fail, there's nothing you can do about it, just consider it spooky and move on with your life", when getting at what's really going on is not just easy to do, but super beneficial for your career development!

Put another way, we're all systems programmers in our day jobs--some of us are just in denial.

/rant

1. https://jvns.ca/blog/2014/04/20/debug-your-programs-like-the...


You won't have a good time trying to debug a JVM or Node JS runtime with gdb. There are too many threads, too many moving parts, and trying to stop the world to make sense of it is too invasive; you'll muck with timings and waits and timeouts and GC thresholds and all sorts, and your effort to figure out what's going on will have changed what's going on. And that's if gdb doesn't crash.

IMO not enough people download and build the source. It's primitive, but adding logging or printfs to existing source, combined with strace if necessary, can be effective in a brutal way, where gdb - or any debugger - wouldn't get you far.


Thank you for a reminder of one of my favourite debugging stories. I was working on an x86 bootloader and something was going wrong when making the transition into 32-bit protected mode. Ultimately I found the problem by... adding printfs inside qemu and recompiling, so it would output the CPU state registers for every instruction it was executing (luckily not too many to run before encountering the bug).


While it's certainly much harder in managed runtimes, I've been quite happy with gdb's "where" output on JNI-bound or syscall-stuck JVM cores, and jstack (and more modern friends) where that didn't work. For my other HLL of choice, pyrasite[1] works quite well. When using all of those tools, I encountered significant out-of-the-box usability issues until I learned what those debugger-esque tools were doing with regards to their "target" programs' (or their postmortem heaps) stacks, signal handlers, and allocators. Combining the understanding of the latter concepts with the former tools is a killer combo, and, while I can't speak directly to your counterexample of the NodeJS runtime, I would be extremely surprised if tooling did not exist to perform similar debugging tasks with it.

1. https://github.com/lmacken/pyrasite


Doesn't mean you have to be afraid of debugging your runtime. May be too much noise, or too much info, but sometimes getting a crashing runtime to dump core may be enough to clue you in on what's going wrong.

The other day I did exactly that with JVM to discover why our JavaFX app crashes to desktop on customer's machine. Going through the crash dump, I immediately spotted DirectX calls being prominently displayed (IIRC top of one of the stack frames), and upon investigation, it turned out that the computer haven't had its graphics drivers updated in a long, long time.


I've always wondered if there shouldn't be a VSCode or other plugin that allows you to automatically throw in a bunch of printfs up a dependency chain to dump vars etc, and remove them just as easily.


I'm not sure that would be as useful as it sounds. A big part of what you get from manually adding printfs is familiarity with the code you're instrumenting.


that's called a debugger, put a breakpoint somewhere, the application will pause right there and show you all the variables.


Funny thing about debuggers is that they seem to encourage stepping - step in, step out, step over, single step, step to return.

Stepping is an enormous waste of time with a debugger.

Breakpoints are where it's at. Level up your breakpoints: logging breakpoints, conditional breakpoints, memory breakpoints (super handy when combined with deterministic memory allocation), syscall / library breakpoints - and soon you're going somewhere useful.

Debugging is about formulating theories and proving them wrong, or right. Stepping doesn't scale; breakpoints can scale, if you know where to put them.

I'm not sure that the gap between a debugger with breakpoints, and printfs in source, is that large, though. It's easier to write complex conditionals or log the state of a complex predicate in the source language than in whatever the debugger gives you - even if it gives you the native language, you typically have to write it all on a single line. And you need to dive deep into the source to understand where to put those breakpoints - or printfs.

An underused debugging tool is an instrumenting profiler. The captured call graph is a useful clue for dynamic & indirect control flow in particular, without needing to decipher it from stepping or infer it from chasing source references and indirections. Things like generic message dispatchers and event callbacks are horrible to try and step through and aren't much better when eyeballing the source.


FWIW, I rarely put debugging logs in my code-almost always in cases where I can't stick a breakpoint on something because I can't get a debugger on it. 99% of your debugging printfs can be breakpoint commands. They're actually better because they are separate from your source code, so you aren't going to accidentally commit them or anything, they can change dynamically without recompiling or even re-running, and they have access to a lot more information than your compiler may let you print out.


It's generally hard to add breakpoints in production. OTOH most logging systems aren't set up for dynamic contextual reconfiguration, a place could have a competitive advantage over debuggers with a bit of effort.

IME I find it easier to deal with modifications to code than configuration in IDEs, precisely because I can identify diffs and if desired commit the modifications to source control on my own branch. I frequently debug things like third-party libraries (e.g. Ruby gems by adding my rvm gemset) to a git repo to track my modifications.

It's generally harder to export and import breakpoints from an IDE. They are usually driven by line numbers, which aren't particularly stable over code changes.


How does one go about learning these kinds of skills?


I, personally, learned gradually how to use them by refusing to accept a bug fix when I didn’t understand the bug I was fixing. Gdb was first, because when I learned C it was the only freely available option. Strace likely was next when I started encountering problems that happened too long into a program’s execution to successfully use breakpoints to debug. Valgrind was likely next, when I ran into bugs where the memory corruption happened a long time before the actual crash (think use-after-free or uninitialized values).


Every time I touch gdb I curse, "why isn't this easier to inspect code, watch variables, have multiple views, etc, like Visual Studio's debugger since 1995"


There are numerous front-ends which does that.


I was curious as well, so I'm currently reading this:

https://dev.to/captainsafia/say-this-five-times-fast-strace-...

where I stumbled upon and read the ezine:

https://jvns.ca/blog/2015/04/14/strace-zine/


Speaking very personally and not axiomatically: by refusing to accept "fixes" that aren't fully understood.

If that means rejecting the PR "stop using globals in $language because that triggers a 'slab allocation' (whatever that is) which stackoverflow says is bad", and instead insisting to your boss that you spend an extra week reading kernel source/stdlib source instead of moving on from the original 2-hour-scoped bug, so be it. The outcome is far more beneficial (for you and for the product/company) in the long term.


Pratice.

strace and gdb are your friends in catching evasive bugs. Personally I would go exactly same route, though I won't modify the binary but coined an environment for it to work instead.


Seconding the post. The post focuses on strace & friends in the Linux/Mac world; for Windows, a powerful tool to look into is Event Tracing for Windows, which lets you trace events, syscalls and performance counters for everything that's running. I.e. a global trace of all (or at least most) processes.

Two main tools to look into are xperf, the cmdline tool for configuring and running trace sessions, and Windows Performance Analyzer, a GUI app for exploring recorded traces (additionally, there's Windows Performance Recorder, which seems to be an official GUI for xperf, but I haven't used it myself yet).

This is essentially a superpower that's even less known than strace is by Linux devs. I've only learned the basics quite recently, and already used it to debug two hairy issues in few hours, that have already taken a man-month worth of time and would likely take another one. Being able to explore the recordings of the process tree evolution, see all file IO operations, their times and targets and relation to everything else happening in the system, all in both visual and tabular form - it gives you a perspective you otherwise can't get.

(An unfortunate side effect is that the tough bugs, like multiprocess race conditions or silent crashes, all seem to be directed my way nowadays.)

Unfortunately, xperf and WPA don't come pre-installed in most Windows distributions; the fastest way to get them is through UIforETW[0] - Google's UI for xperf that can automatically locate and install the correct version of Windows Performance Toolkit[1].

Bruce Dawson (of Google Chrome fame) wrote a lot about advanced usage of this tooling, see: https://randomascii.wordpress.com/2015/09/24/etw-central/.

--

[0] - https://github.com/google/UIforETW/releases - fetch the newest .zip package with binaries.

[1] - Alternative approach, if you need xperf on a computer with no Internet access (happened to me the other day with a customer's VM), you need to download the Windows SDK ISO (e.g. https://developer.microsoft.com/en-us/windows/downloads/wind...), mount it, and grab the Windows Performance Toolkit installer from the mounted image.


> (An unfortunate side effect is that the tough bugs, like multiprocess race conditions or silent crashes, all seem to be directed my way nowadays.)

The trick is to never reveal your superpower!

The article was a delight to read, it's definitely an area where I am lacking and it seems fairly difficult to pick up this skill through a curriculum.


I am not aware of strace/gdb/ldd working alternative for Windows. Still should look into strace looks really powerful


windows has visual studio, whose debugger is light years ahead of gdb.

to capture syscall, there are procmon/regmon/sysmon from sysinternals, that are also light years ahead of anything available on linux and come with an amazing UI.

for ldd I am not sure what it would be for, windows is not quite like linux when it comes to dynamic libraries. there is depends.exe from visual studio and there was a nice import analyzer tool I forgot the name.


Since I read the word “Ceph”, I knew what the problem could be. Not the first time I find an old 32 bit binary that cannot work in very large partitions. I think they could save time using a virtual machine with a virtualized disk, if the data processing is not too intensive.


Hah. Binary patching is a neat hack, but my first thought would've been to make a loopbackable filesystem in a file that explicitly didn't have that many inodes.

Mostly on grounds that if one binary had that problem, maybe another one would, and I would really rather not go through the same process twice ;)


The nice thing about OP's solution is that they produced a new binary that's runnable on any system - in particular, it sounded like they had a whole research team who probably needed to run the software. I suppose you could give them an "installer" of sorts that would simply mount the appropriate filesystem and put everything there, but the OP's solution seems decently elegant all things considered :)


Yeah. It's a question of axes: "runnable on any system" versus "robust against any program with that bug".

Trade-offs all the way down, as usual, and "neat hack" was meant positively, mine's just a different neat hack with different advantages and failure modes :D


How about something like a vm


I’ve done that, man (Cheech & Chong reference), but with different languages and contexts.

In my experience, debugging threaded stuff is the worst. Remote-machine threaded debugging (like you see with driver and embedded stuff) makes one wake up screaming for years.

In the eighties and nineties, there was this famous low-level Mac debugger, called “The Debugger V2 & MacNosy.”

If anyone remembers that, I may have just reawakened trauma. Sorry.

The running joke was that the only people that knew how to use that system were the ones personally trained by its author. I never really got the hang of it, myself. I tended to use MacsBug a lot.

My favorite debugging tools were something called “In-Circuit Emulators” (ICEs). These were big-ass machines that had a PGA that you would plonk into the processor slot, and they would emulate the processor. You really could dig out those nasties with an ICE.

Processors are way too hairy for that, nowadays.


ICE: uggggh, the ghosts of my debugging past are haunting me.

The pleasure of a successful two week debugging journey, the pain of discovering it was a off-by-one written by oneself (when one was not in a fit state to have been at the keyboard!)

I think my most inventive embedded CPU debugging was instrumenting the custom RTOS to output the task number onto data pins. Then capturing that data on a logic analyser, printing the capture from the logic analyser to the parallel port of a PC, then post processing the data so that I could profile the CPU usage per task. I guess my electronic engineer background caused the logic analyser to be my hammer...


I think you get part of that now with Processor Trace or Coresight. You get a full control flow trace, every branch. For values (registers, variables, function parameters) it's not there but I've been experimenting with light PTWrite calls and it can help :-)


Remember OllyDbg? Well, this is its distant cousin edb.

https://github.com/eteran/edb-debugger

You will like it.

Try Ghidra, too.


+1 for Ghidra. I’ve heard good things: https://ghidra-sre.org


Ghidra is very good for how much it costs.


I did a talk [0] at the LISA conference back in 2013 with a narrative around the same stat "bug" and how it caused a huge outage when we were in the final days of rendering a movie. Instead of debugging I was focussing on how complex systems fail. (If you just want the stat story watch the beginning, skip to 14:00 and then from 41:30 for the conclusion.)

While I love reading about how someone can patch a binary to work around this, to me it's more interesting to think about the systemic failures. The problem ultimately is that one system is sending another a response that it can't handle, but the original Unix API has made this impossible to handle. In fact in both this article and my talk, receiving a 64-bit inode causes the segfault, but neither program was asking the filesystem for an inode in the first place. (Ultimately the fix in TFA was to ignore the result of the stat call.)

It's amazing that software works at all when you're dealing with latent 25 year old "bugs" that are really just badly designed APIs and poorly implemented specs. And the organisational failures that force us to maintain compatibility with ancient binaries without access to the source. The author here did a great job working resolving something that a lot of folk would have just walked away from.

[0] https://www.usenix.org/conference/lisa13/drifting-fragility


I like these post-mortems of a bug sort of write-ups.

I feel like there is so much more to gdb than what I've yet had a need to use.


Halfway through the article when he finds stat is the problem, I was surprised he didn't just solve it with a hastily coded LD_PRELOAD shim and move on with his day. I'm routinely working with Fortran that predates my birth. If I find a convenient inflection point to inject a trivial solution, that's usually much easier and safer than trying to debug the code and later discover that it makes use of a computed goto or a longjmp that would cause an insidious bug


Does ld_preload work for static binaries?


Ah, I missed that part. No, LD_PRELOAD isn't going to be able to overwrite statically linked functions


Not for fully static, but apart from go programs those are pretty rare, right? For everything-static-but-libc it's still OK.


https://stackoverflow.com/questions/60450102/why-does-ld-pre...

> No, it doesn't. A static binary has no need to resolve any symbol dynamically, and therefore the dynamic loader is not invoked to resolve the usual library function symbols.


A bit unrelated, but I once had to "rewind" quite literally.

I had an embedded system with a CPU that had no instruction cache. It was crashing, and I had the crash address, but I couldn't figure out how it got there. So I attached a logic analyser to the memory bus, triggered it on the crash address (or maybe on some symptom of the crash; it's been a while). I had the trigger point be at the end of the saved trace. I then could see what the address bus had been up to (that is, what instructions were fetched) for some time before the crash. That got me enough information to find and fix the problem.

You couldn't do that today, because everything is too small to put a chip clip on, and the processors cache instructions, so the program counter is not directly visible on the address bus.


This is a great post, nice work! Once it was identified the root cause was due to it leveraging an ancient stat -- would it have been easier/safer to patch the syscall table to point stat at stat64? Presumably that can be done without rolling your own kernel module?


`struct stat` and `struct stat64` differ in size - so no, this would not be possible. Using `stat64` in place of `stat` would cause a buffer overflow (not to mention screwing up structure offsets).


Punchline - the line explaining the actual bug is here: https://squanderingti.me/blog/2020/10/28/extreme-debugging.h...

PS: ignore the little link icons at the end of each paragraph... they don’t work! I love reading about a successful bug hunting expedition.


The permalink icons do work for me (Firefox)...


Quite clever! I wish there were more articles like this posted here :-)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: