I used this technique once to very useful effect in an automated dependency-checking tool.
Essentially given an idempotent data-transformation process, you can inject a dll into it and hook its filesystem calls, maintaining a list of all files it touches. Upon process termination, serialize out the filesystem state (size, timestamp, MD5) of each of those files, saving the data in a dependency file whose filename is the hash of the command line.
When you run the same command line again, the dll can first look up the dependency file from the previous run, and if all files on the filesystem are in the same state as previously, you can short-circuit the process execution entirely.
This is orders of magnitude faster in many cases for highly-parallel data build jobs where only some small percentage of the source data changes each time you run the build job, and has the advantage that you don't need manually maintain a list of dependencies for each process type (no new dependency can be added without changing one of the existing dependencies).
I wrote a hooking library that implements the methods talked about, plus a bunch more it doesn't. As well as handle a bunch of edge cases not mentioned: https://github.com/stevemk14ebr/PolyHook_2_0
Thanks for linking, this is really cool. Also nice to see AMD64 support, I never figured out how it can be done. (Is there an absolute jmp without destroying registers in x64?)
yes there is. \x25\xff\x00\x00\x00\x00 where 0 is a 32 bit displacement to a memory location that contains a 64 bit constant is what i use: jmp [disp] but you need to be able to place the constant +- 32bits. This is really hard for x64 on windows, since VirtualAlloc gives you no guarantees, i used to walk pages manually now i just new/delete in a loop and hope for the best.
There's also:
push rax
mov rax, 0xDEADBEEFDEADBEEF
xchg qword ptr ss:[rsp], rax
ret
but i dont like it as much since it touches stack (technically more detectable since you overwrite stuff at rsp - 8). RSP & RAX are original values after that gadget though.
Also fun fact. My library supports JIT-ing, so you can create the stub that the hook jmps to at runtime and it will JIT translation logic for the calling convention and pack the args + ret value into a structure that can be modified. So you can hook unknown functions at runtime.
I am guessing the length of the jump itself is probably important for some reason. But if you could afford to overwrite ~16 bytes, maybe you can store the address inside the imm64 of another instruction. Length issues aside, it shouldn’t break nested hooking at least.
Mixing code and data, right? I think the downside here is if you tried to install another hook it’d fail because the LDE wouldn’t be able to make heads or tails of the address. I suggest embedding the value into something with an imm64 argument (mov?) so that LDEs can handle it.
I guess also though, at ~16 bytes its probably deep enough into the function that it may no longer be position independent, or hell, maybe the function isn’t even that long to begin with.
If you're writing a hooking library / a hook you should be keeping track of where they are. It's a big hook, that is true but it's also one that doesn't spoil a register and is pretty straightforward to add. It's a tradeoff.
Well the bigger problem imo is other hook engines that might also be roaming around the process space. I think all you need is two extra bytes to make it valid instructions, and in theory then nested hooking should work fine. Though it only exacerbates the length issue.
if you place the data at the end of the trampoline it avoids these issues of mixing data and code, it's like a little custom data segment you make since you have to allocate the trampoline anyways. This is what i do in my lib. The disp is after the jmp the trampoline uses to jmp back to the original. The original function only has the jmp [disp] and no data is mixed.
We used hooking to "isolate" the audio output from a certain application on Windows:
The application would ask the operating system for the default audio device. By intercepting this request, we were able to re-route it to our own, virtual audio device. Our program would then fetch the audio data from the virtual device, and replay it to the "real" audio device. At the same time, the audio gets saved to ram, and finally to disk.
The benefit of this method was that we were able to actually isolate the audio from all other sources on the computer. So you could, in theory, mute the playback, while still being able to let the recording run.
Ultimately, we abandoned this method, as it proved quite unreliable. But it was fun to come up with, and finally implement.
Oh hey! This is a favorite topic of mine. I wrote myself a hooking library for a project where I didn’t want to use libc (and it was Win32 by design so I could just use the equivalent Win32 calls). The hardest part was definitely the LDE (length disassembly engine) and in the end I found a small header-only open source library that did it perfectly. The rest was very easy, especially on x86.
API hooking sometimes even works in hostile environments, like on software that tries to guard against patches and modification, simply because it can be challenging to detect, and you can do it early in (like inside a DLL entrypoint). So if you can do all of your work at API boundaries, you can get away with a whole lot, even if an app is packed with a strong VM packer.
Worth noting that for many less difficult use cases on Linux, you can use LD_PRELOAD to somewhat similar effect.
On Linux (and probably other Unices), another way to hook is via the interesting default behaviour of the dynamic linker to only use symbol names instead of qualifying them with the library's filename from which they are to be imported and prefer symbols in already-loaded libraries; I suspect that more often than not, this happens by accident instead of deliberation.
Renderdoc uses hooking to instrument your 3D API calls to provide excellent graphics debugging. Good example to dive into since it's also cross platform.
I used graphics API hooking in a Source Engine game where I didn't have the source for the engine but needed to display the 2D Flash based GUI (Iggy) at the correct time. It was fun to get it working.
Valve's Steam also does something similar since it has the ability to superimpose it's GUI over a running game.
Conceptually yes, but technically renderdoc works differently. On Windows it patches IAT (Import Address Table), then for D3D it wraps complete D3D COM interfaces. It doesn’t use the described tricks to patch code with jmp instruction, it’s not too reliable, IMO.
Similar on Linux, it doesn’t have IAT but it has PLT (Procedure Linkage Table) which is basically the same thing as IAT.
API hooking used to be super easy and very common under DOS. All the API calls used to work using traps, so you'd just have to 1) store the original interrupt vector 2) change it to point at your own interrupt service routine, 3) do whatever when the trap was invoked.
Point 3 might include logging, passing the request through to the original ISR (possibly with changed parameter values), changing the return values, anything really. Easy & fun. =)
I've used MS Detours to hook my own C++ interceptor functions into dispatch from Excel to XLL extensions via the internal Excel4V interface. It worked very well, showing me the XLOPER values. But that was for a 32 bit Excel. I didn't appreciate how much trickier it is with the amd64 instruction set.
Essentially given an idempotent data-transformation process, you can inject a dll into it and hook its filesystem calls, maintaining a list of all files it touches. Upon process termination, serialize out the filesystem state (size, timestamp, MD5) of each of those files, saving the data in a dependency file whose filename is the hash of the command line.
When you run the same command line again, the dll can first look up the dependency file from the previous run, and if all files on the filesystem are in the same state as previously, you can short-circuit the process execution entirely.
This is orders of magnitude faster in many cases for highly-parallel data build jobs where only some small percentage of the source data changes each time you run the build job, and has the advantage that you don't need manually maintain a list of dependencies for each process type (no new dependency can be added without changing one of the existing dependencies).