IE Toolbar crashes in IE 8 Beta 2
The problem: A client's IE toolbar was crashing on the beta 2 version of IE 8 as soon as IE started up. This was an easily reproducible crash. The end of the story is that IE 8 has a feature called LCIE which runs each tab in its own process. The toolbar was sending messages to the parent rebar control which as of IE 8 beta 2 is now resulting in a cross-process SendMessage. If the message is not marshalled (more on that later), and contains pointers to memory space, then this could easily crash the receiving process.
Below is an analysis of the discovery process for this bug:
Virtual PC and Remote Debugging
For testing purposes, I have an XPSP2 system running on a virtual PC. I installed IE8 beta 2 on this virtual PC. I then installed the client's toolbar product and easily reproduced the problem. To debug this problem, I installed the Visual Studio Remote Debugging tools, and then executed the following command line to allow debugging from my desktop to the Virtual PC:
msvsmon /anyuser /noauth /port:9912
Connecting to the Virtual PC
The bug would happen on startup, so I wanted the debugger to be attached and actually launch the iexplore.exe process instead of waiting to attach to it. So I set up my DLL project so that it would launch iexplore.exe on the remote machine when I start the debugger:
One that has been set up, I simply selected Debug.Go to start IE and reproduce the bug.
The Crash With Symbols
The process crashed with an access violation and broke in the debugger. Before displaying the callstack, I should take a moment here to explain that my callstack has a lot of details about the internal functions being called within the Microsoft modules. The reason for this is that I have retrieved all of the public symbols that Microsoft makes available for the DLLs that are in play. This can be done using Microsoft's symbol server.
There are a few ways to enable it, but the way I usually do it is to change the _NT_SYMBOL_PATH environment variable before launching Visual Studio. I first created a symbol cache directory "c:\symbols", and then set the environment variable to the following:
_NT_SYMBOL_PATH=symsrv*symsrv.dll*c:\symbols*http://msdl.microsoft.com/download/symbols
You will notice that the first time you launch Visual Studio with this setting that it will run substantially slower. That's because it has to attempt to retrieve all of the symbols. It should be faster after the initial caching of those symbols. So, after getting all of the symbols, I have a nicely detailed call stack:

With this callstack, we can clearly see that the crash is happening within the window procedure of a rebar control.
Where is the toolbar DLL?
At first I looked for our DLL in the callstack and found that it wasn't there. I also looked for our DLL in the module list and found that it wasn't there either. I then realized that there were two IE processes, which is due to the LCIE feature.
Fortunately, Visual Studio will let you attach to multiple processes, so I attached to the other IEXPLORE.EXE process also, and found that there was indeed a thread in that process that was waiting on a SendMessage call. If I had built this DLL with debug symbols, I would be able to see the exact code that was causing the problem. But let's assume I didn't have any symbols for that DLL and see if we can discover the cause of the problem.
Why is it crashing?
We can clearly see from the callstack that we are in a window procedure handler. But what message is being sent, and why is it crashing? It may be fair to assume that a function called WndProc will look like an actual window procedure, so let's walk the callstack and look at the call to that procedure to see if we can determine which message is being sent:
Based on the callstack above, we should be able to determine the windows message that is causing the problem. We know that a window procedure takes the message identifier as a second argument. The arguments are pushed in reverse order, so the next to last push statement is the one that pushes the window message identifier. The expression in the watch window is the same as the expression in the disassembly code, and see we see that our window message identifier is 1053.
So what caused the crash?
1053 corresponds to the RB_GETBANDINFO message (WM_USER + 29 == 0x0400 + 29 == 1024 + 29 == 1053).
If we look at the format of this message, we can see that it passes a pointer to a structure in the lParam. Since this is a cross-process call any pointers to memory would be invalid unless the pointer and data were marshalled into the receiving process.
It turns out that Windows does in fact marshal pointers for a number of system messages, but only for messages below WM_USER, and RB_GETBANDINFO is higher than WM_USER. So it is not safe to send a RB_GETBANDINFO message cross-process.
To make sure I was right about the bug, I verified that the receiving process was de-referencing the argument that was pushed by the calling process (the lParam). And once that verification was complete, I knew that the cause of the crash was that the toolbar was sending a RB_GETBANDINFO message to the parent rebar control and that would have to be fixed.
What is the fix?
In this case, it turned out that the RB_GETBANDINFO message wasn't really needed so it was simply removed. If the information from RB_GETBANDINFO were truly required then there are a number of options that could be explored, here are just a couple:
1. Find a way to inject code into the parent process and use an IPC mechanism to invoke that code, which in turn would send the RB_GETBANDINFO message within the correct process.
2. There is an interesting approach on CodeProject that uses VirtualAlloc to allocate the memory within the target process. I haven't tried it, but here it is: http://www.codeproject.com/KB/winsdk/CProcessData.aspx.
That's that. I hope somebody finds this helpful.
