ADM390
Microsoft® Windows® Crash
Dump Analysis
Mark Russinovich
Winternals Software
David Solomon
David Solomon Expert Seminars
About The Speakers
Authors of:
Inside Windows 2000, 3rd Edition
(Microsoft Press)
Inside Windows 2000/XP/2003 Interactive Internals
Video Tutorial
Used by Microsoft for worldwide internal training
David Solomon:
Teaches Windows internals classes
(www.solsem.com)
Writes books and articles on Windows internals
Mark Russinovich:
Author of tools on www.sysinternals.com
Co-founder and Chief Software Architect for
Winternals Software (www.winternals.com)
Teaches Windows internals classes
Writes books and articles on Windows internals
Outline
What causes crashes?
Crash dump options
Analysis with WinDbg/Kd
Debugging hung systems
Microsoft On-line Crash Analysis
Using Driver Verifier
Live kernel debugging
Getting past a crash
Introduction
Many systems administrators ignore
Windows NT/Windows 2000’s crash dump
options
“I don’t know what to do with one”
“Its too hard”
“It won’t tell me anything anyway”
Basic crash dump analysis is actually pretty
straightforward
Even if only 1 out of 5 or 10 dumps tells you
what’s wrong, isn’t it worth spending a few
minutes?
Why Analyze Dumps?
The debuggers and Microsoft Online
Crash Analysis (OCA) often solve crashes
Sometimes, however, they do not, so your
analysis might tell you:
What driver to disable, update, or replace with
different hardware
What OEM to send the dump to
What Causes Crashes?
System crashes when a fatal error prevents
further execution
Any kernel-mode component can crash the
system
Drivers and the OS share the same memory
space
Therefore, any driver or OS component can,
due to a bug, corrupt system memory
Note: This is for performance reasons and is the
same on Linux, most Unix’s, VMS, etc…
What Are The Root Causes?
Anecdotal evidence suggests:
Buggy drivers
Bugs in the OS
Hardware failure/error
Cosmic rays
At The Crash
A component calls KeBugCheckEx, which takes
five arguments:
Stop code
4 stop-code defined parameters
KeBugCheckEx:
Turns off interrupts
Tells other CPUs to stop
Paints the blue screen
Notifies registered drivers of the crash
If a dump is configured:
Verifies checksums
Calls dump I/O functions
Common Stop Codes
There are about 150 defined stop codes
Shared by many components and drivers
Common ones include:
IRQL_NOT_LESS_OR_EQUAL (0x0A)
Usually an invalid memory access
INVALID_KERNEL_MODE_TRAP (0x7F) and
KMODE_EXCEPTION_NOT_HANDLED (0x1E)
Generated by executing garbage instructions
Usually caused when a stack is trashed
Documented in Debugger Tools help file
Often, multiple articles in Knowledge Base
Dump Options
Complete memory dump (Windows NT 4,
Windows 2000, Windows XP)
Full contents of memory written to
<systemroot>\memory.dmp
Kernel memory dump (Windows 2000, Windows
XP, Server 2003)
System memory written to <systemroot>\memory.dmp
Small memory dump (Windows 2000, Windows
XP, Server 2003)
Also called a minidump or triage dump
64KB of summary written to
<systemroot>\minidump\MiniMMDDYY-NN.dmp
Enabling Dumps
In Windows 2000/XP/2003:
What Happens When Crash
Dumps Are Enabled
When the system boots it checks
HKEY_LOCAL_MACHINE\System\
CurrentControlSet\Control\CrashControl
The boot disk paging file’s on-disk mapping
is obtained
Relevant components are checksummed:
Boot disk miniport driver
Crash I/O functions
Page file map
At The Reboot
WinLogon
Session 2
Manager Memory.dmp
3
SaveDump
1 4
User mode
Kernel mode
NtCreatePagingFile
Paging
File
At The Reboot
Session Manager process
(\windows\system32\smss.exe) initializes
paging file 1
NtCreatePagingFile
NtCreatePagingFile determines if the dump
has a crash header 2
Protects the dump from use
WinLogon calls NtQuerySystemInformation
to tell if there’s a dump
At The Reboot
If there’s a dump, Winlogon executes
SaveDump 3
(\windows\system32\savedump.exe)
Writes an event to the System event log
SaveDump writes contents to appropriate
file 4
Crash dump portion of paging file is in use
during copy, so virtual memory can run low
Why Crash Dumps Fail
Most common reasons:
Paging file on boot volume is too small
Not enough free space for extracted dump
Less common:
The crash corrupted components involved in the
dump process
Miniport driver doesn’t implement dump I/O
functions
Windows storage drivers must implement dump I/O to
get a Microsoft® digital signature
Microsoft On-line Crash
Analysis (OCA)
By Default, after a reboot XP/Server
2003 prompts you to send information
to http://oca.microsoft.com
Can be configured with Computer
Properties->Advanced->Error Reporting
Can be customized with Group Policies
What Does OCA Do?
Server farm uses !analyze, but uses
Microsoft’s Triage.ini file and database that
includes information about known problems
Several ways to get OCA results:
Via e-mail
At the OCA site
Sometimes OCA will point you at KB
articles that describe the problem
KB articles may tell you to use Windows
Update to get newer drivers, a hotfix, or install
a Service Pack
Analyzing a Crash Dump
If OCA doesn’t help you, or you have an NT4 or
Windows 2000 dump, then you need to open it
with one of the kernel debuggers:
WinDbg –Windows program
Kd – command-line program
Both provide same kernel debugger analysis commands
Part of the Debugging Tools for Windows
Free download from
http://www.microsoft.com/whdc/ddk/debugging/default.m
spx
Supports Windows NT 4, Windows 2000, Windows XP,
Server 2003
Check for updates frequently
Don’t use older version on install media
Symbol Files
Before you can use any crash analysis tool you
need symbol files
Symbol files contain global function and variable names
Symbols are service pack-specific and have an
installer (default directory is \windows\symbols)
Windows NT 4: *.dbg
Windows 2000: *.dbg, *.pdb
Windows XP/2003: *.pdb
Note: Service Pack symbols only include updates
Microsoft Symbol Server
WinDbg and Kd can download symbols
automatically from Microsoft
Pick a directory to install symbols and add
the following to the debugger’s symbol
path:
SRV*directory*http://msdl.microsoft.
com/download/symbols
The debugger automatically detects the OS
version of a dump and downloads the
symbols on-demand
Automated Analysis
When you open a crash dump with Windbg
or Kd you get a basic crash analysis:
Stop code and parameters
A guess at offending driver
The analysis is the result of the automated
execution of the !analyze debugger
command
Automated Analysis
Always execute !analyze with the –v option
to get more information
Text description of stop code
Meaning (if any) of parameters
Stack dump
!Analyze uses heuristics to walk up the
stack and determine what driver is the likely
cause of the crash
“Followup” is taken from optional triage.ini file
Manual Analysis
Sometimes automated analysis isn’t enough
!analyze doesn’t tell you anything useful
You want to know what else was happening at the time of the
crash
Useful commands:
Examine current thread: !thread tid
May or may not be related to the crash
List all processes: !process 0 0
Make sure you understand what was running on the system
Examine a specific process: !process <pid> 7
List loaded drivers: lm kv
Make sure drivers are all recognized and up to date
Look at memory usage: !vm
Create a smaller dump file: .dump
Additional commands: !help
Driver Verifier
If you find a driver in a crash dump that looks like
it might be the cause of the crash, turn on
verification for it
If the Verifier detects a violation it crashes
the system and identifies the driver
Use “Last Known Good” if the verifier detects a bug
during the boot
If a bug is detected in a third-party product check for
updates and/or contact the vendor’s support
NotMyFault.exe
In order to demonstrate
common crash scenarios,
use NotMyFault.Exe
Download from
http://www.sysinternals.com
/files/notmyfault.zip
It loads MyFault.sys
User Mode
MyFault.Sys has an IOCTL Kernel Mode
interface that implements IOCTL Interface
different bugs MyFault.sys
IRQL_NOT_LESS_OR_EQUAL
Run NotMyFault and select “High IRQL fault (kernel
mode)”
Allocates paged pool buffer
Frees the buffer
Raises IRQL ≥ DISPATCH_LEVEL
Touches the buffer
Paged buffers that are marked “not present” but are
touched when IRQL ≥ DISPATCH_LEVEL result in the
IRQL_NOT_LESS_OR_EQUAL bug check
Memory Manager calls KeBugCheckEx from page fault handler
The IRQL is not less than or equal to the maximum IRQL at
which the operation is legal (which is < DISPATCH_LEVEL)
Using the Stack in Analysis
!analyze easily identifies MyFault.sys by
looking at the KeBugCheckEx parameters
The Memory Manager looked at the stack and
determined the address that caused the page
fault
!analyze often looks at the stack to determine
the cause of a crash
Stacks
Each thread has a user-mode and kernel-mode
stack
The user-mode stack is usually 1 MB on x86
The kernel-mode stack is typically 12 KB on x86
systems
Stacks allow for nested function invocation
Parameters can be passed on the stack
Stores return address
Serves as storage for local variables
Stack Frames Parameter 1
Return Address
Frame Pointer
Local Variable 1
Function 1 Local Variable 2
Parameter 3
Higher
Parameter 2 Addresses
Parameter 1
Function 2 Return Address
Frame Pointer
Local Variable 1
Local Variable 2
Function 3 Parameter 2
Parameter 1
Return Address
Frame Pointer
Local Variable 1
Stacks
Other calling conventions make the stack hard to
figure out
No frame pointer
Register arguments (fast calls)
Debugger requires symbol information to parse
The stack is the #1 analysis resource
It requires that a driver get “caught in the act”
Sometimes that’s not possible without the Driver
Verifier’s help
Stack Trashing
Stack trashes have several possible causes:
A driver pushing things on the stack causes the stack
to overflow
A driver overruns a stack-allocated buffer
Usually results in garbage code being executed
(KMODE_EXCEPTION_NOT_HANDLED)
Driver Verifier can’t determine cause
Since the stack is corrupted, analysis is especially
hard
Debugging Stack Trashes
Run NotMyFault and select “Stack Trash”
Allocates a buffer on the stack
Overruns the buffer
Returns to the caller
Crash doesn’t show much off hand
!analyze actually blames Win32K.sys, the Win32 kernel-mode
subsystem
Stack doesn’t show anything except an exception handler
Look deeper
!thread shows an outstanding IRP
!irp <irp> shows that myfault.sys was the target of the IRP
Buffer Overruns
Result when a driver goes past the end
(overrun) or the beginning (underrun) of a buffer
Usually detected when
overwritten data is
referenced Higher Another Driver’s Buffer
Addresses
Another driver or the
kernel makes the reference Pool Structures
There can be a long delay
between corruption and Driver Buffer
detection
Causing a Buffer Overrun
Run NotMyFault and select “Buffer Overrun”
Allocates a nonpaged pool buffer
Writes a string past the end
Note that you might have to run several times
since a crash will occur only if:
The kernel references the corrupted pool structures
A driver references the corrupted buffer
The crash tells you what happened, but not why
A Buffer Overrun Bluescreen
In this example, where the crash was the result of the
kernel tripping on corrupt pool tracking structures, the
Bluescreen tells you what to do:
What is Special Pool?
Special pool is a kernel buffer
area where buffers are Invalid
Page n+2
sandwiched with invalid pages
Conditions for a driver allocating Buffer
Higher
Addresses
from special pool:
Page n+1
Driver Verifier is verifying driver Signature
Special pool is enabled
Allocation is slightly less than one
page (4 KB on x86) Page n Invalid
Turning on Special Pool
Enable Special Pool verification on the suspect driver
The Verifier Catching Buffer
Overrun
The Driver Verifier catches the overrun when it occurs
The Bluescreen tells you who’s fault it is
!analyze explains the crash and also tells you the buggy driver
name
The stack shows where the driver bug is
Code Overwrites
Caused when a bug results in a wild pointer
A wild pointer that points at invalid memory is easily detected
A wild pointer that points at data is similar to buffer overrun
Might not cause a problem for a long time
Crash makes it look like its something else’s fault
Driver Verifier doesn’t catch code overwrite
System code write protection catches code overwrite,
but it’s not on if:
It’s a Windows 2000 system with > 127 MB memory
It’s a Windows XP or .NET Server system with > 255 MB
Something has disabled it
Causing a Code Overwrite
Run NotMyFault and select “Code Overwrite”
Overwrites first bytes of nt!ntreadfile
Function is most common entry to I/O system so a random thread
will cause the crash
The crash hints that the fault occurred in NtReadFile
The last user-mode address is ZwReadFile
The ebx register in the exception frame points at NtReadFile
NtReadFile’s start location looks scrambled (u ntreadfile)
System Code Write Protection
Make sure system code write protection is on
Set HKLM\System\CurrentControlSet\Control
\Session Manager\Memory Management
LargePageMinimum REG_DWORD 0xFFFFFFFF
EnforceWriteProtection REG_DWORD 1
Reboot to take effect
Rerun NotMyFault
Crash occurs immediately and even the blue screen points at
MyFault.sys:
!analyze shows the address of the write and the target (NtReadFile)
Hung Systems
You can tackle a hung system, but only if you’ve
prepared:
Boot in debug mode, or
Set the keystroke-crash Registry value
For debug mode you need a second system (the
debugger host) connected to the target via serial
cable
Run Windbg/Kd on the host
Edit the target’s boot.ini file:
/debugport=comX /baudrate=XXX
When the system hangs, connect with the debugger
and hit Ctrl-C
Hung Systems
To configure keystroke-crash:
Set HKEY_LOCAL_MACHINE\System\
CurrentControlSet\Services\i8042prt\
Parameters\CrashOnCtrlScrl to 1
Enter right-ctrl+[scroll-lock, scroll-lock] to crash
the system
Use !thread to see what’s running
Examine loaded drivers, IRQL, …
Getting Past a Crash
Last-Known Good
Boots with driver/kernel configuration last used during
a successful boot
Safe Mode
Boots the system with core set of drivers and services
Network and non-network
Recovery Console
Manually disable offending service, replace corrupt
images, update files
ERD Commander 2003
Registry Editor, Explorer, Driver/Service Manager,
password changer, Event Log viewer, Notepad
The Bluescreen Screen Saver
Scare your enemies and fool your friends
with the Sysinternals Bluescreen Screen
Saver
Be careful, your job may be on the line!
More Information
Inside Windows 2000, 3rd edition
Section on System Crashes in chapter 4
Debugging Tools help file
Knowledge Base Articles
http://www.microsoft.com/whdc/ddk/debugging/
DBG-KB.mspx
Usenet newsgroup microsoft.public.windbg
for discussion of debugger issues
The debugger team wants your feedback
and bug reports - mail suggestions or bug
reports to
[email protected]Community Resources
Community Resources
http://www.microsoft.com/communities/default.mspx
Most Valuable Professional (MVP)
http://www.mvp.support.microsoft.com/
Newsgroups
Converse online with Microsoft Newsgroups, including Worldwide
http://www.microsoft.com/communities/newsgroups/default.mspx
User Groups
Meet and learn with your peers
http://www.microsoft.com/communities/usergroups/default.mspx
evaluations
© 2003 Microsoft Corporation. All rights reserved.
This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.