ia32lib v1.0
Document Revision 1.0.020105
This is intended to be a short description of the ia32lib library. You can use
the links below to jump to the different sections. Use the HOME key on your
keyboard to return to the top. If you want to get the big picture without
reading too much, I would suggest skipping the Reference
section. If you don't feel reading this at all, scan through the
Examples section.
If you have any questions not discusses herein, please contact me via
e-mail.
I will appreciate any possible feedback you might have at every stage (usage,
design, source, documentation...). Enjoy!
[ Introduction | Overview |
Requirements | Installation |
Reference | Examples | Future
Work ]
Introduction
Modern processors based on IA-32 have performance counter registers that allow
programmers to count different statistical values about their running
applications. Programming these counters to count exactly what a programmer
wants and reading their values requires access to the so called Model Specific
Registers (MSRs) of the processor. There are several instructions in the IA-32
ISA that provide access to these special registers (e.g. RDMSR , WRMSR ,
RDPMC ), but they are either privileged (restricted to be executed
in kernel ring-0 mode only) or have some other restrictions, which combined
with the security of the operating system, does not allow the application
programmer to use them. The main purpose of the ia32lib library is to export
user-programmable interface to these crucial performance measurement facilities
and to provide appropriate ways for detailed processor detection (family,
model, cache configuration, ...).
Both the library and its full source code are free for personal use and can be
freely downloaded. I have not yet figured out the policy for commercial uses,
but who knows...
Overview
ia32lib consists of two main parts plus some examples:
-
ia32.sys - a Windows NT/2K/XP Kernel-Mode Driver that provides access to
IA-32's MSRs;
-
ia32.lib + ia32.h - a static library which provides easy interface to ia32.sys
and some other nice features like CPU model and cache configuration detection;
-
ia32detect.cpp and ia32p6.cpp are two examples that use ia32lib. I will pay
more attention to them later (See examples ).
The sources to build ia32.sys are provided for completeness and educational
purposes only. It is not advisable to try rebuilding ia32.sys unless you really
know what you are doing. Further you will need Microsoft Windows NT Driver
Development Kit (freely available from Microsoft's site).
Requirements
-
A PC running Microsoft Windows NT / 2K / XP;
-
Microsoft Visual C++ 6.0 or Microsoft Visual Studio.NET (7.0)
-
Optional: Intel C++ Optimizing Compiler 5.0.
NOTES: The compilers ia32lib has been tested with so far are Microsoft C++ 6.0
and 7.0 and Intel C++ Optimizing Compiler 5.0. Every effort has been made to
make to code portable to other compilers, but no tests have been performed so
far. The next compilers to look at will probably be Watcom C++ 11.0c and
Borland C++ 5.5. Compatibility with the first three mentioned above is
guaranteed as long the development effort continues.
Installation
-
Download the distribution
(If you have not already done so);
-
Start ia32lib.exe - this will unpack all files to a directory of your choice;
-
Install the ia32.sys Kernel-Mode Driver on your Windows system (step-by-step
instructions);
-
Open ia32lib.dsw workspace or ia32lib.sln solution with the appropriate version
of Microsoft Visual C++ (6.0 or 7.0.NET respectively);
-
You are ready to go! Try building the two sample programs (ia32detect and
ia32p6).
NOTES: Installation of the ia32.sys driver does not require system restart on
Windows XP. The installation instructions are also for Windows XP, but the
steps should be isomorphic to the steps for Windows NT and Windows 2000. Note
that at this point I have not tried the driver on Windows NT and Windows 2000,
but it should work :). If you have any problems,
mail me !
The directory structure of the distribution is as follows:
-
ia32 - root directory of the distribution
-
doc - documentation directory
-
steps - images for step-by-step driver installation
-
generic.css
- style sheet for the documentation
-
install.htm
- step-by-step driver installation instructions
-
index.htm - main documentation file (similar to this one, if not the
same :)
-
drv - NT Kernel-Mode Driver directory
-
source - driver sources
-
makefile
- required part of the NT DDK environment build process
-
sources
- required part of the NT DDK environment build process
-
ia32.c - main driver source, based on the portio
example in the NT DDK
-
ring0.c
- IA-32 assembly support routines
-
ring0.h
- IA-32 assembly support routines header
-
ia32.rc - driver version info resource
-
ia32.inf
- driver installation information file (needed by "Add Hardware Wizard" to
install the driver)
-
ia32.sys - compiled binary of the driver itself
-
inc - host for all header files of the ia32lib library
-
ia32.h
- main header files, includes all others. This is the only one you need to
include
-
ia32cache.h
- describes possible cache configurations for IA-32
-
ia32counter.h
- defines abstract base class for performance counters
-
ia32def.h
- defines basic types
-
ia32detect.h
- defines the IA-32 CPU detection class
-
ia32driver.h
- provides interface constants for use with the driver. Also included by the
driver
-
ia32error.h
- defines error exception class
-
ia32ring0.h
- defines class to expose driver API to the application
-
ia32size.h
- defines auxiliary class for managing memory sizes
-
p6counter.h - specializes ia32counter for the Intel P6 processor family
(Pentium Pro, II and III)
-
lib - source files needed to build ia32.lib
-
ia32.cpp
- used solely for pre-compiled header generation (Microsoft Visual C++ feature)
-
ia32cache.cpp
- initializers for known cache configurations... needs additions, so keep an
eye on it
-
ia32counter.cpp - initializes ia32counter's static variable "counter"
-
out - here all output files of the build process are placed
-
ia32.lib
- pre-built relese version of the library
-
ia32detect.exe
- pre-built release version of the ia32detect sample application
-
ia32p6.exe - pre-built release version of the ia32p6 sample application
(requirest the ia32.sys driver to be installed)
-
prj - support files for Microsoft Visual C++
-
vc.6 - support files for Microsoft Visual C++ 6.0
-
ia32detect - ia32detect.exe sample application project directory
-
ia32detect.dsp - ia32detect sample application project file
-
ia32lib - ia32.lib library project directory
-
ia32lib.dsp - ia32.lib library project file
-
ia32p6 - ia32p6.exe sample application project directory
-
ia32p6.dsp - ia32p6.exe sample application project file
-
ia32lib.dsw - Microsoft Visual C++ 6.0 Project Workspace (open this
thing inside the environment)
-
vc.net - support files for Microsoft Visual C++ .NET
-
ia32detect - ia32detect.exe sample application project directory
-
ia32detect.vcproj - ia32detect sample application project file
-
ia32lib - ia32.lib library project directory
-
ia32lib.vcproj - ia32.lib library project file
-
ia32p6 - ia32p6.exe sample application project directory
-
ia32p6.vcproj - ia32p6.exe sample application project file
-
ia32lib.sln - Microsoft Visual C++ .NET Solution (open this thing inside
the environment)
-
src - examples source directory
-
ia32detect - source directory for the ia32detect.exe example
-
ia32detect.cpp - source for the ia32detect.exe example
-
ia32p6 - source directory for the ia32p6.exe example
-
ia32p6.cpp - source for the ia32p6.exe example
Reference
[ ia32def.h | ia32size.h |
ia32error.h | ia32driver.h |
ia32ring0.h | ia32cache.h | ia32detect.h
| ia32counter.h | p6counter.h
]
This part is mostly top-down description of all features in the library. Each
header file is discussed separately and in detail. Moreover, if you don't feel
like reading, this is the part to skip :).
ia32def.h
types
|
|
name
|
|
equivalent
|
|
byte
|
unsigned char
|
|
word
|
unsigned
|
|
bit
|
unsigned
|
|
uint8
|
unsigned __int8
|
|
uint16
|
unsigned __int16
|
|
uint32
|
unsigned __int32
|
|
uint64
|
unsigned __int64
|
Notes:
-
bit is used in structured bit-fields (see ia32detect.h for
examples).
Back to Reference...
ia32size.h
Constants
|
|
Name
|
|
Value
|
|
B
|
(uint64)1
|
|
KB
|
(1024 * B)
|
|
MB
|
(1024 * KB)
|
|
GB
|
(1024 * MB)
|
|
TB
|
(1024 * TB)
|
Classes
|
|
Name
|
|
Definition
|
|
ia32size
|
class ia32size
{
uint64 size;
public:
ia32size (uint64);
operator const string () const;
operator const uint64 () const;
}
|
Notes:
-
ia32size's purpose is to convey memory sizes in easy to read textual
format;
-
ia32size::ia32size(uint64)
constructs an instance for a specific capacity value;
-
ia32size::operator string () const
is used to convert the encapsulate value to a string (see example below);
-
ia32size::operator uint64 () const is used to return the encapsulated
value in native integer format.
Example:
#include "ia32size.h"
void main ()
{
printf("%8s\n", ((string)ia32size(16)).c_str());
printf("%8s\n", ((string)ia32size(1024)).c_str());
printf("%8s\n", ((string)ia32size(4096)).c_str());
printf("%8s\n", ((string)ia32size(3 * 1024 * 1024)).c_str());
printf("%8s\n", ((string)ia32size((uint64)13 * 1024 * 1024 * 1024 * 1024)).c_str());
printf("%8s\n", ((string)ia32size((uint64)13 * 1024 * 1024 * 1024 * 1024 + (uint64)7 * 1024 * 1024 * 1024)).c_str());
printf("%8d\n", (uint64)ia32size(12345678));
}
Output:
16 B
1KB
4KB
3MB
13TB
13319GB
12345678
Back to Reference...
ia32error.h
Classes
|
|
Name
|
|
Definition
|
|
ia32error
|
class ia32error
{
public:
enum err_
{
err_generic,
err_ring0_cpu,
err_ring0_create,
err_ring0_ioctl,
err_ring0_size,
err_ring0_close,
err_counter_overflow,
err_counter_family,
err_counter_MMX,
err_counter_SSE,
err_counter_counter,
err_invalid
};
ia32error (err_);
operator const char * () const;
protected:
err_ v;
};
|
Notes:
-
ia32error
is a class whose instances are thrown as exceptions;
-
enum ia32error::err_
enumerated the different error values;
-
ia32error::ia32error (err_)
initializes an instance to a particular error value;
-
ia32error::operator const char * () const converts the encapsulated
error value to a string (suitable for error display), for the list of specific
string values look inside ia32error.h
;
-
most ring-0 routines throw ia32errors as exceptions. For specific
examples see the ia32p6 sample.
Back to Reference...
ia32driver.h
Constants
|
|
Name
|
|
Value
|
|
IA32CPU_TYPE
|
40000
|
|
IOCTL_IA32CPU_READ_MSR
|
CTL_CODE(IA32CPU_TYPE, 0x900, METHOD_BUFFERED, FILE_READ_ACCESS)
|
|
IOCTL_IA32CPU_WRITE_MSR
|
CTL_CODE(IA32CPU_TYPE, 0x901, METHOD_BUFFERED, FILE_WRITE_ACCESS)
|
Notes:
-
Constants defined in this header are used both by the ia32.sys kernel-mode
driver and by the ia32ring0.h
driver interface header;
-
IOCTL_IA32CPU_XXX_MSR are needed to complete DeviceIoControl system calls to
the driver.
Back to Reference...
ia32ring0.h
Classes
|
|
Name
|
|
Definition
|
|
ia32ring0
|
class ia32ring0
{
HANDLE h;
public:
ia32ring0 ();
uint64 rdmsr (uint32 i) const;
void wrmsr (uint32 i, uint64 d) const;
~ia32ring0 ();
};
|
Notes:
-
ia32ring0 is the exported user-level API to the ia32.sys
driver, used to read and write IA-32 Model Specific Registers (MSRs);
-
ia32ring0::ia32ring0 ()
initializes a connection to the driver;
-
uint64 ia32ring0::rdmsr (uint32 i) const
uses the driver to read the i-th MSR and returns its value;
-
void ia32ring0::wrmsr (uint32 i, uint64 d) const
uses the driver to write the i-th MSR with the value contained in d;
-
ia32ring0::~ia32ring0 () closes the connection to the driver.
Back to Reference...
ia32cache.h
Classes
|
|
Name
|
|
Definition
|
|
ia32cache
|
class ia32cache
{
public:
enum type_
{
type_reserved,
type_unified,
type_instruction,
type_trace,
type_data,
type_invalid
};
enum _
{
level_TLB = -1,
associativity_Full = -1,
block_AnySize = 0
};
const byte descriptor;
const type_ type;
const int level;
const ia32size capacity;
const ia32size block;
const int associativity;
ia32cache (byte, type_, int, ia32size, ia32size, int);
operator const string () const;
protected:
const const char * type_text () const;
const const string associativity_text () const;
};
|
Variables
|
|
Name
|
|
Declaration
|
|
ia32caches
|
extern const ia32cache ia32caches[];
|
Functions
|
|
Name
|
|
Prototype
|
|
_ia32cache
|
const ia32cache &_ia32cache (byte);
|
Notes:
-
ia32cache is a class describing cache memory parameters. For now a
number of predefined such classes exist (see ia32cache.cpp
for complete listing), but in the future it will also be used to describe
caches detected empirically by software;
-
enum ia32cache::type_
enumerates the different types of caches supported;
-
enum ia32cache::_
enumerates some special values for otherwise integer fields like block size and
associativity;
-
const byte ia32cache::descriptor
contains the IA-32 defined byte descriptor of the cache;
-
const type_ ia32cache::type
contains the type of the cache;
-
const int ia32cache::level
contains the cache level (-1 means TLB cache);
-
const ia32size ia32cache::capacity
contains the size of the cache;
-
const ia32size ia32cache::block
contains the block size of the cache (0 means "Any Size" for page sizes in TLB
caches);
-
const int ia32cache::associativity
contains the associativity of the cache (-1 means "Fully-Associative");
-
ia32cache::ia32cache (byte, type_, int, ia32size, ia32size, int)
initializes a cache instance;
-
ia32cache::operator const string () converts the cache instance to a
nice looking string representation (see the ia32detect
sample for detailed examples);
-
const char * ia32cache::type_text () const returns a text representation
of the current value of the type
field;
-
const string associativity_text () const returns a text representation
of the current value of the associativity
field;
-
ia32cache ia32caches[]
contains pre-initialized cache instances for all descriptors known so far;
-
const ia32cache &_ia32cache (byte) searches a cache instance by
descriptor in the above array.
Back to Reference...
ia32detect.h
Classes
|
|
Name
|
|
Definition
|
|
ia32error
|
class ia32detect
{
public:
enum type_
{
type_OEM,
type_OverDrive,
type_Dual,
type_reserved
};
enum brand_
{
brand_na,
brand_Celeron,
brand_PentiumIII,
brand_PentiumIIIXeon,
brand_reserved1,
brand_reserved2,
brand_PentiumIIIMobile,
brand_reserved3,
brand_Pentium4,
brand_invalid
};
struct version_
{
bit Stepping : 4;
bit Model : 4;
bit Family : 4;
bit Type : 2;
bit Reserved1 : 2;
bit XModel : 4;
bit XFamily : 8;
bit Reserved2 : 4;
};
struct misc_
{
byte Brand;
byte CLFLUSH;
byte Reserved;
byte APICId;
};
struct feature_
{
bit FPU : 1; // Floating Point Unit On-Chip
bit VME : 1; // Virtual 8086 Mode Enhancements
bit DE : 1; // Debugging Extensions
bit PSE : 1; // Page Size Extensions
bit TSC : 1; // Time Stamp Counter
bit MSR : 1; // Model Specific Registers
bit PAE : 1; // Physical Address Extension
bit MCE : 1; // Machine Check Exception
bit CX8 : 1; // CMPXCHG8 Instruction
bit APIC : 1; // APIC On-Chip
bit Reserved1 : 1;
bit SEP : 1; // SYSENTER and SYSEXIT instructions
bit MTRR : 1; // Memory Type Range Registers
bit PGE : 1; // PTE Global Bit
bit MCA : 1; // Machine Check Architecture
bit CMOV : 1; // Conditional Move Instructions
bit PAT : 1; // Page Attribute Table
bit PSE36 : 1; // 32-bit Page Size Extension
bit PSN : 1; // Processor Serial Number
bit CLFSH : 1; // CLFLUSH Instruction
bit Reserved2 : 1;
bit DS : 1; // Debug Store
bit ACPI : 1; // Thermal Monitor and Software Controlled Clock Facilities
bit MMX : 1; // Intel MMX Technology
bit FXSR : 1; // FXSAVE and FXRSTOR Instructions
bit SSE : 1; // Intel SSE Technology
bit SSE2 : 1; // Intel SSE2 Technology
bit SS : 1; // Self Snoop
bit Reserved3 : 1;
bit TM : 1; // Thermal Monitor
bit Reserved4 : 2;
};
string vendor;
string brand;
version_ version;
misc_ misc;
feature_ feature;
byte *cache;
ia32detect ();
const string version_text () const;
protected:
const char * type_text () const;
const string brand_text () const;
private:
uint32 init0 ();
void init1 (uint32 *d);
void process2 (uint32 d, bool c[]);
void init2 (byte count);
void init0x80000000 ();
};
|
Notes:
-
enum ia32detect::type_ enumerates CPU types for the version.Type
field;
-
enum ia32detect::brand_ enumerates CPU brands for the misc.Brand
field;
-
struct ia32detect::version_ (version
field) describes CPU version information as returned by the CPUID instruction;
-
struct ia32detect::misc_ (misc
field) describes CPU miscellaneous information as returned by the CPUID
instruction;
-
struct ia32detect::feature_ (feature
field) describes CPU feature information as returned by the CPUID instruction;
-
string ia32detect::vendor
specifies the CPU vendor ("GenuineIntel" for Intel CPUs);
-
string ia32detect::brand
specifies the CPU brand string, when supported;
-
byte *ia32detect::cache
specifies a null terminated stream of cache descriptors;
-
ia32detect::ia32detect ()
initializes an instance of the class by (multiple) use of CPUID instruction;
-
const string ia32detect::version_text () returns a string representation
of the version
field;
-
const char *ia32detect::type_text () returns a string representation of
the type
field;
-
const string ia32detect::brand_text () returns a string representation
of the misc.Brand
field;
-
all the private members are auxiliary routines to simplify the work of the
constructor.
Back to Reference...
ia32counter.h
Classes
|
|
Name
|
|
Definition
|
|
ia32counter
|
class ia32counter
{
protected:
static uint32 count;
uint32 index;
public:
ia32counter (uint32 counters);
};
|
Notes:
-
ia32counter
is an abstract base class for performance monitoring hardware counter;
-
static uint32 ia32counter::count
accumulates the number of instances created;
-
uint32 ia32counter::index
contains the hardware index of this instance;
-
ia32counter::ia32counter (uint32 counters) initializes the index and
checks for structural hazards (enough hardware counters).
Back to Reference...
p6counter.h
Classes
|
|
Name
|
|
Definition
|
|
p6counter
|
class p6counter: public ia32counter
{
public:
enum event_
{
// Data Cache Unit (DCU)
DCU_MEMORY_REFERENCE = 0x43, // DATA_MEM_REFS
DCU_LINES_IN = 0x45,
DCU_M_LINES_IN = 0x46,
DCU_M_LINES_OUT = 0x47,
DCU_MISS_OUTSTANDING = 0x48,
// Instruction Fetch Unit (IFU)
IFU_IFETCH = 0x80,
IFU_IFETCH_MISS = 0x81,
IFU_TLB_MISS = 0x85, // ITLB_MISS
IFU_MEMORY_STALL = 0x86,
IFU_ILD_STALL = 0x87, // ILD_STALL
// L2 Cache
L2_IFETCH = 0x28,
L2_LOADS = 0x29, // L2_LD
L2_STORES = 0x2A, // L2_ST
L2_LINES_IN = 0x24,
L2_LINES_OUT = 0x26,
L2_M_LINES_IN = 0x25,
L2_M_LINES_OUT = 0x27,
L2_REQUEST = 0x2E, // L2_RQSTS
L2_ADDRESS_STROBE = 0x21, // L2_ADS
L2_DATA_BUS_BUSY = 0x22, // L2_DBUS_BUSY
L2_DATA_BUS_BUSY_READ = 0x23, // L2_DBUS_BUSY_RD
// External Bus Logic (EBL)
EBL_DATA_READY = 0x62, // BUS_DRDY_CLOCKS
EBL_LOCK = 0x63, // BUS_LOCK_CLOCKS
EBL_REQ_OUTSTANDING = 0x60, // BUS_REQ_OUTSTANDING
EBL_TRANS_BURST_READ = 0x65, // BUS_TRAN_BRD
EBL_TRANS_READ_OWNER = 0x66, // BUS_TRAN_RFO
EBL_TRANS_WRITEBACK = 0x67, // BUS_TRANS_WB
EBL_TRANS_IFETCH = 0x68, // BUS_TRAN_IFETCH
EBL_TRANS_INVALIDATE = 0x69, // BUS_TRAN_INVAL
EBL_TRANS_PARTIAL_WRITE = 0x6A, // BUS_TRAN_PWR
EBL_TRANS_PARTIAL = 0x6B, // BUS_TRANS_P
EBL_TRANS_IO = 0x6C, // BUS_TRANS_IO
EBL_TRANS_DEFERRED = 0x6D, // BUS_TRAN_DEF
EBL_TRANS_BURST = 0x6E, // BUS_TRAN_BURST
EBL_TRANS_ANY = 0x70, // BUS_TRAN_ANY
EBL_TRANS_MEMORY = 0x6F, // BUS_TRAN_MEM
EBL_DATA_RECEIVE = 0x64, // BUS_DATA_RCV
EBL_DRIVE_BNR = 0x61, // BUS_BNR_DRV
EBL_DRIVE_HIT = 0x7A, // BUS_HIT_DRV
EBL_DRIVE_HITM = 0x7B, // BUS_HITM_DRV
EBL_SNOOP_STALL = 0x7E, // BUS_SNOOP_STALL
// Floating-Point Unit (FPU)
FPU_FLOPS_RETIRED = 0xC1, // FLOPS, Counter 0 only
FPU_FLOPS_EXECUTED = 0x10, // FP_COMP_OPS_EXE, Counter 0 only
FPU_ASSIST = 0x11, // FP_ASSIST, Counter 1 only
FPU_MUL = 0x12, // MUL, Counter 1 only
FPU_DIV = 0x13, // DIV, Counter 1 only
FPU_DIV_BUSY = 0x14, // CYCLES_DIV_BUSY, Counter 0 only
// Memory Ordering (MO)
MO_LOAD_BLOCKED = 0x03, // LD_BLOCKS
MO_STORE_BUFFER_DRAIN = 0x04, // SB_DRAINS
MO_MISALLIGNMENT = 0x05, // MISALIGN_MEM_REF
SSE_PREFETCH_DISPATCHED = 0x07, // EMON_KNI_PREF_DISPATCHED
SSE_PREFETCH_MISS = 0x4B, // EMON_KNI_PREF_MISS
// Instruction Decoding and Retirement (IDR)
IDR_INSTRUCTION_RETIRED = 0xC0, // INST_RETIRED
IDR_UOP_RETIRED = 0xC2, // UOPS_RETIRED
IDR_INSTRUCTION_DECODED = 0xD0, // INST_DECODED
SSE_INSTRUCTION_RETIRED = 0xD8, // EMON_KNI_INST_RETIRED
SSE_COMPUTATION_RETIRED = 0xD9, // EMON_KNI_COMP_INST_RET
// Interrupts (INT)
INT_HW_RECEIVED = 0xC8, // HW_INT_RX
INT_MASKED = 0xC6, // CYCLES_INT_MASKED
INT_PENDING_AND_MASKED = 0xC7, // CYCLES_INT_PENDING_AND_MASKED
// Branches (BR)
BR_INSTRUCTION_RETIRED = 0xC4, // BR_INST_RETIRED
BR_MISSPREDICT_RETIRED = 0xC5, // BR_MISS_PRED_RETIRED
BR_TAKEN_RETIRED = 0xC6,
BR_MISSPREDICT_TAKEN_RETIRED = 0xC7, // BR_MISS_PRED_TAKEN_RET
BR_INSTRUCTION_DECODED = 0xE0, // BR_INST_DECODED
BR_BTB_MISS = 0xE2, // BTB_MISSES
BR_BOGUS = 0xE4,
BR_BACLEAR = 0xE6, // BARCLEARS
// Stalls (STALL)
STALL_RESOURCE = 0xA2, // RESOURCE_STALLS
STALL_PARTIAL = 0xD2, // PARTIAL_RAT_STALLS
// Multimedia Extensions (MMX)
MMX_INSTRUCTION_EXECUTE = 0xB0, // MMX_INSTR_EXEC
MMX_SATURATING_EXECUTE = 0xB1, // MMX_SAT_INSTR_EXEC
MMX_UOP_EXECUTE = 0xB2, // MMX_UPOS_EXEC
MMX_TYPE_EXECUTE = 0xB3, // MMX_INSTR_TYPE_EXEC
MMX_FPU_TRANSITION = 0xCC, // FP_MMX_TRANS
MMX_ASSIST = 0xCD,
MMX_INSTRUCTION_RETIRED = 0xCE, // MMX_INSTR_RET
// Segment Register Renaming (SRR)
SRR_STALL = 0xD4, // SEG_RENAME_STALLS
SRR_COUNT = 0xD5, // SEG_REG_RENAME
SRR_COUNT_RETIRED = 0xD6, // RET_SEG_RENAMES
SEGMENT_REGISTER_LOADS = 0x06, // SEGMENT_REG_LOADS
CPU_CLOCKS_UNHALTED = 0x79 // CPU_CLK_UNHALTED
};
enum mask_
{
NONE = 0x0,
L2_M = 0x8,
L2_E = 0x4,
L2_S = 0x2,
L2_I = 0x1,
L2_MESI = 0xF,
EBL_SELF = 0x00,
EBL_ANY = 0x20,
SSE_PREFETCH_NTA = 0x00,
SSE_PREFETCH_T1 = 0x01,
SSE_PREFETCH_T2 = 0x02,
SSE_WEAKLY_ORDERED_STORES = 0x03,
SSE_PACKED_AND_SCALAR = 0x00,
SSE_SCALAR = 0x01,
MMX_PACKED_MULTIPLY = 0x01,
MMX_PACKED_SHIFT = 0x02,
MMX_PACK = 0x04,
MMX_UNPACK = 0x08,
MMX_PACKED_LOGICAL = 0x10,
MMX_PACKED_ARITHMETIC = 0x20,
MMX_ANY = 0x3F,
MMX_TO_FPU = 0x0,
MMX_FROM_FPU = 0x1,
SRR_ES = 0x1,
SRR_DS = 0x2,
SRR_FS = 0x4,
SRR_GS = 0x8,
SRR_ANY = 0xF
};
struct
{
bit event : 8;
bit mask : 8;
bit ring123 : 1;
bit ring0 : 1;
bit edge : 1;
bit pin : 1;
bit int_ : 1;
bit reserved : 1;
bit enable : 1;
bit invert : 1;
bit count : 8;
} config;
p6counter (event_ event, mask_ mask = NONE, byte count = 0, bool invert = false);
operator const uint64 () const;
protected:
ia32ring0 r0;
};
|
Notes:
-
p6counter is a derived class of ia32counter
for performance monitoring counter on the Intel P6 Family of CPUs (Pentium Pro,
II and III);
-
enum p6counter::event_
enumerates the different events this counter can be programmed to count;
-
enum p6counter::mask_
enumerates the different values for the mask field in the counter programming
register;
-
struct p6counter::config
represents the counter's programming register;
-
p6counter::p6counter (event_, mask_, byte, invert)
initilizes the hardware counter and starts it;
-
p6counter::operator uint64 () const
reads the current value of the counter;
-
ia32ring0 p6counter::r0 is used for communication with the kernel-mode
driver.
Back to Reference...
Examples
ia32detect
This examples fully exploits the features for CPU detection. Here you can find
demonstrated all the supported features. Provided below is the complete source
code (not much).
#include "ia32.h"
void main ()
{
ia32detect ia32;
printf("Vendor = %s\n\n", ia32.vendor.c_str());
printf("Brand = %s\n\n", ia32.brand.c_str());
printf("Version = %s\n\n", ia32.version_text().c_str());
printf("Cache: \n\n");
for (int i = 0; ia32.cache[i]; i++)
printf("%s\n", ((string)_ia32cache(ia32.cache[i])).c_str());
printf("\nFeatures:\n\n");
printf("%c %s\n", ia32.feature.FPU ? '+' : '-', "Floating Point Unit On-Chip");
printf("%c %s\n", ia32.feature.VME ? '+' : '-', "Virtual 8086 Mode Enhancements");
printf("%c %s\n", ia32.feature.DE ? '+' : '-', "Debugging Extensions");
printf("%c %s\n", ia32.feature.PSE ? '+' : '-', "Page Size Extensions");
printf("%c %s\n", ia32.feature.TSC ? '+' : '-', "Time Stamp Counter");
printf("%c %s\n", ia32.feature.MSR ? '+' : '-', "Model Specific Registers");
printf("%c %s\n", ia32.feature.PAE ? '+' : '-', "Physical Address Extension");
printf("%c %s\n", ia32.feature.MCE ? '+' : '-', "Machine Check Exception");
printf("%c %s\n", ia32.feature.CX8 ? '+' : '-', "CMPXCHG8 Instruction");
printf("%c %s\n", ia32.feature.APIC ? '+' : '-', "APIC On-Chip");
printf("%c %s\n", ia32.feature.SEP ? '+' : '-', "SYSENTER and SYSEXIT instructions");
printf("%c %s\n", ia32.feature.MTRR ? '+' : '-', "Memory Type Range Registers");
printf("%c %s\n", ia32.feature.PGE ? '+' : '-', "PTE Global Bit");
printf("%c %s\n", ia32.feature.MCA ? '+' : '-', "Machine Check Architecture");
printf("%c %s\n", ia32.feature.CMOV ? '+' : '-', "Conditional Move Instructions");
printf("%c %s\n", ia32.feature.PAT ? '+' : '-', "Page Attribute Table");
printf("%c %s\n", ia32.feature.PSE36 ? '+' : '-', "32-bit Page Size Extension");
printf("%c %s\n", ia32.feature.PSN ? '+' : '-', "Processor Serial Number");
printf("%c %s\n", ia32.feature.CLFSH ? '+' : '-', "CLFLUSH Instruction");
printf("%c %s\n", ia32.feature.DS ? '+' : '-', "Debug Store");
printf("%c %s\n", ia32.feature.ACPI ? '+' : '-', "Thermal Monitor and Software Controlled Clock Facilities");
printf("%c %s\n", ia32.feature.MMX ? '+' : '-', "Intel MMX Technology");
printf("%c %s\n", ia32.feature.FXSR ? '+' : '-', "FXSAVE and FXRSTOR Instructions");
printf("%c %s\n", ia32.feature.SSE ? '+' : '-', "Intel SSE Technology");
printf("%c %s\n", ia32.feature.SSE2 ? '+' : '-', "Intel SSE2 Technology");
printf("%c %s\n", ia32.feature.SS ? '+' : '-', "Self Snoop");
printf("%c %s\n", ia32.feature.TM ? '+' : '-', "Thermal Monitor");
}
Below is the output from my laptop machine. Please, if you decide to install
the package, run this small problem and e-mail
me the results.
Vendor = GenuineIntel
Brand = Intel(R) Pentium(R) III Mobile CPU 1000MHz
Version = 6.11.1 Intel OEM Processor XVersion(0.0)
Cache:
0x01: TLB instruction, Entries( 32), PageSize(4KB), Associativity(4-way)
0x02: TLB instruction, Entries( 2), PageSize(4MB), Associativity( Full)
0x03: TLB data, Entries( 64), PageSize(4KB), Associativity(4-way)
0x04: TLB data, Entries( 8), PageSize(4MB), Associativity(4-way)
0x08: L1 instruction$, Size( 16KB), Block( 32 B), Associativity(4-way)
0x0c: L1 data$, Size( 16KB), Block( 32 B), Associativity(4-way)
0x83: L2 unified$, Size( 512KB), Block( 32 B), Associativity(8-way)
Features:
+ Floating Point Unit On-Chip
+ Virtual 8086 Mode Enhancements
+ Debugging Extensions
+ Page Size Extensions
+ Time Stamp Counter
+ Model Specific Registers
+ Physical Address Extension
+ Machine Check Exception
+ CMPXCHG8 Instruction
- APIC On-Chip
+ SYSENTER and SYSEXIT instructions
+ Memory Type Range Registers
+ PTE Global Bit
+ Machine Check Architecture
+ Conditional Move Instructions
+ Page Attribute Table
+ 32-bit Page Size Extension
- Processor Serial Number
- CLFLUSH Instruction
- Debug Store
- Thermal Monitor and Software Controlled Clock Facilities
+ Intel MMX Technology
+ FXSAVE and FXRSTOR Instructions
+ Intel SSE Technology
- Intel SSE2 Technology
- Self Snoop
- Thermal Monitor
ia32p6
This example demonstrates the usage of Intel P6 Hardware Performance Monitoring
Counters. Processors from this family have two almost identical counters. In
the source below, one of them is setup to count memory references and the other
- to count requests to the L2 cache (which is actually nothing else but L1
misses!).
#include "ia32.h"
#include "p6counter.h"
void main ()
{
p6counter c1(p6counter::L2_REQUEST, p6counter::L2_MESI);
p6counter c2(p6counter::DCU_MEMORY_REFERENCE);
const int c = 10000000;
static int a[c];
for (int ai1 = 0; ai1 < c; ai1++)
a[ai1]++;
SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);
uint64 t1 = c1;
uint64 t2 = c2;
for (int ai2 = 0; ai2 < c; ai2++)
a[ai2] *= 13;
printf("L1 misses = %I64d\nL1 accesses = %I64d\n", c1 - t1, c2 - t2);
}
We walk an array of 10000000 integers, multiplying each element by 13 (a load
access, followed by a store access, i.e. 2 accesses per element). Also because
the L1 line size is 32 bytes, we have 8 elements per line or about 12500000
cache lines accessed (all misses). This totals up to 20000000 memory accesses
and 12500000 L1 misses. The excess of 1495 misses and 10728 memory accesses in
the results below is due to OS noise, the amount of which (<<1%) is quite
acceptable.
L1 misses = 1251495
L1 accesses = 20010728
The code of the example employs many techniques to reduce the noise during
measurements. Here are the most important things you need to keep in mind when
monitoring performance in this setting:
-
Microsoft Windows NT / 2K / XP does not allocate all the memory your process
requested instantly after the request. Rather pages are allocated when they are
first accessed. This means that when you access a memory page for the first
time, a page fault occurs and the OS takes over. The instructions executed by
the OS exception handler can be millions, resulting in excessive noise in the
measurements. For this reason the code above walks the array in advance to make
sure all pages are present in memory when the counting starts.
-
Because Microsoft Windows NT / 2K / XP is a preemptive multitasking operating
system, our program is not the only thing running on the machine. Performance
counters are in the CPU and they count for all processes simultaneously. In
order to reduce foreign code noise, it is advisable to boost the priority of
your process to maximum level (real-time priority). This setting will reserve
the machine almost exclusively to your application and the overall
responsiveness might seem jerky until the program terminates. The code above
achieves the priority boost by the SetPriorityClass Windows system call:
SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);
-
Last but not least, make sure you avoid obvious counting overlaps. An example
would be to split the final printf statement in the example above in to
different function calls. Note that the current counter value is read when the
'-' sign is evaluated. Thus if you print the delta of the first counter (cache
misses in this case) in a separate function call to printf, the second counter
(memory references in this case) will count the data accesses performed during
this function call as well.
Future Work
Although this document seems quite long, it is more of a draft than something
completed.
There are many (orthogonal) directions this work can be extended.
First priority is of course implementing ia32counter subclasses (like p6counter)
for other processor families, like Intel Pentium 4, Intel Ithanium and
different models of AMD. I believe it is important to understand the specifics
of Intel P4, as it is the first processor ever to provide precise event-based
sampling performance monitoring. What this means is that one can get the
processor state when an event (e.g. cache miss) occurs, so the exact
instruction causing the miss is known. This can further facilitate the
preciseness of research methods in this area.
Another direction is to extend the CPU detection procedure with empirical
measurements that can detect memory hierarchy in conventional software (a la
HW1 cs612). As processors become more and more sophisticated from hardware
point of view, this task becomes harder and harder, but I believe it is still
doable. This is very important step if we want to build compilers that
dynamically tune themselves to the current CPU (possibly a CPU that did not
exist when the compiler was released!)
Last, I am not sure how important this is, but this document is way too long
and needs better structure and probably some factoring. If the library grows
bigger, better documentation will be needed or it will be yet one of these
public domain things that you need to read all the headers before starting to
use it. I said this before, and I will repeat it again: If you ever plan to use
this thing, please, please give feedback. Contributions are also more than
welcome, but I would suggest if you have an idea to coordinate it with me, as
there is good chance it is already under way...
So far I am not worried if this piece of software is useful or not. For sure it
is useful for me. I bet it would also be useful for cs612... I hope it is
useful for you too. Good luck!
References
-
Intel IA-32 Developper Manuals v.1 - 3, http://www.intel.com
-
http://www.sandpile.org
|