The Intel Pentium Pro and Pentium II CPUs have hardware support for counting any two of several dozen low-level hardware events. For example, I-cache misses can be precisely measured with the counters. By default, the counters cannot be read or programmed from user mode. If the PCE bit in control register 4 is set, then the counters can be read from user mode, but they still must be set from kernel mode. The pmc device driver module sets this PCE bit at module load time to allow subsequent user mode programs to read the values of the counters with the rdpmc instruction.
Note that the Pentium and Pentium with MMX CPUs are very different under the hood from the Pentium Pro and Pentium II CPUs. The Pentium and Pentium with MMX cpus have different performance counters which are accessed in different ways. The pmc device driver does not support the Pentium nor the Pentium with MMX. It looks easy enough to add this support, but I don't have one of these CPUs to test on. Let me know if you'd like to donate appropriate code, or use of such a CPU to develop such code on. 486 and previous are right out.
The pmc driver provides ioctls to program each of the two counters to count any of the available events.
Also in this distribution is the pmcTime program, which provides an easy way to measure these events from the command line.
The Intel Architecture Developer's Manual, volume 3, availabile from Intel describes the counters in more detail.
I doubt that this code will run on these processors. If you try though, please let me know.
DATA_MEM_REFS | All memory references, both cacheable and noncacheable. |
DCU_LINES_IN | Total lines allocated in the DCU. |
DCU_M_LINES_IN | Number of M state lines allocated in the DCU. |
DCU_M_LINES_OUT | Number of M state lines evicted from the DCU. |
DCU_MISS_OUTSTANDING | Weighted number of cycles while a DCU miss is outstanding. |
IFU_IFETCH | Number of instruction fetches, both cacheable and noncacheable. |
IFU_IFETCH_MISS | Number of instruction fetch misses. |
ITLB_MISS | Number of ITLB misses. |
IFU_MEM_STALL | Number of cycles that the instruction fetch pipe stage is stalled, including cache misses, ITLB misses, ITLB faults, and victimcache evictions. |
ILD_STALL | Number of cycles that the instruction length decoder is stalled. |
L2_IFETCH | Number of L2 instruction fetches. |
L2_LD | Number of L2 data loads. |
L2_ST | Number of L2 data stores. |
L2_LINES_IN | Number of lines allocated in the L2. |
L2_LINES_OUT | Number of lines removed from the L2 for any reason. |
L2_M_LINES_INM | Number of modified lines allocated in the L2. |
L2_M_LINES_OUTM | Number of modified lines removed from the L2 for any reason. |
L2_RQSTS | Number of L2 requests. |
L2_ADS | Number of L2 address strobes. |
L2_DBUS_BUSY | Number of cycles during which the data bus was busy. |
L2_DBUS_BUSY_RD | Number of cycles during which the data bus was busy transferring data from L2 to the processor. |
BUS_DRDY_CLOCKS_SELF | Number of clocks during which DRDY is asserted by CPU. |
BUS_DRDY_CLOCKS_ANY | (Any) Number of clocks during which DRDY is asserted by any agent. |
BUS_LOCK_CLOCKS_SELF | Number of clocks during which LOCK is asserted. |
BUS_LOCK_CLOCKS_ANY | Number of clocks during which LOCK is asserted. |
BUS_REQ_OUTSTANDING_SELF | Number of bus requests outstanding. |
BUS_REQ_OUTSTANDING_ANY | Number of bus requests outstanding. |
BUS_TRAN_BRD_SELF | Number of burst read transactions. |
BUS_TRAN_BRD_ANY | Number of burst read transactions. |
BUS_TRAN_RFO | Number of read for ownership transactions. |
BUS_TRANS_WB | Number of write back transactions. |
BUS_TRAN_IFETCH | Number of instruction fetch transactions. |
BUS_TRAN_INVAL | Number of invalidate transactions. |
BUS_TRAN_PWR | Number of partial write transactions. |
BUS_TRANS_P | Number of partial transactions. |
BUS_TRANS_IO | Number of I/O transactions. |
BUS_TRAN_DEF | Number of deferred transactions. |
BUS_TRAN_BURST | Number of burst transactions. |
BUS_TRAN_ANY | Number of all transactions. |
BUS_TRAN_MEM | Number of memory transactions. |
BUS_DATA_RCV | Number of bus clock cycles during which this processor is receiving data. |
BUS_BNR_DRV | Number of bus clock cycles during which this processor is driving the BNR pin. |
BUS_HIT_DRV | Number of bus clock cycles during which this processor is driving the HIT pin. |
BUS_HITM_DRV | Number of bus clock cycles during which this processor is driving the HITM pin. |
BUS_SNOOP_STALL | Number of clock cycles during which the bus is snoop stalled. |
FLOPS | Number of computational floating-point operations retired. |
FP_COMP_OPS_EXE | Number of computational floating-point operations executed. |
FP_ASSIST | Number of floating- point exception cases handled by microcode. |
MUL | Number of multiplies. |
DIV | Number of divides. |
CYCLES_DIV_BUSY | Number of cycles during which the divider is busy. |
LD_BLOCKS | Number of store buffer blocks. |
SB_DRAINS | Number of store buffer drain cycles. |
MISALIGN_MEM_REF | Number of misaligned data memory references. |
INST_RETIRED | Number of instructions retired. |
UOPS_RETIRED | Number of UOPs retired. |
INST_DECODER | Number of instructions decoded. |
HW_INT_RX | Number of hardware interrupts received. |
CYCLES_INT_MASKED | Number of processor cycles for which interrupts are disabled. |
CYCLES_INT_PENDING_AND_MASKED | Number of processor cycles for which interrupts are disabled and interrupts are pending. |
BR_INST_RETIRED | Number of branch instructions retired. |
BR_MISS_PRED_RETIRED | Number of mispredicted branches retired. |
BR_TAKEN_RETIRED | Number of taken branches retired. |
BR_MISS_PRED_NRET | Number of taken mispredictions branches retired. |
BR_INST_DECODED | Number of branch instructions decoded. |
BTB_MISSES | Number of branches that miss the BTB. |
BR_BOGUS | Number of bogus branches. |
BACLEARS | Number of time BACLEAR is asserted. |
RESOURCE_STALLS | Number of cycles during which there are resource related stalls. |
PARTIAL_RAT_STALLS | Number of cycles or events for partial stalls. |
SEGMENT_REG_LOADS | Number of segment register loads. |
CPU_CLK_UNHALTED | Number of cycles during which the processor is not halted. |