Python Cache Hierarchy Simulator
A single-core cache hierarchy simulator written in python.
The goal is to accurately simulate the caching (allocation/hit/miss/replace/evict) behavior of all cache levels found in modern processors. It is developed as a backend to kerncraft, but is also planned to introduce a command line interface to replay LOAD/STORE instructions.
- Currently supported features:
- Inclusive cache hierarchies
- LRU, MRU, RR and FIFO policies
- N-way cache associativity
- Write-allocate with write-back caches
- Non-write-allocate with write-through caches
- Write-combining with sub-blocking
- Speed (core is implemented in C)
- Python 2.7+ and 3.4+ support, with no other dependencies
- Planned features:
- Report cachelines on all levels
- Report timeline of cache events
- Visualize events (html file?)
- More detailed store/evict handling (e.g., using dirty bits)
- (uncertain) instruction cache
- Optional classification into compulsory/capacity and conflict misses (by simulating other cache configurations in parallel)
- (uncertain) multi-core support
- Remove cl_size growth requirement (NVIDIA Kepler’s L1 has 128B cl_size and L2 with 32B)
pycachesim is licensed under AGPLv3.
from cachesim import CacheSimulator, Cache, MainMemory mem = MainMemory() l3 = Cache("L3", 20480, 16, 64, "LRU") # 20MB: 20480 sets, 16-ways with cacheline size of 64 bytes mem.load_to(l3) mem.store_from(l3) l2 = Cache("L2", 512, 8, 64, "LRU", store_to=l3, load_from=l3) # 256KB l1 = Cache("L1", 64, 8, 64, "LRU", store_to=l2, load_from=l2) # 32KB cs = CacheSimulator(l1, mem) cs.load(2342) # Loads one byte from address 2342, should be a miss in all cache-levels cs.store(512, length=8) # Stores 8 bytes to addresses 512-519, # will also be a load miss (due to write-allocate) cs.load(512, length=8) # Loads from address 512 until (exclusive) 520 (eight bytes) cs.force_write_back() cs.print_stats()
This should return:
CACHE *******HIT******** *******MISS******* *******LOAD******* ******STORE******* L1 1 ( 8B) 2 ( 65B) 3 ( 73B) 1 ( 8B) L2 0 ( 0B) 2 ( 128B) 2 ( 128B) 1 ( 64B) L3 0 ( 0B) 2 ( 128B) 2 ( 128B) 1 ( 64B) MEM 2 ( 128B) 0 ( 0B) 2 ( 128B) 1 ( 64B)
Each row refers to one memory-level, starting with L1 and ending with main memory. The 3 loads in L1 are the sum of all individual accesses to the cache-hierarchy. 1 (from first load) + 1 (from store with write-allocate) + 1 (from second load) = 3.
The 1 hit, is for bytes which were cached already. Internally the pycachesim operates on cache-lines, which all addresses get transformed to. Thus, the two misses throughout all cache-levels are actually two complete cache-lines and after the cache-line had been loaded the consecutive access to the same cache-line are handled as hits. That is also the reason why data sizes increase from L1 to L2. L1 is accessed byte-wise and L2 only with cache-line granularity.
So: hits, misses, stores and loads in L1 are byte-wise, just like stores throughout all cache-levels. Every other statistical information are based on cache-lines.
Comparison to other Cache Simulators
While searching for more versatile cache simulator for kerncraft, I stumbled across the following:
- gem5: Very fully-featured full system simulator. Complex to extract only the memory subsystem
- dineroIV: Nice and simple code, but does not support exclusive caches and not available under open source license.
- cachegrind: Maintained and stable code of a well established open source project, but only supports inclusive first and last level caches.
- callgrind: see cachegrind
- SMPcache: Only supports one single cache and runs on Windows with GUI. Also not freely available.
- CMPsim: Was only academically published and source code never made available.
- CASPER: Was only academically published and source code never made available.
|Package||instructions ||blocks ||sub-blocks ||associtivity ||LRU ||MRU ||FIFO ||RR ||CCC ||3+ levels ||exclusive ||victim ||multi-core ||API ||open source |
|gem5||x||x||?||x||x||x||x||?||?||x||?||?||?||python, ruby, c++||yes, BSD-style|
|dineroIV||x||x||x||x||x||x||x||x||x||c||no, free for non-comercial use|
|SMPcache||x||x||x||x||x||?||Windows GUI||no, free for education und research|
|CMPsim||x||x||x||x||x||x||x||?||?||x||?||no, source not public|
|CASPER||x||x||x||x||x||x||x||x||x||x||x||perl, c||no, source not public|
|||Instruction cache support (typically L1I)|
|||Cacheline/block granular caching|
|||Sub-blocking/sectoring for in cache-storage|
|||Support for n-way associativity|
|||(1, 2, 3, 4) Support least-recently-used (LRU), most-recently-used (MRU), first-in-last-out (FIFO), random (RR) replacement policy|
|||Classification of misses into: compulsory (first time access), capacity (access after replacement), conflict (would have been a hit with full-associativity)|
|||Combining of at least three cache levels|
|||Exclusive cache relations (two levels may not share the same cacheline)|
|||Victim caches, where only evicted lines endup(e.g., AMD Bulldozer L3)|
|||Multi-core cache hierarchies with private and shared caches and cache coherency protocol|
|||Supported interfaces (cli = command-line-interface)|
|||Published under an Open Source Initiative approved license?|