# Memory Systems

## Main Memory

• We have a memory hierarchy to balance the tradeoff between cost and speed
• Want to exploit temporal and spatial locality
• Moore's law is long dead and never really applied to memory
• The basic element of main memory is a memory cell capable of being written or read to
• Need to indicate read/write, data input, and also an enable line
• When organising memory cells into a larger chip, it is important to maintain a structure approach and keep the circuit as compact as possible
• For example, a 16 word x 8 bit memory chip requires 128 cells and 4-bit addresses
• A 1024 bit device as a 128x8 array requires 7 address pins and 8 data pins
• Alternatively, it is possible to organise it as a 1024x1 array, which would be really dumb as it would result in a massive decoder and inefficient space usage
• Dividing the address inputs into 2 parts, column and row address, minimise the decoder space and allows more space for memory
• Can use the same principle to build smaller ICs into larger ICs, using decoders/multiplexers to split address spaces
• Semiconductor memory is generally whats used for main store, Random Access Memory
• Two main technologies:
• Static RAM (SRAM) uses a flip-flop as a storage element for each bit
• Dynamic RAM (DRAM) uses the presence or lack of charge in a capacitor for each bit
• Charge leaks away over time so needs refreshing, but DRAM is generally cheaper if the overhead of the refresh circuitry is sufficiently amortised
• SRAM typically faster so is used for cache
• DRAM used for main memory
• The interface to main memory is always a bottleneck so we can do some fancy DRAM organisations stuff
• Synchronous DRAM exchanges data with the processor according to an external clock memory
• Clock runs at the speed of the bus to avoid waiting on memory
• Processor can perform other tasks while waiting because clock period and wait times are known
• Rambus DRAM was used by Intel for Pentium and Itanium
• Exchanges data over a 28-wire bus no more than 12cm long
• Provides address and control information
• Asynchronous and block-oriented
• Fast because requests are issued by the processor over the RDRAM bus instead of using explicit R/W and enable signals
• Bus propertties such as impedances must be known to processor
• DDR SDRAM extends SDRAM by sending data to the processor on both rising and falling edge
• Actually used
• Cache DRAM (CDRAM) combines DRAM with a small SRAM cache
• Performance very dependant upon domain and load
• ROM typically used in microprogramming or systems stuff
• PROM is same as above, but electrically written
• EPROM is same as above, but is erasable via UV light at the chip level
• EEPROM is erasable electrically at the byte-level
• Flash memory is a high speed semiconductor memory
• Used for persistent storage
• Limited to block-level erasure
• Uses typically 1 transistor per bit

## Interleaved Memory

• A collection of multiple DRAM chips grouped to form a memory bank
• banks can service requests simultaneously, increading memory read/write rates by a factor of
• If consecutive words of memory are stored in different banks, the transfer of a block of memory is sped up
• Distributing addresses among memory units/banks is called interleaving
• Interleaving addresses among memory units is known as -way interleaving
• Most effective when the number of memory banks is equal to number of words in a cache line

## Virtual Memory

• Virtual memory is a hierarchical system accross caches, main memory and swap that is managed by the OS
• Locality of reference principle: addresses generated by the CPU should be in the first level of memory as often as possible
• Use temporal, spatial, sequential locality to predict
• The working set of memory addresses usually changes slowly so should maintain it closest to CPU
• Performance measured has hit ratio (assuming a two-level memory hierarchy with data in and )
• The average access time
• When there is a miss, the block is swapped in from to then accessed
• is the time to transfer a block, so
• , the access time ratio of the two levels
• , the factor by which average access time differs from minimum, access efficiency
• Memory capacity is limited by cost considerations, so wastins space is bad
• The efficiency which space is being used can be defined as the ratio of useful stuff in memory over total memory,
• Wasted space can be empty due to fragmentation, or inactive data that is never used
• System also takes up some memory space
• Virtual memory space is usually much greater than physical
• If a memory address is referenced that is not in main memory, then there is a page fault and the OS fetches the data
• When virtual address space is much greater than physical, most page table entries are empty
• Fixed by inverted hashed page tables, where page numbers are hashed to smaller values that index a page table where each entry corresponds to physical frames
• Hash collisions handled by extra chain field in the page table which indicates where colliding entry lives
• Lookup process is:
• Hash page number
• Index the page table using hash. If the tag matches then page found
• If not then check chain field and go to that index
• If chain field is null then page fault
• Average number of probes for an inverted page table with good hashing algorithm is 1.5
• Practical to have a page frame table with twice the number of entries than frames of memory
• Segmentation allows programmer to view memory as multiple address spaces - segments
• Each segment has its own access and usage rights
• Provides a number of advantages:
• Simplifies dynamic data structures, as segments can grow/shrink
• Can be shared among processes
• Access privileges give protection
• Programs divided into segments which are logical parts of variable length
• Segments make up pages, so segment table used to get offset of address within page table
• Two levels of lookup tables, address split into 3
• Translation Lookaside Buffer (TLB) holds most recently reference table entries as a cache
• When TLB misses, there is a significant overhead in searching main memory page tables
• TLB miss ratio usually low, less than 0.01
• Page size has an impact on memory space utilisation factor
• Too large, then excessive internal fragmentation
• Too small, then page tables become large and reduces space utilisation
• is the segment size in words, so when , the last page assigned to a segment will contain on average words
• Size of the page table associated with each segment is approx words, assuming each table entry is 1 word
• Memory overhead for each segment is
• Space utilisation is therefore
• Optimum page size =
• Optimum utilisation =
• Hit ratio increases with page size up to a maximum, then begins to decrease again
• Value of yielding max hit ratios can be greater than the optimum page size for utilisation
• When a page fault occurs, the memory management software is called to swap in a page from secondary storage
• If memory is full, it is necessary to swap out a page
• Efficient page replacement algorithm required
• Doing it randomly would be fucking stupid, might evict something being used
• FIFO is simple and removes oldest page, but still might evict something being used
• Clock replacement algorithm modifies fifo, which keeps track of unused pages through a use bit
• Use bit is set if page hasn't been used since last page fault
• LRU algorithm works well but complex to implement, requires an age counter per entry
• Usually approximated through use bits set at intervals
• Working set replacement algorithm keeps track of the set of pages referenced during a time interval
• Replaces the page which has not been referenced during the preceding time interval
• As time passes, a moving window captures a working set of pages
• Implementation is complex
• Thrashing occurs when there is too many processes in too little memory and OS

## Cache

• Cache contains copies of sections of main memory and relies of locality of reference
• Objective of cache is to have as high a hit ratio as possible
• Three techniques used for cache mapping
• Direct, maps each block of memory to only one possible cache line
• Associative, permits each main memory block to be loaded into any line of cache
• Cache control logic must examine each cache line for a match
• Set associative, each cache line can be in one of a set of cache lines
• In direct mapping, address is divided into three fields: tag, line and word
• Cache is accessed with the same line and word as main memory
• Tag is stored with data in the cache
• If tag matches that of the address, then that's a cache hit
• If a miss occurs, the new data and tag is fetched to cache
• Simple and inexpensive
• Fixed cache location for each block means that if two needed blocks map to the same line than cache will thrash
• Victim cache was originally proposed as a solution
• A fully associative cache of 4-16 lines sat between L1 and L2
• Fully associative cache scheme divide the CPU address into tag and word
• Cache accessed by same word
• Tag stored with data, have to examine every tag to determine if theres a cache miss
• Complex because of this
• Set associative combines the two, where a given block maps to any line in a given set
• eg, a 4-way cache has 4 lines per set and a block can map to any one of these 4
• Performance increases diminish as set size increases
• Performance can be improved with separate instruction and data caches, L1 usually split
• Principle of inclusion states that L1 should always be subset of L2, L2 subset of L3, etc
• When L3 is fetched to, data is written to L2 and L1 also
• Writing to cache can result in cache and main memory having inconsistent data
• It is necessary to be coherent if
• I/O operates on main memory
• Processors share main memory
• There are two common methods for maintaining consistency
• With write through, every write operation to cache is repeated to main memory in parallel
• Average access time
• Assumes is time to transfer block to cache, and is the fraction of references that are writes
• Main memory write operation must complete before any further cache operations
• If size of block matches datapath width, then whole block can be transferred in one operation,
• If not, then transfers are required and
• Write through often enhanced by buffers for writes to main memory, freeing cache for subsequent accesses
• In some systems, cache is not fetched to when a miss occurs on a write operation, meaning data is written to main memory but not cache
• With write back, a write operation to main memory is performed only at block replacement time
• Increases efficiency if variables are changed a number of times
• Simple write back refers to always writing back a block when a swap is required, even if data is unaltered
• Average access time becomes
• x2 because you write the block back then fetch a new one
• Tagged write back only writes back a block if the contents have altered
• 1-bit tag stored with each block, and is set when block altered
• Tags examined at replacement time
• Access time
• is the probability a block has been altered
• Write buffers can also be implemented
• Most modern processors have at least two cache levels
• Normal memory hierarchy principles apply, though on an L2 miss data is written to L1 and L2
• With two levels, average access time becomes
• A replacement policy is required for evicting cache lines in associative and set-associative mappings
• Most effective policy is LRU, implemented totally in hardware
• Two possible implementations, counter and reference matrix
• A counter associated with each line is incremented at regular intervals and reset when the line is referenced
• Reset every time line is accessed
• On a miss when the cache is full, the line with a counter set at the maximum value is replaced and counter reset, all other counters set to 0
• Reference matrix is based on a matrix of status bits
• If lines to consider, then the upper triangular matrix of a matrix is formed without the diagonal, with
• When the th line is referenced, all bits in the th row are set to one and th column is zeroed
• The least recently used one is one that has all 0s in its row and all 1s in its column
• There are three types of cache miss:
• Compulsory, where an access will always miss because it is the first access to the block
• Capacity, where a miss occurs because a cache is not large enough to contain all the blocks needed
• Conflict, misses occurring as a result of blocks not being fully associative
• Sometimes a fourth category, coherency, is used to describe misses occurring due to cache flushes in multiprocessor systems
• Performance measures based solely on hit rate don't factor in the actual cost of a cache miss, which is the real performance issue
• Average memory access time = hit time + (miss rate x miss penalty)
• Measuring access time can be a more indicative measure
• There are a number of measures that can be taken to optimise cache performance
• Have larger block sizes to exploit spatial locality
• Likely to reduce number of compulsory misses
• Will increase cache miss penalty
• Have a larger cache
• Longer hit times and increased power consumption and more expensive
• Higher levels of associativity
• Reduces number of conflict misses
• Can cause longer hit times and increased power consumption
• Multilevel Caches
• Idea is to reduce miss penalty
• L1 cache keeps pace with CPU clock, further caches serve to reduce the number of main memory accesses
• Can redefine average accesstime for multilevel caches: L1 hit time + (L1 miss rate x (L2 hit time + (L2 miss rate x L2 miss penalty)))
• Prioritising read misses over writes
• Write buffers can hold updated value for a location needed on a read miss
• If no conflicts, then sending the read before the write will reduce the miss penalty
• Optimisation easily implemented in write buffer
• Most modern processor do this as cost is low
• Avoid address translation during cache indexing
• Caches must cope with the translation of virtual addresses to physical
• Using the page offset to index cache means the TLB can be omitted
• Imposes restrictions in structure and size of cache
• Controlling L1 cache size and complexity
• Fast clock cycles encourage small and simple L1 caches
• Lower levels of associativity can reduce hit times as they are less complex
• Way prediction
• Reduce conflict misses
• Keep extra bits in cache to preduct the block within the next set of the next cache access
• Requires block predictor bits in each block
• Determine which block to try on the next cache access
• If prediction correct then latency is equal to direct mapped, otherwise at least an extra clock cycle required
• Prediction accuracy commonly 90%+ for 2-way cache
• Pipelined access
• Effective latency of an L1 cache hit can be multiple cycles
• Pipelining allows to increase clock speeds and bandwith
• Can incur slower hit times
• Non-blocking cache
• Processors in many systems do not need to stall on a data cache miss
• Instruction fetch could be performed while data fetched from main memory following a miss
• Allows to issue more than one cache request at at time
• Cache can continue to supply hits immediately following a miss
• Performance hard to measure and model
• Out-of-order processors can hide impact of L1 misses that hit L2
• Multi-bank caches
• Increase cache bandwith by having multiple banks that support simultaneous access
• Ideal if cache accesses spread themselves accross banks
• Critical word first
• A processor often only needs one word of a block at a time
• Request the missing word first and send it to the processor, then fill the remainder of the block
• Most beneficial for large caches with large blocks
• Merging write buffer
• Write buffers are used by write-through and write-back caches
• If write buffer is empty then data and full address are written to buffer
• If write buffer contains other modified blocks then address can be checked to see if new data and buff entry match, and the data is combined with the buffer entry
• Known as write merging
• Reduces miss penalty
• Hardware prefetching
• Put the data in cache before it's requested
• Instruction prefetches usually done in hardware
• Processor fetches two blocks on a miss, the missed block and then prefetches the next one
• Prefetched block put in instruction stream buffer
• Compiler driven prefetching
• Reduces miss rate and penalty
• Compiler inserts prefetching instructions based on what it can deduce about a program
• Compiler can make other optimisations such as loop interchange and blocking