The following possible uses come to mind, although I am not aware of any theoretical exploration of these ideas.
One possible use is to expand the dueling duel range. Presented in "Adaptive Integration Strategies for High Performance Caching" by Moinuddin K. Qureshi et al. (2007), the "Define duel" mechanism dedicates some cache games to each of the two competing strategies and uses the best performing strategy. on dedicated sets for other follower sets. "By linking strategies to means in an asymmetric associative cache overlay, the number of monitored entries can be expanded without sacrificing the available capacity for a given strategy (although the Associativity related to a strategy is inferior).
The number of entries related to the policy can also be adjusted dynamically to compensate for the sensitivity of the training with the associativity of the preferred policy. It is not clear if competition for the same entries would also have benefits; the degree of sharing of monitored entries could also be adjusted dynamically.
Non-temporal virtual cache
With superimposed asymmetric associativity, it might be convenient to associate one or more ways with non-temporal access and to use indexing functions for ways that generate conflicts. For example, by simply removing bits 6..11 from the block number (indexing by bits 0..5, 12..N), a region of 64 blocks would conflict with itself 64 times in a stream. (By using XORing specific low and medium weight bits, flows with a two-block progression can be allocated densely while providing an internal conflict between groups of blocks.)
This would have the advantage, compared to a non-time dedicated cache, of allowing a more flexible use of capacity and the advantage of a potentially lower semi-reserved capacity compared to the placement of non-temporal data in a single unit. cache in a cache traditionally indexed data value would be allocated to a non-temporal data stream).
Since different ways (and different parts of them) could have different degrees of conflict induction and less conflicting ways could be used selectively, the allowed capacity of a flow could be some not flexible and conflicts between flows could be more easily managed.
Although conflict-inducing indexing can be used in isolation, overlapping would allow for more flexible use of capabilities and conflicts. between the paths to output data that will not be referenced.
Blocks with variable alignment
By combining different ways with different cache block alignments, larger blocks may be more practical when there is significant spatial localization, but the portion of the data accessed does not start at a particular time. address very aligned. This could facilitate the use of larger blocks with less idle capacity. Larger blocks can reduce label overload and increase the accuracy of channel prediction (or the utility of channel storage).
This could be particularly useful for instruction caches (which benefit more from spatial localization and are more conducive to path prediction / storage), especially branch target instruction caches. Some trace-oriented pre-decode optimizations could be made more convenient by reducing the number of entry points in a cache block. A smaller (faster, more energy efficient) part of an instruction cache could be used preferentially for branch targets, thus avoiding redirect delays, such as a BTIC, while allowing to use the capacity for non-branch target blocks.
(For data caches, non-aligned storage can be useful for something like a signature cache, which reduces the inconvenience of non-CAM tags. ???)
Similarly, a branch target instruction cache placed after the instruction cache could exploit a variable alignment of function entry points without the overload of the tag to use several smaller blocks.
Data mapping based on stride
Similar to using methods to map different alignments, it would also be possible to use different ways of using different strings of sub-blocks. For example, a way to map data in a block so that alternating words are excluded, not only by increasing the capacity used when double-stride accesses are used, but also by potentially facilitating fixed-rate SIMD loading and storage.
It may even be practical to support a configurable power of two strides as long as the stride does not change often.
(A data trace cache can take advantage of both the support of different alignments and different block compositions (eg 32-bit compressed / shifted address vs. 64-bit address, address only relative to an address plus N bytes of data) .Intra-block NUMA)
Linking the storage type to groups of paths
In a unified cache, one group of means can be dedicated to the instructions and another to the data. Such a combination of means would facilitate a lower associativity at least during the initial search.
This could also be applied to the interpretation of metadata, different ways of storing different types of metadata. For example, some ways could use ECC (requiring reads, modifications, writes on subword stores) and others, a parity (for sub-word stores with low overhead but with a less reliability); this could interact with storage reliability, for example, blocks using ECC in high reliability SRAM could allow greater voltage reduction, and parity-protected means could allocate clean blocks To less reliable SRAM matrices.
Block size based on the way
While non-overlapping classic caches might associate a tag with a pair of storage blocks, the use of additional methods dedicated to larger blocks would tend to require fewer tag checks. Asymmetric indexing, especially when superimposed, would reduce the multiplication of conflicts resulting from the aggregation of cached data and greater spatial localization. In any case, additional tags could be used for inclusion without data or other metadata, although the increased flexibility of overlapping attribution would tend to increase the possibilities of such alternative uses.
Variable characteristics of storage
By overlaying paths, part of each path can be mapped to a cache section that provides faster (or more energy-efficient) access or different amounts of metadata (different types of metadata have been mentioned previously).
By increasing the independence of cache properties and data addresses through a superimposed asymmetric association, more direct direct indexing could be used while supporting differences in specific storage location characteristics.
Although their allocation is not as flexible as a conventional non-uniform cache architecture (which uses an indirection layer), it could make NUCA techniques more applicable to L1 caches. (The problem of scheduling dependent operations based on the cache latency of the L1 variable will still need to be resolved, and the asymmetric associativity folding problems of the L1 will need to be addressed.)
(The mode indication itself may provide some metadata when means are reserved or directed to specific uses or policies.)
It might also be practical to have parts of the cache (whether specific or of different sections) using the NUCA sub-block, where some of the blocks have faster access (for example, to speed up the code pointer tracking). Although the quick assignment of sub-blocks (with selection of different sub-blocks per channel and / or address) can be implemented relatively simply, the variable allocation of the fast memory (for example, where the blocks 64 bytes could be mapped to four fast connections of 16 bytes) storage units, one fast storage unit and three slow ones, or four slow storage units) would seem difficult.
Such a separation could be accomplished in a non-overlapping cache by the same path / address association with the cache feature, but the allocation flexibility would be less.
Disconnect associativity and capacity
With overlay, compression enhancing portioning of ways conflicts can be mapped to different storage sections in different ways and accesses in different ways can compete for a given input.
Partial overlay offers another dimension to design flexibility. For example, a two-way associative cache can map in such a way that one half maps an isolated section and the other half onto a superimposed section and maps one to the other with its own isolated section. (and the shared shared section). This type of partial overlap could be useful for a Knapsack type cache ("Knapsack: a hierarchical component of the null cycle memory", Todd M. Austin et al., 1993) where a part of the storage could be dedicated to a restricted caching. while another part may also be used for other access (for example, a two-way associative section could be used traditionally while also allowing the Knapsack region to be larger and somewhat sparse. in the highest addresses, the number of conflicts and the replacement preferences could be flexibly [some similarity might be noted to Brannon Batson and T. N. Vijaykumar’s “Reactive-Associative Caches”]).
Isolated pathways can separate associativity and capacity by making some smaller than others. However, the overlay offers a little more flexibility in the allocation of capacity. With isolated manners, a given block will always be in a higher conflict section (although this can be compensated by increasing the associativity of the smaller section).
Potentially larger isolation of the virtual machine
By overlaying and using different indexing functions for different virtual machines, cache capacity can be more easily provided to virtual machines with less side channel information communicated.
Traditional asymmetric associativity with a virtual machine indexing adjustment would provide the essential of this benefit, but the overlay provides a little more flexibility for placement and potentially conflicting types of mappings for different pairs of virtual machines.
Multithreading Conflict Containment
A partial overlay could be used to isolate some of the data associated with a thread while providing substantial capacity sharing. Such an assignment would provide a more flexible allocation than the traditional assignment to a thread-specific storage area.
Partial overlay would facilitate higher associativity in a subset of cache, which could be beneficial in shared storage by more threads or used for allocations more likely to induce conflict failures or failures. conflict are more expensive.
In a clustered design, overlaying can be useful for orienting positioning to reduce inter-cluster communication when threads are biased or allocated exclusively to particular clusters.
Implications of forecasting the way
If asymmetric associativity can increase the accuracy of prediction based on partial tags and recency (because spatially local blocks are likely to be mapped to different indexes for different address regions), the overlay can bring a slight additional benefit.
(Similarly, skewing allocation to improve lane prediction is expected to increase the failure rate less with asymmetric associativity and especially with superimposed associativity than with simple modulo indexing.)
Block Block Tolerance and Tolerance
Although it is not a mechanism, it may be worth noting that the use of the overlay would reduce the impact of the exclusion of some blocks allocation choices. While traditional asymmetric associativity increases conflict resistance in such cases, overlaying would have an added advantage.
Similarly, locking a portion of the cache could offer the simplicity of implementing traditional locking with less reduction in effective associativity. (Note: To map a portion of the cache to a specific address to form a notebook memory would free the associated tag memory for other uses, for example, normal-sized blocks can be supported for part of the remaining cache, superimposed and superimposed associativity could facilitate flexible allocation of small blocks.)
Indirect overload reduction for NUCA
The use of the overlay would allow different addresses to map to the same block, potentially avoiding the traditional indirection of traditional NUCA while providing substantial flexibility in allocating cache entries faster storage. The lack of indirection would facilitate tag reconciliation with the data clusters, at least in the case of a faster storage area (it may be desirable to co-locate all tags to allow for a determination. faster errors, although a cache shared by multiple cores or even multiple paths of access (eg, instruction against data) may have separate fast storage with associated tags). Even if indirection is used, the overlay would reduce the costs of less general indirection, which would reduce metadata storage overhead and potentially the latency of access.
A fixed but different slot for each channel would have the same benefits, as it would allow different addresses to be placed in fast storage through a quick search, but the overlay increases the flexibility of allocation.