Abstract
A key-value container providing caching with a least-recently-used replacement strategy is a useful tool in any programmer’s performance optimisation toolkit; however, with no ready-to-use implementations provided in the standard library or the widely used boost libraries, C++ developers are likely resort to inefficient or incorrect approximations to the logic. This document describes a couple of simple implementations built on top of the C++ standard library’s map types and on the boost library’s “bimap” types in the hope of making this useful pattern more accessible to coders not already familiar with it.
While the general aim of this document is to provide its audience with enough information to implement their own LRU-caching classes, readers are welcome to cut-and-paste the code listings in this document under the terms of the Internet Systems Consortium (ISC) license (an OSI-approved BSD-alike license). The following text should be considered to apply to all code in this document, and added as a comment in any code copied directly from it:
Readers simply looking for templated LRU cache code for instant usage may wish to turn straight to Listing 1 and the comments immediately preceding it.
The sources are also available from a Bitbucket-hosted Mercurial repository.
The need for caching behaviour sometimes arises during system development. Generally the desire is to preserve some expensive-to-obtain results so they can be reused “for free” without repeating the expensive operation in future. Typically the expense arises because a complex calculation is needed to obtain the result, or because it must be obtained via a time consuming I/O operation. If the total number of such results dealt with over the lifetime of the system does not consume excessive memory, it may suffice to store them in a simple key-value container (for example, a std::map), with the key being the input to the expensive function and the value being the result. This is often referred to as “memoisation” of a function.
However, for most applications, this approach would quickly consume too much memory to be of any practical value. The memory consumption issue can be addressed by limiting the maximum number of items stored in the cache or, if the items have a variable size, limiting the aggregate total stored. Initially the cache is empty and records (key-value pairs) can be stored in it freely. After some further usage, it will fill up. Once full, the question arises of what to do with subsequent additional records which it seems desirable to cache, but for which there is no space (given the limited capacity constraint) without taking action to remove some other records from the store. Assuming the records most recently added to the cache are those most likely to be accessed again (ie assuming some temporal coherence in the access sequence), a good general strategy is to make way for a new record by deleting the record in the cache which was “least recently used”. This is called an LRU replacement strategy.
Despite the general utility of a LRU-replacement cache component, one is not provided as such in C++’s Standard Library or in the well known Boost libraries. However, it is reasonably easy logic to construct using either library as a basis.
The following two sections describe possible implementations, both of them providing caching up to a limited number of records of a function with signature V f(K), where K and V are the “key type” and the “value type”. The only real difference in the API to either implementation is that the boost-based one passes the cached function as a boost::function rather than a function pointer, providing slightly more flexibility through compatibility with boost::bind.
The code presented has been tested on g++ 4.4.5. The -std=c++0x option is required in order to use the std::unordered_map hash-based map container which appeared with the TR1 additions to the C++ Standard Library, and variadic template arguments are also used to bypass some templated-template issues associated with the different type arguments to the tree-based and hash-based container types.
For brevity the code explanations below will generally refer to the std::map and boost::bimaps::set_of versions of the code; however both cache implementations work equally well (or better; see subsection 7.2) with std::unordered_map and boost::bimaps::unordered_set_of containers and the code presented is fully intended to support both, subject to the key type supporting the necessary concepts (ordered comparison in the case of std::map and boost::bimaps::set_of, hashability and equality testing in the case of the unordered variants).
Users of the C++ Standard Library needing an LRU-replacement cache generally gravitate towards std::map or std::unordered_map, because of their support for efficient keyed value accesses (O(log n) or O(1) access complexity). The problem then is how to implement the eviction strategy.
The most obvious naive solution is to use an
The timestamp_t holds a scalar quantity, the ordering of which indicates when the value was last accessed relative to other values; typically some sort of incrementing serial number is used rather than an actual clock-derived time, to ensure a one-to-one mapping from timestamps to records. Keys then give efficient access to values and timestamps, and timestamps can be updated without the need to adjust the map’s tree-structure (as this depends entirely on the key values). However, to determine the minimum timestamp in order to evict a record, it is necessary to perform a O(n) search over all the records to determine the oldest one.
As a solution to eviction being expensive, it might be tempting to implement the cache as a
Moving any item accessed to the tail of the list (a cheap operation for lists), ensures the least-recently-used item can trivially be obtained (for erasure) at the list head by begin(). However, this just moves the problem elsewhere as it is now necessary to resort to an O(n) search simply to look up a key in the cache.
While either naive solution can be got to work (and may well be a simple pragmatic solution to caching a few tens of items) certainly neither can be considered scalable or efficient due to the O(n) behaviour associated with either identifying eviction targets or accessing values by key.
However, it is possible to implement an LRU-replacement cache with efficient eviction and access using a pair of maps:
On accessing key_to_value by a key, we obtain access to both the value required and the timestamp, which can be updated in both the accessed record and, by lookup, timestamp_to_key. When an eviction is required, the lowest timestamp in timestamp_to_key provides the key to the record which must be erased.
However, this can actually be optimised slightly:
This has the slight advantage that updating the timestamp on a record access avoids an keyed lookup into timestamp_to_key, replacing it with a (presumably more efficient) direct iterator access.
Pedants will observe that further slight improvement would also have the timestamp_to_key_type map’s value be an iterator into the key_to_value_type but this introduces a circular type definition. It might be tempting to try to break the dependency by using void⋆ in place of the iterator, but iterators cannot portably be cast to pointers. In any case, the first iterator optimisation mentioned benefits the updating of timestamps needed during cache hits whereas this second iterator optimisation benefits the eviction associated with cache misses. Since in a well functioning cached system cache hits should be far more common than misses, this second optimisation is likely of much less value than the first. Another consideration is that whatever expensive operation is required to generate a new result following a cache miss is likely to be hugely more expensive than any keyed access. Therefore this second optimisation is not considered further.
In fact there is one final powerful optimisation possible. The only operations actually done on timestamp_to_key are to access its head (lowest timestamp, least recently used) element, or to move elements to the (most recently used) tail. Therefore it can be replaced by a std::list; this also eliminates the need for any actual instances of timestamp_type (and therefore any concerns about the timestamp possibly overflowing). In previous testing the author has found list-and-map implementations to be almost twice as fast as a version (not shown) using a pair of maps.
Finally, before showing the complete implementation, there is one other detail to consider: it would seem to be desirable to template the cache class code on the underlying container type used (std::map or std::unordered_map) rather than provide two completely separate implementations which actually only differ in minor details. This is slightly complicated by the different type argument signatures of the container types own templates (providing the optional capability to override default comparator or hasher/equality predicate logic as appropriate), but a simple workround is to declare the map type as templated on variadic type arguments template<typename...> class MAP and then the actual container instance required can simply be instantiated on the necessary key and value types. Generalising the code to support overriding of the default comparator (or the unordered map type equivalent) or allocators is left as an exercise for the reader.
See Listing 1 for a complete example using the pair of containers
where MAP is expected to be one of std::map or std::unordered_map; for example an LRU cache from std::string to std::string using hashing would be declared as
while an int-keyed cache of Foo objects using the “classic” std::map would be declared as
Note that use of a reference counted smart-pointer for heap-allocated value types is strongly advised as the very nature of caches easily leads to situations where value object ownership must be considered shared between the cache and a client which has retrieved the value from the cache.
Readers interested in the performance impact of choosing std::map vs. std::unordered_map will find some results in subsection 7.2.
Assuming usage of the boost C++ libraries is an option, boost::bimap provides an excellent basis on which to construct LRU-replacement cache logic. Unlike the better known std::map, which has an asymmetric relation between key and value (that is, a key can index a value but a value can’t index a key), boost::bimap supports a bidirectional mapping between a pair of keys. This is very convenient for indexing a cache’s collection using either the cached-function argument key or the key-access history as required, and much of the complication of the standard-library version described in section 5 can be avoided.
As with the standard library based implementation above, it is not actually necessary to record timestamps explicitly and it suffices for the access-history view of the bimap to be a list_of, which can maintain the ordering of keys accessed. By default bimap inserts new records at the tail of the list view, and records can be efficiently relocated there on access.
There are several ways in which the boost::bimap container to be used might be instantiated. The most obvious one is
with the left view providing the key-to-value semantics, and the ordering on the right view used to maintain a record of the key-value-pair access history. Replacing the set_of with unordered_set_of to use hashing instead of trees is another choice.
However, there also exists the possibility of declaring
with the cached value stored using bimap’s support for “additional information” and dummy_type never storing a meaningful value. In the past the author has found some peculiar performance differences between implementations using a with_info version of the container and those without, but this seems to no longer be the case so only the simpler version without with_info is considered further here.
See Listing 2 for a complete implementation. This is similar to the standard library-based version in that it is templated on the key view template to be used which is expected be one of boost::bimaps::set_of or boost::bimaps::unordered_set_of. For example an LRU cache from std::string to std::string using hashing would be declared as
See subsection 7.2 for some comparative performance testing between the tree and hash based versions, and between bimap-based caches and conventional standard library map based caches.
Both test codes below make use of the header Listing 3 which brings together both standard library and boost bimap based implementations and defines some helper types. This permits, for example
to be written
Listing 4 contains templated boost::test code to exercise all four cache implementations and check basic behaviour is as expected.
Listing 5 contains test code to examine the performance of all four cache implementations. It makes use of a timer utility class which is provided in Listing 6 (NB this makes use of the TR1 “chrono” addition to the standard library).
In order to test only pure cache performance, a trivial operation is used for the cached function. The code tests a 1024 element capacity cache with a series of random accesses over a smaller or larger range. When the the load is less than or equal to 100%, accesses are guaranteed to find a record already in the cache, and the performance reflects the cost of simply finding it and updating the access timestamp. When the load is over 100%, the excess determines the expected proportion of accesses which will result in cache misses; these have a higher cost as deletion of a previous record and creation of a new one is required and this is more complex than simply adjusting a timestamp. Hence in the 112.5% load test, one in nine accesses are expected to be a miss, and in the 200% load case, on average half of all accesses should miss.
Two key-value types are tested:
Table 1 and Table 2 shows performance results obtained on a 2.66GHz Intel Q9450 Core2 running Debian Squeeze (32bit) when compiled with
Software versions are g++ 4.4.5, boost 1.42.
Cache | Standard Library | boost bimap
| ||
load | container type | key view type
| ||
map | unordered_map | set_of | unordered_set_of | |
50% | 14.4 | 41.8 | 13.4 | 36.3 |
87.5% | 11.6 | 35.1 | 11.9 | 34.4 |
100% | 11.6 | 34.4 | 11.9 | 34.8 |
112.5% | 7.78 | 17.2 | 8.91 | 23.2 |
200% | 3.72 | 7.53 | 4.87 | 11.6 |
Cache | Standard Library | boost bimap
| ||
load | container type | key view type
| ||
map | unordered_map | set_of | unordered_set_of | |
50% | 1.66 | 0.515 | 1.56 | 1.1 |
87.5% | 1.58 | 0.515 | 1.49 | 1.09 |
100% | 1.57 | 0.513 | 1.49 | 1.09 |
112.5% | 1.4 | 0.422 | 1.49 | 1.02 |
200% | 0.996 | 0.261 | 1.48 | 0.815 |
The main conclusions which can be drawn from these results are:
Of course these results should not be taken as a substitute for readers carrying out performance testing on their own systems using realistic test cases relevant to their own usage.
There are numerous ways in which basic caching behaviour can be extended. Here are a few suggestions.
In many applications for caches, it is required to define the cache capacity not simply as a maximum number of cached elements, but rather in terms of the total of some measure on the elements. For example, a utility for browsing a collection of image files might include a cache of loaded raster images keyed by filename so as to avoid re-loading the most recently viewed images if the user revisits them. A cache which simply stored a fixed number of images would not have very predictable (ie bounded) memory consumption due to the enormous variation in image sizes which could be encountered. However, by taking the size-in-bytes of the stored images into account, the maximum total memory consumed by cached objects could be controlled and the number of images cached automatically adjusted in response to the aggregate size.
Implementation of such a cache is left as an exercise for the reader. Hint: It will be necessary to provide some mechanism (another function object?) for determining the capacity consumed by each stored record, to track the aggregate total currently cached, and to call the evict() method within a loop to possibly free up multiple elements in order to make sufficient space for insertion of a single large elements.
It should be obvious how to trivially make a cache implementation thread-safe by addition of a suitable mutex as a member and locking it for the scope of the access operation. Bear in mind that caches can easily become a serialisation bottleneck because, unlike simpler containers with no state mutated on read, multiple readers cannot simply ignore each other as even a cache-hitting read access must update the access record. Possibly some benefit might be gained from using an upgradeable lock to limit the time for which exclusive access is required to the minimum. For cache hits, exclusivity can be limited to the access history update, but for cache misses the situation is more complicated because several non-exclusive threads could cache miss the same key simultaneously and it is probably desirable for the exclusive-scope code to handle this case efficiently and ideally only compute the corresponding value once.
In some cases the best solution might be for multiple threads to each maintain their own independent cache: the inefficiency introduced by duplication of some cache-miss calculations must be weighed against the overheads introduced by sharing a cache between multiple threads.
Depending on the nature of the stored result values in the cache, it may be possible to reduce the amount of memory used by the cache (or, more usefully, cache more records in the same amount of memory) by compressing the result values. Storing a value in the cache will incur a compression overhead, and retrieving each value will require a decompression, but if the costs of these is small compared with the original operation required to obtain the value then it may be worth the expense to obtain an overall improvement in system performance through a better cache hit rate.
A further refinement might be two levels of caching; one holding compressed values, and a smaller one holding the most recent decompressed values in anticipation of near-term reuse.
See also boost::flyweight, which may be of use as a cache-size reducer in situations where many records in the cache actually hold the same value (this would imply the cached function is “many-to-one” in character).
For some specialist use cases, alternative strategies might be appropriate. For example, a most-recently-used (MRU) replacement strategy evicts the record most recently added to the cache; this might be relevant where cached results generated earliest in the lifetime of a program are most likely to be reused, or it can be a useful strategy to switch to in cases where a program repeatedly loops over a sequence of records slightly too large for the cache to store fully. Alternatively, a cache might have access to additional information (a so-called “clairvoyant algorithm”) which predicts the cached values to be used in future and thereby be able to select records for eviction so as to minimise the amount of recalculation subsequently needing to be done.
Consider a fully populated cache, which should be considered to be the normal state for caches after they have been in existence for some time. Cache misses result in the deletion of an evicted stored result value, followed by the immediate creation of a new one. If the value type is particularly expensive to create there may be a potential optimisation in recycling evicted values through a simple object pool for more efficient reuse.
Listing 6 contains the timer utility class used by the performance testing code in Listing 5.