Damien Krotkine

Alan Cache - building the best python caching library

2026-03-15T00:00:00+00:00

1. Caching is not easy — a bit of history

Alan is a health insurance platform serving multiple countries, powered by a Python/Flask backend with hundreds of web workers and RQ workers (queuing system).

flowchart LR
    Users(("Users")) --> Web["Web Workers
(Flask/Gunicorn)"]
    Web <--> Redis[("Redis")]
    Web -- enqueue --> RQ["RQ Workers"]
    RQ <--> Redis
    Cron["Cron"] -- enqueue --> Redis

And of course, we have some caching.

In appearance, the caching system we used was simple: we used Flask Caching, with a single Redis backend. However, in reality our code was sprinkled with the use of different caching mechanisms.

There were more than six different ways to cache data across the codebase:

Local memory — functools.lru_cache, ad-hoc dictionaries with no expiration
Redis with LRU — Flask-Caching, Redis only, no local RAM layer
Simple dictionaries — plain Python dicts used as caches, no TTL management
PostgreSQL — some data was cached directly in Postgres tables, with ad-hoc expiration and deletion management
RQ job results — engineers were using the results of RQ background jobs as a makeshift cache, accessing them later instead of recomputing
Directly storing to Redis — crafting cache keys manually, with various expiration rules.
other variations…

None of these approaches were standard. There was very little observability — no way to know what was cached, how much memory it consumed, or whether stale data was being served. The only administration tool was old, and it could do exactly one thing: invalidate all of Redis caches at once (but nothing about local RAM cached values).

I decided to tackle the systemic problem. The plan was to survey all existing caching methods, understand each team’s needs, and then build a single internal product — an adaptive, hybrid cache that would work both in local memory and on Redis, with proper observability, monitoring, and administration tools, and with a set of advanced features that none of the existing approaches could offer.

After a few months of work, Alan Cache was born. It’s a Python library that makes caching dead-simple for the common case while offering powerful capabilities for the hard ones. In addition, it provides features that were thought impossible before. Here is a shortlist of its interesting features:

Two interfaces — a @cached_for(hours=1) decorator for the 90% case, plus a direct get/set/delete API for manual control
Hybrid RAM + Redis storage — local RAM for sub-millisecond reads, shared Redis for cross-process persistence, both layers managed transparently
Async background computation — expensive functions compute in background workers while stale data is served
Partial cache invalidation — purge only the entries matching specific arguments, not the entire function’s cache
Distributed invalidation — propagate deletions across hundreds of workers without pub/sub or external message brokers
Atomic Writes — avoid double-writing cache values with a choice between optimistic (at_least_once) and pessimistic (at_most_once) strategies. Useful when creating side effects
Periodic refresh — keep hot caches warm automatically via scheduled recomputation
Cache warming on startup — pre-populate critical entries before the first web request hits
Object-lifetime expiration — cache tied to an object’s lifecycle, automatically cleaned up on garbage collection
Request-scoped caching — deduplicate expensive calls within a single HTTP request, auto-cleanup on response
Conditional caching — skip the cache based on runtime conditions (feature flags, user state, HTTP method)
Full observability — Datadog metrics per function, an admin API to browse and purge keys, and an internal tool to explore cache state in production

Today, Alan Cache is very robust and hasn’t significantly changed in years. It has 258 usages across the codebase.

This article presents Alan Cache’s features, from simplest to most complex, the use cases and the technical solutions. The goal is that you get inspired by this solution and get to build your own variation to meet your caching needs.

2. The Simplest Case — `@cached_for`

One decorator. One line. It works.

@cached_for(hours=1)
def get_product_catalog(country: str) -> dict:
    return fetch_from_database(country)

The first call runs the function’s code, computes the value and stores the result in both local RAM and Redis. Subsequent calls return the cached value — from RAM if available (sub-millisecond), from Redis otherwise (1-5ms). After one hour, the entry expires and the next call recomputes it.

That’s it. No configuration, no setup, no boilerplate.

@cached_for is syntactic sugar for @cached(expire_in=timedelta(hours=1)). It accepts weeks, days, hours, minutes, seconds — any combination. Under the hood, @cached is the real engine (~1500 lines, 20+ parameters), but you rarely need to touch it directly.

This simple form covers 173 out of 258 caching usages in the codebase — the 80% case.

3. Under the Hood — The Two-Layer Architecture

When you write @cached_for(hours=1), here’s what actually happens:

flowchart TB
    F["🔧 Your Function
@cached_for(hours=1)"]
    L1["⚡ Layer 1: RAM
SimpleCache — per-process
< 1ms"]
    L2["🗄️ Layer 2: Redis
shared — cross-process — persistent
1–5ms"]

    F --> L1
    L1 -- miss --> L2

AlanCache manages four internal cache backends:

Backend	Type	Purpose
`shared_cache`	Redis	Primary shared storage, swallows deserialization errors
`shared_cache_atomic`	Redis	Atomic writes via WATCH/MULTI/EXEC
`local_cache`	SimpleCache	Fast local RAM with serialization
`local_cache_no_serializer`	SimpleCache	Local RAM storing objects as-is (ORM models, etc.)

Why four and not two? I couldn’t cleanly make a single Redis backend support both atomic and non-atomic writes, and I needed a serialization-free local cache for objects that don’t pickle well.

The lookup order on get:

def get(self, key: str) -> Any:
    if self.local_cache.has(key):
        return self.local_cache.get(key)
    elif self.local_cache_no_serializer.has(key):
        return self.local_cache_no_serializer.get(key)
    elif self.shared_cache.has(key):
        return self.shared_cache.get(key)
    return self.shared_cache_atomic.get(key)

The AlanCache singleton is instantiated at module level:

alan_cache = AlanCache()

It initializes from environment variables if available, or falls back to defaults (SimpleCache locally, NullCache for Redis). This means the library works in tests without any Redis connection — it gracefully degrades.

Beyond Flask-Caching. I reimplemented the parts of Flask-Caching I needed — the backend factory (dispatching to Redis/SimpleCache/Null based on config) and the core decorator machinery — without the Flask app context dependency. Cache.init_from_config() takes a plain dict, not a Flask app object. This lets the cache work from RQ workers, CLI scripts, and anywhere else outside of a Flask request context.

4. Manual Control — The `get`/`set`/`delete` API

Not everything is a decorator. Sometimes you compute a value in one place and need to cache it for use elsewhere. Or you need to cache something that isn’t a function return value. For those cases, alan_cache exposes a direct API:

from shared.caching.cache import alan_cache

# Store a value in both RAM and Redis for 1 hour
alan_cache.set("user:123:preferences", preferences, timedelta(hours=1))

# Retrieve it (checks RAM first, then Redis)
prefs = alan_cache.get("user:123:preferences")

# Delete from all layers
alan_cache.delete("user:123:preferences")

# Bulk operations
alan_cache.delete_many("key1", "key2", "key3")
foo, bar = alan_cache.get_many("foo", "bar")

The manual API writes to both layers and reads in the same priority order as the decorator: local RAM → local RAM (no serializer) → shared Redis → shared Redis (atomic).

5. Choosing Where to Cache

By default, @cached_for stores values in both layers — local RAM and shared Redis. But sometimes you want control over which layer is used.

RAM Only — No Redis

@cached_for(minutes=5, local_ram_cache_only=True)
def get_orm_objects() -> list[User]:
    return User.query.all()

No Redis round-trip, no serialization. They stay as Python objects in the process’s memory. Perfect for ORM models and other objects that don’t pickle well. The downside: each process has its own copy, and there’s no cross-process sharing.

For class methods, there’s an even simpler shortcut:

class DateHelper:
    @memory_only_cache
    def parse(self, date_string: str) -> date:
        return expensive_parse(date_string)

@memory_only_cache is a class descriptor — it implements __get__ so it works as an instance method decorator. No Redis, no serialization, no expiration. Permanent in-process cache. 36 usages across the codebase.

Skip Serialization

@cached_for(minutes=10, local_ram_cache_only=True, no_serialization=True)
def get_heavy_object() -> SomeComplexObject:
    return build_complex_object()

no_serialization=True stores the Python object as-is in RAM — no pickle roundtrip. Use this with local_ram_cache_only=True for objects that are expensive to serialize.

Redis Only — No Local RAM

@cached_for(hours=1, shared_redis_cache_only=True)
def get_volatile_data() -> dict:
    return fetch_frequently_changing_data()

Skip the local RAM layer. Useful when data changes often and you don’t want stale local copies. Every read goes to Redis.

Thread-Local Storage

class GoogleCalendarService:
    @thread_local_class_cache("calendar_client")
    def get_client(self) -> CalendarClient:
        return build_calendar_client(self.credentials)

For objects that shouldn’t be shared across threads — like API clients for external services. Each thread gets its own cached instance. Used for 7 integrations with external services.

6. Scoping Cache to a Request

Some computations are expensive but only relevant within a single HTTP request — like computing user permissions. You don’t want to hit the database 5 times in one request, but you also don’t want to cache permissions across requests (they might change).

@request_cached()
def get_user_permissions(user_id: int) -> set[str]:
    return compute_permissions(user_id)

@request_cached is RAM only, with a 30-second max TTL. The cache key includes the request’s object ID and a UUID, so there’s no cross-request leakage. When the request ends, a teardown_request callback deletes all cached keys automatically.

By default, it only caches on GET requests. You can change that:

@request_cached(for_http_methods={"GET", "POST"})
def get_feature_flags(user_id: int) -> dict:
    return compute_feature_flags(user_id)

Need to temporarily bypass the cache? Use the context manager:

with without_request_cached_for(get_user_permissions):
    # This call will skip the cache and recompute
    fresh_permissions = get_user_permissions(user_id)

6 usages in production — permissions, feature flags, and similar per-request computation.

How It Works Internally

@request_cached is built on top of @cached with a carefully constructed set of parameters. The magic is in how it isolates cache entries per request and cleans them up automatically.

Cache key isolation. The key prefix is a combination of the request’s Python object ID and a UUID generated once per request:

def cache_key_prefix() -> str:
    if has_request_context():
        request_id = id(request)
        request_uuid = getattr(request, "caching_uuid", None)
        if not request_uuid:
            request_uuid = uuid.uuid4()
            request.caching_uuid = request_uuid
        return f"{request_id}-{request_uuid}"
    return ""

Why both? id(request) alone would be enough within a single request — but Python can reuse memory addresses, so a previous request’s cached values could leak into a new request that happens to reuse the same memory address. The UUID makes each request’s namespace globally unique.

Automatic cleanup. When the decorator is first used, it registers a Flask teardown_request callback (once per app). This callback fires after every request and deletes all cached keys:

@current_app.teardown_request
def destroy_request_cached_entries(_response_or_exc):
    try:
        cache_keys: set[str] = getattr(request, "cache_keys", set())
        alan_cache.delete_many(*cache_keys)
    except Exception:
        pass  # teardown callbacks must never raise

How does it know which keys to delete? Every time a value is cached, an on_cache_computed callback appends the cache key to request.cache_keys:

def on_cache_computed(cache_key: str, value: Any) -> Any:
    if has_request_context():
        if getattr(request, "cache_keys", None) is None:
            request.cache_keys = set()
        request.cache_keys.add(cache_key)
    return value

HTTP method filtering. The method check is implemented as an unless callback passed to the underlying @cached decorator:

def _request_is_disabled_or_not_the_right_http_method(f, *args, **kwargs):
    if unless is not None and unless(f, *args, **kwargs):
        return True
    return bool(
        _cache_killswitch.get()
        or (not request)
        or (request.method not in http_methods)
    )

When the HTTP method doesn’t match, unless returns True, which means the cache is bypassed entirely — the function runs directly.

The killswitch. without_request_cached_for uses a ContextVar — a thread-safe, async-safe variable scoped to the current execution context:

_cache_killswitch: ContextVar[bool] = ContextVar(f"_cache_killswitch_{func.__qualname__}", default=False)

@contextmanager
def without_request_cached_for(func):
    func_killswitch = func.request_cached_killswitch
    token = func_killswitch.set(True)
    try:
        yield
    finally:
        func_killswitch.reset(token)

The token mechanism supports nesting — if you nest two without_request_cached_for blocks, each reset restores the previous state correctly.

The underlying call. Putting it all together, @request_cached delegates to @cached with these hardcoded parameters:

cached(
    expire_in=timedelta(seconds=30),       # safety net TTL
    local_ram_cache_only=True,             # no Redis round-trip
    cache_key_with_func_args=True,         # include arguments in key
    cache_none_values=True,                # None is a valid cached result
    unless=_request_is_disabled_or_not_the_right_http_method,
    cache_key_prefix=cache_key_prefix,     # request-scoped prefix
    on_cache_computed=on_cache_computed,    # track keys for cleanup
)

The 30-second TTL is a safety net, not the primary cleanup mechanism — teardown_request handles that. But if something goes wrong and the teardown doesn’t fire, values still expire quickly.

7. Conditional Caching

Sometimes you want to cache most calls but skip the cache for specific cases — guest users, admin debugging, certain feature flags.

@cached_for(
    minutes=30,
    unless=lambda func, user_id, *args, **kwargs: user_id is None,
)
def get_user_preferences(user_id: int | None) -> dict:
    return fetch_preferences(user_id) if user_id else get_defaults()

When unless returns True, the cache is bypassed entirely — no read, no write. The unless callback receives the decorated function and all its arguments, so you can make decisions based on any input.

For simpler cases, unless can also be a no-arg callable:

@cached_for(minutes=10, unless=lambda: is_admin_mode())
def get_dashboard_data() -> dict:
    return compute_dashboard()

Caching None values. By default, None return values are not cached — the assumption is that None means “no result, try again.” If None is a valid result you want to cache, set cache_none_values=True:

@cached_for(hours=1, cache_none_values=True)
def find_user(email: str) -> User | None:
    return User.query.filter_by(email=email).first()

How It Works Internally

The unless bypass. The unless check happens at the outermost layer of the decorator chain — before any cache lookup or write. When unless returns True, the original function is called directly, with zero cache interaction:

def _wrap_with_disable_cache_and_register(*args, **kwargs):
    if _bypass_cache(unless, func, *args, **kwargs):
        kwargs.pop("_force_cache_update", None)
        return func(*args, **kwargs)  # straight to the original function
    return func6(*args, **kwargs)     # through all caching layers

This is a complete bypass — no cache read, no cache write, no metrics, no key tracking. It’s as if the decorator wasn’t there.

Two-signature detection. How does unless support both lambda: is_admin_mode() and lambda func, user_id, *args, **kwargs: ...? It inspects the callable’s signature at call time:

def _wants_args(f):
    spec = inspect.getfullargspec(f)
    return any((spec.args, spec.varargs, spec.varkw, spec.kwonlyargs))

def _bypass_cache(unless, func, *args, **kwargs):
    if alan_cache.disable_cache:
        return True
    if callable(unless):
        if _wants_args(unless):
            if unless(func, *args, **kwargs) is True:
                return True
        elif unless() is True:
            return True
    return False

If the callable accepts any parameters at all (positional, *args, **kwargs, keyword-only), it’s called with the decorated function and all its arguments. Otherwise, it’s called with no arguments. The check uses is True — not truthiness — so unless must explicitly return True to trigger a bypass.

The None caching problem. When a cache backend’s get() returns None, it’s ambiguous: does the key not exist, or was None the cached value? The behavior depends on cache_none_values:

# Inside the Flask cache layer's get logic:
rv = cache.get(cache_key)
if rv is None:
    if not cache_none:
        found = False        # assume cache miss, don't even check
    else:
        found = cache.has(cache_key)  # actually check if key exists

With the default cache_none_values=False: a None return is always treated as a cache miss. The function runs again, and if it returns None again, that None is not stored — the function will run on every call. This is the right default for functions like “find user by email” where None means “not found, might exist later.”

With cache_none_values=True: an extra has() call distinguishes “key doesn’t exist” from “key exists and its value is None.” This costs one additional Redis round-trip, but it’s necessary when None is a meaningful result you want to cache — like “this feature flag doesn’t exist, stop querying for it.”

8. Cache Keys — How They Work and How to Control Them

Every cached function gets a deterministic key. Understanding the structure helps when debugging and when you need partial invalidation (next section).

Default Key Structure

{funcname}-{hash(args)}-{hash(kwargs)}-{hash(request_path)}-{hash(query_string)}

The function’s fully qualified name (module.qualname) is always prepended. Each part is MD5-hashed (inherited from Flask-Caching):

def _encode(t):
    return str(md5(str(t).encode()).hexdigest())

Controlling What Goes Into the Key

cache_key_prefix — add a static or dynamic prefix:

# Static prefix
@cached_for(hours=1, cache_key_prefix="v2")

# Dynamic prefix based on context
@cached_for(hours=1, cache_key_prefix=lambda: get_current_tenant_id())

cache_key_with_request_path and cache_key_with_query_string — include HTTP context in the key. Useful for caching entire page responses where the same function serves different URLs:

@cached_for(minutes=5, cache_key_with_request_path=True, cache_key_with_query_string=True)
def render_page() -> str:
    return expensive_template_rendering()

args_to_ignore, ignore_self, ignore_cls — exclude specific arguments:

@cached_for(hours=1, ignore_self=True)
def get_data(self, query: str) -> dict:
    # 'self' is excluded from the key, so all instances share the cache
    return self.db.execute(query)

cache_key_with_func_args=False — ignore all arguments entirely. Every call returns the same cached value regardless of inputs. Used with warmup_on_startup and async_refresh_every (covered later).

9. Partial Invalidation — Purge Surgically

This is where cache key design pays off.

Consider a function that caches product definitions by three parameters:

@cached_for(
    hours=24,
    cache_key_with_full_args=True,
)
def get_product_definition(product_type: str, country: str, version: int) -> dict:
    return fetch_product_from_database(product_type, country, version)

The crucial difference is cache_key_with_full_args=True. Instead of hashing all arguments together into a single part, each argument gets its own hash slot:

# Default (cache_key_with_full_args=False):
get_product_definition-{hash((product_type, country, version))}

# With full_args:
get_product_definition-{hash(product_type)}-{hash(country)}-{hash(version)}

Now, when the “health” product type changes, you can purge just those entries:

alan_cache.clear_cached_func_some(
    get_product_definition,
    product_type="health",
)

How It Works Internally

Introspects the function signature to figure out which arguments were provided and which were omitted
Builds a glob pattern replacing omitted arguments with *:
```
get_product_definition-{hash("health")}-*-*
```
Delegates to async deletion — an RQ job scans the function’s CACHED_FUNC_KEYS_{funcname} Redis SET using SSCAN with the glob pattern, deletes matching keys in batches of 1000, then broadcasts to all workers for local cache cleanup

About 10 functions use this in production. It’s marginal in volume but critical in impact — it lets you have a simple caching system for product definitions, contract rules, and similar domain data, with surgical invalidation when only one product or rule changes. No need to manage dozens of specific cache keys manually.

For other invalidation needs:

# Delete one specific cached value (exact args match)
alan_cache.clear_cached_func(get_product_definition, "health", "FR", 3)

# Delete ALL cached values for a function
alan_cache.clear_cached_func_all(get_product_definition)

10. Distributed Invalidation — The Hard Problem

With up to 300 RQ workers and multiple web server processes, each running its own local SimpleCache, how do you propagate a cache deletion across all of them?

This is the hardest problem Alan Cache solves. The answer is a three-stage protocol that doesn’t require pub/sub, message brokers, or any external infrastructure beyond Redis.

Stage 1: Local Immediate Delete

First, delete matching keys in the current process using re2 regex:

def _delete_local_cache_keys_from_patterns(patterns_to_del: list[str]) -> set[str]:
    patterns_to_del = ["^" + p + "$" for p in patterns_to_del]
    regex = re2.compile("|".join(patterns_to_del))
    deleted_keys = set()
    for cache in [alan_cache.local_cache.cache, alan_cache.local_cache_no_serializer.cache]:
        _cache = cache._cache
        to_del = [key for key in _cache if re2.search(regex, key)]
        for key in to_del:
            del _cache[key]
        deleted_keys.update(to_del)
    return deleted_keys

I use Google’s RE2 library instead of Python’s re for two reasons: RE2 guarantees linear-time matching (no exponential blowup on pathological patterns), and it’s immune to ReDoS attacks from crafted patterns. Since deletion patterns come from function names and argument hashes, RE2’s safety guarantees matter.

Stage 2: Redis Async Delete

An RQ job (on the CACHE_BUILDER_QUEUE) scans the function’s key set and deletes matching Redis keys in batches of 1000:

for funcname, filter_pattern in funcnames_and_filters:
    set_name = CACHED_FUNC_KEYS_SET_PREFIX + funcname

    if filter_pattern == "*":
        keys = list(redis.smembers(set_name))
        if keys:
            for batch_keys in group_iter(keys, 1000):
                redis.delete(*batch_keys)
            redis.delete(set_name)
    else:
        keys_to_delete = []
        for key_bytes in redis.sscan_iter(set_name, match=...):
            keys_to_delete.append(key_bytes)
        for batch_keys in group_iter(keys_to_delete, 1000):
            redis.delete(*batch_keys)
            redis.srem(set_name, *batch_keys)

Stage 3: Broadcast via ZSET

After Redis keys are deleted, the job needs to tell every other worker to clean up its local RAM cache. It does this by adding deletion patterns to a Redis Sorted Set, CACHED_FUNCS_TO_DELETE, scored by the Redis server’s epoch time (not the local clock — avoids clock drift issues across machines):

(epoch, _) = redis.time()
patterns_for_dict = [
    f"{funcname}-{filter_pattern}".replace("*", ".*")
    for funcname, filter_pattern in funcnames_and_filters
]
redis.zadd(CACHED_FUNCS_TO_DELETE, dict.fromkeys(patterns_for_dict, epoch))
redis.expire(CACHED_FUNCS_TO_DELETE, 3600)  # 1h TTL

Worker Pickup: Piggyback on Cache Access

Workers don’t poll a dedicated channel. Instead, every cached function call checks (at most every 5 minutes) if there are new patterns in the ZSET:

DELETION_CHECK_FREQUENCY_SECS = 60 * 5

def _cleanup_local_cache_keys() -> None:
    global _last_time_check_for_deletion
    now = datetime.now(UTC)
    epoch = int(_last_time_check_for_deletion.timestamp())
    if patterns_to_del_bytes := alan_cache.redis.zrangebyscore(
        CACHED_FUNCS_TO_DELETE, epoch, "+inf"
    ):
        patterns_to_del = [p.decode("utf-8") for p in patterns_to_del_bytes]
        _delete_local_cache_keys_from_patterns(patterns_to_del)
    _last_time_check_for_deletion = now

Consistency guarantee. The ZSET expires after 1 hour. Workers are recycled every 30 minutes. This means even if a worker doesn’t access the cache for a while, it will be replaced by a fresh one before the patterns expire — no worker ever misses an invalidation.

Function Registry

Every cached function registers its fully qualified name in the CACHED_FUNCS Redis SET on first call:

if funcname not in _registered_funcnames:
    alan_cache.redis.sadd(CACHED_FUNCS, funcname)
    _registered_funcnames.add(funcname)

This makes all cached functions discoverable by admin tools. Each function’s keys are tracked in a dedicated SET (cached_func_keys_{funcname}), enabling efficient key counting, pattern-matching deletion, and space estimation.

11. Atomic Writes — Caching Functions with Side Effects

The Problem

Not all cached functions are pure. Some compute a value and produce a side effect — sending a Slack message, creating a channel, provisioning a resource, calling an external API that charges money.

Consider a cached function that sends a Slack notification as part of an automated task. Two workers race — both see an empty cache, both compute, both send the message. The user gets a duplicate notification. This was a real bug at Alan.

The issue isn’t the redundant computation. It’s the duplicate side effect. Whenever a cached function does something beyond returning a value, a race condition on cache miss becomes a correctness problem. You need a guarantee about how many times the function body actually executes.

Alan Cache solves this with two strategies, named after distributed systems concepts. Both ensure that concurrent cache misses don’t cause the function to run multiple times uncontrollably.

`at_least_once` — Optimistic Concurrency

The idea: let everyone compute, but only the first write to the cache wins.

This uses Redis’s WATCH/MULTI/EXEC transaction mechanism — the same primitive used for optimistic concurrency control in databases. Here’s how it works:

WATCH the cache key
Compute the value (side effects may happen here)
Start a MULTI transaction, SET the key, EXEC
If another process wrote the key between the WATCH and EXEC, Redis raises WatchError
On WatchError: retry — but now the key exists, so the cache hit returns the value immediately

def _retry_on_watch_exception(*args, **kwargs):
    retval = None
    while True:
        try:
            retval = func3_cache_shared(*args, **kwargs)
            break
        except WatchError:
            continue
    return retval

Multiple processes may compute the value (hence “at least once”), but only one write to the cache succeeds. The others discover the cached value on retry and don’t write again.

When to use: functions where the side effect is idempotent or cheap enough that running it twice is acceptable — e.g. fetching data from an external API (you pay the latency twice, but no visible harm). The guarantee here is about cache consistency (no double-write), not about side-effect uniqueness.

`at_most_once` — Pessimistic Locking

The idea: only one process runs the function, everyone else waits for the result.

This is the strategy for non-idempotent side effects — when running the function twice would cause visible problems. Instead of computing the real value, the winning process first writes a lock sentinel to the cache:

def _build_at_most_once_lock(*args, **kwargs) -> str:
    return f"__atomic_lock_proc:{_get_proc_thread_id()}"

The process ID and thread ID identify who holds the lock. Now:

The winning process (whose PID matches the sentinel) runs the function — side effects happen exactly once — then replaces the sentinel with the real value
All other processes detect the sentinel (it starts with __atomic_lock_proc:), sleep 10ms, and retry
Eventually, the real value appears and everyone gets it

def set_real_value_after_lock_is_set(*args, **kwargs):
    key = make_cache_key(*args, **kwargs)
    while True:
        retval = func3_handle_atomic_conflict(*args, **kwargs)
        retval_str = str(retval or "")
        if not retval_str.startswith("__atomic_lock_proc:"):
            return retval  # Real value ready
        proc_thread_id = retval_str.split(":")[1]
        if proc_thread_id == _get_proc_thread_id():
            # I won the lock — compute and store
            computed_val = orig_func3(*args, **kwargs)
            alan_cache.set(key, computed_val, expire_in or timedelta(seconds=0))
            return computed_val
        # Another process holds the lock — wait
        time.sleep(0.01)

When to use: functions with non-idempotent side effects — sending a Slack message, creating a channel, provisioning a cloud resource, calling a billing API. The function runs exactly once; everyone else gets the cached result.

Production Usage

@cached_for(hours=24, atomic_writes="at_least_once")
def get_user_lifecycle_data(provider: str) -> dict:
    return fetch_from_external_api(provider)

70 usages across 39 files — heavily used in internal tooling that integrates with external providers. With up to 300 concurrent RQ workers, I haven’t observed congestion.

12. Async Background Computation — Never Block the User

Some computations take 30+ seconds — aggregating data from external APIs, generating reports, scanning infrastructure. You can’t make the user wait.

@cached_for(minutes=10, async_compute=True)
def get_infrastructure_status() -> dict:
    return scan_all_kubernetes_clusters()  # Takes 45 seconds

When async_compute=True:

First call: enqueues an RQ job on CACHE_BUILDER_QUEUE, raises AsyncValueBeingBuiltException. The caller catches this and shows a loading state.
While computing: subsequent calls keep raising the exception.
Once computed: the value lands in Redis, and subsequent calls return it instantly.

The RQ Serialization Trick

RQ serializes function references as strings like module.function_name and uses import_attribute to load them. But for class methods, the path has two levels (module.Class.method), which import_attribute can’t handle.

The workaround: I dynamically inject a module-level sync wrapper:

sync_func_name = f"_sync_{func.__qualname__}"

@functools.wraps(func)
def _sync_func(*args, **kwargs):
    alan_cache._running_in_an_async_worker += 1
    try:
        ret = func(*args, **kwargs)
    finally:
        alan_cache._running_in_an_async_worker -= 1
    return ret

_sync_func.__qualname__ = sync_func_name
sync_func_module = getmodule(func)
setattr(sync_func_module, sync_func_name, enqueueable(_sync_func))

The _running_in_an_async_worker counter solves another subtle problem: recursive async. If an async-cached function calls another async-cached function, the inner one would also try to enqueue a job and raise AsyncValueBeingBuiltException — crashing the outer job. The counter forces inner calls to run synchronously when already inside an async worker.

13. Keeping Caches Warm — Periodic Refresh & Startup Warming

Periodic Refresh

Some data should always be fresh in the cache — Kubernetes cluster state, Cloudflare deployments, CI pipeline configs. You don’t want the first user after expiry to pay the recomputation cost.

@cached_for(
    minutes=10,
    async_compute=True,
    async_refresh_every=timedelta(minutes=5),
)
def _get_applications() -> dict[str, Any]:
    return run_cli_command(...)

async_refresh_every registers the function and its refresh period in a Redis HASH:

alan_cache.redis.hset(CACHED_FUNCS_TO_REFRESH, funcname, to_seconds(async_refresh_every))

An external cron job triggers the refresh_periodic_cached_values() RQ command, which iterates all registered functions and recomputes the ones that are due:

def _refresh_periodic_cached_values() -> None:
    cached_funcs = alan_cache.redis.hgetall(CACHED_FUNCS_TO_REFRESH)
    cached_funcs_last_run_start = alan_cache.redis.hgetall(
        CACHED_FUNCS_TO_REFRESH_LAST_RUN_FINISHED
    )
    for func_name, period_sec in cached_funcs.items():
        last_run_start = cached_funcs_last_run_start.get(func_name, 0)
        if int(last_run_start) + int(period_sec) < time.time():
            func = import_attribute(func_name_str)
            try:
                func(_force_cache_update=True)
            except AsyncValueBeingBuiltException:
                pass  # Already running, check next time

When combined with async_compute=True, the refresh happens in a background worker — the old cached value continues to be served until the new one is ready. Users never see a loading state after the first computation.

Constraints: minimum 5-minute granularity, must be shorter than expire_in, can’t be used on functions that take arguments — there’s no way to know which arguments to call them with.

9 functions use periodic refresh in production.

Startup Warming

For critical cache entries that should be ready before the first request:

@cached_for(hours=1, warmup_on_startup=True, async_compute=True)
def get_system_config() -> dict:
    return load_system_configuration()

On application startup, a before_first_request callback eagerly computes the value:

@current_app.before_first_request
def _warmup_cache():
    if not alan_cache.disable_cache:
        timeout_end = time.monotonic() + warmup_timeout.total_seconds()
        while True:
            try:
                func7(_force_cache_update=True)
                break
            except AsyncValueBeingBuiltException:
                pass
            if time.monotonic() > timeout_end:
                raise TimeoutError(f"warming up cache value took more than {warmup_timeout}")
            time.sleep(0.1)

It polls until the value is computed or the warmup_timeout (default: 10 seconds) expires. This prevents cold-start penalties — the first real user request gets a cache hit.

Constraints: same as periodic refresh — no arguments, no request-path keys.

14. Object-Lifetime Caching

Sometimes cache entries should live as long as a specific Python object — and be automatically cleaned up when that object is garbage collected.

class RequestContext:
    @cached(local_ram_cache_only=True, expire_when="object_is_destroyed")
    def get_expensive_data(self, key: str) -> dict:
        return compute_expensive_data(key)

When expire_when="object_is_destroyed":

Alan Cache injects a __del__ destructor on the class (preserving any existing destructor)
It tracks all cache keys created for each instance in an _instance_keys dict, indexed by class name and instance identity
When the object is garbage collected, the destructor fires and calls alan_cache.delete_many(*keys) to clean up all associated cache entries

def destructor(self):
    keys = _instance_keys.get(class_name, {}).get(str(self), set())
    alan_cache.delete_many(*keys)
    _instance_keys.get(class_name, {}).pop(str(self), None)
    if existing_destructor is not None:
        return existing_destructor(self)

Must be paired with local_ram_cache_only=True — this feature is designed for in-memory objects whose lifecycle is tied to something transient like a request handler or a temporary computation context.

15. Observability & Admin

Observability was a first-class design goal — not an afterthought. One of the main pain points with the old caching mess was having no visibility into what was cached.

Metrics

Every cache get and set is wrapped with Datadog timing metrics:

metrics.timed(f"cache.{name}.duration", tags=[
    f"cache_type:{cache_type}",
    f"async:{async_compute}",
    f"func_name:{funcname}",
])

This gives me cache.get.duration and cache.set.duration histograms with per-function granularity. Since Datadog histograms include count, I also get hit rate and throughput for free.

Admin API

Internal endpoints for cache inspection:

Endpoint	Method	Purpose
`/alan_cache/funcnames`	GET	List all cached functions with code owners
`/alan_cache/funcnames`	POST	Delete all keys for specified functions
`/alan_cache/function_keys`	GET	List keys for a specific function
`/alan_cache/count_keys_and_space`	GET	Key counts and estimated memory per function
`/alan_cache/default_set_keys`	GET	Paginated key browser (sortable by name or size)

The admin API itself uses Alan Cache — the _count_keys_and_space function is decorated with @cached_for(minutes=60, async_compute=True, async_refresh_every=timedelta(minutes=30)). It uses Redis PIPELINE and MEMORY USAGE commands to estimate space without transferring values.

Function Discovery

The CACHED_FUNCS Redis SET serves as a live registry. Combined with per-function key SETs (with alphabetical and size-sorted variants), it provides:

Complete list of which functions are cached in production
Key count per function
Estimated memory consumption per function
Code ownership mapping (via get_code_owners_of_function)

16. The Admin Dashboard — Exploring Cache State in Production

The API endpoints from Chapter 15 power a React-based internal admin dashboard. It turns raw Redis data into something anyone on the team can browse — no Redis CLI required.

Functions Overview

The main view lists every cached function in production:

Each row shows:

Module & Function name — which Python function is cached
Key count — how many cache entries exist for this function
Estimated memory — computed via Redis MEMORY USAGE across all keys (itself cached and refreshed async every 30 min)
Code owners — extracted from CODEOWNERS, so you know who to ping
Actions — delete all keys for a function, or drill down into individual keys

The search bar at the top filters by function name or module. The bulk delete button lets you wipe multiple functions at once — useful after a deploy that changes return types.

Key Browser

Clicking a function drills into its individual cache keys:

Keys are sortable by name or size. The key format is structured: flask_cache_{funcname}-{hash(arg₀)}-{hash(arg₁)}-.... You can spot outliers — a single key consuming disproportionate memory usually means someone is caching a large queryset that should be paginated.

Value Inspector

Clicking a key shows its deserialized value:

The dashboard deserializes the pickled value and renders it as formatted JSON. This is invaluable for debugging — you can verify that the cached data matches expectations without adding print statements or breakpoints. For cached objects containing user data, the dashboard shows the actual field values (PII is visible only on the internal network).

17. The Full Picture

Now that you’ve seen every feature — from simple decorators to distributed invalidation, atomic writes, async computation, and observability — here’s the complete infrastructure that Alan Cache operates in:

flowchart TB
    Clients["🌐 Clients"]

    subgraph Gunicorn["Web Server (Gunicorn, ×N)"]
        W1["Worker 1
🧠 Local RAM Cache"]
        W2["Worker 2
🧠 Local RAM Cache"]
        Wn["Worker …
🧠 Local RAM Cache"]
    end

    subgraph RQ["RQ Workers (up to 300, recycled every 30 min)"]
        R1["Worker 1
🧠 Local RAM Cache"]
        R2["Worker 2
🧠 Local RAM Cache"]
        Rn["Worker …
🧠 Local RAM Cache"]
    end

    subgraph Redis["Redis"]
        subgraph Storage["Cache Storage"]
            STR["flask_cache_ funcname - hash args  — STR"]
        end
        subgraph Registry["Function Registry & Key Tracking"]
            SET1["CACHED_FUNCS — SET"]
            SET2["cached_func_keys_ funcname  — SET"]
            ZSET1["cached_func_keys_ funcname _alpha — ZSET"]
            ZSET2["cached_func_keys_ funcname _size — ZSET"]
        end
        subgraph Invalidation["Distributed Invalidation"]
            ZSET3["CACHED_FUNCS_TO_DELETE — ZSET
1h TTL, checked every 5 min"]
        end
        subgraph Refresh["Periodic Refresh"]
            HASH1["cached_funcs_to_refresh — HASH"]
            HASH2["cached_funcs_to_refresh_last_run — HASH"]
        end
        subgraph Queue["Job Queues (RQ)"]
            LIST["CACHE_BUILDER_QUEUE — LIST"]
        end
    end

    Cron["⏰ Cron Job"]

    Clients -- HTTP --> Gunicorn
    Gunicorn -- "read/write + enqueue" --> Redis
    RQ -- "dequeue + read/write" --> Redis
    Cron -- "enqueues refresh jobs" --> Redis

Inside every worker process, the AlanCache singleton manages four cache backends — two local, two remote:

flowchart TB
    subgraph AlanCache["AlanCache singleton (one per process)"]
        subgraph Layer1["⚡ Layer 1 — Local RAM (per-process, < 1ms)"]
            LC["local_cache
SimpleCache — pickled values"]
            LCNS["local_cache_no_serializer
SimpleCache — raw Python objects (no I/O)"]
        end
        subgraph Layer2["🗄️ Layer 2 — Shared Redis (cross-process, 1–5ms)"]
            SC["shared_cache
RedisCache — primary, swallows errors"]
            SCA["shared_cache_atomic
RedisCache — WATCH/MULTI/EXEC writes"]
        end
    end

    LC -- miss --> LCNS
    LCNS -- miss --> SC
    SC -- miss --> SCA

18. Build vs Buy — Why Not Use an Existing Library?

I considered the existing Python caching landscape:

functools.cache / functools.lru_cache — Built into Python, zero dependencies. But strictly in-process RAM, no TTL support (cache has no expiration at all, lru_cache only evicts by size), no Redis layer, no invalidation beyond cache_clear() which wipes everything. We were already using lru_cache in places — it was one of the six fragmented approaches we wanted to consolidate
cachetools — Same idea as functools.lru_cache but with more eviction policies (TTL, LFU, LRU). Still RAM-only, no Redis backend, no shared state across processes
cache-tower — The closest in spirit: a multi-layer cache with RAM + Redis support. But it’s a minimal library focused on get/set with TTL and LRU eviction — no decorator interface, no distributed invalidation, no partial invalidation, no atomic writes, no background computation
dogpile.cache — Good two-layer support, but no distributed invalidation, no partial invalidation, no async computation
aiocache — Multi-backend support (Redis, Memcached, in-memory) with a clean decorator API, but built entirely around asyncio. Our stack is synchronous Flask + RQ — adopting aiocache would have meant either running an async event loop inside sync workers (fragile and complex) or migrating to an async framework first. It also lacked distributed invalidation, partial invalidation, and atomic writes
redis-simple-cache — Thin decorator around Redis with TTL support. Redis-only, no local RAM layer — every read is a network round-trip. No partial invalidation, no atomic writes, no background computation. Too simple for our needs
Flask-Caching itself — No local RAM layer, no atomic writes, no async computation, no partial invalidation

None of them gave me what I needed: a unified decorator with two-layer storage, partial invalidation, distributed cache deletion across 300+ workers, async background computation, and atomic writes with semantic choices (at_least_once vs at_most_once).

So I reimplemented the useful parts of Flask-Caching — the backend factory and decorator core — without the Flask app context dependency, and built everything else on top. The key inspiration from CHI (Perl) was the philosophy: one interface, infinite configurability, observable by default.

The trade-off is maintenance cost — ~1500 lines of decorator logic. But the return is total control: every feature in Alan Cache exists because a real production incident demanded it.

19. Numbers

All numbers from Datadog, February 2026. The two-tier architecture shows its value at scale: ~300–500M cache GETs per day, with RAM absorbing ~10x more writes than Redis. The Redis infrastructure itself is barely loaded — ~1% CPU, zero swap, ~200 GiB memory headroom per node — despite serving ~29K GET commands per second.

Metric	Value
Decorator usages across codebase	258
`@cached_for` usages	173
`@cached` usages	36
`@memory_only_cache` usages	36
`@request_cached` usages	6
`@thread_local_class_cache` usages	7
`atomic_writes` usages	70 across 39 files
`async_refresh_every` usages	9
`clear_cached_func_some` usages	~10
Max concurrent RQ workers	300
Deletion check frequency	5 minutes
ZSET expiry (broadcast)	1 hour
Worker recycling interval	30 minutes
Redis keys	[XXX — prod numbers TBD]
Redis memory	~111 GiB total across clusters

Daily throughput
Cache GETs/day	~300–500M (peak ~490M)
Cache SETs in RAM/day	~80–170M (peak ~170M)
Cache SETs in Redis/day	~5–25M (peak ~25M)

Redis infrastructure
Redis GET cmd/s	~29K
Redis SET cmd/s	~9K
Redis memory (total)	~111 GiB across clusters
Redis CPU	~1%
Redis swap	0

Core engine lines of code	~1500

Summary (TL;DR)

Alan Cache is the in-house Python caching library I built at Alan in January 2023. Inspired by Perl’s CHI, it replaced many fragmented caching methods with one unified two-layer system.

The basics: @cached_for(hours=1) stores values in both local RAM (<1ms) and shared Redis (1-5ms). Covers 173 of 258 usages. A direct get/set/delete API handles the rest.

Storage control: choose RAM-only (local_ram_cache_only), Redis-only (shared_redis_cache_only), or both. Specialized decorators for class methods (@memory_only_cache), thread-local state (@thread_local_class_cache), and request-scoped computation (@request_cached).

Cache keys & invalidation: keys are structured as {funcname}-{hash(arg0)}-{hash(arg1)}-.... With cache_key_with_full_args=True, each argument gets its own slot, enabling clear_cached_func_some(func, product_type="health") to purge surgically.

Distributed invalidation: 3-stage protocol — local delete with RE2 regex → async Redis scan+delete in batches of 1000 → broadcast via ZSET scored by Redis server epoch. Workers pick up patterns piggyback-style every 5 min. ZSET TTL 1h + 30min worker recycling = no missed invalidations.

Side-effect safety: at_least_once (WATCH/MULTI/EXEC, optimistic, first write wins — for idempotent side effects) and at_most_once (lock sentinel + polling, pessimistic, exactly-once — for non-idempotent side effects). 70 usages, zero congestion with 300 workers.

Advanced: async background computation via RQ workers with dynamic function injection. Periodic refresh via external cron (9 functions). Startup warming with configurable timeout. Object-lifetime caching with __del__ injection.

Observability: Datadog metrics per function, admin API with key browser and space estimation, internal admin dashboard.

Result: No cache-related incidents since deployment.

Riak as Events Storage

2018-01-01T00:00:00+00:00

– This post is a compilation of four old posts. They are gathered here in one piece for more clarity, and for archiving purpose. The work described in this post was done over the years 2014-2016 –

Riak as Events Storage

Booking.com constantly monitors, inspects, and analyzes our systems in order to make decisions. We capture and channel events from our various subsystems, then perform real-time, medium and long-term computation and analysis.

This is a critical operational process, since our daily work always gives precedence to data. Relying on data removes the guesswork in making sound decisions.

In this series of blog posts, we will outline details of our data pipeline, and take a closer look at the short and medium-term storage layer that was implemented using Riak.

Introduction to Events Storage

Booking.com receives, creates, and sends an enormous amount of data. Usual business-related data is handled by traditional databases, caching systems, etc. We define events as data that is generated by all the subsystems on Booking.com.

In essence, events are free-form documents that contain a variety of metrics. The generated data does not contain any direct operational information. Instead, it is used to report status, states, secondary information, logs, messages, errors and warnings, health, and so on. The data flow represents a detailed status of the platform and contains crucial information that will be harvested and used further down the stream.

To put this in numerical terms - we have more than billions of events per day, streaming at more than 100 MB per second, and adding up to more than 6 TB per day.

Here are some examples of how we use the events stream:

Visualisation: Wherever possible, we use graphs to express data. To create them, we use a heavily-modified version of Graphite.
Looking for anomalies: When something goes wrong, we need to be notified. We use threshold-based notification systems (like seyren) as well as a custom anomaly detection software, which creates statistical metrics (e.g. change in standard deviation) and alerts if those metrics look suspicious.
Gathering errors: We use our data pipeline to pass stack traces from all our production servers into ElasticSearch. Doing it this way (as opposed to straight from the web application log files) allows us to correlate errors with the wealth of the information we store in the events.

These typical use-cases are made available in less than one-minute after the related event has been generated.

High Level overview

This is a very simplified diagram of the data flow:

We can generate events by using literally any piece of code that exists on our servers. We pass a HashMap to a function, which packages the provided document into a UDP packet and sends it to a collection layer. This layer aggregates all the events together into “blobs”, which are split by seconds (also called epochs) and other variables. These event blobs are then sent to the storage layer running Riak. Finally, Riak sends them on to Hadoop. The Riak cluster is meant to safely store around ten days of data. It is used for near real-time analysis (something that happened seconds or minutes ago), and medium-term analysis of relatively small amounts of data. We use Hadoop for older data analysis or analysis of a larger volume of data.

The above diagram is a simplified version of our data flow. In practical application, it’s spread across multiple datacenters (DC), and includes an additional aggregation layer.

Individual Events

An event is a small schema-less [1] piece of data sent by our systems. That means that the data can be in any structure with any level of depth, as long as the top level is a HashTable. This is crucial to Booking.com - the goal is to give as much flexibility as possible for the sender, so that it’s easy to add or modify the structure, or the type and number of events.

Events are also tagged in four different ways:

the epoch at which they were created
the DC where they originated
the type of event
the subtype.

Some common types are:

WEB events (events produced by code running under a web server)
CRON events (output of cron jobs)
LB events (load balancer events)

The subtypes are there for further specification and can answer questions like: “Which one of web server systems are we talking about?”.

Events are compressed Sereal blobs. Sereal is possibly the best schema-less serialisation format currently available. It was also written at Booking.com.

An individual event is not very big, but a huge number of them are sent every second.

We use UDP as transport because it provides a fast and simple way to send data. Despite some (very low) risk of data loss, it doesn’t impact senders sending events. We are experimenting with an UDP-to-TCP relay that will be local to the senders.

Aggregated Events

Literally every second, events from this particular second (called epoch), DC number, type, and subtype are merged together as an Array of events on the aggregation layer. At this point, it’s important to try and get the smallest size possible, so the events of a given epoch are re-serialized as a Sereal blob, using these options:

compress => Sereal::Encoder::SRL_ZLIB,
dedupe_strings => 1

dedupe_strings increases the serialisation time slightly. However it removes strings duplications which occur a lot since events are usually quite similar between them. We also add gzip compression.

We also add the checksum of the blob as a postfix, to be able to ensure data integrity later on. The following diagram shows what an aggregated blob of events looks like for a given epoch, DC, type, and subtype. You can get more information about the Sereal encoding in the Sereal Specification.

This is the general structure of an events blob:

The compressed payload contains the events themselves. It’s an Array of HashMaps, Serialized in a Sereal structure and gzip-compressed. Here is an example of a trivial payload of two events, as follows:

[
  { cpu => 5  },
  { cpu => 99 }
]

And the gzipped payload would be the compressed version of this binary string:

It can be hard to follow these hexdigits [2], yet it’s a nice illustration of why the Sereal format helps us to reduce the size of serialised data. The second array element is encoded on far fewer bytes than the first one, since the key has already be seen. The resulting binary is then re-compressed. The Sereal implementation offers multiple compression algorithms, including Snappy and gzip.

A typical blob of events for one second/DC/type/subtype can weight anywhere from several kilobytes to several megabytes, which translates into a (current) average of around 250 gigabytes per hour.

Side note: smaller subtypes on this level of aggregation aren’t always used, because we want to minimise the data we transmit over our network by having good compression ratios. Therefore we split types into subtypes only when the blobs are big enough. The downside to this approach is that consumers have to fetch data for the whole type, then filter out only subtypes they want. We’re looking at ways to find more balance here.

Data flow size and properties

Data flow properties are important, since they’re used to decide how data should be stored:

The data is timed and all the events blobs are associated with an epoch. It’s important to bear in mind that events are schema-less, so the data is not a traditional time series.
Data can be considered read-only; the aggregated events blobs are written every second and almost never modified (history rewriting happens very rarely).
Once sent to the storage, the data must be available as soon as possible

Data is used in different ways on the client side. A lot of consumers are actually daemons that will consume the fresh data as soon as possible - usually seconds after an event was emitted. A large number of clients read the last few hours of data in a chronological sequence. On rare occasions, consumers access random data that is over a few days old. Finally, consumers that want to work on larger amounts of older data would have to create Hadoop jobs.

There is a large volume of data to be moved and stored. In numerical terms:

Once serialized and compressed into blobs, it is usually larger than 50 MB/s
That’s around 250 GB per hour and more than 6 TB per day
There is a daily peak hour but the variance of the data size is not huge: There are no quiet periods
Yearly peak season stresses all our systems, including events transportation and storage, so we need to provision capacity for that

Why Riak

In order to find the best storage solution for our needs, we tested and benchmarked several different products and solutions.

The solutions had to reach the right balance of multiple features:

Read performance had to be high as a lot of external processes will use the data.
Write security was important, as we had to ensure that the continuous flow of data could be stored. Write performance should not be impacted by reads.
Horizontal scalability was of utmost importance, as our business and traffic continuously grows.
Data resilience was key: we didn’t want to lose portions of our data because of a hardware problem.
Allowed a small team to administrate and make the storage evolve.
The storage shouldn’t require the data to have a specific schema or structure.
If possible, it would be able to bring code to data, perform computation on the storage itself, instead of having to get data out of the storage.

After exploring a number of distributed file systems and databases, we chose Riak over distributed Key-Value stores. Riak had good performance and predictable behavior when nodes fail and when scaling up. It also had the advantage of being easy to grasp and implement within a small team. Extending it was very easy (which we’ll see in the next part of this series of blog posts) and we found the system very robust - we never had to face dramatic issues or data loss.

Disclaimer: This is not an endorsment for Riak. We compared it carefully to other solutions over a long period of time and it seemed to be the best product to suit our needs. As an example, we thoroughly tested Cassandra as an alternative: it had a larger community and similar performance but was less robust and predictable; it also lacked some advanced features. The choice is ultimately a question of priorities. The fact that our events are schema-less made it almost impossible for us to use solutions that require knowledge of the data structures. Also we needed a small team to be able to operate the storage, and a way to process data on the cluster itself, using MapReduce or similar mechanisms.

Riak 101

The Riak cluster is a collection of nodes (in our case physical servers), each of which claims ownership of a given key. Depending on the chosen replication factor, each key might be owned by multiple nodes. You can ask any node for a key and your request will be redirected to one of the owners. Same goes for writes.

On closer inspection of Riak, we see that keys are grouped into virtual nodes. Each physical node can own multiple virtual nodes. This simplifies data rebalancing when growing a cluster. Riak does not need to recalculate the owner for each individual key; it will only do it per virtual node.

We won’t cover Riak architecture in a great detail in this post, but we recommend reading the following article for further information.

Riak clusters configuration

The primary goal of this storage is to keep the data safe. We went with the regular replication number value of three. Even if two nodes owning the same data will go down, we won’t lose our data.

Riak offers multiple back-ends for actual data storage. The main three are Memory, LevelDB, and Bitcask. We chose Bitcask, since it was suitable for our particular needs. Bitcask uses log-structured hash tables that provide very fast access. As data gets written to the storage, Bitcask simply appends data to a number of opened files. Even if a key is modified or deleted, the information will be written at the end of these storage files. An in-memory HashTable maps the keys with the position of their (latest) value in files. That way, at most one seek is needed to fetch data from the file system.

Data files are then periodically compacted, and Bitcask provides very good expiration flexibility. Since Riak is a temporary storage solution for us, we set it up with automatic expiration. Our expiration period varies. It depends on the current cluster shape, but usually falls between 8-11 days.

Bitcask keeps all of the keys of a node in memory, so keeping large numbers of individual events as key value pairs isn’t trivial. We sidestep any issues by using aggregations of events (blobs), which drastically reduce the number of needed keys.

More information about Bitcask can be found here.

For our conflict resolution strategy, we use Last Write Wins. The nature of our data (which is immutable as we described before) allows us to avoid the need for conflict resolution.

The last important part of our setup is load balancing. It is crucial in an enviromnent with a high level of reads, and only 1 gigabyte network. We use our own solution for that based on Zookeeper. Zooanimal daemons are running on the riak nodes, and collect information about system health. The information is then aggregated into simple text files, where we have an ordered list of IP addresses, plus up and running Riak nodes, which we can connect to. All our Riak clients simply choose a random node to send their requests to.

We currently have two Riak clusters in different geographical locations, each of which have more than 30 nodes. More nodes equates to more storage space, CPU power, RAM, and more network bandwidth available.

Data Design

Riak is primarily a key-value store. Although it provides advanced features (secondary indexes, MapReduce, CRDTs), the simplest and most efficient way to store and retrieve data is to use the key-value model.

Riak has three concepts - a bucket is a namespace, in which a key is unique. A key is the identifier of the data; and has to be stored in a bucket. A value is the data; it has an associated mime-type, which can enable Riak awareness of its type.

Riak doesn’t provide efficient ways to retrieve the list of buckets or the list of keys by default [3]. When using Riak, it’s important to know the bucket and key to access. This is usually resolved by using self-explanatory identifiers.

In our case, our events are stored as Sereal-encoded blobs. From these, we know the datacenter, type, subtype, and of course the time at which it was created.

When we need to retrieve data, we always know the time we want. We are also confident in the list of our datacenters. It doesn’t change unexpectedly so we can make it static for our applications. We are not always sure about what types or subtypes will appear in a given epoch for a given datacenter. On some seconds events of certain types may not arrive.

We came up with this simple data design:

events blobs are stored in the events bucket, keys being ::::
metadata are stored in the epochs bucket, keys being : and values being the list of events keys for this epoch and DC combination

The value of chunk is an integer, starting at zero, which keeps event blobs smaller than 500 kilobytes each. We use the integer to split big events blobs into smaller ones, so that Riak can function more efficiently.

We’ll now see this data design in action when pushing data to Riak

Pushing to Riak

Pushing data to Riak is done by a number of relocators, which are daemons running on the aggregation layer that then push events blobs to Riak.

Side note: it’s not recommended to have keys more then 1-2MB in Riak (see this FAQ). And since our blobs can be 5-10MB in size, we shard them into chunks, 500KB each. Chunks are valid Sereal documents, which means we do not have to stich chunks together in order to retrieve data back.

This means that we have quite a lot of blobs to send to Riak, so to maximise our usage of networking, I/O, and CPU, it’s best to send data in a mass-parallel way. To do so, we maintain a number of forked processes (20 per host is a good start), in which each of them push data to Riak.

Pushing data to Riak can be done using the HTTP API, or the Protocol Buffers Client (PBC) API. PBC has a slighly better performance.

Whatever protocol is used, it’s important to maximise I/O utilisation. One way is to use an HTTP library that parallelises the requests in term of I/O (YAHC is an example). Another method is to use an asynchronous Riak Client like AnyEvent::Riak.

We use an in-house library to create and maintain a pool of forks, but there are more than one existing libraries on CPAN, like Parallel::ForkManager.

PUT to Riak

Writing data to Riak is rather simple. For a given epoch, we have the list of events blobs, each of them having a different DC/type/subtype combination (remember, DC is short for Data Center). For example:

The first task is to slice the blobs into 500 KB chunks and add a postfix index number to their name. That gives:

Next, we can store all the event blobs in Riak in the events bucket. We can simulate it with curl:

curl -d  -XPUT "http://node:8098/buckets/events/keys/1413813813:1:type1:subtype1:0"
# ...
curl -d  -XPUT "http://node:8098/buckets/events/keys/1413813813:2:type3::0"

Side note: we store all events in each of the available Riak clusters. In other words, all events from all DCs will be stored in the Riak cluster which is in DC 1, as well as in the Riak cluster which is in DC 2. We do not use cross DC replication to achieve that - instead we simply push data to all our clusters from the relocators.

Once all the events blobs are stored, we can store the metadata, which is the list of the event keys, in the epochs bucket. This metadata is stored in one key per epoch and DC. So for the current example, we will have 2 keys: 1413813813-1 and 1413813813-2. We have chosen to store the list of events blobs names as pipe separated values. Here is a simulation with curl for DC 2:

curl -d "type1:subtype1:0|type1:subtype1:1|type3::0" -XPUT "http://riak_host:8098/buckets/epochs/keys/1413813813-2"

Because the epoch and DC are already in the key name, it’s not necessary to repeat that in the content. It’s important to push the metadata after pushing the data.

PUT options

When pushing data to the Riak cluster, we can use different attributes to change the way data is written - either by specifying which ones when using the PBC API, or by setting the buckets defaults.

Riak’s documentation provides a comprehensive list of the parameters and their meaning. We have set these parameters as follows:

"n_val" : 3,

"allow_mult"      : false,
"last_write_wins" : true,

"w"  : 3,
"dw" : 0,
"pw" : 0,

Here is a brief explanation of these parameters:

n_val:3 means that the data is replicated three times
allow_mult and last_write_wins prohibit siblings values; conflicts are resolved right away by using the last value written
w:3 means that when writing data to a node, we get a success response only when the data has been written to all the three replica nodes
dw:0 instruct Riak to wait for the data to have reached the node, not the backend on the node, before returning success.
pw:0 is here to specify that it’s OK if the nodes that store the replicas are not the primary nodes (i.e. the ones that are supposed to hold the data), but replacement nodes, in case the primary ones were unavailable.

In a nutshell, we have a reasonably robust way of writing data. Because our data is immutable and never modified, we don’t want to have siblings or conflict resolution on the application level. Data loss could, in theory, happen if a major network issue happened just after having acknowledged a write, but before the data reached the backend. However, in the worst case we would lose a fraction of one second of events, which is acceptable for us.

Reading from Riak

This is how the data and metadata for a given epoch is laid out in Riak:

bucket: epochs
key: 1428415043-1
value: 1:cell0:WEB:app:chunk0|1:cell0:EMK::chunk0

bucket: events
key: 1428415043:1:cell0:WEB:app:chunk0
value: 

bucket: events
key: 1428415043:1:cell0:EMK::chunk0
value: 

Fetching one second of data from Riak is quite simple. Given a DC and an epoch, the process is as follow:

Read the metadata by fetching the key - from the bucket "epochs"
Parse the metadata value, split on the pipe character to get data keys, and prepend the epoch to them
Reject data keys that we are not interested in by filtering on type/subtype
Fetch the data keys in parallel
Deserialise the data
Data is now ready for processing

Reading a time range of data is done the same way. Fetching ten minutes of data from Wed, 01 Jul 2015 11:00:00 GMT would be done by enumerating all the epochs, in this case:

Then, for each epoch, fetch the data as previously mentioned. It should be noted that Riak is specifically tailored for this kind of workload, where multiple parallel processes perform a huge number of small requests on different keys. This is where distributed systems shine.

GET options

The events bucket (where the event data is stored) has the following properties:

"r"            : 1,
"pr"           : 0,
"rw"           : "quorum",
"basic_quorum" : true,
"notfound_ok"  : true,

Again, let’s look at these parameters in detail:

r:1 means that when fetching data, as soon as we have a reply from one replica node, Riak considers this as a valid reply, it won’t to compare it with other replicas.
pr:0 remove the requirement that the data comes from a primary node
notfound_ok:true makes it so that as soon as one node can’t find a key, Riak considers that the key doesn’t exist (notfound_ok:true).

These parameter values allow to be as fast as possible when fetching data. In theory, such values don’t protect against conflicts or data corruption. However, in the “Aggregated Events” section (see the first post), we’ve seen that every event blob has a suffix checksum. When fetching them from Riak, this enables the consumer to verify that there is no data corruption. The fact that the events are never modified ensures that no version conflict can occur. This is why having such “careless” parameter values is not an issue for this use case.

Real time data processing outside of Riak

After the events are properly stored in Riak, it’s time to use them. The first usage is quite simple: extract data out of them and process it on dedicated machines, usually grouped in clusters or aggregations of machines that perform the same kind of analysis. These machines are called consumers, and they usually run daemons that fetch data from Riak, either continuously or on demand. Most of the continuous consumers are actually small clusters of machines spreading the load of fetching data.

Some data processing is required at near real-time. This is the case for monitoring, and building graphs. Booking.com heavily uses graphs at every layer of its technical stack. A big portion of graphs are generated from Events. Data is fetched every second from the Riak storage, processed, and dedicated graphing data is sent to an in-house Graphite cluster.

Other forms of monitoring also consume the events stream- fetched continuously and aggregated in per-second, per-minute, and daily aggregations in external databases, which are then provided to multiple departments via internal tools.

These kind of processes try to be as close as possible to real-time. Currently there are 10 to 15 seconds of lag. This lag could be shorter: a portion of it is due to the collection part of the pipeline, and an even bigger part of it is due to the re-serialisation of the events as they are grouped together, to reduce their size. A good deal of optimisation could be done there to reduce the lag down to a couple of seconds [4]. However, there was no operational requirement for reducing it and 15 seconds is small enough for our current needs.

Another way of using the data is to stick to real-time, but accumulate seconds in periods. One example is our Anomaly Detector, which continuously fetches events from the Riak clusters. However, instead of using the data right away, it accumulates it on short moving windows of time (every few minutes) and applies statistical algorithms on it. The goal is to detect anomalous patterns in our data stream and provide the first alert that prompts further action. Needless to say, this client is critical.

Another similar usage is done when gathering data related to A/B testing. A large number of machines harvest data from the events’ flow before processing it and storing the results in dedicated databases for use in experimentation-related tooling.

There are a number of other usages of the data outside of Riak, including manually looking at events to check new features behaviours or analysing past issues / outages.

Limitations of data processing outside of Riak

Fetching data outside of the Riak clusters raises some issues that are difficult to work around without changing the processing mechanism.

First of all, there is a clear network bandwidth limitation to the design: the more consumer clusters there are, the more network bandwidth is used. Even with large clusters (more than 30 nodes), it’s relatively easy to exhaust the network capacity of all the nodes as more and more fetchers try to get data from them.

Secondly, each consumer cluster tends to use only a small part of the events flow. Even though consumers can filter out types, subtypes, and DCs, the resulting events blobs still contain a large quantity of data that is useless to the consumer. For storage efficiency, events need to be stored as large compressed serialised blobs, so splitting them more by allowing more subtyping is not possible [5].

Additionally, statically splitting the events content is too rigid since use of the data changes over time and we do not want to be a bottleneck to change for our downstream consumers. Part of an event from a given type that was critical 2 years ago might be used for minor monitoring now. A subtype that was heavily used for six month may now be rarely used because of a technical change in the producers.

Finally, the amount of CPU time needed to uncompress, load, and filter the big events blobs is not tiny. It usually takes around five seconds to fetch, uncompress, and filter one second’s worth of events. Which means that any real-time data crunching requires multiple threads and likely multiple hosts - usually a small cluster. It would be much simpler if Riak could provide a real-time stream of data exactly tailored to the consumer need.

Next: data filtering and processing inside Riak

What if we could remove the CPU limitations by doing processing on the Riak cluster itself? What if we could work around the network bandwidth issue by generating sub-streams on the fly and in real-time on the Riak cluster?

This is exactly what we implemented, using simple concepts, and leveraging the ease of use and hackability of Riak. These concepts and implementations will be described in the next sections

Real-time server-side data processing: the theory

The reasoning is actually very simple. The final goal is to perform data processing of the events blobs that are stored in Riak in real-time. Data processing usually produces a very small result, and it appears to be a waste of network bandwidth to fetch data outside of Riak to perform data analysis on consumer clusters, as in this example::

This diagram is equivalent to:

So instead of bringing the data to the processing code, let’s bring the code to the data:

This is a typical use case for MapReduce. We’re going to see how to use MapReduce on our dataset in Riak, and also why it’s not a usable solution.

For the rest of this post, it’s important to establish a reference for all the events that are stored for a time period of exactly one second. Because we already happen to store our events by a second (and call it an “epoch”), using this unit of measure is a practical consideration that we’ll refer to as epoch-data.

A first attempt: MapReduce

MapReduce is a very well known (if somewhat outdated) way of bringing the code near the data and distributing data processing. There are excellent papers explaining this approach for further background study.

Riak has a very good MapReduce implementation. MapReduce jobs can be written in Javascript or Erlang. We highly recommend using Erlang for better performance.

To perform events processing of an epoch-data on Riak, the MapReduce job would look like the following list. Metadata and data keys concepts are explained in the part 2 of the blog series. Here are the MapReduce phases:

Given a list of epochs and DCs, the input is the list of metadata keys, and as additional parameter, the processing code to apply to the data.
A first Map phase reads the metadata values and returns a list of data keys.
A second Map phase reads the data values, deserialises it, applies the processing code and returns the list of results.
A Reduce phase aggregates the results together

This works just fine. For one epoch-data, one data processing code is properly mapped to the events, the data deserialised and processed in around 0.1 second (on our initial 12 nodes cluster). This is by itself an important result: it’s taking less than one second to fully process one second worth of events. Riak makes it possible to implement a real-time MapReduce processing system [6].

Should we just use MapReduce and be done with it? Not really, because our use case involves multiple consumers doing different data processing at the same time. Let’s see why this is an issue.

The metrics

To be able to test the MapReduce solution, we need a use case and some metrics to measure.

The use case is the following: every second, multiple consumers (say 20) need the result of one of the data processing (say 10) of the previous second.

We’ll consider that an epoch-data is roughly 70MB, data processing results are around 10KB each. Also, we’ll consider that the Riak cluster is a 30 nodes ring with 10 real CPUs available for data processing on each node.

The first metric we can measure is the external network bandwidth usage. This is the first factor that encouraged us to move away from fetching the events out of Riak to do external processing. External bandwidth usage is the bandwidth used to transfer data between the cluster as a whole, and the outside world.

The second metric is the internal network bandwidth usage. This represents the network used between the nodes, inside of the Riak cluster.

Another metric is the time (more precisely the CPU-time) it takes to deserialise the data. Because of the heavily compressed nature of our data, decompression and deserialising one epoch-data takes roughly 5 sec.

The fourth metric is the CPU-time it take to process the deserialized data, analyze it and produce a result. This is very fast (compared to deserialisation), let’s assume 0.01 sec. at most.

Note: we are not taking into account the impact of storing the data in the cluster (remember that events blobs are being stored every second) because it’s impacting the system the same way in both external processing and MapReduce.

Metrics when doing external processing

When doing standard data processing as seen in the previous part of this blog series, one epoch-data is fetched out from Riak, and deserialised and processed outside of Riak.

External bandwidth usage

The external bandwidth usage is high. For each query, the epoch-data is transferred, so that’s 20 queries times 70MB/s = 1400 MB/s. Of course, this number is properly spread across all the nodes, but that’s still roughly 1400 / 30 = 47 MB/s. That, however, is just for the data processing. There is a small overhead that comes from the clusterised nature of the system and from gossiping, so let’s round that number to 50 MB/s per node, in external output network bandwidth usage.

Internal bandwidth usage

The internal bandwidth usage is very high. Each time a key value is requested, Riak will check its 3 replicas, and return the value. So 3 x 20 x 70MB/s = 4200 MB/s. Per node, it’s 4200 MB/s / 30 = 140 MB/s

Deserialise time

Deserialise time is zero: the data is deserialised outside of Riak.

Processing time

Processing time is zero: the data is processed outside of Riak.

Metrics when using MapReduce

When using MapReduce, the data processing code is sent to Riak, included in an ad hoc MapReduce job, and executed on the Riak cluster by sending the orders to the nodes where the epoch-data related data chunks are stored.

External bandwidth usage

When using MapReduce to perform data processing jobs, there is certainly a huge gain in network bandwidth usage. For each query, only the results are transferred, so 20 x 10KB/s = 200 KB/s.

Internal bandwidth usage

The internal usage is also very low: it’s only used to spread the MapReduce jobs, transfer the results, and do bookkeeping. It’s hard to put a proper number on it because of the way jobs and data are spread on the cluster, but overall it’s using a couple of MB/s at most.

Deserialise time

Deserialise time is high: for each query, the data is deserialised, so 20 x 5 = 100 sec for the whole cluster. Each node has 10 CPUs available for deserialisation, so the time needed to deserialise one second worth of data is 100/300 = 0.33 sec. We can easily see that this is an issue, because already one third of all our CPU power is used for deserialising the same data in each MapReduce instance. It’s a big waste of CPU time.

Processing time

Processing time is 20 x 0.01 = 0.2s for the whole cluster. This is really low compared to the deserialise time.

Limitations of MapReduce

As we’ve seen, using MapReduce has its advantages: it’s a well-known standard, and allows us to create real-time processing jobs. However it doesn’t scale: because MapReduce jobs are isolated, they can’t share the deserialised data, and CPU time is wasted, so it’s not possible to have more than one or two dozens of real-time data processing jobs at the same time.

It’s possible to overcome this difficulty by caching the deserialised data in memory, within the Erlang VM, on each node. CPU time would still be 3 times higher than needed (because a map job can run on any of the 3 replicas that contains the targeted data) but at least it wouldn’t be tied to the number of parallel jobs.

Another issue is the fact that writing MapReduce jobs is not that easy, especially because — in this case — it’s a prerequisite to know Erlang.

Last but not least, it’s possible to create very heavy MapReduce jobs, easily consuming all the CPU time. This directly impacts the performance and reliability of the cluster, and in extreme cases the cluster may be unable to store incoming events at a sufficient pace. It’s not trivial to fully protect the cluster against MapReduce misuse.

A better solution: post-commit hooks

To work around these limitations, We explored a different approach to enable real-time data processing on the cluster that scales properly by deserialising data only once, allows us to cap its CPU usage, and allows us to write the processing jobs in any language, while still bringing the code to the data, removing most of the internal and external network usage.

This technical solution is what is currently in production at Booking.com on our Riak events storage clusters, and it uses post-commit hooks and a companion service on the cluster nodes.

Strategy and Features##

The previous parts introduced the need for data processing of the events blobs that are stored in Riak in real-time, and the strategy of bringing the code to the data:

Using MapReduce for computing on-demand data processing worked fine but didn’t scale to many users (see part 3).

Finding an alternative to MapReduce for server-side real-time data processing requires listing the required features of the system and the compromises that can be made:

Real-time isolated data transformation

As seen in the previous parts of this blog series, we need to be able to perform transformation on the incoming events, with as little delay as possible. We don’t want any lag induced by a large batch processing. Luckily, these transformations are usually small and fast. Moreover, they are isolated: the real-time processing may involve multiple types and subtypes of events data, but should not depend on previous events knowledge. Cross-epoch data processing can be implemented by reusing the MapReduce concept, computing a Map-like transformation on each events blobs by computing them independently, but leaving the Reduce phase up to the consumer.

Performance and scalability

The data processing should have a very limited bandwidth usage and reasonable CPU usage. However, we also need the CPU usage not to be affected by the number of clients using the processed data. This is where the previous attempt using MapReduce showed its limits. Of course, horizontal scalability has to be ensured, to be able to scale with the Riak cluster.

One way of achieving this is to perform the data processing continuously for every datum that reach Riak, upfront. That way, client requests are actually only querying the results of the processing, and not triggering computation at query time.

No back-processing

The data processing will have to be performed on real-time data, but no back-processing will be done. When a data processing implementation changes, it will be effective on future events only. If old data is changed or added (usually as a result of reprocessing), data processing will be applied, but using the latest version of processing jobs. We don’t want to maintain any history of data processing, nor any migration of processed data.

Only fast transformations

To avoid putting too much pressure on the Riak cluster, we only allow data transformation that produces a small result (to limit storage and bandwidth footprint), and that runs quickly, with a strong timeout on execution time. Back-pressure management is very important, and we have a specific strategy to handle it (see “Back-pressure management strategy” below)

The solution: Substreams

With these features and compromises listed, it is now possible to describe the data processing layer that we ended up implementing at Booking.com.

This system is called Substreams. Every seconds, the list of keys of the data that has just been stored is sent to a companion app - a home-made daemon - running on every Riak node. This fetches the data, decompresses it, runs a list of data transformation code on it, and stores the results back into Riak, using the same key name but with a different namespace. Users can now fetch the processed data.

A data transformation code is called a substream because most of the time the data transformation is more about cherry-picking exactly the needed fields and values out of the full stream, rather than performing complex operations.

The companion app is actually a simple pre-forking daemon with a Rest API. It’s installed on all nodes of the cluster, with around 10 forks. The Rest API is used to send it the list of keys, and wait for the process completion. The events data doesn’t transit via this API; the daemon is fetching itself the key values from Riak, and stores the substreams (results of data transformation) back into Riak.

The main purpose of this system is to drastically reduce the size of data transferred to the end user by enabling the cherry-picking of specific branches or leaves of the events structures, and also to perform preliminary data processing on the events. Usually, clients are fetching these substreams to perform more complex and broader aggregations and computations (for instance as a data source for Machine Learning).

Unlike MapReduce, this system has multiple benefits:

Data decompressed only once

A given binary blob of events (at mot 500K of compressed data) is handled by one instance of the companion app, which will decompress it once, then run all the data processing jobs on the decompressed data structure in RAM. This is a big improvement compared to MapReduce, the most CPU intensive task is actually to decompress and deserialise the data, not to transform it. Here we have the guarantee that data is decompressed only once in its lifetime.

Transformation at write time, not at query time

Unlike MapReduce, once a transformation code is setup and enabled, it’ll be computed for every epoch, even if nobody uses the result. However, the computation will happen only once, even if multiple users request it later on. Data transformation is already done when users want to fetch the result. That way, the cluster is protected against simultaneous requests of a big number of users. It’s also easier to predict the performance of the substreams creations.

Hard timeout - open platform

Data decompression and transformation by the companion app is performed under a global timeout that would kill the processing if it takes too long. It’s easy to come up with a realistic timeout value given the average size of event blobs, the number of companion instances, and the total number of nodes. The hard timeout makes sure that data processing is not using too many resources, ensuring that Riak KV works smoothly.

This mechanism allows the cluster to be an open platform: any developer in the company can create a new substream transformation and quickly get it up and running on the cluster on its own without asking for permission. There is no critical risk for the business as substreams runs are capped by a global timeout. This approach is a good illustration of the flexible and agile spirit in IT that we have at Booking.com.

Implementation using a Riak commit hook

In this diagram we can see where the Riak commit hook kicks in. We can also see that when the companion requests data from the Riak service, there is a high chance that the data is not on the current node and Riak has to get it from other nodes. This is done transparently by Riak, but it consumes bandwidth. In the next section we’ll see how to reduce this bandwidth usage and have full data locality. But for now, let’s focus on the commit hook.

Commit hooks are a feature of Riak that allow the Riak cluster to execute a provided callback just before or just after a value is written, using respectively pre-commit and post-commit hooks. The commit hook is executed on the node that coordinated the write.

We set up a post-commit hook on the metadata bucket (the epochs bucket). We implemented the commit hook callback, which is executed each time a key is stored to that metadata bucket. In part 2 of this series, we explained that the metadata is stored in the following way:

the key is -, for example: 1413813813-1
the value is the list of data keys (for instance 1413813813:2:type3::0)

The post-commit hook callback is quite simple: for each metadata key, it gets the value (the list of data keys), and sends it over HTTP in async mode to the companion app. Proper timeouts are set so that the execution of the callback is capped and can’t impact the Riak cluster performance.

Hook implementation

First, let’s write the post commit hook code:

:::erlang
metadata_stored_hook(RiakObject) ->
    Key = riak_object:key(RiakObject),
    Bucket = riak_object:bucket(RiakObject),
    [ Epoch, DC ] = binary:split(Key, <<"-">>),
    MetaData = riak_object:get_value(RiakObject),
    DataKeys = binary:split(MetaData, <<"|">>, [ global ]),
    send_to_REST(Epoch, Hostname, DataKeys),
    ok.

send_to_REST(Epoch, Hostname, DataKeys) ->
    Method = post,
    URL = "http://" ++ binary_to_list(Hostname)
       ++ ":5000?epoch=" ++ binary_to_list(Epoch),
    HTTPOptions = [ { timeout, 4000 } ],
    Options = [ { body_format, string },
    		         { sync, false },
            		  { receiver, fun(ReplyInfo) -> ok end }
              ],
    Body = iolist_to_binary(mochijson2:encode( DataKeys )),
    httpc:request(Method,
                  {URL, [], "application/json", Body},
                  HTTPOptions, Options),
    ok.

These two Erlang functions (here they are simplified and would probably not compile), are the main part of the hook. The function metadata_stored_hook is going to be the entry point of the commit hook, when a metadata key is stored. It receives the key and value that was stored, via the RiakObject, uses its value to extract the list of data keys. This list is then sent to the companion damone over Http using send_to_REST.

The second step is to get the code compiled and Riak setup to be able to use it is properly. This is described in the documentation about custom code.

Enabling the Hook

Finally, the commit hook has to be added to a Riak bucket-type:

riak-admin bucket-type create metadata_with_post_commit \
'{"props":{"postcommit":["metadata_stored_hook"]}'

Then the type is activated:

riak-admin bucket-type activate metadata_with_post_commit

Now, anything sent to Riak to be stored with a key within a bucket whose bucket-type is metadata_with_post_commit will trigger our callback metadata_stored_hook.

The hook is executed on the coordinator node, that is, the node that received the write request from the client. It’s not necessary the node where this metadata will be stored.

The companion app

The companion app is a Rest service, running on all Riak nodes, listening on port 5000, ready to receive a json blob, which is the list of data keys that Riak has just stored. The daemon will fetch these keys from Riak, decompress their values, deserialise them and run the data transformation code on them. The results are then stored back to Riak.

There is little point showing the code of this piece of software here, as it’s trivial to write. We implemented it in Perl using a PSGI preforking web server (Starman). Using a Perl based web server allowed us to also have the data transformation code in Perl, making it easy for anyone in the IT department to write some of their own.

Optimising intra-cluster network usage

As seen saw earlier, if the commit hook simply sends the request to the local companion app on the same Riak node, additional bandwidth usage is consumed to fetch data from other Riak nodes. As the full stream of events is quite big (around 150 MB per second), this bandwidth usage is significant.

In an effort to optimise the network usage, we have changed the post-commit hook callback to group the keys by the node that is responsible for their values. The keys are then sent to the companion apps running on the associated nodes. That way, a companion app will always receive event keys for which data are on the node they are running on. Hence, fetching events value will not use any network bandwidth. We have effectively implemented 100% data locality when computing substreams.

This optimisation is implemented by using Riak’s internal API that gives the list of primary nodes responsible for storing the value of a given key. More precisely, Riak’s Core application API provides the preflist() function: (see the API here) that is used to map the result of the hashed key to its primary nodes.

The result is a dramatic reduction of network usage. Data processing is optimised by taking place on one of the nodes that store the given data. Only the metadata (very small footprint) and the results (a tiny fraction of the data) travel on the wire. Network usage is greatly reduced.

Back-pressure management strategy

For a fun and easy-to-read description of what back-pressure is and how to react to it, you can read this great post by Fred Hebert (@mononcqc): Queues Don’t Fix Overload.

What if there are too many substreams, or one substream is buggy and performs very costly computations (especially as we allow developers to easily write their own substream), or all of a sudden the events fullstream change, one type becomes huge and a previously working substream now takes 10 times more to compute?

One way of dealing with that is to allow back-pressure: the substream creation system will inform the stream storage (Riak) that it cannot keep up, and that it should reduce the pace at which it stores events. This is however not practical here. Doing back-pressure that way will lead to the storage slowing down, and transmitting the back-pressure upward the pipeline. However, events can’t be “slowed down”. Applications send events at a given pace and if the pipeline can’t keep up, events are simply lost. So propagating back-pressure upstream will actually lead to load-shedding of events.

The other typical alternative is applied here: doing load-shedding straight away. If a substream computation is too costly in CPU time, wallclock time, disk IO or space, the data processing is simply aborted. This protects the Riak cluster from slowing down events storage - which after all, is its main and critical job.

That leaves the substream consumers downstream with missing data. Substreams creation is not guaranteed anymore. However, we used a trick to mitigate the issue. We implemented a dedicated feature in the common consumer library code; when a substream is unavailable, the full stream is fetched instead, and the data transformation is performed on the client side.

It effectively pushes the overloading issue down to the consumer, who can react appropriately, depending on the guarantees they have to fulfill, and their properties.

Some consumers are part of a cluster of hosts that are capable of sustaining the added bandwidth and CPU usage for some time.
Some other systems are fine with delivering their results later on, so the consumers will simply be very slow and lag behind real-time.
Finally, some less critical consumers will be rendered useless because they cannot catch up with real-time.

However, this multitude of ways of dealing with the absence of substreams, concentrated at the end of the pipeline, is a very safe yet flexible approach. In practice, it is not so rare that a substream result for one epoch is missing (one blob every couple of days), and such blips have no incidence on the consumers, allowing for a very conservative behaviour of the Riak cluster regarding substreams: “when in doubt, stop processing substreams”.

Conclusion

This data processing mechanism proved to be very reliable and well-suited for our needs. The implementation required surprisingly small amount of code, leveraging features of Riak that proved to be flexible and easy to use.

Notes

[1] It is not strictly true that our events are schema-less. They obey the structure that the producers found the most useful and natural. But they are so many producers which each of them sending events that have a different schema, so it’s almost equivalent to considering them schema-less. Our events can be seen as structured, yet with so many schemas that they can’t be traced. There is also complete technical freedom to change the structure of an event, if it’s seen as useful by a producer.

[2] After spending some time looking at and decoding Sereal blobs, the human eye easily recognizes common data structures like small HashMaps, small Arrays, small Integers and VarInts, and of course, Strings, since their content is untouched. That makes Sereal an almost human readable serialisation format, especially after a hexdump.

[3] This can be worked around by using secondary indexes (2i) if the backend is eleveldb or Riak Search, to create additional indexes on keys, thus enabling listing them in various ways.

[4] Some optimisation has been done, the main action was to implement a module to split a sereal blob without deserialising it, thus speeding up the process greatly. This module can be found here: Sereal::Splitter. Most of the time spent in splitting sereal blobs is now spent in decompressing it. The next optimization step would be to use compression that decrunches faster than the currently used gzip; for instance LZ4_HC.

[5] At that point, the attentive reader may jump in the air and proclaim “LevelDB and snappy compression!”. It is indeed possible to use LevelDB as Riak storage backend, which provides an option to use Snappy compression on the blocks of data stored. However, this compression algorithm is not good enough for our need (using gzip reduced the size by a factor of almost 2). Also, Leveldb (or at least the eleveldb implementation that is used in Riak) doesn’t provide automatic expiration which is critical to us, and had issues with reclaiming free space after key deletions, with versions below 2.x

[6] Using MapReduce on Riak is usually somewhat discouraged because most of the time it’s being used in a wrong way, for instance when performing bulk fetch or bulk insert or traversing a bucket. The MapReduce implementation in Riak is very powerful and efficient, but must be used properly. It works best when used on a small number of keys, even if the size of data processed is very large. The fewer keys the less bookkeeping and the better performance. In our case, there are only a couple of hundred keys for one second worth of data (but somewhat large values, around 400K), which is not a lot. Hence the great performance of MapReduce we’ve witnessed. YMMV.

PromCon2017 - Prometheus Conference 2017

2017-08-22T00:00:00+00:00

This post is a list of things that I found interesting about Prometheus and its ecosystem while attending PromCon2017, the Prometheus Conference, the 17th and 18th august 2017 in Munich (Germany). Things are not split per talks; instead I have gathered information from all the talks and grouped them by topics, so that it’s more organised, and easier to read.

The conference was very nice, well organized, and with a good mix of talks: technical, less technical, war zone experience, (remotely) related topics and products. It was a medium-sized one track conference, which are the ones I prefer, as one can grasp everything that happens and talk to everybody in the hallways.

Best practises - general

monitor all metrics from all services, and from all libraries
when coding, instead of printing debug messages or sending to log, send metrics!
USE method for resources (queues, CPU, disks…): “Utilization, Saturation, Errors”
RED method for endpoints and services: “Rate, Errors, Duration”

Best practises - metrics and label naming

standardize metric names and labels early on before it’s chaos
you need conventions
add unit suffixes
base units (seconds vs milliseconds, bytes instead of megabytes)
add _total counter suffixes to differenciate between counters and gauge
all the labels of a given metrics should be summable or average-able
be carefull about label cardinality
- it’s OK to ingest millions of series
- but one metric should have max 1000 or 10_000 series (labels combinations)
more best practises (website]
when querying counters, don’t do rate(sum()), because it masks the resets. Do sum(rate())

Best practises - alerting

use label and regex to do alert routing
page only on user-visible symptoms, not causes
“My Philosophy on Alerting” (see the SRE book or the google doc)
for all jobs: have these 2 basic alerts
- alert on the prometheus job being up
- alert if the job is not even there
don’t use a too short FOR duration (4 or 5 min) or too long (no persistence between restart)
keep labels when alerting (both recording and alerting rules) to know where it comes from
use filtering per job, as metrics are per jobs

Remote storage

prometheus provides an API to send/read/write data to a remote storage
it also provides a gateway to act as a proxy to other DB like OpenTSDB or InfluxDB
in real life some people use OpenTSDB, others influxDB

InfluxDB

influxDB works fine with remote storage, read/write
influxDB will (once again) change a lot of things
- new data model similar to prometheus
- new QL called Influx Functional Query Language (IFQL)
- isolate QL, storage, computation, have them on different nodes
- generate a DAG for queries, and use an execution engine

Exporters

telegraf: having one telegraf instance per service is a SPOF, so be careful and either have redundant telegraf instances or multiple telegrafs per service.
useful exporters: node exporters, blackbox (check urls), mtail
don’t use one exporter to collect more than one service: one thing going crazy won’t pollute other metrics collections.
graphite exporter is easy and useful but it’s tricky to get labels exported and transformed in graphite metric names in the right way

Alerting tools

alert manager deduplicates, so can be used from federated prometheus
use jiralert (github), it’ll reopen existing ticket if an alarm is triggered, avoids overcreating tickets.
use alertmanager2es (github) to index alerts in ES
unsee (github) is a dashboard for alerts

Meta Alerting

send one alert on page duty at start of shift, make sure it’s received
or use grafana for graphing alert manager and to alert about it (basic alerts)

Grafana

lots of improvements of the query box (auto complettion, syntax highlighting, etc)
improvements of displaying graph, with spread, upper limit points
emoji available for quick glimpse at a state
table panels available
heatmap panel: histogram over time
diagram panel: awesome feature to display your pipeline with annotated metrics/colors
dashboard version history is available
dashboards in git:
- currently possible via the grafana lib from cortex
- later on will be provided by grafana
dashboards folders available
grafana data source supports templating so you can change quickly data sources when one prometheus instance is down, nice for fault tolerance

Cortex

A multitenant, horizontally scalable Prometheus as a Service (github)
has multiple parts, ingesters, storage, service discovery, read/write query paths
storage is implemented through an API so one could use a different storage

Various

promgen: a prometheus configuration tool, worth checking out (github)
load testing: Gatling (scriptable, generate scala code, Akka based) vs JMeter (UI oriented, XML, threads)

Prometheus limitations

HA issues: when restarting/upgrading prometheus, gaps in data/graph can appear
there is no horizontal scaling but sharding + federation; can be surprising at first
remote storage API and gateway can work around limitation of the local storage
hard time figuring out where the data is located on disk
retention issues: you can’t specify a disk size, only expiration date; there is no downsampling feature, which limit retention capacity

Prometheus v2

will use Facebook’s Gorilla paper optimization, and Damian Gryski (github) implementation
prometheus 2 new storage, not a distributed storage but huge improvement in ram, cpu, disk usage
libTSDB is the new storage lib for prometheus v2. It can be used outside of prometheus: an embeddable TSDB Go library.
alertmanager with HA through gossip protocol and CRDTs using the mesh library by Weaveworks (github). It’s AP.
beta avaioable now, stable enough for testing and some level of production use

Exception::Stringy - Modern exceptions for legacy code

2015-02-10T00:00:00+00:00

A small recap of Perl exceptions

Basic Usage Of Exceptions

In Perl, exceptions are a well known and widely used mechanism. It is an old feature that has been enhanced over time. At the basic level, exceptions are triggered by the keyword die. Exceptions were initially used as a way to stop the execution of a program in case of a fatal error. The too famous line:

    open my $fh, $file or die "failed to open '$file', error: $!";

is a good example.

The original way to catch exceptions in Perl has a somewhat strange syntax, it’s based on the eval keyword and the special variable $@:

    eval { code_that_may_die(); 1; }
      or say "exception has been caught: $@"

Nowadays, exceptions are usually thrown using croak and friends, from the Carp module. It allows for a much better flexibility about where the exception seems to originate, and how to display the stack trace, if any.

Catching exceptions with eval is also supersed by try/catch mechanisms. The most used one is via the [Try::Tiny][try-tiny] module by Yuval Kogman and Jesse Luehrs, and goes like this:

    try {
      croak "exception";
    } catch {
      warn "caught error: $_";
    };

Throwing Objects

The good thing about die (or croak), is that it’s very easy to use, when given a string. It’s perfect for using in scripts, or moderately big projects. However, for more features, or extensive usage of exceptions, then it’s better to throw objects instead of strings, like this:

    open $file or die MyExceptions::IO::File->new(
      filename => $file,
      error => $!
    );

For this snippet of code to work, the MyExceptions::IO::File class has to be declared, its fields as well, and the it should probably inherit from MyExceptions::IO. So it requires some amount of work.

Some modules have been created - long time ago - to automate or help with declaring exception classes. The most well known one is [Exception::Class][exception-class], by Dave Rolsky. For instance, here is how to declare two exceptions matching with previous example:

    package MyExceptions;

    use Exception::Class (
        'MyException::IO',
        'MyException::IO::File' => {
            isa => 'MyException::IO',    
            fields => [ 'filename' ],
        },
    );

And then, here is the code to make use of that and throw an exception when failing to open a file:

    use MyExceptions;

    open $file or MyException::IO::File->throw(
      filename => $file,
      error => $!
    );

Catching Objects Exceptions

When using objects as exceptions, a set of features becomes available, thanks to Object Oriented Programming. Inheritance, attributes and introspection are some of them. However the most visible and used feature is about catching such exceptions:

    use MyException;

    try {
        open $file or MyException::IO::File->throw(
          filename => $file,
          error => $!
        );
    } catch {
        my $exception = $_;
        if ($exception->isa(MyException::IO)) {
            # we know how to handle these
        } else {
            $exception->rethrow
        }
    };

As you can see, it’s easy to introspect an exception if it’s an object. In this case we use the isa keyword to know if the exception is or inherits from a given class name.

When things go wrong

Mixing Objects And String Exceptions

As we saw in the previous chapter, Perl allows exceptions being whatever you like (string, objects, but actually numbers, structures, etc, work as well).

Usually, when starting a project, the author decides whether to use simple strings or objects with a class hierarchy. With very big projects, it is sometimes not possible to impose one kind of exceptions. This may be due to legacy code, a subproject that was included, or the wish to give people freedom about what they want to use depending on the context.

In these cases, the code may have to handle exceptions of two kinds: strings and objects. This can be done via this kind of code:

    use MyException;
    use Scalar::Util qw(blessed);

    try {
        # ... code that may die
    } catch {
        my $exception = $_;
        if (blessed $exception) {
            # exception is an object
            # ...
        } else {
            # exception is a normal string
            # ...
        }
    };

Mixed Exceptions Issues

The previous code snippet suffers from increased complexity due to the additional checks and two different codepaths for handling potential errors. This is clearly both suboptimal and error prone.

Another issue is that some code may consider that the exception it is catching is of one type, whereas it could be of an other type, especially because of the action-at-distance nature of the exception. Consider this function:

    sub do_stuff {
        try {
            # ... code that can only throw objects exceptions
        } catch {
            my $exception = $_;
            # exception is always an object
            if ($exception->isa(...)) {
                # ...
            }
        };
    }

This code assumes that the exception will always be an object. However, let’s consider this: in following example, the function do_stuff is called (its original code is unchanged), but before doing so, the special signal handler for __DIE__ is changed.

    $SIG{__DIE__} = sub { die "FATAL: $_[0]" };
    do_stuff();

The first line of the example is being called when an exception is raised, and will be executed instead of propagating the exception. What this code does is prepending FATAL: to it, then propagate the exception again by using die.

Alas, it is doing so in a naive way, by forcing the exception (in $_[0]) to be evaluated as a string. So when the exception is then re-thrown, it is now a string ! and Boom, the ->isa call in do_stuff won’t work.

The worst thing about this kind of issue is that it doesn’t appear at compile time, nor at execution time, but at exception time, which is the worst time…

The Overloaded Stringification Route

So at that point, most developers will choose the following strategy. Use object exceptions for their code, but guard against receiving string exceptions, and also make their object exceptions nicely degrade into strings, by using stringification overloading. That means that if an object exception is managed by a handler that threats it as a string, the exception will transform itself into a string, and try to present some meaningful aspect of itself.

The issue is that handling exception is now back to square one, having to deal with strings, trying to parse it looking for meaningful information to hopefully make a good decision.

What if, instead of taking an object exception and downgrading it to a string while keeping as much information as possible, one starts from a string, and enhance it until it looks like an object, without being one ? That way we would have the best of both worlds

This is what Exception::Stringy tries to achieve.

Exceptions::Stringy from scratch

The Needed Features

A perfect exception would have these features:

be a string, containing an error message
be an instance of a class
be able to inherit from an other exception
have simple fields with values
provide a way to introspect itself

This set of features is not big, but it’s probably enough for a start. Let’s see how we can implement them in a simple string. We’re going to use an exception with these attributes:

an error message ‘permission denied’
from the class MyException::IO
which inherits from MyException
with a field filename

Class Instance

Let’s start with the first feature: be a string, containing an error message. That’s easy:

    "permission denied"

Being an instance of a class is usually done in Perl by using bless on a ScalarRef. But we don’t want the eception to be an object. What bless does - and what it ultimately means to “be an instance of a class”, is just attaching a label to a value. Let’s do that, by having a label as a substring in our exception. For instance:

    "[MyException::IO]permission denied"

We could add a magic mark or have a more complex label syntax to make sure it’s a legit label.

To know what the class of a given exception is, we just need to extract the label, for instance with a regex.

Class Inheritance

Inheritance is easy, it only requires that standard Perl classes be created to map the exception labels, and then Perl usual inheritance can be used.

So, following our example, we need two packages, MyException and MyException::IO, and @MyException::IO::ISA set to ['MyException']. This can be made automatically at exception declaration time.

Fields

For simplicity, Exception::Stringy only handles simple field values, that is strings and numbers basically. To put fields into our string, we need to be able to identify them, so for instance with a separator between the different fields, and an other one between a field name and its value. Like this:

    "[MyException::IO|filename:/tmp/file|]permission denied"

And if the field name or value contains one of the separators ( [, |, : or ]), let’s encode them in base64, and mark it as such.

So, by now, we have fleshed out a string with useful data, which is properly parseable, and can be described. Let’s add methods to the data now.

Introspection and Modification

Given an exception, it is mandatory to be able to introspect and modify it, namely be able to:

get/set the class of the exception,
get/set the fields values attached to the exception,
get/set the exception message,
other useful methods.

In an ideal world, we would want methods, that we can call on our exception instances. However because our exceptions are regular strings, we can’t do this:

    $exception->message();

Usually, this way of calling a method (the arrow notation) works only if $exception is a blessed reference (that is, an object). However, there are other cases in which we can use the arrow notation, and have it work in a similar way. One of it is this one:

    $exception->$message();

If $message is a variable that contains a reference on a subroutine, then the previous line will translate into:

    $message->($exception);

And it works whatever the type of $exception, like in our case, a string. So, Exception::Stringy creates the needed subroutine references for the user and allow such arrow notation, which is very similar to the OO method invocation. I call these pseudo methods.

However, to avoid clobbering an existing variable, the pseudo methods need to have names that are unlikely to be already used in the target package. It’s even better if there is an option to add a prefix to these pseudo-methods. Once again, Exception::Stringy provides these features. The default pseudo method names are :

    $exception->$xthrow()
    $exception->$xrethrow()
    $exception->$xraise()
    $exception->$xclass()
    $exception->$xisa()
    $exception->$xfields()
    $exception->$xfield()
    $exception->$xmessage()
    $exception->$xerror()

Launching The Exception

Finally, once we have created the exception, let’s throw it. The first think to do is to implement a throwor `raise class method on all the exception class, so that we can do

    MyException->throw(...)

That will basically craft a new exception string, with all the properties encoded in it, and call die or croak on it.

We can also use a pseudo method on an existing exception to (re)throw it:

    $exception->$xthrow();

Exceptions::Stringy example

Synopsis

Below is the synopsis of the Exceptions::Stringy module. It’s basically a wrap up of what has been explained above. The exceptions definition is heavily inspired from Exception::Class.

    use Exception::Stringy;
    Exception::Stringy->declare_exceptions(
        'MyException',
     
        'YetAnotherException' => {
            isa         => 'AnotherException',
        },
     
        'ExceptionWithFields' => {
            isa    => 'YetAnotherException',
            fields => [ 'grandiosity', 'quixotic' ],
            throw_alias  => 'throw_fields',
        },
    );
    
    ### with Try::Tiny
    
    use Try::Tiny;
     
    try {
        # throw an exception
        MyException->throw('I feel funny.');
    
        # or use an alias
        throw_fields 'Error message', grandiosity => 1;
  
        # or with fields
        ExceptionWithFields->throw('I feel funny.',
                                   quixotic => 1,
                                   grandiosity => 2);
  
        # you can build exception step by step
        my $e = ExceptionWithFields->new("The error message");
        $e->$xfield(quixotic => "some_value");
        $e->$xthrow();
    
    }
    catch {
        if ( $_->$xisa('Exception::Stringy') ) {
            warn $_->$xerror, "\n";
        }
    
        if ( $_->$xisa('ExceptionWithFields') ) {
            if ( $_->$xfield('quixotic') ) {
                handle_quixotic_exception();
            }
            else {
                handle_non_quixotic_exception();
            }
        }
        else {
            $_->$xrethrow;
        }
    };
   
    ### without Try::Tiny
   
    eval {
        # ...
        MyException->throw('I feel funny.');
        1;
    } or do {
        my $e = $@;
        # .. same as above with $e instead of $_
    }

Conclusion

This was an in-depth look at why and how to build up a resilient and non-intrusive exception mecanism. I hope to have demonstrated one aspect of the extreme flexibility of Perl.

Feel free to use Exception::Stringy, it is being used in production code for some time now. Feedback welcome !

Perl Redis Mailing List

2013-11-12T00:00:00+00:00

This is going to be a short post. I’m the new maintainer of Redis.pm, the most used Redis Perl client. Pedro Melo was the previous maintainer, but due to Real Life, he is unable to continue. I’d like to thank him for all his efforts so far in maintaining and improving this module. I hope I’ll be able to achieve the same level of quality. Pedro will actually stay around for a while, watching over my shoulder and giving his opinions about stuff, to allow for a smooth transition.

We’ve used this maintainership change to improve the tools we use around this project. So we’ve moved the code to the github’s PerlRedis organization (notice the cool logo), and I’ve performed quite a few code cleanups and housekeeping.

But the big thing is the creation of a mailing list, that aims at gathering forces around Redis support in Perl. It’s located here, and hosted by the good folks at ShadowCat Systems Limited (thank you guys).

It is not limited to the Redis.pm module: any Perl related Redis topic is welcome, including other Perl clients. So if you have any interest in Perl and Redis, feel free to subscribe !

dams.

p5-mop: a gentle introduction

2013-09-17T00:00:00+00:00

I guess that you’ve heard about p5-mop by now.

If not, in a nutshell, p5-mop is an attempt to implement a subset of Moose into the core of Perl. Moose provides a Meta Object Protocol (MOP) to Perl. So does p5-mop, however p5-mop is implemented in a way that it can be properly included in the Perl core.

Keep in mind that p5-mop goal is to implement a subset of Moose, and. As Stevan Little says:

We are not putting “Moose into the core” because Moose is too opinionated, instead we want to put a minimal and less opinionated MOP in the core that is capable of hosting something like Moose

As far as I understood, after a first attempt that failed, Stevan Little restarted the p5-mop implementation: the so-called p5-mop-redux github project, using Devel::Declare, ( then Parse::Keyword ), so that he can experiment and release often, while keeping the implementation core-friendly. Once he’s happy with the features and all, he’ll make sure it finds its way to the core. A small team (Stevan Little, Jesse Luehrs, and other contributors) is actively developping p5-mop, and Stevan is regularly blogging about it.

If you want more details about the failing first attempt, there is a bunch of backlog and mailing lists archive to read. However, here is how Stevan would summarize it:

We started the first prototype, not remembering the old adage of “write the first one to throw away” and I got sentimentally attached to my choice of design approach. This new approach (p5-mop-redux) was purposfully built with a firm commitment to keeping it as simple as possible, therefore making it simpler to hack on. Also, instead of making the MOP I always wanted, I approached as building the mop people actually needed (one that worked well with existing perl classes, etc)

Few months ago, when p5-mop-redux was announced, I tried to give it a go. And you should too ! Because it’s easy.

Why is it important to try it out ?

It’s important to have at least a vague idea of where p5-mop stands at, because this project is shaping a big part of Perl’s future. IMHO, there will be a before and an after having a MOP in core. And it is being designed and tested right now. So as Perl users, it’s our chance to have a look at it, test it, and give our feedback.

Do we like the syntax ? Is it powerful enough? What did we prefer more/less in Moose ? etc. In few months, things will be decided and it’ll only be a matter of time and implementation details. Now is the most exciting time to participate in the project. You don’t need to hack on it, just try it out, and provide feedback.

Install it

p5-mop is very easy to install:

you need at least perl 5.16. If you need to upgrade, consider perlbrew or plenv
if you don’t have cpanm, get it with curl -L http://cpanmin.us | perl - App::cpanminus
first, we need to install twigils, with cpanm --dev twigils
if you’re using github, just fork the p5-mop-redux project. Otherwise you can get a zip here.
using cpanm, execute cpanm . from within the p5-mop-redux directory.

A first example

Here is the classical point example from the p5-mop test suite

use mop;

class Point {
has $!x is ro = 0;
    has $!y is ro = 0;

    method set_x ($x) {
        $!x = $x;
    }

    method set_y ($y) {
        $!y = $y;
    }

    method clear {
        ($!x, $!y) = (0, 0);
    }

    method pack {
        +{ x => $self->x, y => $self->y }
    }
}

# ... subclass it ...

class Point3D extends Point {
    has $!z is ro = 0;

    method set_z ($z) {
        $!z = $z;
    }

    method pack {
        my $data = $self->next::method;
        $data->{z} = $!z;
        $data;
    }
}

This examples shows how straightforward it is to declare a class and a subclass. The syntax is very friendly and similar to what you may find in other languages.

class declares a class, with proper scoping. method is used to define methods, so no sub there. The distinction is important, because in methods, additional variables will be automatically available:

$self will be available directly, no need to shift @_.
attributes variable will be available automatically, so you can access attributes from within the class without having to use their $self->accessors.

Functions defined with the regular sub keyword won’t have all these features, and that’s for good: it makes the difference between function and method more explicit.

hasdeclares an attribute. Attribute names are twigils. Borrowed from Perl6, and implemented by Florian Ragwitz in its twigils project on github, twigils are useful to differenciate standard variables from attributes variables:

class Foo {
    has $!stuff;
	method do_stuff ($stuff) {
        $!stuff = $stuff;
    }
}

As you can see, it’s important to be able to differenciate stuff (the variable) and stuff (the attribute).

The added benefit of attributes variables is that one doesn’t need to contantly use $self. A good proportion of the code in a class is about attributes. Being able to use them directly is great.

Other notes worth mentiong:

Classes can have a BUILD method, as with Moose.
A class can inherit from an other one by extend-ing it.
In a inheriting class, calling the parent method is not done using SUPER, but $self->next::method.
A class Foo declared in the package Bar will be defined as Bar::Foo.

Attributes traits

When declaring an attribute name, you can add is, which is followed by a list of traits:

has $!bar is ro, lazy = $_->foo + 2;

ro / rw means it’s read-only / read-write
lazy means the attribute constructor we’ll be called only when the attribute is being used
weak_ref enables an attribute to be a weak reference

Default value / builder

has $!foo = 'default value';

which is actually

has $!foo = sub { 'default value' };

So, there is no default value, only builders. That means that has $!foo = {}; will work as expected ( creating a new hashref each time ).

You can reference the current instance in the attribute builder by using $_:

has $!foo = $_->_init_foo;

There has been some comments about using = instead of // or || or default, but this syntax is used in a lot of other programing language, and considered somehow the default (ha-ha) syntax. I think it’s worth sticking with = for an easier learning curve for newcomers.

Class and method traits

UPDATE: Similarly to attributes, classes and methods can have traits. I won’t go in details to keep this post short, but you can make a class abstract, change the default behaviour of all its attributes, make it work better with Moose, etc. Currently there is only one method trait to allow for operator overloading, but additional ones may appear shortly.

Methods parameters

When calling a method, the parameters are as usual available in @_. However you can also declare these parameters in the method signature:

method foo ($arg1, $arg2=10) {
    say $arg1;
}

Using = you can specify a default value. In the method body, these parameters will be available directly.

Types

Types are not yet core to the p5-mop, and the team is questioning this idea. The concensus is currently that types should not be part of the mop, to keep it simple and flexible. You ought to be able to choose what type system you want to use. I’m particularly happy about this decision. Perl is so versatile and flexible that it can be used (and bent to be used) in numerous environment and configuration. Sometimes you need robustness and high level powerful features, and it’s great to use a powerful typing system like Moose’s one. Sometimes (most of the time? ) Type::Tiny (before that I used Params::Validate) is good enough and gives you faster processing. Sometimes you don’t want any type checking.

Clearer / predicate

Because the attribute builder is already implemented using =, what about clearer and predicate?

# clearer
method clear_foo { undef $!foo }

# predicate
method has_foo { defined $!foo }

That was pretty easy, right? Predicates and clearers have been introduced in Moose because writing them ourselves would require to access the underlying HashRef behind an instance (e.g. sub predicate { exists $self->{$attr_name}}) and that’s very bad. To work around that, Moose has to generate that kind of code and provide a way to enable it or not. Hence the predicateand clearer options. So you see that they exists mostly because of the implementation.

In p5-mop, thanks to the twigils, there is no issue in writing predicates and cleare ourselves.

But I hear you say “Wait, these are no clearer nor predicate ! They are not testing the existence of the attributes, but their define-ness!” You’re right, but read on!

Undef versus not set

In Moose there is a difference between an attribute being unset, and an attribute being undef. In p5-mop, there is no such distinction. Technically, it would be very difficult to implemente that distinction, because an attribute variable is declared even if the attribute has not been set yet.

In Moose, because objects are stored in blessed hashes, an attribute can either be:

non-existent in the underlying hash
present in the hash but with an undef value
present and defined but false
present, defined and true

That’s probably too many cases… Getting rid of one of them looks sane to me.

After all, we got this “not set” state only because objects are stored in HashRef, so it looks like it’s an implementation detail that made its way into becoming a concept on its own, which is rarely a good thing.

Plus, in standard Perl programming, if an optional argument is not passed to a function, it’s not “non-existent”, it’s undef:

foo();
sub foo {
    my ($arg) = @_; # $arg is undef
}

So it makes sense to have a similar behavior in p5-mop - that is, an attribute that is not set is undef.

Roles

Roles definition syntax is quite similar to defining a class.

role Bar {
    has $!additional_attr = 42;
    method more_feature { say $!additional_attr }
}

They are consumed right in the class declaration line:

class Foo with Bar, Baz {
    # ...
}

My (hopefully constructive) remarks

Method Modifiers

Method modifiers are not yet implemented, but they won’t be difficult to implement. Actually, here is an example of how to implement method modifiers using p5-mop very own meta. It implements around:

sub modifier {
    if ($_[0]->isa('mop::method')) {
        my $method = shift;
        my $type   = shift;
        my $meta   = $method->associated_meta;
        if ($meta->isa('mop::role')) {
            if ( $type eq 'around' ) {
                $meta->bind('after:COMPOSE' => sub {
                    my ($self, $other) = @_;
                    if ($other->has_method( $method->name )) {
                        my $old_method = $other->remove_method( $method->name );
                        $other->add_method(
                            $other->method_class->new(
                                name => $method->name,
                                body => sub {
                                    local ${^NEXT} = $old_method->body;
                                    my $self = shift;
                                    $method->execute( $self, [ @_ ] );
                                }
                            )
                        );
                    }
                });
            } elsif ( $type eq 'before' ) {
                die "before not yet supported";
            } elsif ( $type eq 'after' ) {
                die "after not yet supported";
            } else {
                die "I have no idea what to do with $type";
            }
        } elsif ($meta->isa('mop::class')) {
            die "modifiers on classes not yet supported";
        }
    }
}

It is supposed to be used like this:

method my_method is modifier('around') ($arg) {
    $arg % 2 and return $self->${^NEXT}(@_);
    die "foo";
}

I would like to see method modifiers in p5-mop. As per Stevan Little and Jesse Luehrs, it may be that these won’t be part of the mop, but in a plugin or extension. I’m not to sure about that, for me method modifier is really linked to OO programmning. I prefer using around than fiddling with $self->next::method or ${^NEXT}.

Here are some syntax proposals I’ve gathered on IRC and blog comments regarding what could be method modifiers in p5-mop:

around foo { }
method foo is around { ... }
method foo is modifier(around) { ... }

${^NEXT} and ${^SELF}

These special variables are pointing to the current instance (useful when you’re not in a method - otherwise $self is available), and the next method in the calling chain. It’s OK to have such variables, but their horrible name makes it difficult to remember and use.

Can’t we have yet an other type of twigils for these variables ? so that we can write $^NEXT and $^SELF.

Twigils for public / private attributes

Just an idea, but maybe we could have $!public_attribute and $.private_attribute. Or is it the other way around ?

why `is` ? we already have `has` !

This one thing is bothering me a lot: why do we have to use the word is when declaring an attribute? The attribute declaration starts with has. So with is, that makes it two verbs for one line of code. For me it’s too much. in Moo* modules, the is was just one property. We had default, lazy, etc. Now, is is just a seperator between the name and the ‘traits’. In my opinion, it’s redundant.

Also, among the new keywords added by p5-mop, we have only nouns (class, role, method). Only one verb, has.

The counter argument on this is that this syntax is inspired by Perl6:

class Point is rw {
    has ($.x, $.y);
    method gist { "Point a x=$.x y=$.y" }
}

So, “blame Larry” ? :)

Exporter

p5-mop doesn’t use @ISA for inheritance, so use base 'Exporter' won’t work. You have to do use Exporter 'import'. That is somewhat disturbing because most Perl developers (I think) implement functions and variables exporting by inheriting from Exporter (that’s also what the documentation of Exporter recommends).

You could argue that one should code clean classes (that don’t export anything, and clean modules (that export stuff but don’t do OO). Mixing OO in a class with methods and exportable subs looks a bit un-orthodox. But that’s what we do all day long and it is almost part of the Perl culture now. Think about all the modules that provides 2 APIs, a functional one and an OO one. All in the same namespace. So, somehow, being able to easily export subs is needed.

However, as per Jesse Luehrs and Stevan Little, they don’t think a MOP implementation should be in charge of implementing an Exporter module, and I can totally agree with this. So it looks like the solution will be a method trait, like exportable:

sub foo is exportable { ... }

But that is not yet implemented.

Inside Out objects versus blessed structure objects

p5-mop is not using the standard scheme where an object is simply a blessed structure (usually a HashRef). Instead, it’s using InsideOut objects, where all you get as an object is some kind of identification number (usually a simple reference), which is used internally to retrieve the object properties, only accessible from within the class.

This way of doing may seem odd at first: if I recall correctly, there a time where InsideOut objects were trendy, especially using Class::Std. But that didn’t last long, when Moose and its follow ups came back to using regular blessed structured objects.

The important thing to keep in mind is that it doesn’t matter too much. Using inside out objects is not a big deal because p5-mop provides so much power to interact and introspect with the OO concepts that it’s not a problem at all that the attributes are not in a blessed HashRef.

However, a lot of third-party modules assume that your objects are blessed HashRef. So when switching to p5-mop, a whole little ecosystem will need to be rewritten.

UPDATE: ilmari pointed out in the comments that there is a class trait called repr that makes it possible to change the way an instance is implemented. You can specify if an object should be a reference on a scalar, array, hash, glob, or even a reference on a provided CodeRef. This makes p5-mop objects much more compatible with the OO ecosystem.

Now, where to ?

Now, it’s your turn to try it out, make up your mind, try to port an module or write on from scratch using p5-mop, and give your feedback. To do that, go to the IRC channel #p5-mop on the irc.perl.org server, say hi, and explain what you tried, what went well and what didn’t, and how you feel about the syntax and concepts.

Also, spread the word by writing about your experience with p5-mop, for instance on blogs.perl.org.

Lastly, don’t hesitate to participate in the comments below :) Especially if you don’t agree with my remarks above.

Reference / See also

Contributors

This article has been written by Damien Krotkine, but these people helped proof-reading it:

Stevan Little
Jesse Luehrs
Toby Inkster
Lukas Atkinson

MooX::LvalueAttribute - improved

2013-08-21T00:00:00+00:00

Just a quick note to mention that following Mike Doherty’s bug report, I’ve released a new version of MooX::LvalueAttribute.

This release (version 0.12) allows you to use MooX::LvalueAttribute in a Moo::Role, like this;

{
    package MyRole;
    use Moo::Role;
    use MooX::LvalueAttribute;
}

{
    package MyApp;
    use Moo;

    with ('MyRole');

    has name => ( is => 'rw',
                  lvalue => 1,
                );
}

my $object = MyApp->new();
$object->name = 'Joe';

So now it’s easier to specify which classes will have lvalue attributes and which one won’t. Until now I avoided adding a flag to globally enable lvalue attributes across all Moo classes (without having to say lvalue => 1). Maybe that’s something some of you would like ?

Anyway, that’s all folks! Nothing revolutionary, but I’ve been told we should talk more about what we do, so that’s what I’m doing.

For more detail about Moox::LvalueAttribute, see my original post

New And Improved: Bloomd::Client

2013-06-13T00:00:00+00:00

thanks to @yenzie for the picture :P

Bloom filters

Bloom filters are statistical data structures. The most common use of them is to consider them as buckets. In one bucket, you add elements. Once you’ve added a bunch of elements, it’s ready to be used.

You use it by presenting it yet an other element, and it’ll be able to say almost always if the element is already in the bucket or not.

More precisely, when asking the question “is this element in the filter ?”, if it answers no, then you are sure that it’s not in there. If it answers yes, then there is a high probability that it’s there.

So basically, you never have false negatives, but you can get a few false positives. The good thing is that depending on the space you allocate to the filter, and the number of elements it contains, you know what will be the probability of having false positives.

The huge benefit is that a bloom filter is very small, compared to a hash table.

bloomd

At work, I replaced a heavy Redis instance ( using 60g of RAM) that was used primarily as a huge hash table, by a couple of bloom filters ( using 2g ). For that I used bloomd, from Armon Dadgar. It’s light, fast, has enough features, and the code looks sane.

All I needed was a Perl client to connect to it.

Bloomd::Client

So I wrote Bloomd::Client. It is a light client that connects to bloomd using a regular INET socket, and speaks the simple ASCII protocol (very similar to Redis’ one) that bloomd implements.

    use Bloomd::Client;
    my $b = Bloomd::Client->new;

    my $filter = 'test_filter';
    $b->create($filter);
    my $hash_ref = $b->info($filter);

    $b->set($filter, 'u1');
    if ($b->check($filter, 'u1')) {
	  say "it exists!"
    }

When you use bloomd it usually means that you are in a high availibility environment, where you can’t get stuck waiting on a socket, just because something went wrong. So Bloomd::Client implements non-blocking timeouts on the socket. It’ll die if bloomd didn’t answer fast enough or if something broke. That allows you to incorporate the bloomd connection in a retry strategy to try again later, or fallback to another server…

To implement such a strategy, I recommend using Action::Retry. There is a blog post about it here :)

dams.

MooX::LvalueAttribute - Lvalue accessors in Moo

2013-02-11T00:00:00+00:00

Yesterday I was reading Joel’s post, where he lists great Perl things he’s seen done lately. Indeed these are great stuff. I was particulary interested by his try at playing with Lvalue accessors.

I thought that it would be a great exercise to try to implement it in Moo, as an additional feature, trying to get rid of the AUTOLOAD. Also, I was willing to avoid doing a tie every time an instance attribute accessor was called. Surely, I needed to tie only once per instance and per attribute, not each time the attribute is accessed.

So I started hacking on the code of Moo. Getting rid of the AUTOLOAD was easy, as I could change the way the accessor generator was, well, generating the, err, accessors.

Shortly after I started having issues to cache a tied variable. I asked the all-mighty Vincent Pit, and he found a solution for my tied variables, but more importanlty pointed me to Variable::Magic, which is faster, more flexible and powerful.

All I needed was to move my hacks in a proper Role, and wrap the whole in a module, and push it on CPAN. Tadaa, MooX::LvalueAttribute was born.

In the process I used play-perl to register my quests, and exchanged thoughts with Joel Berger. I think I’m going to use this website more, see if it can boost my productivity, and help me figure out what’s really important to do.

On IRC, haarg discovered a bug and recommended to use so-called fieldhashes, from Hash::Util::FieldHash::Compat. At the end of the day, I only acted as a glue between different pieces of knowledges, and that was very satisfactory.

TL:DR

MooX::LvalueAttribute is a module that provides Lvalue attributes:

package App;
use Moo;
use MooX::LvalueAttribute;

has name => (
  is => 'rw',
  lvalue => 1,
);

# Elsewhere
my $app = App->new(name => 'foo');

$app->name = 'Bar';

print $app->name;  # Bar

Enjoy!

New Perl module: Action::Retry

2013-01-21T00:00:00+00:00

UPDATE: I have included a functional API, as per Oleg Komarov request, and amended this post accordingly.

I’ve just released a new module called Action::Retry.

Use it when you want to run some code until it succeeds, waiting between two retries.

A simple way to use it is :

use Action::Retry qw(retry);
retry { ... };

And the Object Oriented API:

Action::Retry->new( attempt_code => sub { ... } )->run();

The purpose of this module is similar to Retry, Sub::Retry, Attempt and AnyEvent::Retry. However, it’s highly configurable, more flexible and has more features.

You can specify the code to try, but also a callback that will be executed to check the success or failure of the attempt. There is also a callback to execute code on failure.

The module also supports different sleep strategies ( Constant, Linear, Fibonacci…) and it’s easy to build yours. Strategies can have their options as well.

my $action = Action::Retry->new(
  attempt_code => sub { ... },
  retry_if_code => sub { $_[0] =~ /Connection lost/ || $_[1] > 20 },
  strategy => { Fibonacci => { multiplicator => 2000,
                               initial_term_index => 3,
                               max_retries_number => 5,
                             }
              },
  on_failure_code => sub { say "Given up retrying" },
);
$action->run();

And the functional API:

  use Action::Retry qw(retry);
  retry { ... }
  retry_if_code => sub { $_[0] =~ /Connection lost/ || $_[1] > 20 },
  strategy => { Fibonacci => { multiplicator => 2000,
                               initial_term_index => 3,
                               max_retries_number => 5,
                             }
              },
  on_failure_code => sub { say "Given up retrying" };

Strategies can decide if it’s worthwhile continuing trying, or if it should fail.

Action::Retry also supports a pseudo “non-blocking” mode, in which it doesn’t actually sleep, but instead returns immediately, and won’t perform the action code until required time has elapsed. Basicaly it allows to do this:

my $action = Action::Retry->new(
  attempt_code => sub { ... },
  non_blocking => 1,
  strategy => { 'Constant' }
);
while (1) {
  # if the action failed, it doesn't sleep
  # next time it's called, it won't do anything until it's time to retry
  $action->run();

  do_something_else();
  # do something else while time goes on

}

of course do_something_else should be very fast, so that the loop goes back quickly to retrying the attempt_code.

Action::Retry is based on Moo for performance (and because the module is simple enough to not require Moose). Moo classes properly expand to Moose ones if needed, so there is no excuse not to use it.

So, please give a try to Action::Retry, and let me know what you think.

Damien Krotkine

Alan Cache - building the best python caching library

1. Caching is not easy — a bit of history

2. The Simplest Case — @cached_for

3. Under the Hood — The Two-Layer Architecture

4. Manual Control — The get/set/delete API

5. Choosing Where to Cache

RAM Only — No Redis

Skip Serialization

Redis Only — No Local RAM

Thread-Local Storage

6. Scoping Cache to a Request

How It Works Internally

7. Conditional Caching

How It Works Internally

8. Cache Keys — How They Work and How to Control Them

Default Key Structure

Controlling What Goes Into the Key

9. Partial Invalidation — Purge Surgically

How It Works Internally

10. Distributed Invalidation — The Hard Problem

Stage 1: Local Immediate Delete

Stage 2: Redis Async Delete

Stage 3: Broadcast via ZSET

Worker Pickup: Piggyback on Cache Access

Function Registry

11. Atomic Writes — Caching Functions with Side Effects

The Problem

at_least_once — Optimistic Concurrency

at_most_once — Pessimistic Locking

Production Usage

12. Async Background Computation — Never Block the User

The RQ Serialization Trick

13. Keeping Caches Warm — Periodic Refresh & Startup Warming

Periodic Refresh

Startup Warming

14. Object-Lifetime Caching

15. Observability & Admin

Metrics

Admin API

Function Discovery

16. The Admin Dashboard — Exploring Cache State in Production

Functions Overview

Key Browser

Value Inspector

17. The Full Picture

18. Build vs Buy — Why Not Use an Existing Library?

19. Numbers

Summary (TL;DR)

Riak as Events Storage

Riak as Events Storage

Introduction to Events Storage

High Level overview

Individual Events

Aggregated Events

Data flow size and properties

Why Riak

Riak 101

Riak clusters configuration

Data Design

Pushing to Riak

PUT to Riak

PUT options

Reading from Riak

GET options

Real time data processing outside of Riak

Limitations of data processing outside of Riak

Next: data filtering and processing inside Riak

Real-time server-side data processing: the theory

A first attempt: MapReduce

The metrics

Metrics when doing external processing

External bandwidth usage

Internal bandwidth usage

Deserialise time

Processing time

Metrics when using MapReduce

External bandwidth usage

Internal bandwidth usage

Deserialise time

2. The Simplest Case — `@cached_for`

4. Manual Control — The `get`/`set`/`delete` API

`at_least_once` — Optimistic Concurrency

`at_most_once` — Pessimistic Locking