Clive Holloway - Senior Developer with expertise in Solution Architecture

These in depth examples will hopefully give you some idea as to how I think and how I approach problems.

Phone number normalization (ZipRecruiter)

Massive in-situ clean up of the way over 65M phone numbers were stored in MySQL. [+]

The problem

This was the first project I undertook at ZipRecruiter. Still in the start-up phase there was a lot of technical debt that needed cleaning up, the most egregious being the way phone numbers were being stored in the database - as free form text fields with no normalization of data! Consequently, searching on phone numbers was problematic, with the existing solution involving some complicated regexes. My task was to clean this up, normalize it, and provide a better interface and implementation.

The solution

First, I researched phone number formats. I ended up using E.164 for the normalized storage, and E.123 for display. Google's libphonenumber was perfect, but had no native Perl implementation. I ended up stealing the metadata from the repo and parsing the regexes from it to use in my code, which I was kindly allowed to release to CPAN (though I haven't touched this since I wrote it). In the CPAN release I added a few tweaks so that you could set up a cron to automatically update the metadata (which changes suprisingly often) and a few helper ENV vars that weren't needed in the ZR implementation.

So, now I had the code to normalize the numbers, I just had to update in the database. This was done in several steps:

add a normalized phone number column to each DB table containing phone numbers
amend all places in the code that inserted/updated phone numbers to also add the normalized value to the new field via triggers
run a batch script to populate this new column across all affected tables
amend all view code that set phone numbers to ensure numbers were received in E.164 format (via a country flag select and free form text entry, validated on submit)
amend views displaying phone numbers to use the new field and display them in E.123 format
amend all search code to only search via the E.164 format after normalizing user input
stop updating the legacy fields
final cleanup to drop the old free form field from affected tables

By the end of the project, I had sped up phone number search from several seconds (yes, really!) to a few milliseconds. At this point, I had thoughts on how we could improve this further, which led to...

Migrate user search from MySQL to Elasticsearch (ZipRecruiter)

The problem

We now had basic search that wasn't killing the servers, but while it worked, search was still clunky. The GUI used by Customer Service Reps (CSRs) had multiple fields you had to type search terms into, and they were exact/partial matches on literal text only, so searches still took a while and it was still often hard to find what you were looking for.

I started working out how we could improve this. By reaching out to individual CSRs, I was able to get a better picture of where their frustrations lay and how we could make their job easier.

The solution

I chose Elasticsearch because of the flexibility it gave me for each field CSRs wanted to search on. After a lot of playing around I ended up defining multiple document attributes, based on real life use cases by CSRs, including:

Double Metaphone tokenizations of names - so that CSRs didn't have to ask for users to spell their names - John or Jon worked the same
last four digits and expiry date of CC on file
domain name of their email address (so that one could look up all users from the same company easily)
Address lines
phone numbers in E.164 format

I think there were about a dozen document attributes in the end that we used.

Initially, I wrote a batch migration script that converted the ~65M contacts in the MySQL table into well indexed Elasticsearch docs in about 4 hours and then added triggers to automatically push updates on amended or added records over to Elasticsearch in real time (via Debezium and Kafka).

Once I had all this live, I then replaced the multi-field legacy search with a new search form that was a single field. After the CSR had typed 3 characters, suggested results would populate in a drop-down under the search box via AJAX until they found the person they needed to. The Perl controller did a little "loose" formatting of data on the way to Elasticsearch (convert possible phone numbers to E.164, assume 5 digits was a zip code, dd/dd was a credit card expiry date etc, and then send a well formed query to Elasticsearch. Document attributes were weighted differently on match, and after some tuning we ended up with a simple UI that allowed CSRs to drill down the users in a very efficient way.

In my initial calculations, I thought this implementation would work well up to about 150M contacts. An ex-colleague contacted me later to confirm it hard started to degrade at ~ 135M contacts, so I wasn't too far out with my estimate.

Once live, I must admit I did enjoy the complimentary feedback from the CSRs! Customer lookup was now a simple process that was a lot cleaner, faster, and accurate than before.

Migrating observability tooling from Sumo to Datadog - and cleaning up the underlying code as we went (Estee Lauder)

Analyzing 20 years of observability technical debt and refactoring it to use a single observability platform. [+]

The problem

As part of a modernization and clean-up push, I migrated observability from several disjointed services (Sumo and Promethus, and a couple of custom implementations) in a very old and large monolith mod_perl app. As part of the preparation for this, I analyzed existing logging and metric implementations and saw several issues including:

multiple teams had rolled their own interfaces to the logger so they could add custom data to all log messages in their code - with a complete lack of standards - and the resulting code added noise to the application logic
one couldn't cross reference all log messages from a request because of the different approaches to metadata
metric names were hard coded, leading to possible naming clashes

So I ran an analysis across the repo to identify each of the existing approaches and the problems the custom code was trying to solve.

Because the monolith ran on a 15 year old version of perl, OpenTelemetry modules could not be used, so we had to create a bespoke solution.

The solution

As a first project for EL, I had written an application framework in Plack, and had implemented generic observability tooling in it. This project was shelved due to company direction, but I was able to salvage the interface to quickly create better tooling for the monolith. Key aspects of the solution:

implement a simple Domain Specific Language (DSL) for all observability code
automatically add multiple attributes to all log messages (caller, user cookie, package and line number where emitted and several other attributes needed internally
wrote the metric emitter so that each metric name was prefixed with the service and namespace (ie like this: perl.namespace.metric_name) to effectively eliminate naming clashes (hard coded metrics were allowed, as long as they began with "perl." so that badly written legacy code could still set the same metric in multiple packages
added helper methods that allowed you to add sticky log attributes that were sent for the rest of the request, or for all messages in the current package
provided a simple way to redact Personally Identifiable Information (PII) in structured data via allow or exclude lists via a simple config - ie when a log message dumped user records, automatically mask or remove any PII before logging
provided the same interface for adding tags to logs and metrics

Once we started rolling this out, I made several presentations to multiple teams to help with the paradigm shift, and kept an eye on the repo to spot when developers were using the tooling inappropriately (mainly trying to roll their own loggers because they didn't understand the metadata features correctly). By communicating and monitoring progress, I was able to see this project fully to completion, educated consumers (developers) and vastly improved the quality of the data being ingested.

For traces, I wrote a simple system where you just had to register the methods you wanted to trace within a package and, at server start up, decorators were added to the method to start/stop the timer and emit the trace. This allowed us to remove dozens of custom timers devs had added to the code and replace with a single config.

Approximate examples of the interface

I have changed the naming of a few things, but here's some annotated examples:

# note, I never export symbols by default
# - it makes it harder for new devs to get up to speed
use MyLogger qw(_logger);

# allow devs to set package tags emitted with every message
# emitted in this package
_logger->package_tags(
    tag_name  => $tag_value,
    tag_name2 => $tag_value2,
);

# tags sent with every message sent for the remainder of request
# - used with caution and only when you know you own the request
_logger->request_tags(
    tag_name  => $tag_value,
    tag_name2 => $tag_value2,
);

# simple message emission
_logger->info("This is an info message");

# message with tags
_logger->info("This is an info message", { with => 'a tag' });

# log an expensive message by wrapping it in a sub
# - it only runs if running at debug level
_logger->debug(sub { expensive_code(); });

      
use MyMetric qw(_metric);

# basic counter, implicit increment by 1
_metric->increment('my_counter');

# basic counter with explicit increment
_metric->increment('my_counter', 5);

# as above, but with tags
_metric->increment('my_counter',    { status => $status });
_metric->increment('my_counter', 5, { status => $status });

# basic gauge
_metric->gauge($temperature);

I kept the lexicon quite small in the DSL so that we could keep things clean and simple. I noticed patterns where developers were preformatting data before emitting log messages, and that led me to expanding the core DSL by adding a plugin architecture.

Log message plugins

Many times in the legacy code, developers dumped whole objects, which was 'OK' when the logging was local, but was a really bad idea if they contained PII when sending it to centralized logging. I also noticed a few other patterns emerging that would manipulate message data before emission, so I added in a plugin architecture to standardize these tools in a clean and simple way.

All plugins were defined in the MyLogger::Plugin namespace
At server startup, the plugins were loaded and registered with the logger
In individual log messages, if you included a tag that began with an underscore, the logger would look for a plugin with that registered name and call it with the message data

For PII, I made it so you could specify which tags needed redacting, and it would automatically recurse through the element specified and redact as needed. Eg

# only remove 2 fields - useful when the data type is static
_logger->info("User data", {
    users => $user_data,
    _redact => { users => { remove => [qw(password ssn)] } }
});

# only keep the fields you want - useful when you just need
# a small party of each record, especially if the data's
# structure was fluid and often had attributes added.
_logger->info("User data", {
    users => $user_data,
    _redact => { users => { keep => [qw(name age )] } },
});

So, whether $user_data was a single record hashref or a list of records in an arrayref of hashrefs, the plugin would recurse through it and remove the specified fields.

Again, I added the plugin pattern as another tool to stop developers rolling their own helpers over and over again. The first plugin not written by me was a plugin that would truncate fields, and was used to shorten deep call stacks, and worked like this.

# truncate the stack trace to first 1000 characters
# note the anonymous sub that stopped the expensive
# code running when log level wasn't debug!
_logger->debug("Stack trace", {
    stack     => sub { Carp::longmess() },
    _truncate => { stack => 1000 },
});

I think they now have about a dozen plugins in use, removing a lot of repetitive fluff code from the main application logic.

I'm quite proud of what I achieved here, and I think my ex-colleagues agreed I did a good job.