Synthetic vs RUM Performance Monitoring – The Pepsi Challenge

Synthetic vs RUM Performance Monitoring – The Pepsi Challenge

When it comes to understanding your site performance there are a range of tools and techniques available, but they all fall into one of two categories – synthetic and real user monitoring (known as RUM). But whats the difference, and why should you use both?

Possibly the biggest challenge with performance monitoring is how much performance can vary even when seemingly nothing has changed. There are a huge number of variables at play which impact the overall performance beyond just the code on the page itself. Everything from the CPU and network capabilities of the users device, through to other applications, web pages and browser extensions causing a fight for attention which impacts the page load times. Lighthouse list seven of the most common sources of variability on their Lighthouse variability documentation.

This variability is one of the fundamental differences between synthetic and real user monitoring.

Synthetic monitoring

Otherwise known as lab testing, synthetic monitoring attempts to either removing or reducing some of the sources of variability to help improve the overall consistency of results. Typically, this means checking a specific set of URLs on known set of devices and browsers with throttling of network and CPU conditions. Whilst there is still a high degree of variability, this does significantly reduce the variance between tests.

Testing samples in a science laboratory

As an analogy we can compare this to the controlled taste test like the Pepsi Challenge marketing campaign. To market Pepsi as a competitor to Coca-Cola, customers were invited to taste the two products blind and pick their favourite. By tasting the two blind they've removed a huge number of variables that would usually influence the outcome of the test such as the product cost, availability, branding, name recognition and marketing to make this a judgement on the drinks themselves rather than anything around them. This is very similar to synthetic performance monitoring – we're removing the factors that are out of our direct control to understand the performance of the factors which are within our control.

Pepsi Challenge Commercial from 1983

Running a synthetic check on it's own is great, but where things get really useful is when you do this either in combination with checking against a competitor to see how your sites compare or checking your own site over time to see how your performance changes over time. This is where synthetic checks become synthetic monitoring.

The most popular synthetic testing tools are Lighthouse and WebPageTest, with popular monitoring tools/services being Lighthouse CI (LHCI) Server, SpeedCurve and Calibre. These monitoring tools often introduce additional strategies for dealing with variance.

Synthetic monitoring gives you access a huge amounts of performance data, including screenshots and videos to really highlight how your site is performing and where there is room for improvement. If you've already got all this insight, why go any further?

Real user monitoring (RUM)

Whereas synthetic monitoring helps get us consistent results, real user monitoring instead embraces the variance showing you exactly what your users are experiencing in real-world conditions. This is usually done by integrating a real user monitoring service into your site, such as SpeedCurve or Catchpoint.

Laptop screen showing performance graphs from SpeedCurve

In Pepsi terms, this is the equivalent of eleminating the blind tastings and measuring the best tasting cola based on actual sales – people voting with their wallets. This adds many more factors than just the taste into the mix, but ultimately is the best reflection of people perceive your product.

You'll see RUM results that align with the synthetic results, but you'll also see faster and slower page loads too depending on how all the variability factors play out for different users. The general trends should match your synthetic data, but you'll now be able to see the impact of the other factors like varying network conditions.

Comparison

As RUM relies on collecting data from actual users "in the field" the range of metrics you can gather does differ from synthetic. With a couple of exceptions, synthetic gathers data from more metrics than RUM can, but it makes up for this in the breadth of data it can collect as it can collect data on any page, 24 hours a day.

The combination of the two is a great pairing and gives you a useful t-shaped set of data. Use the wide breadth of your RUM data to identify trends and areas that may need some improvement before then diving deeper into your synthetic data to understand exactly what is happening and how you can make those valuable improvements.

RUM
Synthetic
Overlap
Representation of the T-shape – RUM gives wide coverage but low depth whilst synthetic gives a deep set of data on a thinner slice.

When I started comparing synthetic and RUM data, one of the observations that jumped out the most is how much our performance varies depending on the time of day. Our synthetic tests ran on popular pages at peak times, and get fairly consistent results. On the other hand, RUM showed early mornings as being significantly slower than evenings despite the fact the content being served is the same. This highlighted the impact of factors like our caching strategy and server load on end-user performance.

Synthetic and RUM should generally work in tandem with each other. Both should show you similar patterns, but with RUM allowing you to validate synthetic results in real-life conditions whilst synthetic tests allow you to benchmark your performance in controlled conditions representative of the conditions of your actual users.

Let's look back at the Pepsi Challenge example.

In the blind taste tests Pepsi generally came out as the winner, and with it's marketing campaign highlighting this it saw good market share gains. However the synthetic taste tests were, whilst great marketing, flawed as a product comparison. In Malcolm Gladwell's book "Blink: The Power of Thinking Without Thinking", he showed that generally the trend of Pepsi winning came down to using a "sip test" method where the sweeter taste of Pepsi performed particularly well whereas when switching this to a full can switched the winner back to Coca-Cola.

Coca-Cola attributed their change of fortunes to be down to the taste of the product, rather than the marketing around it which let them. In other words, they mis-interpreted the results of the synthetic test and, in turn, assumed there to be a correlation between the test and their real-life market share.

"Pepsi's dominance in blind taste tests never translated to much in the real world. Why not? Because in the real world, no one ever drinks Coca-Cola blind." – Malcolm Gladwell

Whilst an abstract example, this does show the danger in not finding the right balance between synthetic and real user results. On the web, a similar case study could be running your synthetic tests based on using the latest iPhone on a fast 4G connection in the UK and assuming you've got a well performing website. However, if your users are actually using low-to-mid range Android devices on patchy 3G connections in South America then theres a very high chance they are getting a completely different experience to the one you've tested against.

Aligning the restrictions on those variability sources to closely match common configurations found by your users is crucial if you want synthetic monitoring to be an effective form of benchmarking your performance. If you get this right, you'll be able to confirm your site performane using synthetic tests before you release the changes up to production, giving you a great early-warning system which you can then validate once it does reach production using your real user monitoring.

Cover photo by Ja San Miguel, laboratory photo by ThisisEngineering RAEng and SpeedCurve dashboard photo by Luke Chesser. All from Unsplash.