Scalar-Aggregate Comparison for High-Performance Correlated Aggregate SQL Queries

By Steven Moseley November 4th, 2019

Whoah, that title is a mouthful.  Let's break it down by starting with definitions.

  • Correlated Aggregate Queries: self-joining a table using an aggregated value, e.g. MIN(id), to find the correlated value of a different column, e.g. URL.
  • Scalar-Aggregate Comparison: the process of combining scalar and aggregate SQL functions to compare compare cross-columnar values in aggregate. (See below)

Why do we need to use Correlated Aggregate Queries?  Often, in SQL, you’ll want to find correlated data points of a given aggregate that aren’t simple to resolve from the grouping.

For example, finding the URL of the first page of a each user’s session over a period of time.

Believe it or not, this isn't a simple task.  Most developers will create a subquery to first identify the first pageview ID of each session, e.g.:

SELECT MIN(id) AS first_pageview_id
FROM pageview
GROUP BY session_id;

This will yield the first pageview’s id in the first_pageview_id result. However, retrieving the correlated url field isn’t a simple task.

The reason for this is because of how aggregation works - reference the Aggregation section again for the 3rd dimension explanation.  In the 3rd dimension, rows of data in a given aggregate dataset have no relativity. That is to say, the individual fields of a given row are no longer correlated, and the output of each aggregate function is independent of the output of the other aggregate functions.

So when I get the MIN(id), that has no correlation to the relative row of that MIN(id).

So how do we get the row?  There are a few methods analysts use to do this.

With Common Table Expressions (CTEs)

CTEs are a nifty recent addition to MySQL syntax.  They allow correlation with an aggregate expression's output, allowing indexes to pass through, while not requiring the creation of a temporary table.

To achieve what we want from the above problem using a CTE, we could write:

WITH first_pageview_cte AS (
    SELECT session_id, MAX(id) AS first_pageview_id,
    FROM pageview
    GROUP BY session_id
)
SELECT
    session_id,
    url
FROM pageview AS p
    INNER JOIN first_pageview_cte AS fp ON p.sessino_id = fp.sessino_id;

CTEs are very powerful, because they allow relationships between expression output while maintaining indexes and limiting problems with repetitive execution of portions of a query (which can happen with poorly designed subqueries).

The problem with CTEs are that they aren't available in all environments. In MySQL, only versions 8+ support CTEs.  This eliminates all but a handful of current production MySQL environments from implementing them.

Aggregate Subquery Method

The aggregate subquery method operates on the basis of being able to break this operation into 2 steps: first, a derived table generating the first_pageview_id of each row, then correlating that to the pageview table in a self-join to retrieve the correlated data-points.

The query looks like this:

SELECT
    fpv.session_id,
    pv.url
FROM (
    SELECT 
        session_id,
        MIN(id) AS first_pageview_id
    FROM pageview
    GROUP BY session_id
) AS fpv
INNER JOIN pageview AS pv
    ON fpv.first_pageview_id = pv.id
GROUP BY fpv.session_id;

As you can see, here we’re building an aggregated correlated subquery that gets the first_pageview_id for each session_id, then in an outer query joins to the pageview table again, so as to be able to retrieve the URL associated with the given first pageview.

This is the method that is most commonly used among programmers and analysts to find correlated aggregates, 1) because it works, and 2) because it can be written in a single query without complicated temporary table structures or stored procedures.

The drawbacks to this method are

  1. It is non-performant.  It requires a self-join on a correlated subquery, which will increase in computational complexity at an O-n² rate.  And the inability of MySQL to bubble-up indexes from Correlated Subqueries results in a massive performance loss when joining and grouping in the outer query.
  2. Furthermore, the requirement of duplicating the FROM clause on the inner and outer queries increases the Cyclomatic Complexity, the length of the query, and the chance of human error at a 2n rate.  This can get out of control when joining tables in the aggregate.

For these reasons, I always suggest avoiding this method.

Temp Table Method 

The aggregate temp-table method is a faster, simpler, better way of achieving the aggregate subquery method.  

CREATE TEMPORARY TABLE first_pageview (
    INDEX (session_id),
    INDEX (first_pageview_id)
)
SELECT 
    session_id,
    MIN(id) AS first_pageview_id
FROM pageview
GROUP BY session_id;

SELECT     fpv.session_id,     pv.url FROM first_pageview AS fpv INNER JOIN pageview AS pv     ON fpv.first_pageview_id = pv.id GROUP BY fpv.session_id;

The main benefit of this method is the ability to index the subquery via a Temp Table, which will eliminate most of the performance issues with the Correlated Subquery method.

The drawbacks of this method are:

  1. It still results in a self-join, which has the same computational complexity problems as the first method.
  2. It’s a multi-step query that requires a consistent session/connection with your database, and therefore isn’t ideal for use in programming solutions.  
  3. It also carries the problem of state persistence issues - if the temporary tables in use change state mid-analysis, it can impact the output of the final results in ways that aren’t easily detectable.

For these reasons, I suggest the 3rd method (and 4th when possible)

My Method:

Scalar-Aggregate Comparison

This is a method I invented that I really like for correlated aggregates in MySQL.

It is the simplest syntax (unless querying multiple columns) and best-performing of all methods for this approach, even outperforming CTEs.  However, that comes at the price of being probably the most complex concept to wrap your head around, and being prone to hard-to-diagnose data-malformation 

Let’s have a look:

SELECT 
    session_id,
    MIN(id) AS first_pageview_id,
    SUBSTRING(MIN(CONCAT(LPAD(id, 11, ‘0’), url)), 12) AS first_url
FROM pageview
GROUP BY session_id;

The idea of this solution is that we’ll create a single aggregate expression to extract all the details we need from a single query without any subqueries or joins required.

How it works is to create an expression for extracting values from one column relative to the MIN (or MAX) value from another column.

To do this, we go through this process:

  1. Concatenate the comparison column (id in this case) with the extraction column (url)
  2. Before concatenating, LPAD (left-pad) numeric comparison columns with 0s to ensure the numeric comparisons will function properly in a string comparison.  For example, when comparing the ids 2 and 11, the lesser of the two in a string comparison is “11”, because strings are compared character-by-character from left-to-right.  Left-padding these numbers to a length of 3 (“003” and “011”) will result ina correct comparison where “003” is the lesser of the two.
  3. Extract the minimum or maximum of the resulting concatenation.  In this case, we’ll get a value like “00000879476/landing_page_1”.  Note this consists of an 11-digit left-padded id and the remainder of the string comprising the correlated URL.
  4. Substring the final output to retrieve the url.  Here, we’re saying “give us everything from the 17th character on”.  This will yield the value “/landing_page_1”.

This method takes advantage of the fact that columns of a given row maintain correlation before aggregation, even though they lose correlation after the aggregation.

By putting the CONCAT() within the aggregate MIN() function, we’re able to manipulate the URL data into a manner that will be universally comparable based on the associated ID.  After the aggregation, we then strip away the ID portion of the value to yield the data we’re interested in.

Furthermore, the performance of this method far exceeds all other methods explained thus far, with a very admirable O-n computational complexity.

In addition to simplifying the load on the DB engine, this method additionally allows first and last records (as well as any other aggregates) in a single result.  

For example:

SELECT 
    session_id,
    COUNT(1) AS total_pageviews,
    MIN(id) AS first_pageview_id,
    SUBSTRING(MIN(CONCAT(LPAD(id, 11, ‘0’), url)), 12) AS first_url,
    MAX(id) AS last_pageview_id,
    SUBSTRING(MAX(CONCAT(LPAD(id, 11, ‘0’), url)), 12) AS last_url
FROM pageview
GROUP BY session_id;

No other method allows for first and last results in a single query.  In every other method, you must structure these as 2 separate queries or self-join twice (once for the first and once for the last).

The main downsides of this method are:

  1. Not super-simple to understand / may confuse others looking at the query,
  2. Requires lots of code for each column being extracted, and
  3. In the Concatenation process, casts everything to a CHAR, which may require additional casting to yield meaningful/comparable data, e.g. integers or decimal values that require additional comparison in an additional outer query.  Something like this may be done like so:
CAST(SUBSTRING(MIN(CONCAT(LPAD(id, 11, ‘0’), val)), 12) AS UNSIGNED) AS val,
CAST(SUBSTRING(MIN(CONCAT(LPAD(id, 11, ‘0’), dt)), 12) AS DATETIME) AS dt