The simplest SQLite common table expression tutorial
I've been trying to wrap my head around Common Table Expressions for a while, and all the tutorials I've read started out with "simple" examples that were way too advanced for me to follow. Here's my attempt to write a tutorial that starts as simple as possible.First, let's start with the simplest query:
sqlite> SELECT 1;
1
sqlite>
All this does is return a result set containing a row. Next, consider the simplest subquery:
sqlite> SELECT * FROM ( SELECT 1 );
1
sqlite>
This just selects all the results from the subquery -- which in this case, is just a single row. A "Common Table Expression" is basically the same as a subquery, except assigned a name and defined prior to the query in which it's referenced. Accordingly, the simplest CTE version of the above query would be like:
sqlite> WITH one AS ( SELECT 1 )SELECT * FROM one;
1
sqlite>
Breaking that down a bit further:
We've defined a common table expression named "one"
We've "filled" it with the output of SELECT 1, which is just 1 row
Then we selected everything from "one"
Such that the final result is a single value: 1
But a CTE can have multiple columns, too, and those columns can be assigned names:
sqlite> WITH twoCol( a, b ) AS ( SELECT 1, 2 )
SELECT a, b FROM twoCol;
1|2
sqlite>
Similarly, a CTE can query other tables:
sqlite> CREATE TABLE foo ( bar INTEGER );
sqlite> INSERT INTO foo VALUES(1);
sqlite> INSERT INTO foo VALUES(2);
sqlite> SELECT * FROM foo;
1
2
sqlite> WITH fooCTE AS (SELECT * FROM foo)
SELECT * FROM fooCTE;
1
2
sqlite>
Additionally, you can define as many CTEs as you want in a single query:
sqlite> WITH aCTE AS (SELECT 'a'),
bCTE AS (SELECT 'b')
SELECT * FROM aCTE, bCTE;
a|b
sqlite>
So, common table expressions can be used to restructure a query to make it more readable, by moving the subqueries out in front. But the real power of common table expressions is when you define an expression that recursively selects itself. They key to this is using a "Compound Select Statements", such as the UNION ALL operator. This just combines two result sets into one (so long as they have the same number of columns):
sqlite> SELECT 1, 2
UNION ALL
SELECT 3, 4;
1|2
3|4
sqlite>
Take this example:
sqlite> WITH RECURSIVE infinite AS (
SELECT 1
UNION ALL
SELECT * FROM infinite
)
SELECT * FROM infinite;
1
1
1
1
1
^C
Runtime error: interrupted (9)
sqlite>
Let's break down why that query will never finish:
"WITH RECURSIVE infinite" defines a common table expression named "infinite"
"SELECT 1" seeds that CTE's output with a single row -- containing "1"
Next the "UNION ALL" says "combine the output of what's on the left, with the output of what's on the right
And on the right we do "SELECT * FROM infinite" -- meaning, select everything currently in the table.
The result is we're defining a common table expression named "infinite" to be the union of "a single row" and "all other rows".
Because no "cap" has been placed on this (via a WHERE or LIMIT), this means we've defined an infinitely recurring CTE. Fun!
So we can "cap" that CTE by writing a query like:
sqlite> WITH RECURSIVE finite AS (
SELECT 1
UNION ALL
SELECT * FROM finite LIMIT 2
)
SELECT * FROM finite;
1
1
sqlite>
This does the same basic thing, but we've limited the number of possible results to only be 2. Ok, so that's all well and good, but what is this good for? It turns out, a lot. Say you wanted to generate a table on the fly containing the numbers one through ten:
sqlite> WITH RECURSIVE ten(x) AS (
SELECT 1
UNION ALL
SELECT x+1 FROM ten WHERE x<10
)
SELECT * FROM ten;
1
2
3
4
5
6
7
8
9
10
sqlite>
To do this, we've defined a CTE named "ten", with a single column named "x" (the column name is optional, but in this case we need it to refer to later). Then in the recursive UNION ALL, we keep adding one more row to the result set -- each one larger than the row before -- until we reach a limit of 10.
So CTEs can be used to generate a wide array of different types of data -- such as date ranges, perhaps to join against when doing a historical analysis against a sparse dataset (where some months have no data, so a simple group-by won't suffice):
sqlite> WITH RECURSIVE dates(x) AS (
SELECT '2015-01-01'
UNION ALL
SELECT DATE(x, '+1 MONTHS') FROM dates WHERE x<'2016-01-01'
)
SELECT * FROM dates;
2015-01-01
2015-02-01
2015-03-01
2015-04-01
2015-05-01
2015-06-01
2015-07-01
2015-08-01
2015-09-01
2015-10-01
2015-11-01
2015-12-01
2016-01-01
sqlite>
But recursive CTEs can do more than just generate data. They're essentially a powerful programming language that can be used to perform complex transformations on data -- such as converting a comma-separated list into a table that can be joined against:
sqlite> WITH RECURSIVE list( element, remainder ) AS (
SELECT NULL AS element, '1,2,3,4,5' AS remainder
UNION ALL
SELECT
CASE WHEN INSTR( remainder, ',' )>0 THEN
SUBSTR( remainder, 0, INSTR( remainder, ',' ) )
ELSE
remainder
END AS element,
CASE WHEN INSTR( remainder, ',' )>0 THEN
SUBSTR( remainder, INSTR( remainder, ',' )+1 )
ELSE
NULL
END AS remainder
FROM list
WHERE remainder IS NOT NULL
)
SELECT element FROM list WHERE element IS NOT NULL;
1
2
3
4
5
sqlite>
To explain a bit more about how that one works:
Imagine a table with two columns: element, and remainder. The "element" of each row contains a single element of the list, whereas "remainder" contains everything that comes after it.
We start off by seeding the table with a single row containing a NULL element, and a "remainder" containing the list we want to transform: "1,2,3,4,5"
Then we define the recursive CTE to generate a new row containing the first element of the list (everything before the first comma), and the remainder (everything after the first comma)
Unfortunately sqlite doesn't support the IF syntax, so I'm using the slightly more confusing (but way more powerful) CASE syntax, but basically it's saying "if there is a comma, then everything before is an element, and everything after is the remainder. If there is no comma, then the whole thing is the element, and there is no remainder."
Finally, it says "keep recursing until there is no more remainder"
The result is a table containing one row per element, plus one NULL element (which is what we seeded the CTE's result table with)
So we select everything except for that NULL element, and voila, we've converted a comma-separated list into a table, purely in SQL.
Now that might all sound academic. But it's not uncommon for a database to have some table (eg, "nameValuePairs") that stores arbitrary data -- and sometimes that data is a comma separated list. Without the CTE, you can get as far as obtaining that list in the form of a string, but then you need to post-process it with Python or some scripting language to complete your analysis. With this CTE, you can convert the list into a standard result set, and then continue your analysis with the full power of SQL -- including joining that result set against other tables.
For example, consider if you had some Github-style commenting system, where users could refer to each other inline by username (eg, "This fixes a bug introduced by @quinthar, thanks @genintho for the tip!"). And imagine this were stored as a simple string in a "comments" table. Normally it would be impossible to write a pure SQL query to determine which users were referenced inside the comment, so as to look up which to notify. But you could modify the CTE above to replace the "comma" with an "@", and thus generate a list of all usernames mentioned in a comment. Then you could take that output and join against your account table to look up the email address for each of those users -- producing a clean list of distinct email addresses to notify. Indeed, you could then put that straight into an INSERT statement to insert a command into a "notifications" table, queueing the email for later sending.
So CTEs enable transformations to happen inside SQL that previously would force you to switch languages. We've already talked about list-based transformations. But the last I'll discuss sounds crazy: processing hierarchical information inside SQL.
Everybody knows relational databases are based on tables containing rows, and that you can join those tables together to produce -- in effect -- super large tables that can be analyzed as one. And many people know that you can join tables against themselves to do pretty tricky analyses. But without a CTE, at best you can join something against itself a fixed number of times -- however many times you write the query to do. Enter the CTE and you can effectively join a table against itself as many times as it takes to complete the analysis.
Take for example, a company's expense report approval hierarchy. This can be stored in a simple "company" table, where everybody's expense report "approver" can be easily looked up by the employee's "name". The approval structure for any specific employee is both their approver, their approver's approver, and so on until you reach someone who has no approver. The hierarchy might have different "depths" for different employees, so there's no way to write a non-recursive query to select an employee's approvers directly. But you can with a recursive CTE:
sqlite> CREATE TABLE company ( name, approver );
sqlite> INSERT INTO company
VALUES
( 'David', NULL ),
( 'Matt', 'David' ),
( 'Jason', 'David' ),
( 'Ryan', 'David' ),
( 'Mike', 'Matt' ),
( 'Carlos', 'Matt' ),
( 'Garrett', 'Jason' ),
( 'Puneet', 'Jason' ),
( 'Joanie', 'Ryan' );
sqlite> WITH RECURSIVE approvers(x) AS (
SELECT 'Joanie'
UNION ALL
SELECT company.approver
FROM company, approvers
WHERE company.name=approvers.x
AND company.approver IS NOT NULL
)
SELECT * FROM approvers;
Joanie
Ryan
David
sqlite>
The upshot of all this is Common Table Expressions allow for extremely powerful data generation, transformation, and hierarchical querying capabilities -- directly from within SQL. They take a bit to wrap your head around, but they open up an entirely new level of potential for in-database analysis and logic. Enjoy!