postgrails   >   20230708-post-orm  

Post-ORM via SQL

published 2023-07-08

Recently "Sketch of a Post-ORM" was on HN. The author elucidates some shortcomings of ORMs (some of which are straw men) and of raw SQL, before sketching the shape of the database schema and query management system that he wants. The comments mostly argue that ORMs are better than the author’s straw men – for example, ORMs do in fact make efficient queries most of the time, and you can always drop into raw SQL for the other 5%. A few of the comments also point to other non-ORM query and schema management systems, including:

  • PRQL = a new pipelined query language that compiles to SQL
  • sqlc = a compiler from SQL queries to data access objects (for {MySQL, PostgreSQL} in {Go, Kotlin, Python}).
  • quill = “compile-time language integrated queries for Scala”
  • Ecto = Schema and query system for Elixir

As a long-time application developer and relational database (PostgreSQL) user, I feel the pain with both raw SQL and with ORMs. Stripping away the straw men, the pain points can be summarized as follows:

Raw SQL: Shortcomings

  1. Queries are very verbose to write. If you’re an application developer, mostly you just want CRUD queries, and these become very repetitive.
  2. Raw SQL is error prone for inexperienced developers. (SQL injection mostly went away when as an industry we moved away from raw SQL in queries.php files and started using ORMs.)
  3. SQL queries are not composable. Most application programming requires queries with dynamically generated filtering, ordering, etc. (Think of a search endpoint that takes a variety of filtering parameters.) These queries are quit difficult to get right in raw SQL (and can be very simple to create using an ORM or language-integrated query generator).
  4. Migrations and schemas written in SQL are hard to convert to a language-integrated data access layer, because parsing SQL DDL correctly essentially requires using the database’s own parser. (NB: So why not do that?)
  5. Migrations written in SQL don’t allow for application logic.

For me, the shortcomings of raw SQL primarily have to do with the difficulty of composing more complex queries. I am comfortable writing SQL and don’t find it much of a burden to build a set of generic queries that can do CRUD on tables. I also think that inexperienced developers ought to bone up on SQL if they are going to develop applications that query databases: Learning and mastering at least one dialect of SQL is a basic skill and requirement for backend developers, because they are going to be using SQL for the rest of their career. (Despite the attempts to replace SQL, I don’t see it going anywhere, though I do think that things like PRQL will become more popular for analytical work.)

ORMs: Shortcomings

  1. ORMs are fantastic for 95% of application CRUD queries – the simple cases are very fast and easy – but the complex cases can be very difficult (though all ORMs do provide a means to drop into raw SQL as needed). (NB: This shortcoming mostly turns out to be a non-issue because of the ability to drop into SQL.)
  2. ORMs require learning a DSL that is only available in that programming language with that framework. The non-transferrable time you’ve spent learning all the ins and outs of that DSL would be better spent mastering transferrable SQL.
  3. Leaky Abstractions – when things go wrong, the ORM can make it much more difficult to understand what is wrong. (It doesn’t help matters that many ORMs don’t provide a way to generate the resulting SQL and inspect it.)

With #1, I have had good experiences using ORMs for CRUD applications. So I am mostly concerned about #2 and #3. With #3, I have had some very bad experiences debugging issues with complex ORM queries. The composability of queries in, say, SQLAlchemy leads some development teams to write queries in small pieces scattered over several files, which makes them impossible to understand in code, and the way the query functions are written can make it very difficult to render those queries into SQL to understand them.

These issues are compounded by #2: If you are not an expert in the DSL in question, it can be very difficult to understand what the query is doing. I have used both the Django and SQLAlchemy ORMs, but I’ve never become an absolute expert in any ORM’s DSL. If I’m writing a query and it gets to a certain level of complexity, such as needing a CTE or recursion, I will drop into SQL, which I have spent more time and effort mastering. This has proven wise, because it makes it possible for me to be very productive on a wide variety of projects: No matter what ORM is being used, CRUD queries are easy, and complex queries can be done in SQL.

#3 is also an issue with ORM migration systems: Typically, you write the table as a class, and then the migration system generates a language-integrated script that applies changes to the live database. Lots of layers of abstraction over SQL DDL there, when it would be so much easier just to write SQL CREATE TABLE and ALTER TABLE statements.

In the end, most of my issue with ORMs is that, although it provides some conveniences for doing CRUD queries, especially those with filtering driven by business logic, it adds too many layers of abstraction over SQL, which developers should understand and not have hidden from them.

Solutions with SQL

From my point of view, the ideal solution to the shortcomings of both ORMs and raw SQL will have the following characteristics:

  • Automatically generate SQL for most simple CRUD queries that application developers write.
  • Compose complex queries with business logic while staying close to SQL.
  • Write schema migrations in SQL.
  • Write data migrations in common data languages: CSV, JSON, YAML, XML.

Generate CRUD SQL

Given a relation name, fields, and filters, it is straightforward to generate SQL for

  • SELECT – relation name, iterator of fields to select, and filters
  • INSERT – table name and key-value field mapping
  • UPDATE – table name and key-value field mapping and filters
  • DELETE – table name and filters

Key-value mappings are handled in a consistent manner:

  • The keys are column names.
  • The values are usually column values but can be something else (especially in the case of filters).

Filters can be modeled as a list of key-value + operator relationships.

  • In the simple case, the operator is =.
  • Other operators can be string representations of the SQL operator

Compose Complex SQL

Joins, CTEs, Windows, Grouping – all these are often useful, and many ORMs spend a lot of effort modeling them in application code. Instead, it would be a lot simpler to facilitate the more direct use of SQL.

What is the simplest thing that could possibly work? That would be defining queries as a sequence of SQL fragments and “composing” these fragments as a single SQL query. For example:

# Here's a simple Python example
database = ...  # some database connection object
tablename = "items"
fields = ["*"]
params = {"a": 5, "b", "two"}
ops = {"a": "<"}
filters = {key: ops.get(key, "=") for key in params}

query = " ".join([
    "SELECT",
    ",".join(fields),
    "FROM",
    tablename,
    "WHERE",
    " AND ".join(f"{key} {op} :{key}" for key, op in filters.items())
])
print(query)  # SELECT * FROM items WHERE a = :a AND b < :b
results = database.fetch_all(query, params)

The above query is a string that is composed from other strings. The library can assist with with ergonomics around filters, fields, and parameters, as well as with the simple CRUD cases – this SELECT query is essentially what the CRUD helper would produce. Here are some simple helpers that would make writing this query a breeze in real life:

from collections.abc import Sequence

def fields(data: Sequence) -> str:
    """
    Render a comma-separated list of field names from the given fields. Use: E.g.,
    for dynamically specifying SELECT or INSERT field lists.
    """
    return ", ".join(key for key in data)

def filter(key: str, *, op: Optional[str] = "=", val: Optional[Any] = None):
    """
    Render a filter from the given field key, optional operator, and optional value.
    """
    return f"{key} {op} {val or ':' + key}"

def select(relation: str, fields=None, filters=None, orderby=None, limit=None) -> str:
    """
    SELECT fields
        FROM relation
        [WHERE filters]
        [ORDER BY orderby]
        [LIMIT limit].
    """
    fields = fields or ["*"]
    query = [
        f"SELECT {Q.fields(fields)}",
        f"FROM {relation}",
    ]
    if filters:
        query.append(f"WHERE {' AND '.join(filters)}")
    if orderby:
        query.append(f"ORDER BY {orderby}")
    if limit:
        query.append(f"LIMIT {limit}")
    return " ".join(query)

# For the SELECT query above, we still have to set the data up...
tablename = "items"
fields = ["*"]
params = {"a": 5, "b", "two"}
ops = {"a": "<"}
filters = [filter(field, op=ops.get(field, '=')) for field in params]

# ... but then, the query is just:
query = Q.select('items', fields=fields, filters=filters)

results = database.fetch_all(query, params)

This approach can also be used for arbitrarily complex cases. Many people poo-poo that, but are other approaches really better? The key thing that requires care is never to include the parameter value directly in the query string, but to let the database interface match values to the query via parameters.

There are other approaches, most of which involve a language-integrated DSL for generating SQL queries. Common Lisp has CLSQL and Postmodern, both of which define a query generator that uses lisp expressions to generate SQL. These libraries are very close to SQL so that the lisp expression translates directly to SQL. Other language-specific query generating libraries are less transparent. SQLAlchemy is perhaps the most complex and complete. This approach has the problems #2 and #3 above, of leaky abstractions and learning a DSL when learning SQL would be a better use of time and effort.

Another approach would be to use a library like PRQL and to write queries as a sequence of PRQL fragments that are compiled to SQL. This would be very similar to using the features of the language to compose SQL directly, but using a query language that is perhaps more amenable to composition.

For my money, composing SQL with strings and using a few helper functions for things like CRUD queries is the best all around approach.

SQL Migrations

I would much rather define a table or add a column in SQL than in some programming language’s DSL.

CREATE TABLE users (
    id    UUID    PRIMARY KEY DEFAULT gen_random_uuid(),
    name  varchar NOT NULL
);
-- some time later...
ALTER TABLE users ADD COLUMN email varchar;

There is nothing that a schema migration should do that can’t be done by this method. (Someone might argue for the ability to create these definitions dynamically in application code. I have never seen the need for this, but if it arose, it could easily be accommodated using the query composition methods above.)

Data Migrations

A migration tool should also have a method for loading seed data from CSV, JSON, or YAML.

# A YAML example
users:
- name: connieroberts
  id: '8992e675-ac3a-4a1c-8816-293f37e62d0b'
- name: robertthomas
  id: '26b9481d-e11d-413f-8389-a153342ea67c'
- name: joyharris
  id: '923919f2-30a9-4dbe-abb2-f7d70f4407c0'
- name: andrewjohnston
  id: 'ecfe5a2b-0378-4d5f-8cec-93b28e792463'
- name: susancross
  id: 'f4ceacdf-ea1a-4096-800d-04da0caadbdf'
# users.csv - using pipe `|` as a delimiter because it is uncommon in field data.
id|name
8992e675-ac3a-4a1c-8816-293f37e62d0b|connieroberts
26b9481d-e11d-413f-8389-a153342ea67c|robertthomas
923919f2-30a9-4dbe-abb2-f7d70f4407c0|joyharris
ecfe5a2b-0378-4d5f-8cec-93b28e792463|andrewjohnston
f4ceacdf-ea1a-4096-800d-04da0caadbdf|susancross

Migration Sequencing: a DAG of tasks

A migration tool should be able to handle not just a single sequence of migrations, but branching and merging migrations. Merging is not an easy problem to solve in terms of schema maintenance, so the migration tool shouldn’t try to solve merging – the developers have to ensure that a merge does not create conflicts. But the migration tool should be able to process merge migrations.

It should also be able to handle situations where a developer switches branches but has migrations in the database from a branch that is no longer present in the file system – the migration system should be able to reverse out of migrations that are in the database but not in the codebase.

The Django migration tool has the ability to incorporate migrations defined in other applications that are imported into (but not present in) the current codebase. This is an extremely nice, labor-saving feature that enables the creation of applications that can be imported and used in other applications, including the migrations that make up the data models. I have not seen this feature elsewhere.

SQLY: A library for SQL queries and migrations

A few years ago (May 2019), I set out to solve these issues for myself: I started a Python library called SQLY with the following goals:

  • Write SQL in SQL. No bespoke DSL.
  • Dialect-aware safe value substitution. Different database adaptors have different syntax for substituting parameters into queries. Some use positional numbering ($1$N) or question marks (?) , others use named %s-strings (%(name)s) or colon-strings (:name) to indicate parameters that should be substituted. SQLY opts to use named colon-strings and automates generating SQL with the correct dialect’s syntax.
  • Dynamic CRUD queries. For the simple cases.
  • Block composition. Compose SQL queries through a sequence of query blocks.
  • Multi-application, branching SQL and data migrations. The migration engine should handle all of the above-described use cases for migrations, including multiple applications, multiple branches, and both SQL and data migrations.

As of this writing, SQLY has achieved these goals — the query and migration engines work as advertised, but some of the code throughout the library needs to be cleaned up, and there are no documentation (!) or tests (!!!). Until those things are completed, it is alpha-quality software.