postgrails   >   20230722-sql-databases-in-python  

SQL Databases in Python

published 2023-07-22

Database Drivers

It is useful, when working with a database from an application programming language, to be able to use the same patterns to connect to and interact with the database, regardless of what database you are connecting to. Different projects use different database. Working in Python I have had projects that use PostgreSQL, MySQL, and MS SQL Server. In enterprise contexts, the relational database system is often not the application programming team's choice, and a developer might be called upon to work with a variety of databases in a given year or career. It is useful not to have to learn completely different idioms for connecting to and interacting with different relational databases. To address this issue, Python has the DB-API 2.0 standard (see PEP 249 -- and note the data on that baby!). For Python programmers, it makes the most sense to use database drivers that conform to this standard for projects that connect to relational database. That way, you can learn a set of database interaction idioms once and use them on every project, regardless of what database server technology is in use. An incomplete list of DB-API 2.0 compliant drivers includes:

  • SQLite: sqlite3, built into the standard library
  • PostgreSQL: psycopg, which was adopted by Django and so became very widely used; v3 now supports asyncio with the same interface. (Early in the asyncio era, asyncpg provided a MUCH higher performance driver, but with quite a different interface from DB-API 2.0. The psycopg project has closed most of the performance gap. Daniele Verrazzo, the primary developer of psycopg, wrote a very interesting blog post about his performance-improvement journey.)
  • MS SQL Server: pymssql.
  • MySQL: mysqlclient, which I haven't used, but is recommended by the Django project.
  • Oracle python-cx_Oracle - also not used, also Django-recommended.
  • Other databases: pyodbc. Need to connect to Teradata? MS SQL Server? Some other database that supports ODBC? Here's your driver.

Not all of these drivers support asyncio operations. When using these drivers, there are two possibilities:

  1. Wrap synchronous database interactions in awaitable run_in_executor. This has a very small performance penalty, but that penalty can be kept quite small since the executor doesn't need to create a new thread on each call.

  2. Use an async driver that doesn't conform to the DB-API 2.0 spec. This is the approach taken by the databases project, which integrates with a number of non-DB-API-2.0 drivers and provides a very different interface.

For my money and time, it is a better use of resources to standardize all python database operations and interfaces on the DB-API 2.0 spec. (It has been said that DB-API is a synchronous spec and not really applicable to an async context. Psycopg v3 has shown that this is not really true and that the same basic interface can be provided in an async context.)

(For a few years, I have been using the aforementioned databases library as my primary database interface. It exposes a nice API, but it is not DB-API 2.0 compliant, and the underlying database drivers are not, either, which means that when I inevitably want to drop down to the driver level I have to learn the ins and outs of a new database interface. Databases doesn't make this particularly easy to do, since it hides the driver interface pretty well.)

Standardizing on a spec doesn't remove all the need to adapt one's code to the driver, just most of the need. For example, when Python types that are not supported by the driver (such as custom types) need to be converted between Python and the database, rather than doing this in application runtime code, it is better to define type adaptors for the database driver. But DB-API 2.0 doesn't define how to adapt types beyond a base set of built-in types. Where the spec is silent, everyone comes up with their own solution. So if you need to adapt types that aren't covered by the driver, you're going to have to write type adaptors that are very likely to be specific to the driver in question. Abstractions are leaky, and there is no getting around having to deal with lower layers at some point when dealing with non-core or optional aspects of interacting with a database: Different drivers have slightly different implementations of the DB-API 2.0 spec. Even so, having the spec does keep this sort of nonsense to a minimum, especially for the core work of connecting to a database, running queries, and committing transactions.


Database Queries and ORMs

What I have said about database drivers applies even more to the primary way in which an application programmer interacts with a database: by executing queries. For my money and time, it is most valuable to master a single query language that I can use for any project on any database in any application programming language. That language, of course, is SQL.

But many python application programmers never or rarely write SQL queries, and when they do it is called “dropping down to raw SQL,” and it is frowned upon.

As with all conventional wisdom, there is history here: Early in the web era, database queries often consisted of raw SQL in PHP or ASP web pages. A common mistake was to take form input and string-interpolate it into the SQL query, resulting in the possibility that a bad actor could seriously harm your database. This problem largely went away when people started using ORMs to construct and issue database queries. The ORM or query builder would take a model or parameters object, construct the query, and then issue it in a safe way. Ever since then, using “raw SQL” has been seen as dangerous and even a little bit illicit. But in fact every database driver has a way to issue queries with sanitized parameters, so there is no inherent danger in using SQL. The primary issue is trusting user inputs. As a result, SQL injection still happens, even when using an ORM, because unsavvy developers trust inputs from more savvy bad actor users. All applications should validate and sanitize user inputs according to an expected schema of types and models, then issue parametrized queries using validated and sanitized inputs as parameters. Once we learn to do this, it becomes second nature, and then SQL is no more scary than the ORM for issuing queries. At least as regards security.

So, if most developers don’t use SQL, what do they use? They use ORMs. Now, the great thing about ORMs is that they make it super easy to issue the kinds of queries that an application programmer might want to issue. Here’s some examples with the Django ORM, using Widget as the data model class, widget as the local instance, and data as a dict of data that goes in a widget:

  • Create a widget: widget = Widget.objects.create(**data)
  • Update the widget: widget.name = 'new name'; widget.save()
  • Select the widget: widget = Widget.objects.get(name='new name')
  • Delete the widget: widget.delete()

What if we want to use “raw SQL” and a DB-API 2.0 database connection? It’s harder. Much harder:

  • Create a widget:

    keys = data.keys()
    # Let's assume this database driver uses `$1..$N` for parameters…
    params = [f"${i}" for i in range(keys)+1]
    values = data.values()
    sql = f"""
    	INSERT INTO widgets ({",".join(keys)}) 
    	VALUES ({",".join(params)})
    	RETURNING *
    """ # no values in SQL! no SQL-injection.
    cursor = connection.execute(sql, params)
    fields = [d.name for d in cursor.description]
    widget = dict(zip(fields, cursor.fetchone()))
    connection.commit()  # don't forget this!
    
  • Update the widget:

    widget["name"] = "new name"
    assigns = [f"{key}=${i+1}" for i, key in enumerate(widget.keys())]
    id_param = f"${assigns.index('id')+1}""
    values = widget.values()
    sql = f"""
    	UPDATE widgets SET {assigns}
    	WHERE id={id_param}
    """
    cursor = connection.execute(sql, values)
    connection.commit()
    
  • Select the widget:

    filters = ["name=$1"]
    values = [widget["name"]]
    sql = f"""
    	SELECT * FROM widgets
    	WHERE {" AND ".join(filters)}
    """
    cursor = connection.execute(sql, values)
    fields = [d.name for d in cursor.description]
    widget = dict(zip(fields, cursor.fetchone()))
    
  • Delete the widget:

    filters = ["name=$1"]
    values = ["new name"]
    sql = f"""
    	DELETE FROM widgets
    	WHERE {" AND ".join(filters)}
    """
    connection.execute(sql, values)
    connection.commit()
    

No wonder ORMs are so popular.

When doing these kinds of queries, nothing beats the simplicity and directness of an ORM. Application programmers will always want what a good ORM provides. (When I say “good ORM,” I mean a mature, battle-tested system. In the Python world there are two: Django and SQL Alchemy.)

The problem with ORMs come about when more complex queries are in play. For example, common table expressions (CTEs) are very useful for structuring complex queries, but Django does not natively support them (though there is an extension library). SQL Alchemy does support them, but in both cases, you have to learn a custom DSL (domain-specific language) to construct them.

Here is where I start to balk: SQL is a great language for constructing queries against relational databases, and even if it isn’t great, it is ubiquitous. Effort invested in learning SQL will pay compounding dividends for the rest of one’s career. (Even if it is eventually replaced by something better, which it won’t be, there will always be lots of SQL in the world for the foreseeable future.)

For queries that go beyond the “basic CRUD” (create, read, update, delete) illustrated above, I see no reason to invest effort in learning any ORM’s DSL for making these queries. Instead, I am very quick to drop into “raw SQL” whenever my needs don’t align with the most common uses of the ORM.

There are, however, a couple of issues that I run into whenever I do this:

  • SQL is hard to compose. If the query is at all dynamic with varying requirements, it can quickly become taxing to construct SQL “by hand.” The most common dynamic queries are UPDATEs with varying fields and SELECTs with varying filters. These common cases are often best handled by the ORM. Less common cases are also less likely to need to be dynamically constructed, so raw SQL becomes more tenable.
  • The results of SQL queries often don’t map precisely to the ORM model objects, so these results have to be handled differently in application code. But there is a strong argument to be made for not using ORM model objects in application code / business domain logic — instead, keep the ORM model objects to the data access layer and have a separate set of classes for objects in the business domain.

In practice, these issues have not been insurmountable. The hardest time I have had is in a project where we were hacking the Django admin beyond reason and wanted to do various complex joins to display data in the admin. The admin requires Django model query managers, collections, and objects, but the queries we wanted to do were a bit too complex. Our hacky approach resulted in a lot of Python code. It would have been better to define some SQL database views that captured our needs and then created model classes to represent those views.

The best system is one which combines the best of both worlds: easy-to-use object-to-query methods for the simple cases, and easy first-class access to parametrized SQL for the more complex cases.


Database Schema Definition and Migrations

Database drivers, queries, and now schema definition and migrations.

Most ORMs have a way to define database migrations “in code” rather than in SQL DDL. Usually the “in code” migrations are auto-generated by comparing the current definition of the data access (“model”) objects with the database itself. There are two benefits to this:

  • All code is in your application programming language. So convenient!
  • Once the data access / model objects are defined in code, they are also available to the application. For applications that use the data model for application domain logic, there is one data definition to rule them all.

The devil, however, is in the details, and those details can be devilish.

  • ORM migration libraries can only go so far in autogenerating migrations from data model definitions. Once you reach the end of what can be automated, you have the dubious privilege of writing migration code in a custom DSL (with all of the shortcomings that implies), or else dropping into raw SQL.
  • Be careful about using application-defined objects or methods in your migrations, because the migrations will use the current versions of those objects and methods, which might be different from when the migration was defined. (This is a perennial source of headaches in Django migrations.)
  • Data migrations are another source of headaches, since the most obvious way to do them — just use the model object definitions in the application code — is completely wrong, for the reason just described.
  • Different developers working on different branches often create different changes to the database. Can those changes be merged without issue, or are there conflicts? One must carefully inspect what the migrations are doing and ensure that a merge can be done without revising the branch migrations.

What is the alternative? Write schema definitions and migrations in SQL DDL, which is perfectly designed for this purpose.

  • SQL migrations are usually not portable across databases, but they don’t need to be.
  • The SQL does not need to be parametrized, it can generally just be run.
  • Data migrations can also be handled in the same way.
  • There is no issue with drift between the database and the migration schema, as long as the migration system controls all changes to the database schema as it should.
  • Merging branches is still an issue that needs to be addressed by inspecting the migration code. It’s just SQL instead of application-language code, which might make it more obvious exactly what is changing in each branch and whether there is a merge conflict.

The primary downside of using SQL for database migrations is that it doesn’t result in application-language data access objects with which to simplify the creation of CRUD queries. So that means that the database schema is being defined in two places, potentially: In SQL, and in application data model objects. And, if business domain models are being kept separate from database model objects, that is possibly three places. That’s a lot of boilerplate.

Is there a “SQLy” way to get back to the benefits of a single set of definitions that we see in applications that use data model objects for both business domain logic and schema database definition? Or at least combining the data access object definitions with the SQL schema definitions? There are two methods:

  1. Define the data access objects in the application language and then generate the schema and migrations from those definitions. Most ORMs follow this approach, with the benefits and downsides enumerated above.
  2. Define the schema and its migrations in SQL and then generate the data access objects from the current database schema. The only widely-used system that I am aware of to do this is Diesel for Rust. It is a goal for SQLY to provide this capability for Python as well.

The usually-given reason for avoiding the second approach is that it is hard to parse SQL schema definitions. However, most database platforms provide a means of querying its system tables to determine the current schema, so parsing SQL is not needed — just query the system tables and construct a set of data access models based on the information in those tables.

The benefit of generating data access object (DAO) definitions is that the data access objects are automatically in sync with the current database schema. DAOs can either be generated statically (separate script or at compile time), or dynamically (application startup). As long as the migration system controls all changes to the database schema, the auto-generated DAOs will exactly match what is defined in the migrations, which is to say that the application code and the migration code are never out of sync the way they can be when defining migrations in application code using application-level objects.


The State of the Art in Python

If I had to advise a Python developer starting an application today what system to use, I would strongly recommend the following approach:

  1. Install the Diesel CLI and use it to manage all your schema definition and migrations in SQL. Diesel is the only widely-used tool I am aware of that supports defining migrations in SQL files. It has limitations, but it works great for most uses. The Diesel CLI is a standalone executable, it can easily be installed in a Docker container (see here or here), and it works with the most common open-source databases (PostgreSQL, MySQL, SQLite).
  2. Install one of the widely-used DB-API 2.0 drivers (psycopg, etc.) and use it directly to set up connections / pools to your database. Learn the DB-API 2.0 interface for connections, cursors, and transactions.
  3. Write data-access object (DAO) definitions using one of the schema definition libraries, such as Pydantic or (my new heartthrob) msgspec.
  4. Make CRUD queries easy by writing some simple methods for creating generalized CRUD queries that can take a DAO object, fields, and/or filters as parameters and return a SQL query and set of parameters that can be executed by your database connection. (The current development version of SQLY has some examples of these queries.)

I have lately been following my own advice and using this system for my own personal and commercial projects. It is solid, production-ready, and though slightly more verbose than the Django ORM, it is only slightly, and much more in tune with my inner database administrator.

SQLY: Write SQL in SQL

The goal of the SQLY is to solve all of the concerns raised in this article. Its current status is:

  • SQLY can generate parametrized queries for a wide range of Python database drivers from a single database parametrization syntax (known to DB-API 2.0 as “named” parameters). Given the following inputs:

    data = {"id": 1, "name": "My Name"}
    query = "INSERT INTO my_table (id, name) VALUES (:id, :name)"
    

    It will produce a SQLY string and parameters that are correct for the database driver being used.

  • SQLY has some very simple methods to generate CRUD queries given a set of data. For example:

    data = {"id": 1, "name": "My Name"}
    query = INSERT("my_table", data)
    

    These methods generally work with any iterable data (so, schema objects might need to cast to an iterable that returns field names).

These parts of the library are tested and working well in production in a variety of projects.

In progress work:

  • SQLY can be used to create and apply SQL-based migrations. The migration system handles both a single series of migrations as well as a DAG (directed acyclic graph) of migrations, so it can merge various branches. It can also incorporate migrations written in a dependency library, including SQLY itself.

    SQLY opts to define the migration in a YAML document, because that is the only single-file format that can incorporate the SQL strings for “up” and “down” migrations along with the metadata that the migration needs (well, unless I create a file named *.sql with comments for metadata and to separate different sections, but that approach is a hack and not solid data management). The only viable alternative is to do what diesel does: Put each migration in its own folder. Then that folder can have up.sql, dn.sql, and meta.yml. We started there and might end up back there, but for now I’m trying the single YAML document approach, which has as its primary drawback the lack of SQL syntax highlighting and formatting capabilities in VS Code.

The migrations part of the library is not yet fully tested or stabilized, so I cannot yet recommend it for production use, but it certainly works and I would love to have feedback from intrepid explorers.

Future possible work:

  1. Generate DAO object definitions from migrations for several of the most common database platforms. This requires standardizing on a schema library, and if that is not dataclasses in the standard library, adding that library as a runtime requirement. I am trying to avoid hard runtime requirements for this library.

  2. Improve the basic CRUD query builders. These queries are so, so simple. I have been ambivalent about whether it is a good idea to include them in the SQLY library; writing this article has convinced me that they are needed if SQLY is going to be wanted and used. So there is room for improvement to make them more useful, though I am always going to cast a suspicious eye on adding features that turns SQLY into Yet Another ORM. (YAORM sounds like a sound a cat might make.)

  3. ??? Create a DB-API compliant wrapper that smooths out the differences between different database drivers, making it possible for application developers to use a single interface no matter what driver they use. But creating an interface wrapper has many pitfalls around tracking and dealing with all the differences between all the drivers, so I’m not sure it’s a good idea at all.

Maybe all we need is to…

  1. Create a query method. One of the things I like about the databases library is that it provides a nice interface to issue queries that return dict-like Record objects. A query method would take a DB-API 2.0 compliant connection, a SQL query, and parameters, and returns a generator that yields dicts or DAO objects. It wouldn’t do any transaction management — that is all outside the query itself. But this simple method could really improve the ergonomics of the library in daily use.

(later that same day...)

SQLY vs. Django vs. raw DB-API 2.0

UPDATE: I went off and hacked together a Database interface class that fulfills the desires of #4 above: Make it more ergonomic to issue queries without wrapping the database connection or dealing with transaction management (except for in one case…).

How does SQLY do compared to Django and raw DB-API 2.0 for issuing basic queries?

  • Initialize the Database interface: (Everyone has to do this in some form, I just didn’t show it earlier for Django or DB-API 2.0 queries.)

    import psycopg, os
    from sqly import Database, queries, Q
    conn = psycopg.connect(os.environ['DATABASE_URL'])
    db = Database(dialect='psycopg')
    data = {...}
    
  • Create a widget:

    sql =  queries.INSERT('widgets', data) + 'RETURNING *'
    widget = next(db.query(conn, sql, data))
    conn.commit()
    
  • Update the widget:

    widget["name"] = "new name"
    db.execute(conn, queries.UPDATE('widgets', widget, ["id=:id"]), widget)
    conn.commit()
    
  • Select the widget:

    results = db.query(conn, queries.SELECT('widgets', ['*'], [Q.filter("name")]), {'name': 'new name'})
    widget = next(results)
    
  • Delete the widget:

    db.execute(conn, queries.DELETE('widgets', [Q.filter("id")]), widget)
    conn.commit()
    

(I’m still wrestling with the best way to handle query filters, so I used both styles that SQLY supports: ["id=:id"] is exactly what Q.filter("id") produces, but the latter is not clearly better and might be worse than the former.)

Conclusion:

  • SQLY is not as terse as Django, but it’s much closer. I can stand to write and read this code in the course of everyday development work.

  • At the same time, it has the advantage of transparently using widely-available database interfaces

    • DB-API 2.0 connection objects are in play and visible
    • the SQL queries are still close to the surface here even though they are wrapped in our CRUD query builder methods.

    I like this transparency, as I like my database interactions raw and wriggling.

  • It also is much less error-prone than the raw DB-API 2.0 queries (I had several mistakes in those queries that I found while creating these!). That is the tax on writing more boilerplate that we avoid when we have more ergonomic interfaces.

I’ll take these results as a win.