published 2020-09-02
On 2020-08-27 I wrote describing an application based on Andy Matushak's ideas for a writing inbox, his system for personal notetaking and writing. The envisioned application enables a person to write daily and to be prompted daily to continue working on ideas and items that have previously been started.
One of the questions that arises is, How do we store the written pieces? Some requirements are clear:
In addition, the question arises of making the content available to the world -- of publishing it. This question got me thinking about content addressing and from there about using a system like IPFS to store and distribute the content.
Content addressing is a good way to uniquely identify a piece of content and to verify that the content you receive matches the content you expected -- because you yourself can perform the hashing operation and make a comparison.
Decentralized storage is a separate question, but also a fascinating possibility. If all content is in decentralized storage, then publishing the content is just a matter of registering the content with an index or registry.
Decentralization is an important part of the process of making the world safe for all users and fighting against the power of large corporations and governments. It is also makes implementing an application significantly more complicated.
My intuition is that the core writing application should first be built with content addressing but without decentralized storage. Then various paths to decentralization can be explored, and the application should be able to be lifted and shifted later.
So, what is the file storage method for the first version of writing inbox?
What I like the best: Minio. Every item is hashed based on its content and stored as an object under that key. Images and other media are similarly hashed and stored.
What about metadata? Is it versioned with the file? Is it included in the hash? In a blockchain, metadata is part of the data that is included in the hash. It also makes sense to do so in order to prove provenance, to indicate content type, etc.
In this case, every node would have two "files" (objects) under the hash key:
The hash would simply be content + metadata. Full algorithm:
>>> import orjson, hashlib, base64
>>> content = b'This is my content'
>>> md = {'content-type': 'text/markdown', 'filename': '2020-09-02.md', 'author': 'Sean Harrison <sah@kruxia.com>'}
>>> metadata = orjson.dumps(md)
>>> h = hashlib.sha256()
>>> h.update(content + metadata)
>>> key = base64.urlsafe_b64encode(h.digest()).trim(b'='); print(key)
b'MUgKgOczPTnPjVfq7AWaw91JHLdLLfJftj_HNRlUE18'
>>> # now store content and metadata under key
Note that a base64-encoded SHA256 hash has a single trailing =
, which we can trim from the key. (Similarly, a base64-encoded SHA512 hash has two trailing =
.)
In object storage, each "key" will have two files. In the above example, we will have:
These files are immutable - or at least, if they are changed, it will be obvious, because the content + metadata can be re-hashed and compared with the key prefix.
That seems good enough for a first version of the software. We can always iterate and improve. In the future, I would like to integrate with IPFS or another decentralized filestorage system, and with blockchains in general. But first, a standalone version of the app!
Web application with PostgreSQL database and minio filestorage, hosted with docker stack on a single-node cluster (virtual machine).
Backend: Write in python with
Frontend: Javascript with
(This is becoming my standard stack)