Filename Metadata
Files can be used as makeshift databases when the filenames contain the metadata. This approach is not better than a database for nearly all purposes, but it's great when storing things like logs with huge chunks of data that aren't frequently needed. SQL text and related datatypes, as well as nosql dbs, can also accomplish this, but let's just assume that a new database table or connection is unwanted for the use case.
The filename metadata pattern is similar to AWS Athena's partitioning, which uses patterns like "month=12/day=01" and so on. I believe that particular pattern has a name, but I'm not familiar with it. The pattern I'm using in this post is similar, but it's more of a custom json/query parameters pattern. Here's an example for a blog which stores all posts in a directory:
"date_published::2025-11-16__author::Robert%20Tamayo__title::Filename%20Metadata__category::code__status::published__modified::2025-11-16__scheduled_publish::2025-11-16%2017:00.json"
There really isn't a reason why the normal GET query strings can't be used here. The characters aren't important. The pattern has existed all along with normal GET parameters.
Why is this even relevant? Because the files can be listed very quickly, much faster than getting the contents of each file one by one and parsing the data. Once the files are listed, the data is parsed, which gives the relevant information. When the file is needed, the filename itself is already there. The parsing can be a GET parameter decoding, or the filenames can be in JSON format, base64-encoded, and then decoded and parsed after listing the filenames. With any approach, the result would be:
{
"date": "2025-11-16",
"author": "Robert Tamayo",
"title": "Filename Metadata",
"category": "code",
"status": "published",
"modified": "2025-11-16",
"scheduled_publish": "2025-11-16 17:00",
"filename": "base64 encoded filename, or raw filename"
}
For logs such as status codes from REST APIs, this sort of pattern works well. The file's contents aren't needed unless they are requested, and the file contents can contain encyclopedias if desired. The relevant metadata is right there, showing things like log date, response type, and so on. A database does all this as well, but when the logs are going to be stored in S3 or raw files anyways, the naming convention can be a great way to speed up the process of storing and retrieving the data.
A great thing about this compared to partitioning is that the order of the data doesn't matter. Even if new fields are added or removed, or omitted from the metadata, it's not important. It's up to the handler to know what's going on, and it can decide what default values should be used for missing fields.
If partitioning is desired, say to group things by year, category, or anything else, it's probably not a great idea to store hundreds of thousands or even millions of files in one directory and expect list commands to be efficient. That's where scripting and other tricks come in to add partitions and such, which of course adds overhead. This means that some amount of initial partitioning is desired and should still be considered thoroughly before implementing something like this.
Overall, this pattern has been effective for me.
Comments:
Leave a Comment
Submit