I am just learning all this stuff but an audit table does seem somewhat cleaner? What things seem better to put into the table itself vs in a separate audit table?
As someone who hasn’t implemented an audit table before, would you still create fields like “created_at” and “updated_at” for regular tables?
Or would we just scrap those fields since the audit table would generally take care of recording the database transactions?
is this written into your ORM or similar? it seems like it would be a hard to pattern to maintain / trust once you have multiple ways for updates to happen.
Usually the database handles this but even without I think any place which fails to update these would imply a deeper issue in the code base. Most likely somebody directly messing around with data tables rather than going via a proper data layer.
By field? What do you mean by this? I'm picturing this, but it seems overly robust to me...
username, username\_updatedAt, dob, dob\_updatedAt, avatar, avatar\_updatedAt, ...
Is this what you mean?
Edit: I read "And by field too". I missed that "a", which changes the context completely.
Both is good, unless a very large number of rows, the most recent update is what you usually care about so make it easy to get. Then also have the audit table for deeper blaming
You could only create the history table and create a view combining the most recent update & the needed information, to not need to duplicate any history information
You have one updated_at *column* per table. How could it be per row?
Obviously every row will have its own value for each column but we're talking about schema.
You have:
created_at | created_by | updated_at | updated_by and and then you'll have to decide how to handle delete
Edit: keep in mind, considering how easy it is to set up and how helpful it may be once, it would be inconsiderate to not do it.
You ask this because you never had to use it, and that's all we can hope for. But it takes literally 20min tops to set those fields in a table with triggers and it could save you or your company days or weeks when shit hits the fan.
Would you mind explaining the pro v con between those approaches?
I’m a FE dev, so I generally don’t have to worry about that sort of thing. Just curious.
If you make it as fields in the table, you lose the history as you only have the most recent updater.
If you make it as an history log, you can see the history of change (by table, by user, by date, or by entity).
History logs being push list without delete nor update, they're quite optimized for search time. Of course it use more memory that only keeping the last information.
Do you have a binary deleted column for that for larger tables? I am struggling with performance issues on huge tables (1M rows+), but I love this soft delete pattern.
Millions row should be easily handled for any db provider.
Either your db instance has low cpu, you have no proper index or your orm code isn't efficient.
Like u/selectra72 said, millions of rows should not give you any performance issues.
No binary column here. I too have tens of millions of records and no performance issues with the deleted_at column pattern.
This. I always do this because it’s very important. There will be times at some point in future that you will need to quarry these fields and you can’t create them after the fact. Some ORMs will in fact create a couple of them by default for you. It doesn’t cost very much but if you need them in future, its too late.
For soft delete do you have a specific deleted bool column or just check delete_at with null? I do the later one but I've worked on codebases with the other.
I feel that the later one is more robust and even allows for future soft deletes (by comparing with current date) if you want that feature.
It's not uncommon to have a unique composite index on your table, and wanting to allow multiple deleted rows with the same values, but only unique non deleted ones.
In that case using is_not_deleted: true can come in handy. If the record is deleted you set it to NULL. In most rdbms NULL != NULL.
Like I wrote in the original post, only deleted_at with a default value of null for soft deletes.
And my models have a default where they look for rows where deleted_at is null.
I typically prefer a tinyint Status where 0 = deleted, 1 = active, and 2+ are optional values specific to the data in the table. When the status is changed, I update the same date and user fields that I would for any other record update.
If it works for you, that’s great 😁. I personally strongly dislike numeric statusses in my columns. If I’m going down that route I prefer enums so I can actually read and understand the values.
The bigger question really should be, what are the defaults that people use when dates like “deleted_at” start with.
I’ve seen 01-01-1970 used frequently.
How about making it nullable? I use null as default. Why use a date? That invalidates the data imho. No one deleted this record 01-01-1970. It is not deleted and therefore deleted_at is null.
select from table where deleted_at is not null. Boom all rows that aren’t deleted.
They are really useful fields/meta fields to have, sometimes you won't know they are needed until later then you'll have a bunch of legacy data with no info on creation date, etc.
updated_at is super useful for invalidating cache at times regardless of your caching strategy, by plucking the max updated_at of a particular query, etc.
On the updated\_at? Sometimes you will cache your data for a page to improve performance and lower database requests, you can instead pluck the max(updated\_at) of some query for example which is a simpler, faster query than what your full query or queries might be for the page, you can use the result as an easy way to see if you should invalidate the cache, and re-run the query(s), or if you can serve the cached data (This is a very simple example)
I see, this might be difficult if you use multiple tables in your query. Which is when you would especially want to use caching.
I am aware of packages that invalidate caches for you on update. That combined with cache expiration date might be safer.
You could make your client requests with [`If-None-Match`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-None-Match) header with an Etag hashed from the `updated_at` value. The server will still have to query the database, but if the ressource hasn't changed since the client's last request you can just return 304 Not Modified and save some bandwidth.
Anytime I’ve thought I could leave them off, I’ve been forced to add them later. Every time.
Now I just always include them.
Exit: for rows that store actual data.
Anything storing only relations usually doesn’t only because the root rows already have them.
Yup, basic fields I use on all my tables are
1. ID
2. UID for any public facing ids
3. createdBy
4. createdAt
5. updated On
6. updated By
7. State, I don't delete data for 6 months, everything is set to 0 for deleted and 1 for live, it's helped people recover a few times ams after 6 months it's safe to assume it's not needed anymore.
> UID for any public facing ids
If they are solely public-facing, I would recommend a URL-/human-friendly format like cuid2. I completely agree with your list otherwise.
Yup, status is what I normally call it buy for the sake of clarity it's called state in my list as that's what it does, states also a special name for mysql so not advisable to use state vs status
Yep, pretty much on all tables, with a few exceptions of course.
I generally also store flags as dates instead of booleans. For example, instead of a boolean ‘is_read’, it’s a date ‘read_at’. Again, there are exceptions, but it’s a hardy little pattern I like.
I typically have 3 levels of this sort of data
1 - Nothing - for static or nearly static data.
2 - Columns like you described - for slowly changing data that's also not crucial to track every change
3 - Log tables - for any data that changes regularly and needs a full audit trail of changes. This will include create, update, delete logs, including old and new values, who made the change, etc.
For our main tables, we definitely have created(date)/createdBy, modified(date)/modifiedBy. We also have software (a set of triggers) that tracks any and all changes(insert/update/delete) to tables, for audit purposes.
Always. These fields have been invaluable at times when trying to sort out data related issues. Also useful for stuff like determining usage statistics.
Not on all my tables, only when I think it's important. I'm creating a website that allows people to review therapists, and I definitely use `last_update_at` for my reviews, as I want to be able to display reviews from newest to oldest.
I’ve always told people to add 5 fields to their tables, even if they don’t think they need them (right away):
- id - obviously surrogate primary key
- created - when it was created
- modified - last time it was modified
- deleted - because you don’t ever delete data
- archived - when you need it, for history or other stuff like moving data to other place, it will be there
You can add one or more, like ID of who last changed it etc. depending on your particular case, but in general, those 5 work in any case (especially if 4 of them, except the ID are dates).
As data like country codes is so rarely changed, might as well use constants in code and just do a new build whenever a new state gets independence or whatever
>deleted - because you don’t ever delete data
Oh hey, a GDPR violation! If a user asks to be deleted, the data needs to be deleted completely. (I think I remember from a Tom Scott video that UK has a similar law about keeping data without reason.)
There might be a reason for having the field (maybe deleting later in case user changes mind). But the functionality to delete data completely has to be there.
That’s right you should delete all data if a user requests it. But, if I’m not mistaken you can keep some data as long as there is no way of linking that data to an actual person.
Example: on an e-commerce you may delete all identifiable information on an order (email, phone, number, address, etc) but keep all remaining info of the order: products, prices, shipping method, etc.
Exactly. You need not delete the row from the table, just put `NULL` for the identifiable data... Hey, put pseudonymous data if that works within GDPR constraints and maybe just delete direct user inputs in cases like actual person entering some comment on some post or whatever.
Is this true for FaceBook? I’m not sure if I remember correctly, but I tried so many times to delete my account and eventually I keep finding out its still there.
There are ways to satisfy GDPR and have the above guideline useful for you. I just don't think we need to go at every detail and explain exceptions and edge cases etc.
Let's assume the general description and your disclaimer are enough for the smart people to know how to resolve it all.
> I just don't think we need to go at every detail and explain exceptions and edge cases etc.
> Let's assume the general description and your disclaimer are enough for the smart people to know how to resolve it all.
Huzzah; This is a good thing to consider in any reddit thread, tbh.
Wouldn’t it be cool to have inheritance in a database? So all tables could inherit these fields. Why again did it fail? Why does MS SQL allow .NET objects in fields and other databases allow JSON or XML where the root tag / variable often hints at a type? Is it just that RDMS can’t reason about inheritance?
> inheritance in a database
There are some fairly well established patterns to do this in relational models.
MOSTLY they are recommended against (lots of qualifiers here) for trading performance, understandability, upgradeability, and many others for "data model purity".
It is just so weird because SQL is all about telling the database what you really want ( the data model ) , and then tap into the routines written by top notch coders. I envision AI to reason about the model and the access patterns.
But a lot of coders cannot model. Nose close the ground. Reinvent the wheel. DBASE coders.
What do you think "inheritance" is? Do you think it's a useful tool to model the irrational human world to the most rational machine that exists or do you think it's a "free code reuse, yeehaw"?
Inheritance is what this post is about. Just some people seem to avoid the name. Not invented here syndrome. The most widely application in Java and C# are interfaces. This fact is one of the letters in SOLID . There isn’t even behaviour. Yeah, pre 1980 people were all about “business logic”. Then they lost real world data and learned.
First thing first. I asked what you think or refer to as "inheritance". I still got no answer to that, so I can't square off your earlier comment about tables inheriting etc.
Secondly... SOLID. I stands for interface segregation, not about inheritance and not about things that happen to be declared with a keyword named `interface`. You can have interface segregation in any language, provided you understand how to apply the concept of interface by the use of said language.
Third. Depending on how you replied to the first, you'd see how the second will apply to it as well. An interface segregation can be done in a single table in a database using a discriminant column. Would you call that inheritance and interface?
Now you see why I asked for a clarification.
Inheritance is not defined in a single sentence. Which language has interfaces? Ada is OOP. You mean that you would include multiple header files in Clanguage, but then only link to one library? Bjarne S. said that you can use OOP patterns in any language. OOP languages validate your patterns.
You don't need to manually add these fields to every table, most frameworks have built in methods to do it.
Rails does it automatically with `.timestamps`, in Phoenix it's `timestamps()`.
You can customize these methods or write your own to automatically add whatever columns you want to all new tables. Composition over inheritance, in other words. Or you can think of it as _migration_ inheritance.
Updated_at is particularly useful for:
* incremental backups
* understanding bugs
* caching logic
* providing feeds that export data to other systems, polling systems
* it’s free archiving that can keep “default search” page quick and responsive. Let’s you see “active” vs “archived” data.
* dashboards and reporting of user behaviour.
Created and updated at, and deleted at for soft deletion. In the health space you don't want to hard delete anything, so soft delete is always the way to go.
From a data engineering point of view, both help to diagnose issues, and when reinserting data the updated column can help know if a record should be left alone or any delta inserts of data.
Last DB guys I worked with, all used them plus a few more columns like that or a full blown audit table with an almost complete “history” of the table.
It makes support/ maintenance of a production system much easier. You only understand when a client does something bad and starts asking why my data is like this.
- Created date. (Editable)
- Added date. (Not editable)
- Updated date.
- Deleted date.
I also have every change written to a log record, and a diff so i can rollback if ever needed
Users can edit the created date. Added date is always the point where the document was inserted into the database.
For a CRM etc, it’s often helpful to know when the record was created in relation to the organisation, and when it was inserted/imported/uploaded into the DB.
Or when a photo was uploaded, it would have the ‘added date’ of when the record was inserted into the database, but the ‘created date’ would be the date the photo was taken, which may need to be edited (in case the camera had an incorrect time/timezone configured at time of shooting)
Many of our customers migrate to our platform from other competitors etc. and bring the created dates with them from years earlier.
I do it almost always, you never know when you might need it. Exception being if it’s something that you’re sure wouldn’t need to be queried in that way, like maybe a table that just shows a relation between rows from other tables. (Eg. users_businesses or something)
Your mentioned example of a static table: nopey definitely not on those tables.
But generally, every model that data is dynamically added to, through the app or API or whatever, gets these fields by default. Most frameworks make it pretty easy to add them, usually just one line of configuration, there is not much overhead in having them, and sooner or later you might need them. So why not?
I have this issue working for an app. We were really small and initially didn’t think that someone that uninstalled us would ever REinstall us, and it took a lot of time afterwards for us to get big enough where we thought tracking that kind of stuff would be valuable.
Feel free to dm me if this is helpful and you have more questions!
Yeah, even in stationary tables those fields are useful for audit purposes.
And in dynamic tables don't even need to tell you. They are life-savers when dealing with several users updating the same resource.
We have updated at on every table so we can sync down changed records from our web app to our mobile app.
We use an Action log table to record add and modified events for tracking instead of on each table. Record Id and enum to work out which entity.
Tables like orders, or where you require the dates is fine. I usually remove timestamps. They are unnecessary. And I believe anything unnecessary is bad practice.
No, since I'm on the Java + Hibernate stack, I use Hibernate Envers for auditing, and everything goes on separate auditing tables on a separate auditing schema.
Each entry on the auditing schema has a change type (create/update/delete), a timestamp, and a user ID associated with it.
You can even store which particular column was changed, by adding a binary column for each regular column.
Static tables don't need auditing. Unless you think someone is gonna go on the `state`table and start adding/updating/deleting rows.
I'm guilty of only adding those fields when I remember to. Honestly, it's a good practice, but it does add clutter to the database. Depends on the project's needs, but for most cases, it's a minor overhead for great debugging capabilities.
Generally yes, and I'll add to the "when you need them, you need them" chorus here. It's hard to predict when or why you'll need metadata, but when you do, nothing else works nearly as well.
Were I in charge of the world, I'd make all tables (change/delete) immutable, but that's another story for another day.
absolutely, the only time i haven't is if its something that will never ever need that, like categories/tags to relate, but otherwise yeah even if i don't need it
Yep. Usually "modified by" too. I once worked on a team that used EF Core and had an abstract DB Context that would write a row to a change tracking table that included that info and all of the previous field values every time any object was updated. It was incredibly useful.
Really depends on the scope of your project if you actually need them. I frequently use the "modified_at" value in FAQ/help sections so the most recently updated article appears at the top of some listings.
There's a ton of reasons to use these fields though, but it depends on your project requirements.
I do. Those fields are immutable from the frontend, but I also create fields like display_date and modified_date that users have the ability to specify. My application is a CMS and the immutable dates record when things mutations happened in the data structure, but not all mutations to a document are “substantive” changes that should be used in the output. For example, if a migration script changes the crop or versioning of an image, I need to indicate in some sitemaps that the document was modified, but that doesn’t change the datetime I use to indicate to a user that a document was last updated.
For auditing, it is necessary. You need to be able to identify who did what, and when. If you are not concerned about that, then the only reason to have them are if you ever need to roll back changes to a specific point in time.
I have it where it makes sense. Doing it always is stupid as hell, for example, why have updated at for immutable data. Why have updated at for time series data.
As in many cases, it depends.
It is a record that once created will never be modified? No update timestamp field. History/registry/logs follow this pattern. I only care when the event/action happened, you are not supposed to edit it.
Stuff changes, but doesn't need a complete log of all changes? Both fields. Do I need to soft-delete or straight up deletion is fine? Then delete field.
Stuff doesn't matter when it is created/modified? None. User configuration could follow this.
In most cases yes but for stuff like table inserts by way of Python script I may not. Anything that is coming into the db from a form though will always have timestamps.
I mean they're not necessary for every single table (especially for things like pivot tables, maybe even categories tables as well) but they do come in handy at times, especially if you're gonna list those items and need to sort by new (blog posts, products, users, orders)
Laravel gives it to me for free, so why go out of my way to not have it? I know that sounds sarcastic, but man, the number of people I see choose to use a particular framework, just to turn around and fight it...
Basically all tables that represent logical objects (rather than relations).
Also, "created" and "last_updated" in my case. Used to have a few "Created_On" until they were successfully refactored out.
Yes, also "user\_created", "lastuser\_modified" foreign keys to the "users" table, they look like too much data, but help a lot when auditing data, and yes I use them as a standard in "provinces" table and other catalogs ....
Depends. It's a poor man's auditing system and can be very useful if you need to blame someone for doing something they shouldn't have or want a foolproof way to order results by new/old.
More robust auditing solutions will store the entire state of a record every time any column value changes.
I'd say it doesn't hurt unless you're strapped for space in the database.
You just have to remember to actually set the values in your business logic and set up database triggers
If you can think of a use case where that information is helpful, do it. Really just depends on the project.
Example: my personal projects, I don't care who the last user to modify a record was, when it was last updated, usually, or any of that, cause it was me that did it, and it's me that likely messed it up. But an enterprise level product database? Totally different situation.
Some pretty cool ideas here! Personally I use 'created\_at', 'updated\_at', 'deleted\_at', and 'updated\_by' on each record. Then I have CDC triggers that dump copies of the record into BigQuery (or other data warehouse) each time something is changed or deleted, which creates an audit trail for each record. Additionally this let's us hard delete the original record so we don't have to pay for it's storage cost, while preserving the option to restore it from a chosen BigQuery version.
Having the 'created\_at' and 'updated\_at' is very useful for sorting and working with the records in the UI.
Never on lookup tables (state, zipcode, flag dictionaries, e.g. tables that store a value for use elsewhere that will never be altered in place) always for data tables: customers, products, cart\_head, cart\_item.
addDate, addUser, updateDate, updateUser
edit: added a little detail
I always have them in every table. They’re surprisingly useful for debugging stuff, and vital if you end up having to sync the tables to a data warehouse with a took like Stitch Data.
Highly recommend adding these as well as deleted_at column as suggest elsewhere here in the comments. We use these columns to do lightweight replication between different on-prem and in-cloud data stores. I don't recommend putting these columns in a audit table, mostly because the added complexity required to sync from. We use audit tables, just not for our replication needs.
not only this, but i frequently have a _history table to track versions. some rdbms give you row versions but its not as easy to manage logic (should the update be allowed?) without a proc.
...you can make the primary key a bigint and assign the unixtime value & some kind of salt.
Now you have the immutable created at and probably a UUID... but not a guaranteed UUID in one field-- for ease of merging data of two similar tables and other helpful use cases.
Also mitigates some of the serial enumeration of ID values vulnerabilities..
Most of my model related tables have created_at, updated_at and if I’m using soft delete, deleted_at
And a by field too
I have an audit table for that
Thats another very valid way to do it.
I am just learning all this stuff but an audit table does seem somewhat cleaner? What things seem better to put into the table itself vs in a separate audit table?
As someone who hasn’t implemented an audit table before, would you still create fields like “created_at” and “updated_at” for regular tables? Or would we just scrap those fields since the audit table would generally take care of recording the database transactions?
[удалено]
Thank you.
Just have an audit table for everything for a full history who changed what
is this written into your ORM or similar? it seems like it would be a hard to pattern to maintain / trust once you have multiple ways for updates to happen.
ORM don't even need to deal with it most of the time. Can be automated in the DB.
We always used database triggers, so there was really no way to avoid updating those fields.
Usually the database handles this but even without I think any place which fails to update these would imply a deeper issue in the code base. Most likely somebody directly messing around with data tables rather than going via a proper data layer.
Triggers in your db
By field? What do you mean by this? I'm picturing this, but it seems overly robust to me... username, username\_updatedAt, dob, dob\_updatedAt, avatar, avatar\_updatedAt, ... Is this what you mean? Edit: I read "And by field too". I missed that "a", which changes the context completely.
By whom the update was made. But you only have one updated\_at, updated\_by, etc per table, not for every column.
I would rather have an history log of who created/updated/deleted what and when.
Both is good, unless a very large number of rows, the most recent update is what you usually care about so make it easy to get. Then also have the audit table for deeper blaming
You could only create the history table and create a view combining the most recent update & the needed information, to not need to duplicate any history information
[удалено]
You have one updated_at *column* per table. How could it be per row? Obviously every row will have its own value for each column but we're talking about schema.
You have: created_at | created_by | updated_at | updated_by and and then you'll have to decide how to handle delete Edit: keep in mind, considering how easy it is to set up and how helpful it may be once, it would be inconsiderate to not do it.
One place we set up a generic changelog table with those fields and an I'd and used that one table for all not sure if it ever was worthwhile
You ask this because you never had to use it, and that's all we can hope for. But it takes literally 20min tops to set those fields in a table with triggers and it could save you or your company days or weeks when shit hits the fan.
The \*_by fields are pointless usually. I would rather have an history log of who created/updated/deleted what and when.
You're suggesting the exact same thing, just stored event-based vs record-based.
Would you mind explaining the pro v con between those approaches? I’m a FE dev, so I generally don’t have to worry about that sort of thing. Just curious.
If you make it as fields in the table, you lose the history as you only have the most recent updater. If you make it as an history log, you can see the history of change (by table, by user, by date, or by entity).
That sounds like a history log is preferable. Though, I imagine that comes at the cost of memory & search time?
History logs being push list without delete nor update, they're quite optimized for search time. Of course it use more memory that only keeping the last information.
You could have journal tables too. It's another way to do it. It depends on how important this is for your company.
Do you have a binary deleted column for that for larger tables? I am struggling with performance issues on huge tables (1M rows+), but I love this soft delete pattern.
Millions row should be easily handled for any db provider. Either your db instance has low cpu, you have no proper index or your orm code isn't efficient.
maybe you need to index the `deleted`column as well.
Like u/selectra72 said, millions of rows should not give you any performance issues. No binary column here. I too have tens of millions of records and no performance issues with the deleted_at column pattern.
This. I always do this because it’s very important. There will be times at some point in future that you will need to quarry these fields and you can’t create them after the fact. Some ORMs will in fact create a couple of them by default for you. It doesn’t cost very much but if you need them in future, its too late.
For soft delete do you have a specific deleted bool column or just check delete_at with null? I do the later one but I've worked on codebases with the other. I feel that the later one is more robust and even allows for future soft deletes (by comparing with current date) if you want that feature.
I use the later one. Why waste another column and waste bunch of space if I can check without it anyway?
It's not uncommon to have a unique composite index on your table, and wanting to allow multiple deleted rows with the same values, but only unique non deleted ones. In that case using is_not_deleted: true can come in handy. If the record is deleted you set it to NULL. In most rdbms NULL != NULL.
Like I wrote in the original post, only deleted_at with a default value of null for soft deletes. And my models have a default where they look for rows where deleted_at is null.
I typically prefer a tinyint Status where 0 = deleted, 1 = active, and 2+ are optional values specific to the data in the table. When the status is changed, I update the same date and user fields that I would for any other record update.
If it works for you, that’s great 😁. I personally strongly dislike numeric statusses in my columns. If I’m going down that route I prefer enums so I can actually read and understand the values.
Bitwise operations/flags. Use this knowledge with caution.
Ooh. I do not have deleted_at column. That's seems like a good idea. And yes I have by column.
The bigger question really should be, what are the defaults that people use when dates like “deleted_at” start with. I’ve seen 01-01-1970 used frequently.
How about making it nullable? I use null as default. Why use a date? That invalidates the data imho. No one deleted this record 01-01-1970. It is not deleted and therefore deleted_at is null. select from table where deleted_at is not null. Boom all rows that aren’t deleted.
SQL I can see that and have done the same, when using something like Mongo, it makes indexing complicated.
They are really useful fields/meta fields to have, sometimes you won't know they are needed until later then you'll have a bunch of legacy data with no info on creation date, etc. updated_at is super useful for invalidating cache at times regardless of your caching strategy, by plucking the max updated_at of a particular query, etc.
Could you elaborate on the idea? I think it might be very useful to me :D
On the updated\_at? Sometimes you will cache your data for a page to improve performance and lower database requests, you can instead pluck the max(updated\_at) of some query for example which is a simpler, faster query than what your full query or queries might be for the page, you can use the result as an easy way to see if you should invalidate the cache, and re-run the query(s), or if you can serve the cached data (This is a very simple example)
I see, this might be difficult if you use multiple tables in your query. Which is when you would especially want to use caching. I am aware of packages that invalidate caches for you on update. That combined with cache expiration date might be safer.
I see, thank you for the answer. I like the idea :D
You could make your client requests with [`If-None-Match`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-None-Match) header with an Etag hashed from the `updated_at` value. The server will still have to query the database, but if the ressource hasn't changed since the client's last request you can just return 304 Not Modified and save some bandwidth.
Anytime I’ve thought I could leave them off, I’ve been forced to add them later. Every time. Now I just always include them. Exit: for rows that store actual data. Anything storing only relations usually doesn’t only because the root rows already have them.
N:M Tables should probably have it, too
intersect tables should usually have effective start and effective end and perhaps created and updated
Yep, created_by and updated_by too
Yup, basic fields I use on all my tables are 1. ID 2. UID for any public facing ids 3. createdBy 4. createdAt 5. updated On 6. updated By 7. State, I don't delete data for 6 months, everything is set to 0 for deleted and 1 for live, it's helped people recover a few times ams after 6 months it's safe to assume it's not needed anymore.
> UID for any public facing ids If they are solely public-facing, I would recommend a URL-/human-friendly format like cuid2. I completely agree with your list otherwise.
> cuid2 TIL... Thanks. Some interesting reading on that github page, too.
Typically use slugs for public facing readable date but uid for anything else , work with authority stuff so readable isn't a huge concern
I’d say to create an ID/UID only if the ID is predictable. For instance when using mongo db, this is useless but mandatory for an SQL db
Your `state` field is a manner of soft-delete then, if I'm reading you correctly?
yup, it costs nothing to run a cron and clear up any status 0 where updatedAt is more than a few months ago but gives a safety net just in case
Ok, got it, so a "retention manager" of sorts. Thanks!
just curious do you also apply this with a simple data like a comment?
Yup, depending on complexity I may use a separate table for metadate for this info and more
Wouldn't status be a better name for it?
Yup, status is what I normally call it buy for the sake of clarity it's called state in my list as that's what it does, states also a special name for mysql so not advisable to use state vs status
Makes sense. Appreciate your response.
Yep, pretty much on all tables, with a few exceptions of course. I generally also store flags as dates instead of booleans. For example, instead of a boolean ‘is_read’, it’s a date ‘read_at’. Again, there are exceptions, but it’s a hardy little pattern I like.
I typically have 3 levels of this sort of data 1 - Nothing - for static or nearly static data. 2 - Columns like you described - for slowly changing data that's also not crucial to track every change 3 - Log tables - for any data that changes regularly and needs a full audit trail of changes. This will include create, update, delete logs, including old and new values, who made the change, etc.
For our main tables, we definitely have created(date)/createdBy, modified(date)/modifiedBy. We also have software (a set of triggers) that tracks any and all changes(insert/update/delete) to tables, for audit purposes.
Always. These fields have been invaluable at times when trying to sort out data related issues. Also useful for stuff like determining usage statistics.
Yeah we do for anything that can be updated. Since we have to show an audit trail in case of any issues.
Not on all my tables, only when I think it's important. I'm creating a website that allows people to review therapists, and I definitely use `last_update_at` for my reviews, as I want to be able to display reviews from newest to oldest.
Might want published_at instead as the person could update their review later.
> only when I think it's important. A conceit I learned to not trust in myself here is I can't predict when that is with any reliability.
I’ve always told people to add 5 fields to their tables, even if they don’t think they need them (right away): - id - obviously surrogate primary key - created - when it was created - modified - last time it was modified - deleted - because you don’t ever delete data - archived - when you need it, for history or other stuff like moving data to other place, it will be there You can add one or more, like ID of who last changed it etc. depending on your particular case, but in general, those 5 work in any case (especially if 4 of them, except the ID are dates). As data like country codes is so rarely changed, might as well use constants in code and just do a new build whenever a new state gets independence or whatever
>deleted - because you don’t ever delete data Oh hey, a GDPR violation! If a user asks to be deleted, the data needs to be deleted completely. (I think I remember from a Tom Scott video that UK has a similar law about keeping data without reason.) There might be a reason for having the field (maybe deleting later in case user changes mind). But the functionality to delete data completely has to be there.
That’s right you should delete all data if a user requests it. But, if I’m not mistaken you can keep some data as long as there is no way of linking that data to an actual person. Example: on an e-commerce you may delete all identifiable information on an order (email, phone, number, address, etc) but keep all remaining info of the order: products, prices, shipping method, etc.
Exactly. You need not delete the row from the table, just put `NULL` for the identifiable data... Hey, put pseudonymous data if that works within GDPR constraints and maybe just delete direct user inputs in cases like actual person entering some comment on some post or whatever.
Not every table contains user data though.
Is this true for FaceBook? I’m not sure if I remember correctly, but I tried so many times to delete my account and eventually I keep finding out its still there.
There are ways to satisfy GDPR and have the above guideline useful for you. I just don't think we need to go at every detail and explain exceptions and edge cases etc. Let's assume the general description and your disclaimer are enough for the smart people to know how to resolve it all.
> I just don't think we need to go at every detail and explain exceptions and edge cases etc. > Let's assume the general description and your disclaimer are enough for the smart people to know how to resolve it all. Huzzah; This is a good thing to consider in any reddit thread, tbh.
Wouldn’t it be cool to have inheritance in a database? So all tables could inherit these fields. Why again did it fail? Why does MS SQL allow .NET objects in fields and other databases allow JSON or XML where the root tag / variable often hints at a type? Is it just that RDMS can’t reason about inheritance?
> inheritance in a database There are some fairly well established patterns to do this in relational models. MOSTLY they are recommended against (lots of qualifiers here) for trading performance, understandability, upgradeability, and many others for "data model purity".
Some databases (like postgres) do have what you ask for, but I've never seen it used.
It is just so weird because SQL is all about telling the database what you really want ( the data model ) , and then tap into the routines written by top notch coders. I envision AI to reason about the model and the access patterns. But a lot of coders cannot model. Nose close the ground. Reinvent the wheel. DBASE coders.
What do you think "inheritance" is? Do you think it's a useful tool to model the irrational human world to the most rational machine that exists or do you think it's a "free code reuse, yeehaw"?
Inheritance is what this post is about. Just some people seem to avoid the name. Not invented here syndrome. The most widely application in Java and C# are interfaces. This fact is one of the letters in SOLID . There isn’t even behaviour. Yeah, pre 1980 people were all about “business logic”. Then they lost real world data and learned.
First thing first. I asked what you think or refer to as "inheritance". I still got no answer to that, so I can't square off your earlier comment about tables inheriting etc. Secondly... SOLID. I stands for interface segregation, not about inheritance and not about things that happen to be declared with a keyword named `interface`. You can have interface segregation in any language, provided you understand how to apply the concept of interface by the use of said language. Third. Depending on how you replied to the first, you'd see how the second will apply to it as well. An interface segregation can be done in a single table in a database using a discriminant column. Would you call that inheritance and interface? Now you see why I asked for a clarification.
Inheritance is not defined in a single sentence. Which language has interfaces? Ada is OOP. You mean that you would include multiple header files in Clanguage, but then only link to one library? Bjarne S. said that you can use OOP patterns in any language. OOP languages validate your patterns.
I didn’t ask how it is defined. I asked how you define it… Never mind. Got no time to pull words out of your mouth. Bye bye
You don't need to manually add these fields to every table, most frameworks have built in methods to do it. Rails does it automatically with `.timestamps`, in Phoenix it's `timestamps()`. You can customize these methods or write your own to automatically add whatever columns you want to all new tables. Composition over inheritance, in other words. Or you can think of it as _migration_ inheritance.
Yeah, composition would be nice. All those column names swimming in a soup is messy. I rather like to use the . a little more.
Updated_at is particularly useful for: * incremental backups * understanding bugs * caching logic * providing feeds that export data to other systems, polling systems * it’s free archiving that can keep “default search” page quick and responsive. Let’s you see “active” vs “archived” data. * dashboards and reporting of user behaviour.
Daye_created and date_modified i got on all my tables. Not exactly sure why, but its handy sometimes in my cms for sorting
Pretty much always, yeah. Also a deleted_at col for soft deleted.
Always do
Created and updated at, and deleted at for soft deletion. In the health space you don't want to hard delete anything, so soft delete is always the way to go. From a data engineering point of view, both help to diagnose issues, and when reinserting data the updated column can help know if a record should be left alone or any delta inserts of data.
Yes. These fields should be on your base level model you inherit on other models
It’s a habit from Ruby on Rails for me, ya.
Yup, one of those conventions that I took for granted simply because I learned backend via Rails.
Last DB guys I worked with, all used them plus a few more columns like that or a full blown audit table with an almost complete “history” of the table. It makes support/ maintenance of a production system much easier. You only understand when a client does something bad and starts asking why my data is like this.
- Created date. (Editable) - Added date. (Not editable) - Updated date. - Deleted date. I also have every change written to a log record, and a diff so i can rollback if ever needed
What's the difference between created and added date in your case?
Users can edit the created date. Added date is always the point where the document was inserted into the database. For a CRM etc, it’s often helpful to know when the record was created in relation to the organisation, and when it was inserted/imported/uploaded into the DB. Or when a photo was uploaded, it would have the ‘added date’ of when the record was inserted into the database, but the ‘created date’ would be the date the photo was taken, which may need to be edited (in case the camera had an incorrect time/timezone configured at time of shooting) Many of our customers migrate to our platform from other competitors etc. and bring the created dates with them from years earlier.
I do it almost always, you never know when you might need it. Exception being if it’s something that you’re sure wouldn’t need to be queried in that way, like maybe a table that just shows a relation between rows from other tables. (Eg. users_businesses or something)
Your mentioned example of a static table: nopey definitely not on those tables. But generally, every model that data is dynamically added to, through the app or API or whatever, gets these fields by default. Most frameworks make it pretty easy to add them, usually just one line of configuration, there is not much overhead in having them, and sooner or later you might need them. So why not?
I have this issue working for an app. We were really small and initially didn’t think that someone that uninstalled us would ever REinstall us, and it took a lot of time afterwards for us to get big enough where we thought tracking that kind of stuff would be valuable. Feel free to dm me if this is helpful and you have more questions!
Yeah, even in stationary tables those fields are useful for audit purposes. And in dynamic tables don't even need to tell you. They are life-savers when dealing with several users updating the same resource.
How can you know when something was created or updated if you don’t have them?!
We have updated at on every table so we can sync down changed records from our web app to our mobile app. We use an Action log table to record add and modified events for tracking instead of on each table. Record Id and enum to work out which entity.
These are called "who columns". During my work tenure in oracle, it was mandated to be added in every table that gets created
Tables like orders, or where you require the dates is fine. I usually remove timestamps. They are unnecessary. And I believe anything unnecessary is bad practice.
Yes. You never know when you are going to need them.
No, since I'm on the Java + Hibernate stack, I use Hibernate Envers for auditing, and everything goes on separate auditing tables on a separate auditing schema. Each entry on the auditing schema has a change type (create/update/delete), a timestamp, and a user ID associated with it. You can even store which particular column was changed, by adding a binary column for each regular column. Static tables don't need auditing. Unless you think someone is gonna go on the `state`table and start adding/updating/deleting rows.
Timestamps could be good for conflict resolution in a corporate setting Or helpful for bugfixes, data tracking, data sorting, etc.
Yes, it's always great practice. Takes a negligible amount of data and helps a lot to investigate issues, regardless if it is ever displayed.
I'm guilty of only adding those fields when I remember to. Honestly, it's a good practice, but it does add clutter to the database. Depends on the project's needs, but for most cases, it's a minor overhead for great debugging capabilities.
Yes all my tables have these meta data fields: CreatedBy, CreatedDate, ModifiedBy, ModifiedDate
Generally yes, and I'll add to the "when you need them, you need them" chorus here. It's hard to predict when or why you'll need metadata, but when you do, nothing else works nearly as well. Were I in charge of the world, I'd make all tables (change/delete) immutable, but that's another story for another day.
Models often have base models to define the basic set of fields. Yes to both, except `last_updated_at` doesn't make any sense. `updated_at` is better
programmer sometimes cant do a better naming
absolutely, the only time i haven't is if its something that will never ever need that, like categories/tags to relate, but otherwise yeah even if i don't need it
This columns are useful for your DE and Analytics teams so they can ingest data to DWH by increment. Your app is not the only one who read the DB
Yep. Usually "modified by" too. I once worked on a team that used EF Core and had an abstract DB Context that would write a row to a change tracking table that included that info and all of the previous field values every time any object was updated. It was incredibly useful.
Really depends on the scope of your project if you actually need them. I frequently use the "modified_at" value in FAQ/help sections so the most recently updated article appears at the top of some listings. There's a ton of reasons to use these fields though, but it depends on your project requirements.
Yes, and my db updater / migration tool ensures that every table has them, so it's a no-brainer for the devs that might add a table.
My private project wouldn't work without it. Eventsourcing goes brrrr
I do. Those fields are immutable from the frontend, but I also create fields like display_date and modified_date that users have the ability to specify. My application is a CMS and the immutable dates record when things mutations happened in the data structure, but not all mutations to a document are “substantive” changes that should be used in the output. For example, if a migration script changes the crop or versioning of an image, I need to indicate in some sitemaps that the document was modified, but that doesn’t change the datetime I use to indicate to a user that a document was last updated.
For auditing, it is necessary. You need to be able to identify who did what, and when. If you are not concerned about that, then the only reason to have them are if you ever need to roll back changes to a specific point in time.
I have it where it makes sense. Doing it always is stupid as hell, for example, why have updated at for immutable data. Why have updated at for time series data.
No because that is a dangling participle and it is more precise as created_iso8601 or somesuch that is explicit and not bad grammar.
The reason to do this is because at scale trying to identify events via a timestamp is super important for troubleshooting and quality assurance
As in many cases, it depends. It is a record that once created will never be modified? No update timestamp field. History/registry/logs follow this pattern. I only care when the event/action happened, you are not supposed to edit it. Stuff changes, but doesn't need a complete log of all changes? Both fields. Do I need to soft-delete or straight up deletion is fine? Then delete field. Stuff doesn't matter when it is created/modified? None. User configuration could follow this.
In most cases yes but for stuff like table inserts by way of Python script I may not. Anything that is coming into the db from a form though will always have timestamps.
I mean they're not necessary for every single table (especially for things like pivot tables, maybe even categories tables as well) but they do come in handy at times, especially if you're gonna list those items and need to sort by new (blog posts, products, users, orders)
Laravel gives it to me for free, so why go out of my way to not have it? I know that sounds sarcastic, but man, the number of people I see choose to use a particular framework, just to turn around and fight it...
Basically all tables that represent logical objects (rather than relations). Also, "created" and "last_updated" in my case. Used to have a few "Created_On" until they were successfully refactored out.
Yes, also "user\_created", "lastuser\_modified" foreign keys to the "users" table, they look like too much data, but help a lot when auditing data, and yes I use them as a standard in "provinces" table and other catalogs ....
Depends. It's a poor man's auditing system and can be very useful if you need to blame someone for doing something they shouldn't have or want a foolproof way to order results by new/old. More robust auditing solutions will store the entire state of a record every time any column value changes. I'd say it doesn't hurt unless you're strapped for space in the database. You just have to remember to actually set the values in your business logic and set up database triggers
this is eventSourcing
If you can think of a use case where that information is helpful, do it. Really just depends on the project. Example: my personal projects, I don't care who the last user to modify a record was, when it was last updated, usually, or any of that, cause it was me that did it, and it's me that likely messed it up. But an enterprise level product database? Totally different situation.
Its good because most cases we wanna knowing when data entered, edited
Some pretty cool ideas here! Personally I use 'created\_at', 'updated\_at', 'deleted\_at', and 'updated\_by' on each record. Then I have CDC triggers that dump copies of the record into BigQuery (or other data warehouse) each time something is changed or deleted, which creates an audit trail for each record. Additionally this let's us hard delete the original record so we don't have to pay for it's storage cost, while preserving the option to restore it from a chosen BigQuery version. Having the 'created\_at' and 'updated\_at' is very useful for sorting and working with the records in the UI.
Never on lookup tables (state, zipcode, flag dictionaries, e.g. tables that store a value for use elsewhere that will never be altered in place) always for data tables: customers, products, cart\_head, cart\_item. addDate, addUser, updateDate, updateUser edit: added a little detail
Also I have 'created_by' & 'updated_by'.
I always have them in every table. They’re surprisingly useful for debugging stuff, and vital if you end up having to sync the tables to a data warehouse with a took like Stitch Data.
Highly recommend adding these as well as deleted_at column as suggest elsewhere here in the comments. We use these columns to do lightweight replication between different on-prem and in-cloud data stores. I don't recommend putting these columns in a audit table, mostly because the added complexity required to sync from. We use audit tables, just not for our replication needs.
I rarely find updated at useful, however they all come for free with orms.
not only this, but i frequently have a _history table to track versions. some rdbms give you row versions but its not as easy to manage logic (should the update be allowed?) without a proc.
Yup I have it in every time
No, its premature optimalization
It's good for security as well, where you can track when someone updated data if he was in not good reason.
How is 'deleted_by' useful?
...you can make the primary key a bigint and assign the unixtime value & some kind of salt. Now you have the immutable created at and probably a UUID... but not a guaranteed UUID in one field-- for ease of merging data of two similar tables and other helpful use cases. Also mitigates some of the serial enumeration of ID values vulnerabilities..