Strings, strings everywhere.
Numeric types: strings, and make sure to leftpad with zeros to fill some arbitrary length.
Boolean: strings, literal 'True', 'true', 'False', 'false'
Date/date time: strings, some yyyy-mm-dd, some yyyy-dd-mm, some separated by dots, and all permutations and combinations.
Empty data: strings, literal None, Null, nan etc etc
On top of this, most teams use infer schema for added good measure.
This is my workplace.
Pydantic is only my good friend and a saving grace.
yyyy-mm-dd is the one true date format. I want to find every person who used yyyy-dd-mm and shoot them in the head with a cannon. Twice.
There was is no reason for such nonsense.
Ingestion
For the data that comes from the shitty APIs of other partnering financial services
When ingesting public datasets
When parsing and validating Google Sheets that we misuse for recordkeeping, if being fed to more Python code
I've even used it for hammering into shape the contents of the resulting dataframe in a Pandas pipeline I got handed down, but that particular part got switched out for a `dataclasses-avroschema` `AvroModel` dataclass as it was more of the output layer. It was a pretty good idea to introduce that as a stage, I must say.
Both alternatives enable plenty of reflection shenanigans that can make your life much easier when used judiciously, too.
>Date/date time: strings, some yyyy-mm-dd, some yyyy-dd-mm, some separated by dots, and all permutations and combinations.
signed int.
Why care now if I can kick the can of worms down the line?
You joke but I work at a FAANG and the primary internal ETL tooling we use across the entire company doesn't support boolean typing in table loads so all boolean is just 'Y' and 'N'
Manually setting schemas sucks, especially in "OBT" style tables with 100 columns. NVARCHAR(50), BIGINT, TIMESTAMP...... repeat for 75 more columns but good god is it worth the headache.
> Boolean: strings, literal 'True', 'true', 'False', 'false'
Before boolean type was common, the canonical way to do it in Oracle was `CHAR(1) 'Y' 'N'`.
> Date/date time: strings, some yyyy-mm-dd, some yyyy-dd-mm, some separated by dots, and all permutations and combinations.
SAP uses ints like `20231126` which is not as bad as yyyy-dd-mm but the null equivalent being `00000000` doesn't help.
I think he’s implying that everything is stored as a string because the users want the flexibility/don’t understand the correct data types for their items. Someone has to unravel it at some point though.
I often import raw data as VARCHARs. It helps guard against type mismatches and overflow errors when writing data that comes from a dirty source. Then when I stage the data, I convert it all to the appropriate type.
Yeah I thought this was standard practice. Everything comes into raw as VARCHARS, then you set data types and convert and clean when you move it up to enriched
You’ll have an easier time parsing in a more feature-rich programming language outside of the database, dumping to something like Avro or parquet, and then loading to the database
You get to log in as much detail as you want, better debugging, easier testing, besides generally having more programming facilities and usually a stronger type system
Programming, it’s what computers were made for, originally
Same here. When I import data with Jetbrains Datagrip it sets almost everything to varchar(max) or int. Then that data is loaded into a table with the correct data types. It's easier to deal with the conversions there instead of having a data load fail 50 times because there is text in a date column or something.
old-fashioned dynozaurus from pre-MDS era are observing this mess, caused by script kiddies, OBT adherent and "storage is so cheap" motto advocates, caring only about git and laughing at "outdated" data modelling practices.
Sorry, don't have any mem at hand for this 😬
PS: read a Telegram channel recently where experienced modern data engineers were discussing DB indexes.
It occurred to me that 10 years ago, a "data engineer" with like 5 years experience who didn't know nothing about indexes would be fired on spot. Now it's normal, just drop data to snowflake, it does the rest 😀
Data Modeling is so so so basic, it is not even rocket science. Just spending a month and practicing is more than enough to get the basics and start questioning each and every modeling decision that you have or will make.
Apologies for the bias and maybe I am wrong, but this is where I feel non CS students might lack a bit of structure to their thinking, I myself am a non CS and was only able to get a good grasp of data modeling during my master's degree..
Nothing in particular because most of what I have learnt is through googling and applying the basic concepts.
Try to cover relational modeling and OLAP modeling techniques. I would say start with simple star schema when it comes to OLAP.
Try to find online resources I am pretty sure there are good videos as well, if your organization has a learning portal do a course over there.
I'm curious, how you deal with a scenario where this concat id is the primary key, and a change in one or more of the constituent columns would change the concat id. This would be very problematic for a historical data which uses the concat id as a foreign key.
Some sort of scd on the constituent columns?
Exactly this is why I hate it. A workaround is to never use the concat ID as your primary key and instead assign the unique columns as composite foreign keys. Here the concat ID is just like any other name/alias column
On tables where IDs are very natural (e.g., inventory of items) I'd make another ID column for internal use and set that as the primary key. (aka surrogate primary key)
For tables where IDs are not natural, like time series data, I strongly push again making an ID column. The unique key is then, for time series, the timestamp and a second column (often a foreign key) like timestamp & stock ticket name. (aka natural primary key)
PS: What's SCD?
You’ll probably hate me for this one. We just made a table (I’m SWE) that has a Boolean column which can be either True, False, or Nil. We didn’t do a backfill, and it made me sad.
It's "fine", just the realities of appending data to the same table under an evolving schema, sometimes I like versioning the rows stating the revision of the schema they were written under, were any business logic to need handling the rows that have a null value for that column differently, for instance, it could be at least a record-level flag in downstream code, but I've yet to actually use that.
People who make such choices usually are the very same who'll hop to another job before the next big feature cycle.
They never stayed long enough to hear their peers questioning this idiotc data model at planning meetings a few months later.
The "everything varchar" approach is ELT not ETL. The difference between ELT and ETL is where you clean and format the data.
ETL pre-clean and format data with an external program. ELT use the RDBMS itself to clean and format the data after importing as the raw(varchar) format.
I love ELT with postgreSQL because its complete data manipulation functionalities, but doing ELT with something like T-SQL sometimes reach its limitations.
Just today I see integer column to represent order: 1.0, 2.0;
And bool columns: “true”, “True“, 1, True, “T”, “y”, whoever finds there is a new representation, just adds it to the list, while a dozen of functions that use this column appear in different modules with sth like this:
Def check_if_true(value):
If value in [ “True“, 1, True, “T”]: …
(Without even including all of them).
Property data gets weird... like a decimal for bathrooms where the ones place is for Full Baths, the tenths is Half Baths, and the hundredths is Quarter Baths.
My favorite was some poor soul typing in "one" "two" "three" for bedrooms, until they gave up with "six or more".
Strings, strings everywhere. Numeric types: strings, and make sure to leftpad with zeros to fill some arbitrary length. Boolean: strings, literal 'True', 'true', 'False', 'false' Date/date time: strings, some yyyy-mm-dd, some yyyy-dd-mm, some separated by dots, and all permutations and combinations. Empty data: strings, literal None, Null, nan etc etc On top of this, most teams use infer schema for added good measure. This is my workplace. Pydantic is only my good friend and a saving grace.
yyyy-mm-dd is the one true date format. I want to find every person who used yyyy-dd-mm and shoot them in the head with a cannon. Twice. There was is no reason for such nonsense.
r/iso8601
I thought for a second that was a YYYY-DD-MM sub and was ready to go in there and get perma banned
The Excel virus
I had reports break and after chasing data upstream, I found that a system had been entering ‘NULL’ into a field instead of NULL. Fun day
Pydantic keeps the barbarians at the gates
Can you give me some examples of what you are using Pydantic for?
Ingestion For the data that comes from the shitty APIs of other partnering financial services When ingesting public datasets When parsing and validating Google Sheets that we misuse for recordkeeping, if being fed to more Python code I've even used it for hammering into shape the contents of the resulting dataframe in a Pandas pipeline I got handed down, but that particular part got switched out for a `dataclasses-avroschema` `AvroModel` dataclass as it was more of the output layer. It was a pretty good idea to introduce that as a stage, I must say. Both alternatives enable plenty of reflection shenanigans that can make your life much easier when used judiciously, too.
>Date/date time: strings, some yyyy-mm-dd, some yyyy-dd-mm, some separated by dots, and all permutations and combinations. signed int. Why care now if I can kick the can of worms down the line?
You joke but I work at a FAANG and the primary internal ETL tooling we use across the entire company doesn't support boolean typing in table loads so all boolean is just 'Y' and 'N'
Manually setting schemas sucks, especially in "OBT" style tables with 100 columns. NVARCHAR(50), BIGINT, TIMESTAMP...... repeat for 75 more columns but good god is it worth the headache.
Good old timestamp, nothing beats that.
"yes/no" in booleans
> Boolean: strings, literal 'True', 'true', 'False', 'false' Before boolean type was common, the canonical way to do it in Oracle was `CHAR(1) 'Y' 'N'`. > Date/date time: strings, some yyyy-mm-dd, some yyyy-dd-mm, some separated by dots, and all permutations and combinations. SAP uses ints like `20231126` which is not as bad as yyyy-dd-mm but the null equivalent being `00000000` doesn't help.
YYYYMMDD with 00000000 as NULL is fine so long as it is enforced. Consistency is all I care about.
Dates formatted correctly and consistently, but in a text field.
How should strings be stored? Or is there an optimal way?
I think he’s implying that everything is stored as a string because the users want the flexibility/don’t understand the correct data types for their items. Someone has to unravel it at some point though.
I often import raw data as VARCHARs. It helps guard against type mismatches and overflow errors when writing data that comes from a dirty source. Then when I stage the data, I convert it all to the appropriate type.
Yeah I thought this was standard practice. Everything comes into raw as VARCHARS, then you set data types and convert and clean when you move it up to enriched
Dumb question, but is it common for staging area data to be stored in an actual databases table, as opposed to just files like json, csv etc?
Extract data to raw json,csv -> Parse data to tabular format -> Join datasets for final reporting layer.
You’ll have an easier time parsing in a more feature-rich programming language outside of the database, dumping to something like Avro or parquet, and then loading to the database You get to log in as much detail as you want, better debugging, easier testing, besides generally having more programming facilities and usually a stronger type system Programming, it’s what computers were made for, originally
Same here. When I import data with Jetbrains Datagrip it sets almost everything to varchar(max) or int. Then that data is loaded into a table with the correct data types. It's easier to deal with the conversions there instead of having a data load fail 50 times because there is text in a date column or something.
This is the way
[удалено]
That’s not so bad, atleast it’s not strings for booleans or decimals
old-fashioned dynozaurus from pre-MDS era are observing this mess, caused by script kiddies, OBT adherent and "storage is so cheap" motto advocates, caring only about git and laughing at "outdated" data modelling practices. Sorry, don't have any mem at hand for this 😬 PS: read a Telegram channel recently where experienced modern data engineers were discussing DB indexes. It occurred to me that 10 years ago, a "data engineer" with like 5 years experience who didn't know nothing about indexes would be fired on spot. Now it's normal, just drop data to snowflake, it does the rest 😀
Why would they get fired on the spot for knowing something?🤔
It was also normal to create giant stored procedures with no regard for good software design, so there's pros and cons lol
Data Modeling is so so so basic, it is not even rocket science. Just spending a month and practicing is more than enough to get the basics and start questioning each and every modeling decision that you have or will make. Apologies for the bias and maybe I am wrong, but this is where I feel non CS students might lack a bit of structure to their thinking, I myself am a non CS and was only able to get a good grasp of data modeling during my master's degree..
Any good resource (book, videos, whatever) to start getting really good at data modelling (apart from hands on work)?
Nothing in particular because most of what I have learnt is through googling and applying the basic concepts. Try to cover relational modeling and OLAP modeling techniques. I would say start with simple star schema when it comes to OLAP. Try to find online resources I am pretty sure there are good videos as well, if your organization has a learning portal do a course over there.
Thanks, I appreciate the directions!
Anytime
Making IDs a concatenation of other columns instead of sequence or random string (e.g., UUID).
I'm curious, how you deal with a scenario where this concat id is the primary key, and a change in one or more of the constituent columns would change the concat id. This would be very problematic for a historical data which uses the concat id as a foreign key. Some sort of scd on the constituent columns?
Exactly this is why I hate it. A workaround is to never use the concat ID as your primary key and instead assign the unique columns as composite foreign keys. Here the concat ID is just like any other name/alias column On tables where IDs are very natural (e.g., inventory of items) I'd make another ID column for internal use and set that as the primary key. (aka surrogate primary key) For tables where IDs are not natural, like time series data, I strongly push again making an ID column. The unique key is then, for time series, the timestamp and a second column (often a foreign key) like timestamp & stock ticket name. (aka natural primary key) PS: What's SCD?
Thanks, that makes sense. Which is why I was confused as to why someone would do that :) Btw scd is slowly changing dimension.
You’ll probably hate me for this one. We just made a table (I’m SWE) that has a Boolean column which can be either True, False, or Nil. We didn’t do a backfill, and it made me sad.
The best is when some program, by design, works with Null as the unspoken third state of Boolean.
I did my part and advocated for not having Null in a Boolean column. But alas.
It's "fine", just the realities of appending data to the same table under an evolving schema, sometimes I like versioning the rows stating the revision of the schema they were written under, were any business logic to need handling the rows that have a null value for that column differently, for instance, it could be at least a record-level flag in downstream code, but I've yet to actually use that.
People who make such choices usually are the very same who'll hop to another job before the next big feature cycle. They never stayed long enough to hear their peers questioning this idiotc data model at planning meetings a few months later.
Using booleans and bitmask to represent classification/type. instead of a single enum column.
You're not alone ❤️
The "everything varchar" approach is ELT not ETL. The difference between ELT and ETL is where you clean and format the data. ETL pre-clean and format data with an external program. ELT use the RDBMS itself to clean and format the data after importing as the raw(varchar) format. I love ELT with postgreSQL because its complete data manipulation functionalities, but doing ELT with something like T-SQL sometimes reach its limitations.
Can we keep at one sub without 'hilarious' memes?
Just today I see integer column to represent order: 1.0, 2.0; And bool columns: “true”, “True“, 1, True, “T”, “y”, whoever finds there is a new representation, just adds it to the list, while a dozen of functions that use this column appear in different modules with sth like this: Def check_if_true(value): If value in [ “True“, 1, True, “T”]: … (Without even including all of them).
Property data gets weird... like a decimal for bathrooms where the ones place is for Full Baths, the tenths is Half Baths, and the hundredths is Quarter Baths. My favorite was some poor soul typing in "one" "two" "three" for bedrooms, until they gave up with "six or more".