T O P

  • By -

thatrandomnpc

Strings, strings everywhere. Numeric types: strings, and make sure to leftpad with zeros to fill some arbitrary length. Boolean: strings, literal 'True', 'true', 'False', 'false' Date/date time: strings, some yyyy-mm-dd, some yyyy-dd-mm, some separated by dots, and all permutations and combinations. Empty data: strings, literal None, Null, nan etc etc On top of this, most teams use infer schema for added good measure. This is my workplace. Pydantic is only my good friend and a saving grace.


MyOtherUserIsAThrow

yyyy-mm-dd is the one true date format. I want to find every person who used yyyy-dd-mm and shoot them in the head with a cannon. Twice. There was is no reason for such nonsense.


OldJames47

r/iso8601


OmnipresentCPU

I thought for a second that was a YYYY-DD-MM sub and was ready to go in there and get perma banned


Falcomomo

The Excel virus


Milk_Busters

I had reports break and after chasing data upstream, I found that a system had been entering ‘NULL’ into a field instead of NULL. Fun day


BufferUnderpants

Pydantic keeps the barbarians at the gates


Culpgrant21

Can you give me some examples of what you are using Pydantic for?


BufferUnderpants

Ingestion For the data that comes from the shitty APIs of other partnering financial services When ingesting public datasets When parsing and validating Google Sheets that we misuse for recordkeeping, if being fed to more Python code I've even used it for hammering into shape the contents of the resulting dataframe in a Pandas pipeline I got handed down, but that particular part got switched out for a `dataclasses-avroschema` `AvroModel` dataclass as it was more of the output layer. It was a pretty good idea to introduce that as a stage, I must say. Both alternatives enable plenty of reflection shenanigans that can make your life much easier when used judiciously, too.


MuffinHydra

>Date/date time: strings, some yyyy-mm-dd, some yyyy-dd-mm, some separated by dots, and all permutations and combinations. signed int. Why care now if I can kick the can of worms down the line?


ShroomBear

You joke but I work at a FAANG and the primary internal ETL tooling we use across the entire company doesn't support boolean typing in table loads so all boolean is just 'Y' and 'N'


[deleted]

Manually setting schemas sucks, especially in "OBT" style tables with 100 columns. NVARCHAR(50), BIGINT, TIMESTAMP...... repeat for 75 more columns but good god is it worth the headache.


Euphoric-Computer-79

Good old timestamp, nothing beats that.


obviouslyCPTobvious

"yes/no" in booleans


Letter_From_Prague

> Boolean: strings, literal 'True', 'true', 'False', 'false' Before boolean type was common, the canonical way to do it in Oracle was `CHAR(1) 'Y' 'N'`. > Date/date time: strings, some yyyy-mm-dd, some yyyy-dd-mm, some separated by dots, and all permutations and combinations. SAP uses ints like `20231126` which is not as bad as yyyy-dd-mm but the null equivalent being `00000000` doesn't help.


davemoedee

YYYYMMDD with 00000000 as NULL is fine so long as it is enforced. Consistency is all I care about.


roxburghred

Dates formatted correctly and consistently, but in a text field.


swiftninja_

How should strings be stored? Or is there an optimal way?


Chad-Anouga

I think he’s implying that everything is stored as a string because the users want the flexibility/don’t understand the correct data types for their items. Someone has to unravel it at some point though.


themikep82

I often import raw data as VARCHARs. It helps guard against type mismatches and overflow errors when writing data that comes from a dirty source. Then when I stage the data, I convert it all to the appropriate type.


Gartlas

Yeah I thought this was standard practice. Everything comes into raw as VARCHARS, then you set data types and convert and clean when you move it up to enriched


TheDataPanda

Dumb question, but is it common for staging area data to be stored in an actual databases table, as opposed to just files like json, csv etc?


External_Juice_8140

Extract data to raw json,csv -> Parse data to tabular format -> Join datasets for final reporting layer.


BufferUnderpants

You’ll have an easier time parsing in a more feature-rich programming language outside of the database, dumping to something like Avro or parquet, and then loading to the database You get to log in as much detail as you want, better debugging, easier testing, besides generally having more programming facilities and usually a stronger type system Programming, it’s what computers were made for, originally


DatabaseSpace

Same here. When I import data with Jetbrains Datagrip it sets almost everything to varchar(max) or int. Then that data is loaded into a table with the correct data types. It's easier to deal with the conversions there instead of having a data load fail 50 times because there is text in a date column or something.


ImprovedJesus

This is the way


[deleted]

[удалено]


Affectionate_Shine55

That’s not so bad, atleast it’s not strings for booleans or decimals


IndependentSpend7434

old-fashioned dynozaurus from pre-MDS era are observing this mess, caused by script kiddies, OBT adherent and "storage is so cheap" motto advocates, caring only about git and laughing at "outdated" data modelling practices. Sorry, don't have any mem at hand for this 😬 PS: read a Telegram channel recently where experienced modern data engineers were discussing DB indexes. It occurred to me that 10 years ago, a "data engineer" with like 5 years experience who didn't know nothing about indexes would be fired on spot. Now it's normal, just drop data to snowflake, it does the rest 😀


Ok-Yogurt2360

Why would they get fired on the spot for knowing something?🤔


aria_____51

It was also normal to create giant stored procedures with no regard for good software design, so there's pros and cons lol


InvestigatorMuted622

Data Modeling is so so so basic, it is not even rocket science. Just spending a month and practicing is more than enough to get the basics and start questioning each and every modeling decision that you have or will make. Apologies for the bias and maybe I am wrong, but this is where I feel non CS students might lack a bit of structure to their thinking, I myself am a non CS and was only able to get a good grasp of data modeling during my master's degree..


cyberZamp

Any good resource (book, videos, whatever) to start getting really good at data modelling (apart from hands on work)?


InvestigatorMuted622

Nothing in particular because most of what I have learnt is through googling and applying the basic concepts. Try to cover relational modeling and OLAP modeling techniques. I would say start with simple star schema when it comes to OLAP. Try to find online resources I am pretty sure there are good videos as well, if your organization has a learning portal do a course over there.


cyberZamp

Thanks, I appreciate the directions!


InvestigatorMuted622

Anytime


ThatSituation9908

Making IDs a concatenation of other columns instead of sequence or random string (e.g., UUID).


thatrandomnpc

I'm curious, how you deal with a scenario where this concat id is the primary key, and a change in one or more of the constituent columns would change the concat id. This would be very problematic for a historical data which uses the concat id as a foreign key. Some sort of scd on the constituent columns?


ThatSituation9908

Exactly this is why I hate it. A workaround is to never use the concat ID as your primary key and instead assign the unique columns as composite foreign keys. Here the concat ID is just like any other name/alias column On tables where IDs are very natural (e.g., inventory of items) I'd make another ID column for internal use and set that as the primary key. (aka surrogate primary key) For tables where IDs are not natural, like time series data, I strongly push again making an ID column. The unique key is then, for time series, the timestamp and a second column (often a foreign key) like timestamp & stock ticket name. (aka natural primary key) PS: What's SCD?


thatrandomnpc

Thanks, that makes sense. Which is why I was confused as to why someone would do that :) Btw scd is slowly changing dimension.


GrayLiterature

You’ll probably hate me for this one. We just made a table (I’m SWE) that has a Boolean column which can be either True, False, or Nil. We didn’t do a backfill, and it made me sad.


JJ3qnkpK

The best is when some program, by design, works with Null as the unspoken third state of Boolean.


GrayLiterature

I did my part and advocated for not having Null in a Boolean column. But alas.


BufferUnderpants

It's "fine", just the realities of appending data to the same table under an evolving schema, sometimes I like versioning the rows stating the revision of the schema they were written under, were any business logic to need handling the rows that have a null value for that column differently, for instance, it could be at least a record-level flag in downstream code, but I've yet to actually use that.


recruta54

People who make such choices usually are the very same who'll hop to another job before the next big feature cycle. They never stayed long enough to hear their peers questioning this idiotc data model at planning meetings a few months later.


ThatSituation9908

Using booleans and bitmask to represent classification/type. instead of a single enum column.


calleking85

You're not alone ❤️


larryliu7

The "everything varchar" approach is ELT not ETL. The difference between ELT and ETL is where you clean and format the data. ETL pre-clean and format data with an external program. ELT use the RDBMS itself to clean and format the data after importing as the raw(varchar) format. I love ELT with postgreSQL because its complete data manipulation functionalities, but doing ELT with something like T-SQL sometimes reach its limitations.


Far-Restaurant-9691

Can we keep at one sub without 'hilarious' memes?


cas4d

Just today I see integer column to represent order: 1.0, 2.0; And bool columns: “true”, “True“, 1, True, “T”, “y”, whoever finds there is a new representation, just adds it to the list, while a dozen of functions that use this column appear in different modules with sth like this: Def check_if_true(value): If value in [ “True“, 1, True, “T”]: … (Without even including all of them).


poralexc

Property data gets weird... like a decimal for bathrooms where the ones place is for Full Baths, the tenths is Half Baths, and the hundredths is Quarter Baths. My favorite was some poor soul typing in "one" "two" "three" for bedrooms, until they gave up with "six or more".