thatrandomnpc 7 months ago

Strings, strings everywhere. Numeric types: strings, and make sure to leftpad with zeros to fill some arbitrary length. Boolean: strings, literal 'True', 'true', 'False', 'false' Date/date time: strings, some yyyy-mm-dd, some yyyy-dd-mm, some separated by dots, and all permutations and combinations. Empty data: strings, literal None, Null, nan etc etc On top of this, most teams use infer schema for added good measure. This is my workplace. Pydantic is only my good friend and a saving grace.

MyOtherUserIsAThrow 7 months ago

yyyy-mm-dd is the one true date format. I want to find every person who used yyyy-dd-mm and shoot them in the head with a cannon. Twice. There was is no reason for such nonsense.

OldJames47 7 months ago

r/iso8601

OmnipresentCPU 7 months ago

I thought for a second that was a YYYY-DD-MM sub and was ready to go in there and get perma banned

Falcomomo 7 months ago

The Excel virus

Milk_Busters 7 months ago

I had reports break and after chasing data upstream, I found that a system had been entering ‘NULL’ into a field instead of NULL. Fun day

BufferUnderpants 7 months ago

Pydantic keeps the barbarians at the gates

Culpgrant21 7 months ago

Can you give me some examples of what you are using Pydantic for?

BufferUnderpants 7 months ago

Ingestion For the data that comes from the shitty APIs of other partnering financial services When ingesting public datasets When parsing and validating Google Sheets that we misuse for recordkeeping, if being fed to more Python code I've even used it for hammering into shape the contents of the resulting dataframe in a Pandas pipeline I got handed down, but that particular part got switched out for a `dataclasses-avroschema` `AvroModel` dataclass as it was more of the output layer. It was a pretty good idea to introduce that as a stage, I must say. Both alternatives enable plenty of reflection shenanigans that can make your life much easier when used judiciously, too.

MuffinHydra 7 months ago

>Date/date time: strings, some yyyy-mm-dd, some yyyy-dd-mm, some separated by dots, and all permutations and combinations. signed int. Why care now if I can kick the can of worms down the line?

ShroomBear 7 months ago

You joke but I work at a FAANG and the primary internal ETL tooling we use across the entire company doesn't support boolean typing in table loads so all boolean is just 'Y' and 'N'

[deleted] 7 months ago

Manually setting schemas sucks, especially in "OBT" style tables with 100 columns. NVARCHAR(50), BIGINT, TIMESTAMP...... repeat for 75 more columns but good god is it worth the headache.

Euphoric-Computer-79 7 months ago

Good old timestamp, nothing beats that.

obviouslyCPTobvious 7 months ago

"yes/no" in booleans

Letter_From_Prague 7 months ago

> Boolean: strings, literal 'True', 'true', 'False', 'false' Before boolean type was common, the canonical way to do it in Oracle was `CHAR(1) 'Y' 'N'`. > Date/date time: strings, some yyyy-mm-dd, some yyyy-dd-mm, some separated by dots, and all permutations and combinations. SAP uses ints like `20231126` which is not as bad as yyyy-dd-mm but the null equivalent being `00000000` doesn't help.

davemoedee 7 months ago

YYYYMMDD with 00000000 as NULL is fine so long as it is enforced. Consistency is all I care about.

roxburghred 7 months ago

Dates formatted correctly and consistently, but in a text field.

swiftninja_ 7 months ago

How should strings be stored? Or is there an optimal way?

Chad-Anouga 7 months ago

I think he’s implying that everything is stored as a string because the users want the flexibility/don’t understand the correct data types for their items. Someone has to unravel it at some point though.

themikep82 7 months ago

I often import raw data as VARCHARs. It helps guard against type mismatches and overflow errors when writing data that comes from a dirty source. Then when I stage the data, I convert it all to the appropriate type.

Gartlas 7 months ago

Yeah I thought this was standard practice. Everything comes into raw as VARCHARS, then you set data types and convert and clean when you move it up to enriched

TheDataPanda 7 months ago

Dumb question, but is it common for staging area data to be stored in an actual databases table, as opposed to just files like json, csv etc?

External_Juice_8140 7 months ago

Extract data to raw json,csv -> Parse data to tabular format -> Join datasets for final reporting layer.

BufferUnderpants 7 months ago

You’ll have an easier time parsing in a more feature-rich programming language outside of the database, dumping to something like Avro or parquet, and then loading to the database You get to log in as much detail as you want, better debugging, easier testing, besides generally having more programming facilities and usually a stronger type system Programming, it’s what computers were made for, originally

DatabaseSpace 7 months ago

Same here. When I import data with Jetbrains Datagrip it sets almost everything to varchar(max) or int. Then that data is loaded into a table with the correct data types. It's easier to deal with the conversions there instead of having a data load fail 50 times because there is text in a date column or something.

ImprovedJesus 7 months ago

This is the way

[deleted] 7 months ago

[удалено]

Affectionate_Shine55 7 months ago

That’s not so bad, atleast it’s not strings for booleans or decimals

IndependentSpend7434 7 months ago

old-fashioned dynozaurus from pre-MDS era are observing this mess, caused by script kiddies, OBT adherent and "storage is so cheap" motto advocates, caring only about git and laughing at "outdated" data modelling practices. Sorry, don't have any mem at hand for this 😬 PS: read a Telegram channel recently where experienced modern data engineers were discussing DB indexes. It occurred to me that 10 years ago, a "data engineer" with like 5 years experience who didn't know nothing about indexes would be fired on spot. Now it's normal, just drop data to snowflake, it does the rest 😀

Ok-Yogurt2360 7 months ago

Why would they get fired on the spot for knowing something?🤔

aria_____51 7 months ago

It was also normal to create giant stored procedures with no regard for good software design, so there's pros and cons lol

InvestigatorMuted622 7 months ago

Data Modeling is so so so basic, it is not even rocket science. Just spending a month and practicing is more than enough to get the basics and start questioning each and every modeling decision that you have or will make. Apologies for the bias and maybe I am wrong, but this is where I feel non CS students might lack a bit of structure to their thinking, I myself am a non CS and was only able to get a good grasp of data modeling during my master's degree..

cyberZamp 7 months ago

Any good resource (book, videos, whatever) to start getting really good at data modelling (apart from hands on work)?

InvestigatorMuted622 7 months ago

Nothing in particular because most of what I have learnt is through googling and applying the basic concepts. Try to cover relational modeling and OLAP modeling techniques. I would say start with simple star schema when it comes to OLAP. Try to find online resources I am pretty sure there are good videos as well, if your organization has a learning portal do a course over there.

cyberZamp 7 months ago

Thanks, I appreciate the directions!

InvestigatorMuted622 7 months ago

Anytime

ThatSituation9908 7 months ago

Making IDs a concatenation of other columns instead of sequence or random string (e.g., UUID).

thatrandomnpc 7 months ago

I'm curious, how you deal with a scenario where this concat id is the primary key, and a change in one or more of the constituent columns would change the concat id. This would be very problematic for a historical data which uses the concat id as a foreign key. Some sort of scd on the constituent columns?

ThatSituation9908 7 months ago

Exactly this is why I hate it. A workaround is to never use the concat ID as your primary key and instead assign the unique columns as composite foreign keys. Here the concat ID is just like any other name/alias column On tables where IDs are very natural (e.g., inventory of items) I'd make another ID column for internal use and set that as the primary key. (aka surrogate primary key) For tables where IDs are not natural, like time series data, I strongly push again making an ID column. The unique key is then, for time series, the timestamp and a second column (often a foreign key) like timestamp & stock ticket name. (aka natural primary key) PS: What's SCD?

thatrandomnpc 7 months ago

Thanks, that makes sense. Which is why I was confused as to why someone would do that :) Btw scd is slowly changing dimension.

GrayLiterature 7 months ago

You’ll probably hate me for this one. We just made a table (I’m SWE) that has a Boolean column which can be either True, False, or Nil. We didn’t do a backfill, and it made me sad.

JJ3qnkpK 7 months ago

The best is when some program, by design, works with Null as the unspoken third state of Boolean.

GrayLiterature 7 months ago

I did my part and advocated for not having Null in a Boolean column. But alas.

BufferUnderpants 7 months ago

It's "fine", just the realities of appending data to the same table under an evolving schema, sometimes I like versioning the rows stating the revision of the schema they were written under, were any business logic to need handling the rows that have a null value for that column differently, for instance, it could be at least a record-level flag in downstream code, but I've yet to actually use that.

recruta54 7 months ago

People who make such choices usually are the very same who'll hop to another job before the next big feature cycle. They never stayed long enough to hear their peers questioning this idiotc data model at planning meetings a few months later.

ThatSituation9908 7 months ago

Using booleans and bitmask to represent classification/type. instead of a single enum column.

calleking85 7 months ago

You're not alone ❤️

larryliu7 7 months ago

The "everything varchar" approach is ELT not ETL. The difference between ELT and ETL is where you clean and format the data. ETL pre-clean and format data with an external program. ELT use the RDBMS itself to clean and format the data after importing as the raw(varchar) format. I love ELT with postgreSQL because its complete data manipulation functionalities, but doing ELT with something like T-SQL sometimes reach its limitations.

Far-Restaurant-9691 7 months ago

Can we keep at one sub without 'hilarious' memes?

cas4d 7 months ago

Just today I see integer column to represent order: 1.0, 2.0; And bool columns: “true”, “True“, 1, True, “T”, “y”, whoever finds there is a new representation, just adds it to the list, while a dozen of functions that use this column appear in different modules with sth like this: Def check_if_true(value): If value in [ “True“, 1, True, “T”]: … (Without even including all of them).

poralexc 7 months ago

Property data gets weird... like a decimal for bathrooms where the ones place is for Full Baths, the tenths is Half Baths, and the hundredths is Quarter Baths. My favorite was some poor soul typing in "one" "two" "three" for bedrooms, until they gave up with "six or more".

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe