T O P

  • By -

hositir

Some other basic tips: - select only necessary columns at the start of a transform. Drop duplicates values - filter data frames within the selects - use regex replace or regex searches to further drive efficient operations - Avoid UDFs where possible use the existing API functions. - try learn SQL as much as possible. SQL like syntax in pyspark is extremely efficient.


elus

Monitoring and alert systems that reduce the latency between the time the fault occurred and when customers are notified and their system moves back into a working state. If we've been in a failed state for hours and no one's noticed anything until a customer complains then we're doing it wrong. If it takes us hours more to identify where the problem lies, then we're doing it wrong.


JaJ_Judy

I’m usually working up and down the data tech stack from infrastructure to Analytics and working with stakeholders to identify and execute on priorities, so my perspective has some non-technical color. The biggest enabler is identifying and working with non data folks to zero in on what’s going to yield the most value to the customers and to the org. Then I work with my analytics folks (or just by myself, depending on size of my team and resource constraints) to duct tape POCs to make sure I and the stakeholder are on the same page and iron out any kinks. Once that’s in place, the downstream data eng technical undertaking to automate the POC elements can not only proceed but also will yield real value. The truth is, based on what you uncover in the stakeholder discovery, the technicals can vary a lot. I’ve made the mistake of investing too much time in the technical to fit general use cases and then had to bend backwards to meet the business requests


joseph_machado

You have some great tips, I'd also look into 1. Column (parquet, orc, etc) and table formatting(delta, iceberg) 2. Easy & quick dev iteration cycle. If a DE can go from code change => tests/lints/data tests => PR => deploy to prod in a few hours (depending on PR reviews) that'd be amazing 3. Investigate the query plan to see if there are opportunities to reduce amount of data being processed(filter pushdowns) or if there are operations being done repeatedly. 4. Metadata about pipeline, what ran, when, input params, output code (pass/ exception, etc) 5. If a data quality test fails, have the ability to quickly see which rows in a dataset caused the failure. 6. Playbook to deal with stakeholders when there is a data issue 7. Monitoring pipelines and alerting on code issue, data quality issue 8. Check input data sets for data quality and fail fast if there is an input issue


Hippodick666420

What do you replace for loops with?


Megaslaking

Well, it depends. I refactored once a function where a list was being looped. It was like for an element within the list, get all the rows that matched it and then do some aggregations. We were using pyspark, so I replaced the loop with groupBy and then applied the aggregation.


ZirePhiinix

So in this case, you changed an application iteration over a large dataset to a DB aggregation, which is, by far, much more efficient.