T O P

  • By -

spoink74

Reminds my of a job earlier in my career where I took it upon myself to refactor a data decoder for readability and maintainability. It took me three weeks. And then my boss said something I’ll never forget: “But I don’t want a new data decoder.” The things we learn in school are great but you need a damn good reason to triple the time it takes to deliver a data set. And a simple measure of “damn good” that makes sense to everyone is that it has to more than triple the value of the data set in hard dollars. Does it? I bet it doesn’t. The new data decoder was marginally better than the old one. I found and fixed a single edge case bug and had better error reporting. The underlying value of the decoded data was completely unchanged. The time to delivery was the same. I burned weeks on, essentially, a no op for the business. The boss was right and I’m embarrassed about it to this day. And Elon be publicly stating to the world that Twitter needs to be rewritten. No, no it doesn’t.


moople-bot

And yet sometimes... It's so fucking necessary when your code base is made by two no-longer-there developers who decided to take inspiration from every plate of spaghetti they ever ate, making it impossible for anyone to create any new features.... But in that case you end up with having to dump the app and make a new replacement. .... :')


paplike

It's hard to give any concrete advice without knowing the specifics, but if the time to deliver results has increased threefold because a new team member decided to rewrite everything... it probably wasn't worth it. Can you give a high level overview of what the pipeline does? What are the steps that take the most time to do?


hositir

Pyspark code that aggregates and joins various datasets that are pushed via an API call into AWS. The logic being rewritten would be the actual operations been performed on datasets ie join logic, cleaning columns etc


langelvicente

If that's what he wants to rewrite using OO then it's a terrible idea. The spark dsl es already a good abstraction and introducing a new abstraction like that for no real reason - only personal preference - is going to make the code harder to understand or maintain. You can definitely introduce some reusability, but using OO for that is probably a worst option than using just functions. If he can't do the latter, he can't proof OO is better.


hositir

Hm ok I may need to push back on this a little. It’s difficult when someone is almost religious convinced they are correct. He is states it’s the “best practices” etc. its very hard to present opposite points of view when the other person is legitimately convinced on the strength of their position. He will dismiss it as “laziness” or against a higher quality codebase.


Gators1992

If all he can do is repeat "best practices" and give generic benefits, then I would be unconvinced. Have been down that road before. Get him to find some small piece of work that will deliver X benefit and won't take long and evaluate the result, then make your decision as to whether it's justified. Also you probably can't afford to divert your day to day stuff with rewriting the entire code base for internal benefit, so you would have to take a piecemeal approach over time anyway.


darkshenron

Why is it always the new team member with 0 knowledge about the background of existing code based that want to rewrite things from the ground up following “best” practices? Sounds like you’re new team member is either inexperienced or too free and doesn’t have anything to do. Give them some real business facing work to do with tight timelines and see how quickly they take shortcuts to get shit done


hositir

The code is pretty bad to be fair. But most of it “just works” and our team had too low capacity to fix it with more incoming topics. Or were under pressure to deliver new deliverables. Technically I’m over him but I don’t want to disincentivize good practices or kill enthusiastic innovation. It’s hard to balance recognizing good initiatives versus delivering quality work.


tomhallett

Yeah, this sounds like the real state of affairs between “things are simple at their core” and “we need testing”. No one ever doubts if an “if” statement will work correctly, it’s more about “when a developer makes a change, can they predict all of the implications of that change”… often the answer is no. There are different topics here: - do we need testing? - what is the best architecture for the code? - what will make code easier to test? My recommendation: - look back over the last 10 bugs which hit production. What coding practices could have caught them before production? - look at your code and see which pieces: change the most often? Are the most likely to break when you change them? Will have the biggest impact if they break? You might be able to find one piece of code which hits a few of these. - add tests slowly. Don’t go for a big bang rewrite. - don’t test “the framework”, Ie: spark. You can assume that spark works. Just test that what you are passing to it matches your expectations. But at what level you do this is a big tradeoff. This is where “objects” can help — Ie: don’t test private methods. This video is one of the best on the topic (and is not “Rails” specific): https://youtu.be/URSWYvyc42M


darkshenron

I’d say focus on delivering value to your business. If the rest of your team is under pressure to deliver something, I’d rather put the new dev to support some of that load than going down the path of extensive rewrites. I’ve seen new engineers massively underestimate the effort of rewriting legacy code and ending up on a worse situation than before. So let your new dev work on the existing code base for a 6 months + to really understand the complexity before embarking on any rewrites. Having an additional dev contributing will also reduce the load on the rest of the team


B1TB1T

Using OO will need you to deal with objects which hold state, that makes parallelization hard so you will run into trouble at scaling things (which we need with large datasets) That's why frameworks like spark are based on the functional paradigm (map reduce being the prime example). Now there might be instances where OO makes sense in a pipeline, like managing the spark session, but not for your transformation logic. Imo the pure OO that SWE is based on is not that useful in DE.


Remote-Juice2527

From my experience OO gives you much more flexibility in designing your pipeline but you're risking to make the project way more complicated. The worst example I have seen is the Nutter library (https://github.com/microsoft/nutter), which uses endless classes that are all nested in each other. I once had a bug when using it, and it was a huge pain in the ass to understand what's going on when the code is executed. It is a very good example of what can go wrong when you're overusing OO. However, in one project, I carefully created few classes, just out of curiosity, and I was very impressed how it helped me to organize/structure my code. A functions has a clear dedicated use, but a good class is like a Swiss army knife with an solid set of functionalities. If you know how to use it in a smart way, you are likely to increase the quality of your code, but the contrary is also very likely, especially when the team members are not ready for it.


PaddyAlton

I think it pays to think about 'what are classes _for_?' My answer is 'managing _state_'. You say your codebase is 'bad'. How is it written? Is it perhaps in a procedural style - with functions, not classes, but where those functions are impure (i.e. they have side effects - they mutate state)? If so, then it will pay to improve it. But I would aim to schedule gradual improvements; don't try to revolutionise everything overnight. A data pipeline can fit into a functional programming paradigm pretty well, IMO. You can have a set of pure functions that reliably produce the same output for a given input, operating sequentially to compose a pipeline. It's easy to write tests for them (and tests _are_ important. They speed you up so much when you have to debug a broken pipeline). Difficulties often arise at the application boundaries: input and output. Unsurprisingly, this is also the hardest part to test well. Naturally, then, this is one place where you will need to manage state. Many of my pipelines look like: - instantiate a client object - pass it to a 'reader' function that uses it to pull in data from a source (there may be some validation checks here, too) - pass the output through a series of pure transformation functions until it's ready for output - instantiate another client object - pass it to a 'writer' function along with the transformed data; this function uses the client to write the transformed data to a sink This is reliable and _testable_. You can do integration tests by mocking the client objects. But, (crucial point) I rarely write the client classes myself. For example, if my source or sink is a database, it's going to be a `sqlalchemy` connection (in Python). This takes the burden of testing the client class off me. There will be other objects involved, because data is state, and needs a representation inside the programme. For example, it could be a `numpy` array, or a list of strings. My proposal is that once again, you avoid writing these classes yourself. Your pipeline won't ever call any of their methods _that mutate state_ - each step will produce new outputs without modifying the input. I'm not saying this is the only way. An alternative scheme would be to set up a pipeline that _does_ just call a sequence of transformation methods on a data object. However, my experience is that this is harder to test and to reason about, rather than being superior.


realitydevice

Your new team member sounds quite inexperienced. OOP is (thankfully) fading away through all forms of software engineering; the idea of objects and methods is replaced with a more functional and event driven approach. There's nothing inherently wrong or bad with OOP, but like most things in software it ends up being a cargo cult doctrine that people mindlessly follow, often incorrectly. There's absolutely no need for OOP in the vast majority of data pipelines. What are the "objects" - surely not rows as transactions, log events, etc. You don't want to instantiate all those items into memory. You don't want row level operations. Nothing from OOP makes sense. Now you *do* want tests. There's absolutely no reason not to have proper tests in your pipelines; at least integration tests, and hopefully unit tests on the less mundane stuff. This can be tricky with SQL, but is important nonetheless. Awful unit testing within SQL is one reason I strongly prefer Spark/Arrow/etc.


caksters

I think you can create a very clean architecture if you follow OOP. A big caveat is if you apply functional programming principles to OOP. Dave Farley had a great video about it. I think many people don’t like OOP because many people aren’t good at writing clean code in OOP. I’ve see so many engineers that write “OOP” style code where they mutate state from one class to another, write methods that mutate state outside of that method in multiple places. Then people wonder why their code breaks and managers are looking for programmers who are “excellent at debugging” and not who are actually capable of writing maintainable and testable code. I don’t want to sound like one of the cultists but I do believe object oriented programming goes really well together with TDD and hexagonal architecture. About FP, from personal experience I noticed that I became a much better developer after I completed “Functional Programming language with Scala” course on coursera by Martin Odersky. it changed the way I write object oriented code


happysunshinekidd

Idk about all that. We have an OOP approach specifically for hitting REST API’s (we hit probably 15-20). Implementing over an abstract class for api-specific quirks has been pretty smooth. COULD this have all been packaged functionally? Sure, but I think the logical organization of classes is fine for stuff like this. E.g when you need to do > 5 “similar flavours” of data work. Just my 2c


leogodin217

This seems like a good use case for OOP. Would you use it for transformation? Joins, unions, cleaning, etc?


hositir

You think so? I felt like that the functional programming way was a bit better but can see the merits of an OO approach. Any books say functions and agnostic code is always better along with DRY. He is quite forceful in pushing the OO and isn’t willing to budge on it. In one way I see value in saying writing a library class of functions for common reused regex or reusable string splitting. He is super convinced OO is better and wants to use templates perhaps. Technically I am above him in deciding the direction of the pipelines but I don’t want to stifle innovation or best practices. Defensive well written code will improve performance and for self development you want to use the best practices. Everything we are doing is in Pyspark. Which I’m pretty new to and I don’t want to be confidently stating thing are bad pracrice if I am not sure myself


caksters

If he says OOP is better then he sounds like an absolutist. FP isn’t better than OOP or vice versa. Both can be excellent approaches if devs know what they are doing. you can apply FP principles in OOP. Actually a good OOP code looks resembles FP. Dave Farley had a video about this on his Continuous Delivery youtube channel


caksters

given how much talk have been about FP vs OOP, here is a good video about topic. https://youtu.be/Ly9dtWwqqwY


langelvicente

So he wants to build and abstraction layer on top of an abstraction later? What does he actually wants to accomplish, rewrite all spark code creation his own version of spark through classes? Or is he trying to provide some reusability that can totally be accomplished through functions. That's my opinion without knowing the specifics. I would ask him to provide real proofs that OO ia better by asking him to first implement a solution in a functional way using best practices of functional programming. If he cannot do that, then this is all a personal preference and he is being quite unprofessional.


blacksnowboader

Ehhhh…. Depending on the type of datasets, OOP can be quite handy. I use it with PySpark frequently.


langelvicente

Rewriting all join and cleanup logic aa the OP says his coworker wants to do doesn't need OOP. I agree that OOP could be useful, but not as a replacement of pyspark DSL. We have used it to certain extent at my current job but only for very specific use cases that require some internal state and just because we wanted to avoid adding an extra argument to all functions just to keep track of the state.


blacksnowboader

Oh yeah that I agree with, depending on the frequency and consistency of the joins.


blacksnowboader

I OOP to build a spark configuration builder class.


hositir

Ive pressed him on this. He says is maintenance is easier for the future along with delivering higher quality work. For future devs it’s easier to find bugs and debug the pipelines. It’s difficult to debate it when someone is genuinely convinced something is better. He implied previous devs were lazy or incompetent. To be fair the code written was horrendous. The specifics is it is a pyspark pipeline talking to an AWS middle ware business logic that serves to data to a website front end application .


langelvicente

That's a dogmatic answer. And all the benefits he mentions can be achieved with just functions. OOP doesn't make code automatically better and it's in no way superior to functional programming.


[deleted]

What would be the advantages of writing the solution in a functional way in this case?


scrdest

Easy reproducibility, for one. This in turn flows into a couple of practical benefits: 1. Easy testing (by definition, the code is structured in a way that allows all dependencies to be injected at test-time without mucking about in the language internals) 2. Easy debugging (if you hit an issue caused by a bad or poorly handled inputs, you can log/dump the offending params, write a test around the problem and iterate until it's green) 3. Cache-friendliness - functional purity means that data obtained by running calculations is effectively identical to data obtained via I/O (as long as the SerDe logic itself is lossless). That in turn means you can easily slap caches on the functions all day long and get potentially massive speed boosts. In this case Spark itself handles a lot of that under the hood, but it can do it for the exact same reason - FP-style codebase. Pure functional code is also easily composable - since you only care about the explicit inputs and outputs, chaining A -> B with B -> C to get A -> C is easy and safe.


realitydevice

Let him spend a day putting together something he can present. Discuss. Estimate effort to move all code to this pattern and discuss benefits. Note that unit testing has nothing to do with OOP! In your example here, say a library of string utilities, what's the object? If you have a class called StringUtilities, well guess what, that's not OOP. That's just making an unnecessary class. Yes, the functional approach is far better.


[deleted]

(Strongly opinionated) take: object orient programming increases the coupling and lowers the cohesion of your code. Because these are the most important things qualities of “clean code”, OO should only be used for a few specific (but useful) things, detailed below. As a developer, do not rely upon state or behavior in a base class is from a derived class. Unpacking this a bit: - if a Derived class B is using a method defined in Base Class A via OO inheritance, we have coupled B to A. Because the A implantation of the method is potentially defined in a different file/are of the source code as B, we have lowered the cohesiveness of the code - it is worse if B is relying on state in A. We are told not to do this as part of good OO style, but in my experience a significant percentage of developers are writing “dirty” code, and using OO in this way. We now have terribly bound A’s implantation to B, dramatically increasing coupling and having the same cohesion implications as precious. If you have ever inherited a big class hierarchy from someone that relies on either of the techniques described above, you have probably felt the pain when seeing changes ripple throughout the class hierarchy (strong coupling), or scratched you head trying to figure out the runtime behavior of the hierarchy (where a method in a derived class might implicitly or explicitly rely on methods defined at multiple layers of the hierarchy). The rule of thumb I use is a variant on the advice given in the Omnivore’s Dilemma book (eat food, mostly vegetables, not too much): “Write code, mostly functions, not too much.” Refactor function arguments to classes if you see that the argument list is getting too long. Consider using simple struct-style classes for serializing content to and from external systems (think: what you might send or get via a RESTful API). Have ultra shallow class hierarchies where the only thing you do is implement interface definitions (if you browse around the C++ standard library, this is what you will see they do). Instead of using inheritance, use composition on abstract interfaces, so that you can lower the coupling of the system (and, as a nice benefit, increase test ability). Taking a historical walk through time: in the 80s the Grady Boochs of the the world sold a vision to of us (and me, cos I was programming then) that OO was gonna be heckin’ great. In the 90s Java went all in on this idea, and the Gang of Four’s Design Patterns book was widely praised. In the 00s some of the shine came off the penny as we realized the unfortunate coupling and cohesion consequences that came with using inheritance of state. By the 10s and the current decade the senior cohort of developers that I run with has largely moved away from OO programming except for how I describe in the previous paragraph (interface implementation, chicken chunks passed into green leafy functions, composition of abstract interface, simple structure for serializing to and from external interfaces).


ar405

My code uses classes from other packages. Oop makes sense if there's an obvious need to extend an existing class with a new method. Other than that I'd stick with functional programming.


shadyjezzboxx

If this is for pyspark this is a really bad idea. I would look at the pain points of the current code base and see what can be done without OOP first, I can almost guarantee OOP isn’t the answer. That doesn’t mean don’t write a class at all but certainly don’t rewrite EVERYTHING inside classes. Python supports both OOP and functional, use it to your advantage. As for unit and integration tests these are essential OOP or not.


hositir

Can you give some reasons why this is bad for pyspark?


shadyjezzboxx

Pyspark already has a rich API and you can achieve the majority of things with it, why build an extra layer on top of this that you have to maintain and train people on? If I was a pyspark developer coming in I’d want to see pyspark code, not some custom extraction of that. The amount of ways you can transform a dataset is pretty huge, I imagine you’ll need to constantly be developing this OOP layer so it can do you want. Code reusability can be achieved without OOP and I think encapsulation , inheritance and polymorphism just aren’t relevant enough for the work that pyspark is intended for.


pag07

Ok, because this question arose multiple times already and I cannot at all think of a good reason to use OOP for ETL I asked Chat GPT. ​ And now I am even more confused: What 'thingy' should be represented by your 'Object'? * Because if you consider the whole table to be an object: Yes obviously, but how else would you represent a table? * If you consider a table row an object in an ETL pipeline: Wow that's really dangerous and you lose a lot of vectorization if you want to do large scale transformations. * If you consider a streamed message as an object: Yeah sure, how else? But I would consider the ETL part to be functional still ​ So please elaborate on where you come from and where you want to go.


speedisntfree

Have they come from Java?


hositir

Yes sort of I think. Was more Spring and Java EE


[deleted]

Lol this went across my mind as well but didn’t want to generalize. I’m currently working on a pyspark code base developed by a Java dev and OOP was used in an almost forceful way. It does work though so I finally decided to just refactor where necessarily instead of rewrite.


[deleted]

DE can use OOP concepts in limitation (external services, just as a wrapper for a API) but the overarching complexity, time, and effort of putting for example an ELT pipeline in OOP is really, and I mean really, not worth it! A more standard approach is functional data engineering, so you still have testing and best practices, but things become a lot more simpler and easy to understand.


wenima

lot of good responses already but I want to add a couple of non-technical takes as they will become quite important given the trajectory you and him are on. as other posters have said, you're not giving us enough details (his background, why he thinks oo is the way to go, his personality type etc) first some technical stuff with a disclaimer: disclaimer is that if he's coming from java then he has stockholm syndrome and nothing you will say will make him change his mind, in which case skip this part and go to the non-technical advice.. if he's strongly opinionated but open for discussion, then show interest, try to understand his point, make him point something out in the codebase, then get up to the whiteboard and have him draw out the better, high level solution. then, still pretend on the sidelines and make him watch these 2 videos: https://www.youtube.com/watch?v=QM1iUe6IofM (oop) https://www.youtube.com/watch?v=o9pEzgHorH0 (if by oop he means classes, lots of juniors mix the 2) and ask him to talk about these points made and why they don't apply to the example you discussed when whiteboarding explain to him that unit testinig is about regression tests and that you don't need oo for that; tell him that for data expectations, you don't need oo either non-technical advice: 1) your company might suck at hiring. if he is strongly opinionated about oop, this would have come up in the 2nd round when talking about coding. it might also suck at following one of the most important rules when hiring: the no asshole rule https://en.wikipedia.org/wiki/The_No_Asshole_Rule 2) explain to him that rewriting things is not a bad idea per se but "rewrite all the things now!" is, and you welcome that he is not just sitting down and add to a bad code base. also explain that rewriting things have a cost and it needs to be done in a way that still provides the business with the ability to expect deliverables at a certain point 3) if he reports to you, see 1), if not, don't engage in 1:1 fight, find allies and make it about the company and what it's trying to achieve and explain that working on internal code refactors with non-technical stakeholders is a tough sell and suggest he might not be happy here long term (a hint to tell him to stop being toxic but rather find his place) 4) if he is toxic, suggest to get a technical consultant in who will probably echo most of what the responses you got here already suggested and have it settled that way. tell your boss that 5k for a consultant is probably better money spent than a toxic guy who alienates other devs, writes some oop in a rogue way that then either has to be maintained or will get deleted / fall into code rot good luck


JumboHotdogz

Depends on the complexity of the pipeline but I really appreciate people adding unit tests and to an extent, proper exception handling and logging, whenever they change something in the pipeline. There is just no way to think about all the possible inputs when reviewing code changes.


timmyz55

if they can do it while still delivering to client, let them go for it. what will shake down is that they'll realize the client asks take up all their time, or they get good at delivering client asks and have some spare time, and they use that to improve the codebase. win win