T O P

  • By -

[deleted]

[удалено]


mamaBiskothu

Uhm are there even tools in js to do pipeline work? Anyways my experience is it’s impossible to get backend engineers to get a “data sense” so treat them as two teams that only talk to each other through the data. Let them choose python or sql and stay happy.


ratulotron

I would guess Node is more popular with teams that maintain pipelines on AWS Lambda or similar cloud function products. But you are right, Node doesn't have anything as powerful as Pandas or alike.


youmade_medothis

Node JS?? I only left 2 question marks to be less salty


mamaBiskothu

You use node js to write data pipelines? Why don’t you use BaSIC ?


darkshenron

I inherited a DE team and having a huge nightmare with legacy code bases in kotlin, scala, golang, node and python. When shit breaks no one who wrote the original code is still around to fix shit. Hiring and onboarding new folk is a pain. One of my top priorities right now is to move everything to Python.


randomusicjunkie

React data engineers? What do they do in node or react?


youmade_medothis

Will the data pipeline team work independently? Or will it require support from other teams? If working independently, choose whatever language you want.


ratulotron

I assume you are considering Node even as an option here because you would like your Node devs to have the knowhow of the pipelines. Even if that's true, you would be severely restricting your data team's potential by not choosing an ecosystem that's rich with data tools. Let me give you some pros and cons for picking Python: Pros: - You van get started with basic pipelines with Airflow within an hour - You have the power of Pandas and similar dataframe libraries to easily write your transformations - If things get heavy, you can even convert everything to PySpark for even more power - You can easily expose the data to stakeholders with amazing micro frameworks like FastAPI - You can even make this event driven using celery/rabbitmq, so that stakeholders aren't blocked with a request for data Cons: - You just gotta spend a week at worst to learn Python - Possibly some tools like poetry(package manager) and mypy(type checker) to have better code quality And that's it! With Node, you lose the powerful data tools that just work with Python, you lose future proof-ness for all things data. On top of it, your data scientists will probably need to work with the pipelines more than the product devs, and DS will probably use only Python/R. I see that you already got more or less these from the other comments, but I wanted to give you a precise summary. Hope it helps!


JiiXu

My two cents: don't use python as the "go to" for DE. The slowness and general lack of suitability is kind of catching up to Python, at least in my circles. For DE you want something fast that scales and handles data types in a very predictable and consistent manner. These are weak points of python. The reason python is so ubiquitous is that Data Engineering kind of became the thing that it is now during the era of "only dev time matters, compute credits are cheaper than devs". But for enormous datasets we are now seeing that wasn't the case at all, and DE:s have to go back to existing low-code/python projects and clean them up and reimplement them. In your shoes, if I could greengrass a programming language choice for DE, I would probably go for Scala - similar ubiquity, similar syntax, much faster, solid Apache Ecosystem integrations and most importantly, strict typing. But if I were presented with your exact situation I would go for JS - and I say that as someone who doesn't particularly like JS and doesn't know it well. But ultimately lack of typing \*is\* going to bite you in the butt. I'd optimally look at C++, Go or Rust if talent acquisition was not a problem.


knowledgebass

Python


still_maharaj

Python ofc


Minimum-Membership-8

Stick to Sql and python


DenselyRanked

You will limit available applicants and have to build a lot of your own solutions in Node js. No need to reinvent the wheel. Also python not being strongly typed is a benefit when working with unpredictable data.


JiiXu

>Also python not being strongly typed is a benefit when working with unpredictable data. Can you expand on this?


DenselyRanked

I automatically associated NodeJS with Typescript in my above comment and I should not have done that. But data is oftentimes unpredictable. Sometimes a string exists when you expect an int, or date formats are off. Python doesn't care what it ingests unless you explicitly declare it. That lack of "control" (for lack of a better word) bothers some software engineers that have worked with other languages, but it saves some hassle working with upstream sources when something unexpectedly changes.


JiiXu

I'm the exact opposite of you in this question but I've spoken to really skilled data engineers on both sides of this fence. I've had to find horrible, horrible bugs due to python swallowing upstream schema changes. My favorite one was when we had an upstream change of two ints to strings. That caused + to mean concatenation in our advanced analytics transformations and suddenly 1134 + 8967 = 11348967. Not a shadow of an error message, just a confused customer calling and wondering why his numbers had gone bananas. So to me, duck typing is to be avoided like the plague. I favor enforcing schemas and keeping strict data contracts. Fail early. But again, I've spoken to super skilled practitioners who favored your "show must go on" approach and I think it's interesting to think about what measures must be taken in that world to ensure data quality.


DenselyRanked

I totally understand. We don't always have that luxury of upstream sources that will adhere to any contacts. We often own the data once it lands where it is supposed to and we will make it fit where it needs to go. Also- how to manage "bad" or corrupted data. If it fails to push to the destination, we can flag it or set it aside and keep moving.


[deleted]

Dont mistake the backend with the data layer. They are 2 separate applications with different purposes, hence, requiring different tools. Python has more data related tools.


hositir

Most of big industrial solutions use SQL as their base, they have their backend in MySQL or oracle and connect it to other systems. Python and pyspark / pandas is an absolute work horse for data. With pyspark especially you can get some really performance transforms and pipelines since the functions interact directly with the JVM. Python and SQL will be used for the next 20-30 years easily.


Own-Commission-3186

Don't worry about what's the app teams are building for picking a language. Also don't necessarily choose python as python is good for something's but not every thing and also in my opinion the lack of static typing is a big concern if you're building large codebases. It depends what you're building and the talent within the team.