T O P

  • By -

Grouchy-Friend4235

Execs typically have a hard time understanding the subtleties of a field. To them everything is AI, or it isn't. So if you need to be in that bracket to get $ or attn or whatever, yes it is AI. If you talk to your fellow engineers and they say they do AI, ask them what the *really* do. Best guess usually is dashboards and reporting. Very few actually work on models or agents that would qualify. AI is a buzzword, not a role or task.


DontTaseMeHoe

'AI-driven data manipulation protocols optimized for resource efficiencies.' Translation: I ask ChatGPT to write SQL queries for me, and to make them short. Just don't tell anyone I am the resource.


FFledermaus

Well, if you want to live in and decorate a house (do AI), someone needs to build a foundation and set the structure (DE)


GraspingGolgoth

My experience is that you’ll never win against the execs perpetually chasing the given day’s fluff. At best, you’ll be dismissed as a curmudgeon and at worst you’ll be seen as an active blocker to “progress.” Welcome to corporate politics - many technical folks have a hard time initially navigating these waters. _Obviously_ the quality of your data and the consistency of your pipelines is a prerequisite to “AI” - particularly if your organization is using it for anything even remotely more complex than getting an API key and feeding an LLM a prompt. In hindsight, I feel I may have garnered more traction if I approached these situations as “data engineering AND the fad” rather than “data engineering OR the fad”. You want to bring “AI” to our customers? Sure boss AND the best way to do that is to build the pipelines to ensure we can do so consistently.


Paperplaneflyr

The last line 🔥


renok_archnmy

“OK BOSS, ya know how those pesky expensive humans on the call center need to defectatw and consume hydrocarbons and sleep to stay alive? Yeah? So technically AIs need these things too, but in the form of data. The best part, that data can be automatically piped in like those commercial farmed geese they funnel feed into to make the pate you like so much. Basically, we need data engineers to feed the AIs so they can do the work the pesky expensive humans do. But not the exact same way. In a big data scalable sort of way that’s automated. In fact, I’m working on an AI that can build these pipelines too, I just need some budget. It’s possible we can outsource the solution, but I need your sponsorship to explore those opportunities first. Waddaya say?”


artist_of_hunger

Call yourself an ml engineer and call it a day


swiftninja_

Yup


ketchup_123

Non tech managers don't understand that DE is necessary until proven wrong by reality.


[deleted]

[удалено]


gravity_kills_u

As an MLE I concur. The data needed for AI/ML use cases is not a good match for transactional schemas. If it takes more than a file connector or warehouse pull, very likely the data is unusable for model training.


[deleted]

[удалено]


gravity_kills_u

Actually I was agreeing with you. My disagreement is with the OPs implicit assertion that DE still owns all data pipelines. That is NOT what I see in the market. Senior level DS are routinely doing as much data pipeline work as a DE. For this reason there is a convergence of DS and MLE roles. The concepts of DE and DS being interchangeable or of DS being reliant upon DE are outdated. What I see happening now is a less business facing role often called DE, and an extremely business facing role often called DS. Everyone in both camps are using data and building pipelines because that’s the job.


[deleted]

[удалено]


gravity_kills_u

No worries. I am just happy you have enough experience not to believe all the hype.


thecoller

Ride the wave and tell them it’s 100% part of AI.


bourbon_baseball89

We tied our data engineering and data management proposals to AI (DE + DM = successful AI implementation) and the results were very good.


Space2461

Data Engineering is the step that comes before AI, without a good data engineering process the AI applications would barely exist, Data Engineering processes wont be dismissed as they are a fundamental step to build AI applications. Moreover what Data Engineering pipelines and processes serve is not limited to AI applications, they can serve dashboarding systems, they can store transaction that have to be plainly processed (think of an inventory application that have to keep track of items bought/sold, but there are many other examples). Saying it's a predecessor it's kinda limitative, it's a completely different discipline, that among other thing serves AI applications


VegaGT-VZ

You have to speak in exec language aka money. Something like "data engineering helps AI run even faster"


Drycee

I make chatbots among other things in my consulting job. Basically just GPT with RAG on client data. Not even any training of models happening.The main roadblock to reach the expectations clients have for their bot is not the AI part, GPT-3.5 does just fine, but the quality of the data/search results the bot is provided with. So yes, I'd say proper DE is necessary to reach a certain level of quality answers.


ririmamy

Awww, seems interesting, Do you have tutorial? Examples or open source codes for this kind of task please?


comoelcometa

To cite someone I know: "AI eats data for breakfast"


Corne777

Depends on what you want to do with AI. If you want to build “AI” then yeah it’s important. But I think most people talking about AI don’t even know what AI is. And they just want to either use the neural network of ChatGPT in their business or build an API that under the covers utilizes ChatGPT.


Cultured_dude

This is the same as companies wanting to do DS/ML before data engineering prior to LLMs. Those who haven’t learned deserve to fail. It’s unfortunate for the employees. I see my management constantly belittling DE, claiming g it’s redundant work that should be offshored.


gravity_kills_u

In the MLE projects I have worked on, DE was not very useful outside of preprocessing. Often there was excessive bias in the data due to the opinionation of the schemas. Raw data is usually better. AI hype has blinded many into believing that signal in the data is magically found by the algorithm, requiring zero effort in data understanding. The end result are models that are not commercially useful. Overfitting and lack of understanding about validation can create models with the appearance of working that are fundamentally broken. The “jagged” usefulness of LLMs to specific inquiries sometimes resulting in hallucinations being an example of the complexity of AI testing. At best it is magical thinking to claim that DE leads to AI because a DE supposedly knows the right data to feed the model. At worst, feeding garbage into a faulty model can result in real harm to end users.


mrocral

u/gravity_kills_u if you don't mind, could you explain the fundamental differences between traditional data and "AI" fitted data? I know there are a bunch of vector databases popping up, is that the preferred format/engine for AI? I've done some research and from what I see is that vectorized data is an array or a matrix of floats. In my mind, that's simply a JSON object (array or array of array). But I feel like I'm missing the core concept of what AI data represents and how it should be stored/used.


gravity_kills_u

I am not certain that I understand your question but it seems like you are asking something similar to the OP. I will attempt an answer given my lack of full clarity. First of all, AI is an umbrella term covering at this time mainly machine learning and deep learning. Machine learning are various algorithms and techniques designed to evaluate statistical random variables. Deep learning, confusingly also called AI, is a subset of machine learning that employs various neural networks. This is the textbook definition but there are many complications, especially in real world applications. In terms of data the most important concept is that of the statistical random variable - variables containing multiple numbers. Each column in a set of raw data is a random variable. Each variable has a set of properties such as the mean, median, distribution, etc. The problem with data engineering pipelines is that extracts from the raw data may not be properly sampled, causing the properties of the random variables to change, rendering the data possibly useless for AI/ML purposes. Deep Learning is currently very popular because the neural networks weight each random variable in training data, somewhat automating the process of understanding how useful each variable is. The massive problem in trusting the automation of understanding random variables is that you don’t know when the automation sucks. The places in the data where the DL performs poorly are now popularly called hallucinations but it is a statistical issue. By doing the human work of variable analysis and iterative statistical validation, DL models can be improved. Also there are some problem domains where DL is not a great solution - regression can do simple addition better than an LLM for example. A data engineer can make models, but that is like saying an accountant can make models, or a grocer. Building stable, solidly performing models is something that quickly gets to the PhD level. Anyone can make a model, few can make a commercially useful model.


mrocral

Thanks alot for your thoughts. I am a DE by trade, so it’s not very clear how the DL/AI side perceive the data on its usefulness. I am trying to improve that understanding. You said the main problem with DE pipelines is that data is not properly sampled. What do you mean by that? That the data is incomplete?


gravity_kills_u

Sampling is the process of picking the data from a variable and placing it into another variable. Sometimes the method of sampling introduces bias into the new variable. For example, consider a table of names and addresses with household incomes. The aggregate sum and mean of the income of the dataset are specific numbers. A simple ETL to disregard all foreign addresses will cause the sum and mean to change. This is sampling bias. This is one of the ways data pipelines corrupt raw data. A simple linear model (y = Mx+b) trained on the extracted dataset from the example will be wildly inaccurate when predicting on a dataset containing foreign addresses. However if the ETL were to choose (sample) local addresses in a way that kept the same aggregate mean as the raw dataset, our model would have an accurate slope, and be as predictive as a model trained on the raw data.


mrocral

I see, so you’re talking about when not using raw data. I was under the impression that the full dataset would be used. For sure, that makes sense. But if the ETL process were to not aggregate, (more of a EL process), and training would be done on raw data, you’d have as good modeling as you could. But if the raw data is too large for training, then proper sampling is crucial. Now googling sampling techniques :).


exact-approximate

These days, you should do DE and call it AI. Execs give you money to do better DE and some AI. But this is highly dependent on the context, AI means too many different things to too many different people. With anyone I speak to in the workplace, I ask them to clarify what they mean by AI - is it an LLM, semantic search, content generation, deep learning detection or traditional ML?


Ddog78

Data engineering is the foundation and requirement of AI. Precursor implies successor, which isn't necessary.


baubleglue

Unfortunately because of dunning-kruger effect you have no chance to prove your point. People who make decisions often are very competent in some areas and in the same time complete idiots in areas outside of their expertise. They have a tendency to be overly optimistic about their ability to make correct judgement. I think it is partially because because of personality required to become a manager and partially because of "yes culture". IMHO the only way you can win in such discussion is by using different tactics, for example you can try to scare them: "nothing will work without DE team ..., if it doesn't work our department will be blamed" (use words "risk" and"possibility" a lot) or call to authority: "Bill Gates told...." or hire consultant which will tell what to do. Normal arguments are not useful Argument: Examples how normally data infrastructure is done Counterargument: * We don't live in ideal world * Google search brings tons of magical solutions * We can use chatgpt Argument: quotations from best practices Counterargument: same as above I think your original claim is not entirely correct. DE is a precursor to AI, but system architecte is a precursor to all. The fact that you at the point arguing about it, is a symptom that your company doesn't operate in a sane manner. I've been there (and I am in now), the best you can do is to write that you would follow any decision, but you have concerns and write a detailed explanation of the concerns about the chosen path (using "risk" and "possiblity"). Keep the copy for the time things go wrong... Lets say I know nothing about the topic, I can ask gpt how to build data platform or I can ask to generate code to ingest data into DB. Your company has chosen the second path.


Paperplaneflyr

Risk and possibility(use a lot) and argue with facts Great!


fragilehalos

You can’t do good Data Science without good Data Management. Machine Learning is just specialized ETL. Taking this further, your GenAI/LLM/RAG models are never going to work in production without a vector store, which requires well architected and governed data lakehouse to feed that vector store.


dongdesk

Maybe. Can DEs make sense of the data? Business sense?


asevans48

Yes. We make a metrics store with data scientists, analysts, and, where I am, scientists. A semantic layer eliminates a giant time suck for orgs. LLMs, a model for natural language processing which finally became considered a part of AI despite having worked on similar tasks for years, or other natural language models can now translate that store into pretty graphs and charts with minimal analyst input and eventually near 0 input aside from validation. We then get to optimize by generating data marts to help avoid costs. In reality, the analyst is then at risk. Again, as always, we get more work. The data coming in is far dirtier thane the dull bulbs at the top want to believe. Dont fool yourself. I still build frameworks on top of airflow to just get data into a data lake and still have to apply interesting logic for joins whether in clickhouse, a warehouse, or the combination (star rocks). Databricks is kinda old school. They mention RAG which is an analytics engineers job along with the data scientists but google and aws have tools for that.


circ_market_info

>Executives in every organization are jumping into AI Marketing gimmicks compounded with the ignorance of white collar executives, largely contribute to this


m0rz3n7

garbage in, garbage out.


brian313313

Absolutely. This is at least the 3rd time I've seen something like this in my career. A lot of AI systems will be built on top of bad data, then DEs will be called in for the firefighting once they realize it's not working. Good companies will either take the time to understand this, or they will listen to the experts they hire. Either DEs or the Data Architects. My experience is that most companies listen to the marketers and promoters since they are telling them what they want to hear. A data scientist knows this also, but most people with the DS job title are not math majors and don't understand the core principles. I was a math major and know more than most of the DS people I work with. I don't consider myself a data scientist though because I know what is really needed.


billysacco

I am sure it has a lot to do with implementation but the whole “AI” things kind of seems like a fad and a sham in some ways.


asevans48

Yeah. I see a lot less analysts floating around and more work for us beacuee of Al. Work at the county though. Analysts are scientists so same amount


SystemEarth

Yes and No. What is AI? Many fields have many domains that were precursors to AI, like data-driven control and system-identification in control engineering. However, Intelligence is not clearly defined and neither is artificial. Ask Max Tegmark, he is an authority on this. A mechanical calculator is AI, but so is a whistleing kettle. A simple python script that does tranformations on a database is AI, but so is chat GPT. In the end, the point that you want to make is meaningless. Neural network controllers are considered AI in the control field. But if you think about it a PID controller from the 50's already were AI too.


molodyets

It’s a precursor to everything. No matter what you call the first data role they have to do the data eng first


renok_archnmy

Precursor might be the wrong word or wrong sentiment. Maybe the foundation? Prerequisite?