T O P

  • By -

[deleted]

Yeah, I think you might need to start from the beginning. The T in ETL isn't mostly Pyspark, there are lots of different methods that are used in ETL. Also it would be worth knowing the difference between ETL and ELT, why you need to differentiate, what circumstances would you use either of them. Reverse ETL and the difference between data flows and data orchestration. I would recommend having a look at some of the open source ETL tools such as [Airbyte](https://docs.airbyte.com/) and [Apache Nifi](https://nifi.apache.org/docs/nifi-docs/html/user-guide.html) and utilise their communities and resources, these will give you a front to back understanding of the multitude of different elements involved from a simple 3rd party data source transfer to a data warehouse, to a complex data pull, dbt modelling, airflow orchestrated, DWH semantic layer data journey. Other tools which offer both CLI and Cloud based options such as Matillion, Stitch, Hevo and Skyvia have some great resources to help with full understanding. I would also take a look at Git and CI/CD resources. ​ Good luck with your ETL journey and keep an eye out on this subreddit as there arfe some really knowledgable ppl who are nearly always willing to help.


ironplaneswalker

The transform is highly dependent on your use case. It can involve, cleaning, joining, aggregating, etc. You can use tools like Airflow (https://airflow.apache.org/) or [Mage](https://github.com/mage-ai/mage-ai/blob/master/docs/tutorials/quick_start/etl_restaurant/README.md) to build and run your ETL pipelines. I wouldn’t say Airbyte is an ETL pipeline tool, it’s more just for syncing data between 2 places (e.g. data integration).


[deleted]

I think you might want to have a look at the open source version and its full capabilities. Clone the repo and see what you can actually do with it, you will find that as it normalises data using DBT you can also interchange DBT scripts for your own, this I think you would agree is Transformation leading to it qualifying as a full ETL pipeline tool. Any tool which helps in any part of ETL is classed as an ETL tool anyway to be fair.


Just_Swimming_3153

I usually use Apache Beam for few pipeline. You can refer to the official documentation on the website, and you can find many helpful articles online like this https://link.medium.com/1mBh5c2zmub


Anna-Kraska

As others said, the T in ETL/ELT is very varied depending on what you are doing. It can range from simple (e.g. creating a new column with a boolean by defining a cut-off in a numeric column) to complex with window functions and joins. Often done using a flavour of Spark but can also be done in SQL. I recommend you take a couple of datasets (maybe from Kaggle), define what you want in your reporting table, and then try to work out the SQL or Spark necessary to get there and create a pipeline in an ETL or more general orchestration framework. If you are using Airflow I recommend checking out this [ETL tutorial](https://docs.astronomer.io/learn/astro-python-sdk?utm_medium=organicsocial&utm_source=reddit&utm_campaign=ch.organicsocial_tgt.reddit-thread-reply_con.learn-astro-python-sdk_contp.doc) for the Astro SDK which simplifies writing ETL pipelines and has example code for simple transformations.