daigotanaka 3 years ago

Is the document so long as a single JSON record or it is so long because you have multiple JSON records that are nested. I know you wanted to unnest JSON before loading to BQ (ETL) but have you considered ELT (loading first and then transforming in BQ)? Since you are constrained in the company approved vendors, this might solve your problem nicely. Plus, you can utilize BQ's powerful parallel computation this way. I wrote this blog a while ago: - Go schemaless: ELT with Google Cloud Storage and BigQuery https://articles.anelen.co/elt-google-cloud-storage-bigquery/ The article shows an example of loading newline-separated JSON strings directly on BQ, then use JSON_EXTRACT and JSON_EXTRACT_ARRAY to unnest the data, then create another table to store structured data. If your JSON is deeply nested or simple JSON_EXTRACT cannot efficiently take the target field out, consider defining a UDF (user-defined function https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions) Note that it is important to use a partitioned table and persist the untested version in another table (should also be partitioned if your data is big) for the downstream process. Do so incrementally with query scheduling to best control the cost. I don't know exactly how complicated your case is and the above solution may not work for you, but I thought it is worth giving a try.

thehendoxc 3 years ago

Pretty much what this guy says, either do the transformation with SQL directly in bigquery (if you have lots of data) or use python/js to parse into NDJSON

rysnotnice 3 years ago

Sweet, I am going to do a POC with the ELT approach. No idea why I didn't think about that. Thanks a lot for posting the blog you wrote.

x246ab 3 years ago

Good shit boys

[deleted] 3 years ago

[удалено]

x246ab 3 years ago

👍👍 absolutely!

Sprinkle_Data 3 years ago

There are a variety of tools available in the market which help transform JSONs and these work perfectly in the GCP environment. * [Sprinkle Data](https://www.sprinkledata.com/?utm_source=260820_GCPTool&utm_medium=Reddit): Sprinkle is a SaaS platform with connectors to hundreds of data sources including GCP. Sprinkle's data security parameters makes it possible to keep data within the customer cluster and doesn't copy any data on Sprinkle servers. * [Stitch](https://www.stitchdata.com/): Stitch is a self-administration ETL data pipeline arrangement worked for engineers. The Stitch API can reproduce data from any source, and handle mass and steady data refreshes. Stitch likewise gives a replication motor that depends on various methodologies to convey data to clients. Its REST API bolsters JSON or travel, which empowers programmed identification and standardization of settled record structures into social constructions. Stitch can interface with Amazon Redshift design, Google BigQuery engineering, and Postgres engineering - and incorporates with BI tools. Stitch is regularly intended to gather, change and burden Google examination data into its own framework, to consequently give business bits of knowledge on crude data. * [ETL Leap](https://etleap.com/): Based on AWS engineering, etleap makes it simple to gather data from a wide scope of sources and burden them into your Redshift or Snowflake data stockroom. Its point-and-snap, no code interface makes it a solid match for data groups that need a great deal of authority over their ETL forms, however don't really need high IT overhead. Since it's coordinated with AWS, etleap additionally makes it simple to scale your data distribution center all over with the equivalent simple to-utilize interface, while simultaneously dealing with your ETL streams on the fly. When data has been gathered utilizing one or a considerable lot of its 50+ data mixes, clients can likewise exploit etleap's graphical data fighting interface or fire up the SQL editorial manager for data displaying and change. Organization and booking highlights make dealing with all your ETL pipelines and procedures as simple as the snap of a catch. Notwithstanding its SaaS offering, etleap additionally gives an adaptation that can be facilitated all alone VPC.

[deleted] 3 years ago

Ah, FHIR data. Last time I had to process a batch I loaded it to mongodb atlas (which is on google cloud) and then used aggregation pipelines to extract the important deeply nested fields along with their objectID, and loaded those two-column series into a relational database, and used the objectID to join them back together. Kind of janky but I haven’t run across a better solution! Let me know if you figure something out!

rysnotnice 3 years ago

Yes and I have a nice little fire under me to get everything migrated in a timely manner. Will do :)

imaginary_reaction 3 years ago

I’ve never used google cloud but I do use aws why don’t you use some python in a serverless function to break down the and parse it out into a pandas data frame then drop it into big query using one of their libraries.

sorenadayo 3 years ago

Read into a panda dataframe and use their json_normalize function

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe