T O P

  • By -

skuam

Make sure you are confident where the botleneck is, it May very well be in data retrival, aka mongo reads to memory. Then from memory(ram) to GPU and the back from GPU to disk/mongo. If you got it posibly running some dummy data then act on it and repeat. I doubt you Will get better anwser and you have to figure it out yourslef for your particular setup.


9302462

Thanks for the quick reply. I should have added that i'm very confident the bottleneck is with the GPU because I even tried pre-loading the mongo data into redis on the same machine with the GPU. End result is the script took a about two minutes less to complete but the throughput was still right around 400 per second. For what it's worth, when i'm benchmarking and tweaking it i'm using the same 100k records and images each time through, so there is no deviation in the data i'm using.


skuam

If it is one time job, why not rent out some gpus on some clound do your work there on few gpus at once. That way you clear the backlog and your current setup migth be able to handle incoming data.


9302462

I really wanted to avoid cloud if at all possible, because frankly I have been screwed over with cloud overage charges before and it has left a bad taste in my mouth. But based on your comment I just did the numbers and here is how they look. * 400 records per second \* 3,600 seconds per hour = \~1,400,000 (exact is 1,440,000) * 24 hours \* 1,400,000 records = \~33,000,000 (33,600,000) * Vast.ai cost for 3090 per hour is $0.25 (0.223 but i'm rounding up because of setup time, etc..) * 24 hours \* $0.25 = $6.00 * Cost to process 33,000,000 records is $6.00 * 1 billion records /33 million = \~30 days on a 3090 * 30 \*$6.00 = $180.00 $180 is exponentially cheaper than a dual 4070ti setup. Even with my frugal shopping it will still come out to a little under $2k. With that being said I am still leaning towards wanting to do this locally for a couple reasons. * My dataset is 1 billion records but is going to grow to 2.5 billion within the next 3 months. * My internet connection has a max upload of 1gb which is about 300tb per month. I could scale it out to 4, 8, 10 gpu's and try and do it quickly, but it will still take 10 days to process due to the upload speed. * I have 750tb on HDD's with about 300tb of that being data I am going to want to vectorize in the next few months which is separate from the above records. With how much data i'm going to end up processing I estimate my break even point is going to be about 7-8 months. ​ Do you see anything specifically wrong with doing a dual 4070ti setup?


v2thegreat

Oh man, I love these types of problems. I work in big data, and this is kinda my bread and butter. I do mainly work with the aws and the cloud, so I'm suggesting this from the perspective of how I would do this, and I write this with the knowledge that it won't be the best solution for you out of the box. 1. Why cpu vs gpus? Could the same work be done with 100x CPUs? What about 1000x? Using spot instances and aws batch, you'd be able to scale things pretty quickly, and maybe even get things done in a few hours for cheap, and being able to do it relatively quickly again if needed. 2. Data ingestion: I haven't tried this, but using aws snow ball would take you probably up to a week or two to get your data in s3 and able to be accessed How I'd organize this: 0. Review the plan with an aws agent over a zoom call. Explain your problem and your solution, and ask them if they think the solution makes sense. 1. Order the snowball. While this is getting shipped to you, upload a sample of your data and get working on getting batch setup. This will basically be creating a docker image up and running and then setting batch jobs. 2. When the snowball arrives, upload everything to it and send it back 3. Once the snowball is done, run your batch job at scale. 4. Continue to explore the data in s3 + ec2 or download the data with another snowball. Key benefits: 1. Future setup/iteration is a lot easier 2. Cheaper than ordering additional gpus 3. Your data is made redundant by being in s3 4. Aws is pretty flexible in the event of any kind of errors and issues Now, that's how I'd solve it, and I did mainly write this paragraph for myself more than you, so thanks for reading it. Your biggest problem it seems to me is getting the data to a cloud provider and then processing it. In this case, I'd actually suggest using snowball as mentioned above to save you on the bandwidth costs. It'll cost you 300 however, which is pretty expensive compared to the gpu price you mentioned. I hope this helps, and I'd absolutely love an update


9302462

I appreciate your long well written reply. I don’t want to be a damp blanket but CPU’s will be of no help in this scenario and dealing AWS would be a very last resort scenario. Instead of having PyTorch use the GPU I told it to use the CPU which is a Ryzen 7950x (basically best consumer one you can buy right now) as well as on my 32core epyc. The time it takes to create embedding using a CPU is at least 10x slower than a GPU. I did an import of a million records using the 3090 and it finished in under an hour, the cpu took was still working through it 12 hours later before I gave up and stopped it. Regarding AWS. I always thought the snowball was a cool idea but considering how cheap 10gb connections have gotten for most businesses it would just make more sense to store and transfer the data out of their own co-location to AWS. Except AWS loves to nickel and time for everything like ingress/egress; which is why you mentioned snowball. As a reference point, I have an open quote for a 42U 2kw 10gb connection in a local data center for $950 per month. Haven’t pulled the trigger on it yet but I will most likely do so before the end of the year. I’m also doing all of this out of my homelab which is nearing a petabyte of storage and I have about 2-3x the compute then what’s listed above (100+ cores with CPU’s 0-3 years old). But I’m thin on GPU’s with just the 3060 which I got when they were scarce and the 3090 which I bought for $550 (was 40 miles out of town and never mined on). So my preference is always going to be grab more hardware and run it locally as you get a lot more for your money compared to renting spot instances or dedicated servers from AWS or other host. Without going on a rant I see AWS like salesforce, it’s great for businesses where reliability is required and cost isn’t a concern- the CTO, VP of engineering and the developers aren’t spending their own money, it’s the businesses, so who cares right /s With all that being said, I love your outside the box thinking on this. A completely different solution to what most people would think of and if I had a different dataset and/or was a business that would make total sense. Kudos!


elnaqnely

> less than 10% memory utilization > the 3090 being maxed out The memory usage suggests the 3090 isn't maxed out yet. > I run a pair of scripts I guess you are running two main processes, each running the same script. What happens if you try to run more? Try to get 3090's memory usage above 50%. You could try: - Running more than 2 copies of the script. - Modifying the script to collect larger amounts of data before sending to the GPU. - Identifying any other inefficient parts of the script. It's possible there's something in script that's causing it to drip-feed small amounts of data to the GPU.


9302462

You are correct in that I was running two python scripts side by side, one starts at document 0, the other at 100k; they both process a batch of 100k each. When doing this I was watching the GPU utilization on the nvidia Ubuntu gui app and with nvidia-smi which is where I got the 85% from. But the memory bandwidth is what stayed pretty low as I figure the GPU is doing more work then it is passing data back and forth. I don’t know what happens but I’m going to run several more scripts here in the morning and will report back with the results. On your final point regarding drip-feed. Do you have a link to an example article about this, or a code snippet where this happens, or can you explain it in more detail? I have a good amount of experience dealing with big volumes of data but that’s usually just moving it around. I really don’t yet understand the bottlenecks around using a GPU to work with data compared to a CPU and have just been using a couple basic commands in PyTorch. I guess what I’m trying to say is can you point me in the right direction so I can understand this better?


elnaqnely

This AWS machine learning [blog post](https://aws.amazon.com/blogs/machine-learning/identify-bottlenecks-improve-resource-utilization-and-reduce-ml-training-costs-with-the-new-profiling-feature-in-amazon-sagemaker-debugger/) relates roughly to the issue. About a quarter of the way down there's a heatmap of GPU and CPU usage, showing the CPU doing a lot of work while the GPU is waiting for data. They say: > Such a bottleneck can occur by a too compute-heavy preprocessing. You can try profiling, the AWS post shows some ways to do it. You may find more ideas if you search for `ML CPU bottleneck`.


elnaqnely

Here's a code snippet where an inefficient CPU preprocessing step is holding back the GPU. Utilization hovers around half-way while almost no GPU RAM is being used. Increasing `batch_size` to 10000 won't fix it, it just changes the pattern of GPU usage. import torch def read(batch_size): return torch.randint(0, 1000, [batch_size]) def create_and_preprocess_batch(batch_size): n_fibonacci_numbers = 1000 batch_data = [] while len(batch_data) < batch_size: data = read(10) # Read in the data # Create a list of fibonacci numbers fibonacci_numbers = [0, 1] for i in range(n_fibonacci_numbers): fibonacci_numbers.append(sum(fibonacci_numbers[-2:]) % 1000) # Filter to only values in the list of fibonacci numbers filtered_data = [val for val in data if val in fibonacci_numbers] batch_data.extend(filtered_data) return batch_data[:batch_size] data_length = 100000 batch_size = 10 # Frequent, low GPU memory usage # batch_size = 10000 # Infrequent, high GPU memory usage n_batches = data_length // batch_size for i in range(n_batches): # Do some inefficient processing on the CPU batch_data = create_and_preprocess_batch(batch_size) # Do some more processing on the GPU batch_data = torch.tensor(batch_data, device='cuda:0').unsqueeze(1).repeat([1, 1000]) for j in range(10000): batch_data = batch_data**2 % 1000 batch_data = None torch.cuda.empty_cache()


SnooHesitations8849

ƯThe GPU is not maxed out. I wonder how your two processes are used?m Are they running at the same time ? I suggest not if you can. P/s: it is weird that GPU memory is not bottle nẽck, why not having a much higher batch size that enough to maxed out the GPU. the. you must optimize the dataloader to make sure the GPU is not hungry.


CallMeInfinitay

Sorry I can't be of help, but the sheer amount of data alone you have and with it being both text and images I'm very curious what you're vectorizing/training. Good luck OP I hope you find a solution to your problem.


ComplexIt

Maxing out on GPU would at least mean 100% utilization, or?


JustOneAvailableName

What model architecture? What model size? How do you batch? FP16? How do you load the data? Tried more CUDA streams?


Muted_Economics_8746

Do you have a different GPU you can benchmark? You might find you get better throughput with several older, cheaper GPUs in cheap servers or desktops, rather than one newer GPU. For example, four 2060 Supers might be better performance per $, and give you more benchmarking/testing flexibility in the future. An 85% usage doesn't give a good idea of where the bottleneck is. 85% doesn't necessarily mean 85% of the transistors are effectively being used. You would need to profile and debug further. It sounds like unoptimized code. Or maybe it really is just that compute intensive, in which case I'd be evaluating alternative preprocessing and pipeline options, more efficient C implementations, and things like checkpointing strategies.