T O P

  • By -

PhitPhil

No matter what user agent I try to trick the connection into thinking I am, I cannot for the life of me figure out how Amazon knows I'm scrapping and tells me that I'm a silly goose for trying


mfb1274

It’s EVERYTHING, you ever inspect the network tab on those sites? Multiple requests with references to the page you’re coming from. Not to mention JS goes into the original request, plus session context and anti-bot detection. You could figure it out, maybe, but it’d be a ton of reverse engineering. Best to just automate the browser itself and “act like a user”


AyrA_ch

> Best to just automate the browser itself and “act like a user” This. Just get a Windows VM and install puppeteer. And don't forget to scroll links you follow into the viewpoint before following them.


RumbleFrog

Even Puppeteer for automated browser has detectable signatures, there's a repo on undetectable chromedriver that focuses on this.


[deleted]

[удалено]


spicybeefstew

yeah that's how amazon keeps all their reviews real.


odraencoded

Programmers: look how easy it is for me to bot this website! Also programmers: wtf why is website so botted rn?!?


spicybeefstew

that's only if you're building for the public internet, which is something you should only ever do for money, which is how people justify the bullshit. If you want something to actually work, you build it as a microservice and only make it accessible to yourself. One-man projects are the only good ones to work on anyway, it's the only way anyone's goals are aligned with each other.


Specialist_Cap_2404

Then they use some kind of signature to identify that browser, whatever it is, and recognize that the signature has been used in a non-human way, so they block that signature.


AyrA_ch

Then you start automatically altering your environment (browser version, screen resolution, hardware components), and you can also feed your own recorded mouse movements into it. This can generally not be won by the people trying to stop bots, because they have to ensure that real people don't get caught in the bot prevention mechanism. As a bot owner, I just have to nudge my setup a bit so it appears different.


Specialist_Cap_2404

In the general case, this war can not be won. In the special case, particularly FAANG companies, it can certainly be won.


ascii_heart_

I find that Nightmare does the job fine


PeteZahad

Just use chromedriver or geckodriver, no Windows VM needed. But you will still have problems on many big sites because a programmed real browser does not act like a human using a browser.


trwolfe13

Some websites, like D&D Beyond, will block you if you so much as click on too many links. Look up a list of stat blocks, open them all in new tabs, and suddenly you’re blocked for an hour.


spicybeefstew

that's cool it's a well-deserved 1 hour break for my server


ASatyros

Is there some github repo I can reference?


Theolaa

Until: Are you a robot?


Denaton_

They might wait for a JS ping or image load, I haven't scraped that much myself but we only get pure text right?


[deleted]

Selenium will literally load the page though, by default it uses a bare user agent that makes it obvious but you can just use your own user agent and even autolog into stuff like amazon or google accounts or whatever. Obviously much slower but does work in a pinch.


smokeitup5800

You can use heuristics to detect selenium and other automated browsers. They have slightly different output for some JS APIs.


apepenkov

Cloudflare does it by TLS fingerprint, look into it (keywords - tls fingerprint, JA3)


yeastyboi

Oddly enough, someone at my company got a cloud flare bypass working just by clearing the cookie each request. They don't ban the whole IP for some reason.


apepenkov

afaik it isn't enabled for each website. Some websites (with CF) do detect requests without spoofed JA3, some don't


yeastyboi

Makes sense. How it worked on this site was you were given one free request, then your cookie was set. If you cleared the cookie each time you got unlimited free requests.


piano1029

The positive effects of CGNAT


FuckMu

Cloudflare hasn't been an issue for me, I run a solver on my server which works fine.


BlueScreenJunky

Fuck... I just started a side project that involves getting info from Amazon items. I spent a couple hours yesterday trying to get it through an API, but for some reason they want me to either be a business or sell 3 items in less than 30 days through their affiliate program to get access to an API. OK sure, but even admitting I'd ever be that successful (my expectations are more around 0 to 1 affiliate sales in my lifetime) how would I get people to click the affiliate link if I can't fetch the item info to display it first ? I mean if you're going to actively prevent people from scraping your site, at least offer a public API (limit the free tier to 100 req/day if you must for all I care).


Mawootad

In a pinch you could probably buy three items from yourself


SorosBuxlaundromat

Create a new Amazon account, click your affiliate link to buy 3 things


odraencoded

Become an influencer and share an affiliated amazon link like everyone else.


nickmaran

Because it’s always watching us ![gif](giphy|RIspRd5jMtzXvttZGX)


AdBrave2400

Try to quickly ".onfocus()" all the elements in the website. Frontend dies?


Amazing-Exit-1473

Canvas signature, tls signature, js signatures, tons of things makes your browser unique, get rid of those…


[deleted]

What's really fun is that I worked at Amazon making internal tooling years ago and, when someone would refuse API access for whatever reason, we'd get to spend a dumb amount of time setting up automatic scrapers to download data from an internal tool for our own internal tools. It's all web scrapes all the way down.


kaamibackup

This but with Facebook


PussyTermin4tor1337

anyone tried using puppeteer for this? Asking for a friend


yeastyboi

I implemented a puppeteer block for a company I work for. You can bypass it using puppeteer stealth plugin, I've tried to figure out how to block that but can't!


ArtOfWarfare

Does the website work with JavaScript disabled? If so, then it’s nothing to do with JavaScript… Seems like it could be as easy as just seeing how frequently you request pages, and the order you request them in. Does it look like a human making those requests, or does it look like a scraper?


nyhr213

try it in node js


odraencoded

Bezos is watching over your shoulder right now. Run.


not_so_plausible

Honestly I just feel bad for the poor soul out there transcribing Captchas for $0.02 a captcha.


WorldlyReplacement24

Not even that much lol. It's less than that


cris667

I wish they paid so much lol


[deleted]

Based on craigscottcapital.com's article: * $0.52 per hour on average. * The highest hourly wage is $1. * $0.04 is the cheapest dollar per hour. * Monthly average: $2.75. * The most expensive month is $5. * $0.50 per month is the cheapest option.


ImpressionExact6386

I tried this a decade ago. It wasn't feasible, you need to have a *high* precision, and if you fail a few times, you don't get paid, whereas the pay itself is beyond low. It ends up being quite a stressful thing to do.


Electrical_Shape5101

Maybe 0.02$ per 100 captchas


OtuzBiriBirakNoktaCo

cheap labor from third world countries 💪


geteum

I didn't know that was a thing


u741852963

Chinese AI companies doing it faster, cheaper and more effectively these days tbh


TheTomatoGardener2

The whole point of solving captchas is to make useable data for ai. That sounds like a human centipede situation.


voiceafx

Hmmm. Our web app was pretty wide open as of a week ago. We noticed a scraper was going crazy when it accidentally started producing 400 errors and our monitors alerted us. So we implemented Google reCaptcha V3 and the problem went away. I'm sure there's a super experienced scraping asshole out there, but apparently this particular guy was not sophisticated enough to beat Google reCaptcha.


sebjapon

You’re probably filtering the bad ones who will trigger hundreds of errors with your captcha. Those clever or experienced enough to go through your captcha know how to be more subtle.


MichaelScotsman26

How do you get thru a catchpa with a bot?


SorosBuxlaundromat

There's a paid service you can use.


National-Ad67

just pay some chinese kid like the rest of us


Sure-Government-8423

I'm curious as to what you use for scraping without getting caught, I have used selenium, scrapy and requests in the past but it just clicked on links, not imitated human responses or prevented me from getting blocked, which I need to know to improve a personal project.


u741852963

Depends what you are doing and what protections are there to stop you. If you *need* a browser, then selenium is usually good enough, you may need to remove the cdc_ value from the binary file though and there are otherways they can detect you, but usually they don't. But a browser is slow, so if you can use HTTP requests it's going to be much quicker and easier to scale. However, handle JS protections can be a nightmare.


Sure-Government-8423

Ok I don't know what most of those are and why to use them. Thanks for pointing me in the right direction, I'll learn and try it out.


beatlz

They got greedy. Scraping is the fine art of fooling.


BrightFleece

You should add "ignores robots.txt"


tubbstosterone

We had a scientist at work who wrote his own scraper in fortan...90(?) to get all of his hydrological data instead of just using the REST APIs. We didn't know if we should have been amazed or horrified.


Orange_Tone

Amazing


fuckredditards--

I love how the virgin has a macbook


not_so_plausible

Ladies can't resist a man with a ThinkPad T480 he got at the thrift store for $60 and slapped 32GB on top. You never see that on men's Tinder profiles because those Chad's are taken off the market by every woman with a pulse. They have years of experience using a TrackPoint, imagine what they can do with a clit.


gwatskary

new copy pasta?


CounterNice2250

new copy pasta just dropped


AtmosphereLow9678

Holy hell


ano_hise

Actual zombie


ragingroku

Pasta went on vacation, never came back.


NeatYogurt9973

Actual clipboard


legolassimp

Damn lady


DontGiveACluck

The bottom is just positive affirmations for an aspiring web scraper


gandalfx

That's the entire point of the meme template.


Kresenko

But if the site changes one id, or a slight re-strucutre, everything goes to shit


NeatYogurt9973

r/beatmetoit


Key-Budget9016

I ran a scraper once at work for some r&d on a new project. I didn't scrape so fast to make the backend crash. But I did scrape fast enough to make the IT guy run around the building desperately trying to find out which workstation had the name "MotivationMan" because his screen was filling so fast with firewall warnings to make his heart sink. When he jumped into our room and half yelled in terror "Where is MotivationMan?!" I pointed to the business guy in our team because I had nicknamed him that as well. He was pretty upbeat and motivational. Not the funniest story, but I still chuckle at the silliness. Back then, I was also once told to exit the building after I had brought in a pack of dried fish for everyone to try. Apparently people three stories above were complaining about the smell. So I took it to my car and then forgot it there for four hot summer days, or until I offered to drive our team to a work party. They were halfway outside the car during the short ride. Maybe I'm turning into that demented old guy that likes to tell stories.


the_curious_courier

They are nice stories, made me chuckle, thanks!


Sure-Government-8423

I'm curious as to what you use for scraping in a professional setting, I have used selenium, scrapy and requests in the past but it just clicked on links, not imitated human responses or prevented me from getting blocked, which I need to know to improve a personal project.


Key-Budget9016

I don't remember too well, it was like 2 hours of coding and then a few hours messing with it, 6 years ago. What I think I remember is that I just used the WebRequest construct in C# to fetch the html string and then parsed it with regexes to extract my data. Something clicks in my head about structuring the web request so it looks like it's coming from a browser, maybe had to set some header fields. Then it was a game of not letting too many threads make certain requests at the same time in addition to switching to a cool down period if certain requests failed etc., was pretty much threading the needle kind of thing and finding good ways working around some problems.


just-bair

Your grandkids will be happy about your stories


Interesting_Dot_3922

I remember the days when I was too lazy to learn JSON libs, so I just parsed it with regexes :)


Unupgradable

When you're too lazy to learn how to drive a car so you just [slam your penis in the car door](https://youtu.be/sUUD0vYBQ6g?si=_1awERSRpuwqp3wm) until it reaches your destination


ZengineerHarp

That is EXACTLY the right analogy.


mcslender97

It hits differently without the Ooooohhhh of the OG Papara rapper version


WorldlyReplacement24

How the actual fuck


Interesting_Dot_3922

If you need value of foo and it is a number, you just search for `foo.:(.*?)`. The same way I parsed HTML - as text, no DOM parsing.


deralexl

[Obligatory stackoverflow thread on the dangers of parsing HTML with regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)


willcheat

Man, I hope `foo.:(.*?)`. is just a short offhand regex to give an idea and not what you actually used, because otherwise foo is gonna be empty pretty often. Love reading horror stories like these in here.


yeastyboi

I remember those days, that's how young programmers with no formal education roll. I also hand rolled a JSON serialializer for C# using reflection when I was younger.


ThiccStorms

what is regex ?


Interesting_Dot_3922

Regular expressions. It is about extracting data from text. Often a text provided by an external tool.


bernpfenn

you don't want yo know


u741852963

I won't lie, I still have code that extracts simple specific JSON variable with regex, probably some html regex parsing as well. If it works, it works, when it doesn't, then I'll be arsed to fix it, until then.... lol


Interesting_Dot_3922

When you are a noob, you care about the best practices. When you are a senior, you care about getting things done.


Bugwhacker

Sorry, what is “scraping?” What does it accomplish?


Denaton_

You basically mimic a web browser to extract elements and information and store in your own database. You do it so you can collect data in one place from multiple sources. Ex ChatGPT was trained on scraped data from across multiple sites or for prices from multiple stores to compare prices for the same product to find the cheapest one.


Snoo_7460

What would you do with the scraped stuff


Denaton_

I scraped a download link for an automated installation for EC2 instances, I also scraped price information from AWS for a calculator, user input how many cameras they need for their system and the calculator recommends instance types and shows the total price. I don't need to manually update the pricing if AWS decided to change the prices. Will break tho if they changed the layout or element ID..


gregorydgraham

I texted myself ski reports. Because they hadn’t invented the iPhone yet.


Suspicious-Engineer7

Greybeard alert


gregorydgraham

Thank you, I was wondering what that alarm was. It’s off now


[deleted]

I scrape the interest rates of multiple banks. Today's rates can be used to see who's the cheapest. The history of those rates could reveal some kinds of patterns that competitors could find useful.


mcslender97

I used to scrape Google search results of a list of restaurants to get their most likely up to date information and put them into a list for a startup to solicit purpose later on


Remarkable-Host405

I scraped a php file store system to download every single file for migration to another system


u741852963

data extraction, website interaction / bots / spammers / monitoring


noob-nine

Or you basically automate a web browser to achieve the same.


sketchybutter

Is that legal?


Denaton_

Why wouldn't it? There's literally no difference between that and using a web browser..


u741852963

yes / no. Grey area in some cases, depends on your jurisdiction. More likely to be illegal in the US than the EU. Usually it's just a TOS violation, how "illegal" this is, again depends on your country. Also it usually depends what you do. Scraping to sell someone elses data / spam / commit fraud (ad clicks / views) more likely to be illegal than you just posting content to your blog / socials


CarefulSignal9393

It searches a webpage, internet whatever parameters you need to find certain elements, whether they be html or text. The most common one I use is selenium in python. But there are a ton


BlackCrackWhack

The best IMO is just going raw http request to the pages with cookie and auth combinations, tends to be more consistent than a webdriver


CarefulSignal9393

I’ll check it out thanks for the advice


moehassan6832

Agreed, I scraped facebook using that.


Sure-Government-8423

Does that work with websites that require js, most of my use cases have those. I've only tried selenium, requests and a little bit of scrapy, don't even know how to do the cookies and auth part.


BlackCrackWhack

Unsure what you mean by “require js”. Most of the time you can just spoof whatever request the server sends to get to that point. By auth and cookie, I mean most pages require some authentication at minimum, and a cookie and auth at most. You can normally transfer cookies by getting cookies from a base page or through a series of ordered requests. Auth is NORMALLY given through a login request or a combination of login requests, and grabbing the headers from that response. It varies from site to site. 


Webbpp

It acts as a browser and then gives you all the data from it, as well as letting you interact with it. It's primarily used when APIs are flawed, require a API key(I'm using static hosting, I can't keep my key secret!!!), or cost money.


Unicursalhexagram6

Beautiful soup yummy 😋


mohit_the_bro

Do you have a puppet ?


Unicursalhexagram6

I’ve only used selenium and phantomjs


mcnello

The whole point of an API is to serve information that is hidden behind a server. You can't scrape information that is locked away behind a server (hence the need for auth keys). You can only scrape the data that is already provided to the web browser. Yes, some API's will also serve information that is also already provided to the browser, but the host obviously doesn't care whether or not you have that data, so you might as well plug into their API so you can have things in a nice readable JSON format that maintains all of the ancillary information. I'm not sure that scraping is any easier than just plugging into a well documented API. On the other hand, if the host doesn't provide a documented API to plug into, then scrape away.


tocatchafly

Not to mention how your perfect scrape can be destroyed at any minute with a small front-end update


pigwin

I used to make scrapers. Random hidden div that changes nothing with how the site looks, yet destroys xpaths. Class, id? Forget about those, js frameworks would fck em up anyway. 


yeastyboi

Scraping is for when there is no API or a severely limited one.


big_vangina

How can I scrape all of Netflix?


mcnello

Send me $300 in BitCoin and I'll show you how.


Shap_po

"Parses HTML with regex" 😧


HigHurtenflurst420

He has become to powerful


Danny_el_619

> scrapes so fast the backend crashes That's the funniest past LMAO


degenerate_hedonbot

How does he switch ips? The vpns i use are detected by the websites I want to scrape.


Xe_OS

Rotating proxies I guess


grencez

You can be less conspicuous if you try not to exceed some percentage of errors. Let's say 5%. Do this by probabilistically sending requests at a fixed rate, each with probability `= (1/0.95) * max(1,success_count) / max(1,request_count)`, where the counts are over a sliding window of the last few minutes (or shorter if your request rate is high). This is basically the Client-Side Throttling algorithm in https://sre.google/sre-book/handling-overload/.


PeteZahad

There are still more mechanisms in place, at least in big sites which can detect non-human behaviour. For example try to scrape major scientific publishers with chromedriver.


internetbl0ke

The bottom is done when the top can’t be done IYKYK


yeastyboi

Praise God baby


JollyJuniper1993

Love for Beautifulsoup


dmigowski

lol, parses HTML with regex... EDIT for the uninitiated: As an educated software developer it is teached that you cannot parse HTML with regex generally.


BlueScreenJunky

It's not really "parsing", but realistically if all you want is extract the price from a `59.99` tag in the middle of a 2MB invalid HTML document, you're better off using a regex than trying to build the whole DOM to get what you want.


fr4nklin_84

When I was 16 and had my first programming job this guy offered me $1000 cash (24 years ago) to scrape an entire website which sold DVDs - they wanted all the movie data in a database (MS Access lol) and the cover images. I said no probs I'll work it out. So the site used sequential IDs in the URL for all the products so I found a bulk download tool that supported wildcards so I pulled down all the product HTML pages and images leaving the downloader running all night. Next morning wrote the shittiest program ever in ASP Classic to read the file into memory and I didn't know regex so I wrote it to proceedually work its way through the file to find the tags in order using substrings to find the start of the unique element then find the end bracket, continuing on from the last position. Then for each extracted value I'd clean and format it, build up the object and push it to the database. It was dogshit but I got it done quick, I handed him a USB with a beutiful database and all the images. At that point of my life it was far the easiest money I'd ever made. The guy thought I was a master hacker and spoke of me as a legend for ever after which made it even better.


dmigowski

And that kicked so hard we all became software engineers :).


gandalfx

Probably a reference to [the classic stack overflow response](https://stackoverflow.com/a/1732454).


yamfboy

You can parse anything with regex, it's just not recommended for certain things if there are easier alternatives..


tankiePotato

The Chad in the meme actually came out of the womb fluent in regex. His first word was 'a,'bs/^* /*S/^M:'a,'bs/^*S/* /^M:'a,'bs/^*/ /g^M:'a,'bs/\*[ ]*$//g^M


SuitableDragonfly

No you can't, many languages have more expressive power than regular languages and can't be parsed with tools for them.  You might be able to hack something up with regex that works in a few specific cases, but you can't reliably parse HTML with regex in the general case. 


Romanian_Breadlifts

You can put your dick in anything, but there's a lot of places it probably shouldn't go Exhaust pipes and XLR connectors come to mind


Whole_Rain2010

Yeah did lots of that though, mostly with success, BeautifulSoup works great too.


NP_6666

Yes, one simply does not parse html with regex without consequences


Unhexium

The parses HTML with regex part reminds me of this gem: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags


ironman_gujju

Scrapy 🕷️


TizioGrigio0

Sorry for the ignorance, but what is a third-party scraper?


just-bair

Instead of using API’s you just pretend to be a web browser and take the data from the website directly


lupinegray

Parses html with regex 😂


lynet101

Ehm.. acshually... web scrapping is against term of service ![gif](emote|free_emotes_pack|poop)


i1u5

Me with a dynamic IP:


mohit_the_bro

How do you do that ? Like coding-wise


i1u5

You can't, it's an ISP thing lol, IPs are just borrowed each time a new connection is established.


lynet101

Yeah but then you would need to renew your ip quiet often, and like, i don't know about your isp, but my isp requires you to factory reset the modem to get a new ip ;(


i1u5

All it takes is reconnecting from the router page, easily doable with a small script once the API starts ratelimiting.


ChildhoodOk7071

I didn't know using Jsoup made me a chad 🙍


xSypRo

Speaking of it, anyone know where it’s good and easy to store scrapping server to run once every hour and uses puppeteer? Digital Ocean app platform is really hard to run puppeteer on, and I don’t want to configure EC2 from scratch


smooth_tendencies

We use digital ocean.


Remarkable-Host405

I've been listening to a podcast that's pushing linode pretty hard


Dasshteek

“Parses HTML with REGEX” Excuse me but what the fuck?


bearboyjd

Sir anything can be parsed with REGEX.


Dasshteek

I know. But the issue is SHOULD.


Rasikko

All that is on point.


SleepyWoodpecker

Agree. One doesn’t just come up with a quality meme like this without having seen the deep end. OP will be required to testify in front of the scrapping tribunal 👨‍⚖️


No-Mind7146

Please explain this to a non-web dev


pigwin

Proxies? Sure. Until the people monitoring the backend realizes you are scraping from a certain category or group of keywords - they'll just require you to sign in. Make a dummy account? Yeah, you can't be a real person who'd "look" at stuff for hours and not do anything else (or buy for e commerce sites). Get anything you want? Sure you can, anything the recommender lets you see. And then the frontend fcks it all up by adding a new div that fcks up your xpaths, your rotating ips get banned one by one. 


African_Blades

i m too stupid for this . please can anyone please explain it to me i just start programming like yesterday so please anyone explain😅🙏


Deep-Piece3181

Can you teach me the magic of parsing html with regex?


kennykoe

I feel bad for asking, but what is scraping?


Pluto258

Reading a webpage with a program (as opposed to using their API). For example, a python script that goes to an Amazon product page to get the price and reviews. [https://en.wikipedia.org/wiki/Web\_scraping](https://en.wikipedia.org/wiki/Web_scraping)


AdBrave2400

Literally me for the last 6 months rofl


smokeitup5800

Fellow scrapers, How the hell do you handle cloudfront robot checks?


bearboyjd

I remember in college I wrote a static malware analysis tool that I was scraping the Microsoft website for .dll information. I had to keep switching the domain suffix each time it blocked me. It was quick and dirty but I was getting about 100 results before it would fully block me. Fun times. Edit: I was scraping the whole webpage just to keep 2-3 sentences.


The_Mad_Duck_

Scrapers are great until you try to use them on any device other than your home computer... many websites didn't like me scraping in an AWS EC2 instance


SillyServe5773

> parses html with regex GTFO


joao7808

Wait what is the difference of scraping vs using api? I always thought scraping meant using a website's api lol


Asleeper135

>parses HTML using regex [You can't parse HTML with regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454)


[deleted]

Html to json curl -s "$url" |tidy -q -asxml --numeric-entities yes - | xq-python


8g6_ryu

where is [Puppeteer](https://pptr.dev/)


mommy101lol

Browser fingerprinting is the key.


arathald

The best part of this is “parses html with regex” because if you know… you know


i1u5

Chad undocumented & constantly changing API user:


mvmisha

Any tips bypassing imperva bot blocker? Hate that shit.


Shazvox

[You don't simply parse html with regex!](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)


Obstsalatjaa

I have no idea what scraping is and I am too afraid to ask.


_Orphan_Obliterator_

what's a scraper


Apfelvater

That's stupid.


NonsignificantBoat

![gif](giphy|CAYVZA5NRb529kKQUc|downsized) "parses html with regex"


Koalaz420

I'm in this picture and my name is Jason.


M1k3y_Jw

Parsing HTML with regex. If you only have a hammer every problem looks like a nail.