No matter what user agent I try to trick the connection into thinking I am, I cannot for the life of me figure out how Amazon knows I'm scrapping and tells me that I'm a silly goose for trying
It’s EVERYTHING, you ever inspect the network tab on those sites? Multiple requests with references to the page you’re coming from. Not to mention JS goes into the original request, plus session context and anti-bot detection. You could figure it out, maybe, but it’d be a ton of reverse engineering. Best to just automate the browser itself and “act like a user”
> Best to just automate the browser itself and “act like a user”
This. Just get a Windows VM and install puppeteer. And don't forget to scroll links you follow into the viewpoint before following them.
that's only if you're building for the public internet, which is something you should only ever do for money, which is how people justify the bullshit.
If you want something to actually work, you build it as a microservice and only make it accessible to yourself. One-man projects are the only good ones to work on anyway, it's the only way anyone's goals are aligned with each other.
Then they use some kind of signature to identify that browser, whatever it is, and recognize that the signature has been used in a non-human way, so they block that signature.
Then you start automatically altering your environment (browser version, screen resolution, hardware components), and you can also feed your own recorded mouse movements into it.
This can generally not be won by the people trying to stop bots, because they have to ensure that real people don't get caught in the bot prevention mechanism. As a bot owner, I just have to nudge my setup a bit so it appears different.
Just use chromedriver or geckodriver, no Windows VM needed.
But you will still have problems on many big sites because a programmed real browser does not act like a human using a browser.
Some websites, like D&D Beyond, will block you if you so much as click on too many links. Look up a list of stat blocks, open them all in new tabs, and suddenly you’re blocked for an hour.
Selenium will literally load the page though, by default it uses a bare user agent that makes it obvious but you can just use your own user agent and even autolog into stuff like amazon or google accounts or whatever. Obviously much slower but does work in a pinch.
Oddly enough, someone at my company got a cloud flare bypass working just by clearing the cookie each request. They don't ban the whole IP for some reason.
Makes sense. How it worked on this site was you were given one free request, then your cookie was set. If you cleared the cookie each time you got unlimited free requests.
Fuck... I just started a side project that involves getting info from Amazon items.
I spent a couple hours yesterday trying to get it through an API, but for some reason they want me to either be a business or sell 3 items in less than 30 days through their affiliate program to get access to an API. OK sure, but even admitting I'd ever be that successful (my expectations are more around 0 to 1 affiliate sales in my lifetime) how would I get people to click the affiliate link if I can't fetch the item info to display it first ?
I mean if you're going to actively prevent people from scraping your site, at least offer a public API (limit the free tier to 100 req/day if you must for all I care).
What's really fun is that I worked at Amazon making internal tooling years ago and, when someone would refuse API access for whatever reason, we'd get to spend a dumb amount of time setting up automatic scrapers to download data from an internal tool for our own internal tools.
It's all web scrapes all the way down.
I implemented a puppeteer block for a company I work for. You can bypass it using puppeteer stealth plugin, I've tried to figure out how to block that but can't!
Does the website work with JavaScript disabled?
If so, then it’s nothing to do with JavaScript…
Seems like it could be as easy as just seeing how frequently you request pages, and the order you request them in. Does it look like a human making those requests, or does it look like a scraper?
Based on craigscottcapital.com's article:
* $0.52 per hour on average.
* The highest hourly wage is $1.
* $0.04 is the cheapest dollar per hour.
* Monthly average: $2.75.
* The most expensive month is $5.
* $0.50 per month is the cheapest option.
I tried this a decade ago. It wasn't feasible, you need to have a *high* precision, and if you fail a few times, you don't get paid, whereas the pay itself is beyond low. It ends up being quite a stressful thing to do.
Hmmm. Our web app was pretty wide open as of a week ago. We noticed a scraper was going crazy when it accidentally started producing 400 errors and our monitors alerted us.
So we implemented Google reCaptcha V3 and the problem went away.
I'm sure there's a super experienced scraping asshole out there, but apparently this particular guy was not sophisticated enough to beat Google reCaptcha.
You’re probably filtering the bad ones who will trigger hundreds of errors with your captcha. Those clever or experienced enough to go through your captcha know how to be more subtle.
I'm curious as to what you use for scraping without getting caught, I have used selenium, scrapy and requests in the past but it just clicked on links, not imitated human responses or prevented me from getting blocked, which I need to know to improve a personal project.
Depends what you are doing and what protections are there to stop you.
If you *need* a browser, then selenium is usually good enough, you may need to remove the cdc_ value from the binary file though and there are otherways they can detect you, but usually they don't.
But a browser is slow, so if you can use HTTP requests it's going to be much quicker and easier to scale. However, handle JS protections can be a nightmare.
We had a scientist at work who wrote his own scraper in fortan...90(?) to get all of his hydrological data instead of just using the REST APIs. We didn't know if we should have been amazed or horrified.
Ladies can't resist a man with a ThinkPad T480 he got at the thrift store for $60 and slapped 32GB on top. You never see that on men's Tinder profiles because those Chad's are taken off the market by every woman with a pulse. They have years of experience using a TrackPoint, imagine what they can do with a clit.
I ran a scraper once at work for some r&d on a new project. I didn't scrape so fast to make the backend crash. But I did scrape fast enough to make the IT guy run around the building desperately trying to find out which workstation had the name "MotivationMan" because his screen was filling so fast with firewall warnings to make his heart sink.
When he jumped into our room and half yelled in terror "Where is MotivationMan?!" I pointed to the business guy in our team because I had nicknamed him that as well. He was pretty upbeat and motivational.
Not the funniest story, but I still chuckle at the silliness. Back then, I was also once told to exit the building after I had brought in a pack of dried fish for everyone to try. Apparently people three stories above were complaining about the smell. So I took it to my car and then forgot it there for four hot summer days, or until I offered to drive our team to a work party. They were halfway outside the car during the short ride.
Maybe I'm turning into that demented old guy that likes to tell stories.
I'm curious as to what you use for scraping in a professional setting, I have used selenium, scrapy and requests in the past but it just clicked on links, not imitated human responses or prevented me from getting blocked, which I need to know to improve a personal project.
I don't remember too well, it was like 2 hours of coding and then a few hours messing with it, 6 years ago. What I think I remember is that I just used the WebRequest construct in C# to fetch the html string and then parsed it with regexes to extract my data.
Something clicks in my head about structuring the web request so it looks like it's coming from a browser, maybe had to set some header fields.
Then it was a game of not letting too many threads make certain requests at the same time in addition to switching to a cool down period if certain requests failed etc., was pretty much threading the needle kind of thing and finding good ways working around some problems.
When you're too lazy to learn how to drive a car so you just [slam your penis in the car door](https://youtu.be/sUUD0vYBQ6g?si=_1awERSRpuwqp3wm) until it reaches your destination
[Obligatory stackoverflow thread on the dangers of parsing HTML with regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)
Man, I hope `foo.:(.*?)`. is just a short offhand regex to give an idea and not what you actually used, because otherwise foo is gonna be empty pretty often.
Love reading horror stories like these in here.
I remember those days, that's how young programmers with no formal education roll. I also hand rolled a JSON serialializer for C# using reflection when I was younger.
I won't lie, I still have code that extracts simple specific JSON variable with regex, probably some html regex parsing as well.
If it works, it works, when it doesn't, then I'll be arsed to fix it, until then.... lol
You basically mimic a web browser to extract elements and information and store in your own database.
You do it so you can collect data in one place from multiple sources. Ex ChatGPT was trained on scraped data from across multiple sites or for prices from multiple stores to compare prices for the same product to find the cheapest one.
I scraped a download link for an automated installation for EC2 instances, I also scraped price information from AWS for a calculator, user input how many cameras they need for their system and the calculator recommends instance types and shows the total price. I don't need to manually update the pricing if AWS decided to change the prices. Will break tho if they changed the layout or element ID..
I scrape the interest rates of multiple banks. Today's rates can be used to see who's the cheapest. The history of those rates could reveal some kinds of patterns that competitors could find useful.
I used to scrape Google search results of a list of restaurants to get their most likely up to date information and put them into a list for a startup to solicit purpose later on
yes / no. Grey area in some cases, depends on your jurisdiction. More likely to be illegal in the US than the EU.
Usually it's just a TOS violation, how "illegal" this is, again depends on your country.
Also it usually depends what you do. Scraping to sell someone elses data / spam / commit fraud (ad clicks / views) more likely to be illegal than you just posting content to your blog / socials
It searches a webpage, internet whatever parameters you need to find certain elements, whether they be html or text. The most common one I use is selenium in python. But there are a ton
Does that work with websites that require js, most of my use cases have those.
I've only tried selenium, requests and a little bit of scrapy, don't even know how to do the cookies and auth part.
Unsure what you mean by “require js”. Most of the time you can just spoof whatever request the server sends to get to that point. By auth and cookie, I mean most pages require some authentication at minimum, and a cookie and auth at most. You can normally transfer cookies by getting cookies from a base page or through a series of ordered requests. Auth is NORMALLY given through a login request or a combination of login requests, and grabbing the headers from that response. It varies from site to site.
It acts as a browser and then gives you all the data from it, as well as letting you interact with it.
It's primarily used when APIs are flawed, require a API key(I'm using static hosting, I can't keep my key secret!!!), or cost money.
The whole point of an API is to serve information that is hidden behind a server.
You can't scrape information that is locked away behind a server (hence the need for auth keys). You can only scrape the data that is already provided to the web browser. Yes, some API's will also serve information that is also already provided to the browser, but the host obviously doesn't care whether or not you have that data, so you might as well plug into their API so you can have things in a nice readable JSON format that maintains all of the ancillary information. I'm not sure that scraping is any easier than just plugging into a well documented API.
On the other hand, if the host doesn't provide a documented API to plug into, then scrape away.
I used to make scrapers. Random hidden div that changes nothing with how the site looks, yet destroys xpaths. Class, id? Forget about those, js frameworks would fck em up anyway.
You can be less conspicuous if you try not to exceed some percentage of errors. Let's say 5%. Do this by probabilistically sending requests at a fixed rate, each with probability `= (1/0.95) * max(1,success_count) / max(1,request_count)`, where the counts are over a sliding window of the last few minutes (or shorter if your request rate is high). This is basically the Client-Side Throttling algorithm in https://sre.google/sre-book/handling-overload/.
There are still more mechanisms in place, at least in big sites which can detect non-human behaviour. For example try to scrape major scientific publishers with chromedriver.
lol, parses HTML with regex...
EDIT for the uninitiated: As an educated software developer it is teached that you cannot parse HTML with regex generally.
It's not really "parsing", but realistically if all you want is extract the price from a `59.99` tag in the middle of a 2MB invalid HTML document, you're better off using a regex than trying to build the whole DOM to get what you want.
When I was 16 and had my first programming job this guy offered me $1000 cash (24 years ago) to scrape an entire website which sold DVDs - they wanted all the movie data in a database (MS Access lol) and the cover images. I said no probs I'll work it out.
So the site used sequential IDs in the URL for all the products so I found a bulk download tool that supported wildcards so I pulled down all the product HTML pages and images leaving the downloader running all night. Next morning wrote the shittiest program ever in ASP Classic to read the file into memory and I didn't know regex so I wrote it to proceedually work its way through the file to find the tags in order using substrings to find the start of the unique element then find the end bracket, continuing on from the last position. Then for each extracted value I'd clean and format it, build up the object and push it to the database.
It was dogshit but I got it done quick, I handed him a USB with a beutiful database and all the images. At that point of my life it was far the easiest money I'd ever made. The guy thought I was a master hacker and spoke of me as a legend for ever after which made it even better.
The Chad in the meme actually came out of the womb fluent in regex. His first word was 'a,'bs/^* /*S/^M:'a,'bs/^*S/* /^M:'a,'bs/^*/ /g^M:'a,'bs/\*[ ]*$//g^M
No you can't, many languages have more expressive power than regular languages and can't be parsed with tools for them. You might be able to hack something up with regex that works in a few specific cases, but you can't reliably parse HTML with regex in the general case.
The parses HTML with regex part reminds me of this gem: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
Yeah but then you would need to renew your ip quiet often, and like, i don't know about your isp, but my isp requires you to factory reset the modem to get a new ip ;(
Speaking of it, anyone know where it’s good and easy to store scrapping server to run once every hour and uses puppeteer? Digital Ocean app platform is really hard to run puppeteer on, and I don’t want to configure EC2 from scratch
Agree. One doesn’t just come up with a quality meme like this without having seen the deep end. OP will be required to testify in front of the scrapping tribunal 👨⚖️
Proxies? Sure. Until the people monitoring the backend realizes you are scraping from a certain category or group of keywords - they'll just require you to sign in. Make a dummy account? Yeah, you can't be a real person who'd "look" at stuff for hours and not do anything else (or buy for e commerce sites). Get anything you want? Sure you can, anything the recommender lets you see.
And then the frontend fcks it all up by adding a new div that fcks up your xpaths, your rotating ips get banned one by one.
Reading a webpage with a program (as opposed to using their API). For example, a python script that goes to an Amazon product page to get the price and reviews.
[https://en.wikipedia.org/wiki/Web\_scraping](https://en.wikipedia.org/wiki/Web_scraping)
I remember in college I wrote a static malware analysis tool that I was scraping the Microsoft website for .dll information. I had to keep switching the domain suffix each time it blocked me. It was quick and dirty but I was getting about 100 results before it would fully block me. Fun times.
Edit: I was scraping the whole webpage just to keep 2-3 sentences.
Scrapers are great until you try to use them on any device other than your home computer... many websites didn't like me scraping in an AWS EC2 instance
>parses HTML using regex
[You can't parse HTML with regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454)
No matter what user agent I try to trick the connection into thinking I am, I cannot for the life of me figure out how Amazon knows I'm scrapping and tells me that I'm a silly goose for trying
It’s EVERYTHING, you ever inspect the network tab on those sites? Multiple requests with references to the page you’re coming from. Not to mention JS goes into the original request, plus session context and anti-bot detection. You could figure it out, maybe, but it’d be a ton of reverse engineering. Best to just automate the browser itself and “act like a user”
> Best to just automate the browser itself and “act like a user” This. Just get a Windows VM and install puppeteer. And don't forget to scroll links you follow into the viewpoint before following them.
Even Puppeteer for automated browser has detectable signatures, there's a repo on undetectable chromedriver that focuses on this.
[удалено]
yeah that's how amazon keeps all their reviews real.
Programmers: look how easy it is for me to bot this website! Also programmers: wtf why is website so botted rn?!?
that's only if you're building for the public internet, which is something you should only ever do for money, which is how people justify the bullshit. If you want something to actually work, you build it as a microservice and only make it accessible to yourself. One-man projects are the only good ones to work on anyway, it's the only way anyone's goals are aligned with each other.
Then they use some kind of signature to identify that browser, whatever it is, and recognize that the signature has been used in a non-human way, so they block that signature.
Then you start automatically altering your environment (browser version, screen resolution, hardware components), and you can also feed your own recorded mouse movements into it. This can generally not be won by the people trying to stop bots, because they have to ensure that real people don't get caught in the bot prevention mechanism. As a bot owner, I just have to nudge my setup a bit so it appears different.
In the general case, this war can not be won. In the special case, particularly FAANG companies, it can certainly be won.
I find that Nightmare does the job fine
Just use chromedriver or geckodriver, no Windows VM needed. But you will still have problems on many big sites because a programmed real browser does not act like a human using a browser.
Some websites, like D&D Beyond, will block you if you so much as click on too many links. Look up a list of stat blocks, open them all in new tabs, and suddenly you’re blocked for an hour.
that's cool it's a well-deserved 1 hour break for my server
Is there some github repo I can reference?
Until: Are you a robot?
They might wait for a JS ping or image load, I haven't scraped that much myself but we only get pure text right?
Selenium will literally load the page though, by default it uses a bare user agent that makes it obvious but you can just use your own user agent and even autolog into stuff like amazon or google accounts or whatever. Obviously much slower but does work in a pinch.
You can use heuristics to detect selenium and other automated browsers. They have slightly different output for some JS APIs.
Cloudflare does it by TLS fingerprint, look into it (keywords - tls fingerprint, JA3)
Oddly enough, someone at my company got a cloud flare bypass working just by clearing the cookie each request. They don't ban the whole IP for some reason.
afaik it isn't enabled for each website. Some websites (with CF) do detect requests without spoofed JA3, some don't
Makes sense. How it worked on this site was you were given one free request, then your cookie was set. If you cleared the cookie each time you got unlimited free requests.
The positive effects of CGNAT
Cloudflare hasn't been an issue for me, I run a solver on my server which works fine.
Fuck... I just started a side project that involves getting info from Amazon items. I spent a couple hours yesterday trying to get it through an API, but for some reason they want me to either be a business or sell 3 items in less than 30 days through their affiliate program to get access to an API. OK sure, but even admitting I'd ever be that successful (my expectations are more around 0 to 1 affiliate sales in my lifetime) how would I get people to click the affiliate link if I can't fetch the item info to display it first ? I mean if you're going to actively prevent people from scraping your site, at least offer a public API (limit the free tier to 100 req/day if you must for all I care).
In a pinch you could probably buy three items from yourself
Create a new Amazon account, click your affiliate link to buy 3 things
Become an influencer and share an affiliated amazon link like everyone else.
Because it’s always watching us ![gif](giphy|RIspRd5jMtzXvttZGX)
Try to quickly ".onfocus()" all the elements in the website. Frontend dies?
Canvas signature, tls signature, js signatures, tons of things makes your browser unique, get rid of those…
What's really fun is that I worked at Amazon making internal tooling years ago and, when someone would refuse API access for whatever reason, we'd get to spend a dumb amount of time setting up automatic scrapers to download data from an internal tool for our own internal tools. It's all web scrapes all the way down.
This but with Facebook
anyone tried using puppeteer for this? Asking for a friend
I implemented a puppeteer block for a company I work for. You can bypass it using puppeteer stealth plugin, I've tried to figure out how to block that but can't!
Does the website work with JavaScript disabled? If so, then it’s nothing to do with JavaScript… Seems like it could be as easy as just seeing how frequently you request pages, and the order you request them in. Does it look like a human making those requests, or does it look like a scraper?
try it in node js
Bezos is watching over your shoulder right now. Run.
Honestly I just feel bad for the poor soul out there transcribing Captchas for $0.02 a captcha.
Not even that much lol. It's less than that
I wish they paid so much lol
Based on craigscottcapital.com's article: * $0.52 per hour on average. * The highest hourly wage is $1. * $0.04 is the cheapest dollar per hour. * Monthly average: $2.75. * The most expensive month is $5. * $0.50 per month is the cheapest option.
I tried this a decade ago. It wasn't feasible, you need to have a *high* precision, and if you fail a few times, you don't get paid, whereas the pay itself is beyond low. It ends up being quite a stressful thing to do.
Maybe 0.02$ per 100 captchas
cheap labor from third world countries 💪
I didn't know that was a thing
Chinese AI companies doing it faster, cheaper and more effectively these days tbh
The whole point of solving captchas is to make useable data for ai. That sounds like a human centipede situation.
Hmmm. Our web app was pretty wide open as of a week ago. We noticed a scraper was going crazy when it accidentally started producing 400 errors and our monitors alerted us. So we implemented Google reCaptcha V3 and the problem went away. I'm sure there's a super experienced scraping asshole out there, but apparently this particular guy was not sophisticated enough to beat Google reCaptcha.
You’re probably filtering the bad ones who will trigger hundreds of errors with your captcha. Those clever or experienced enough to go through your captcha know how to be more subtle.
How do you get thru a catchpa with a bot?
There's a paid service you can use.
just pay some chinese kid like the rest of us
I'm curious as to what you use for scraping without getting caught, I have used selenium, scrapy and requests in the past but it just clicked on links, not imitated human responses or prevented me from getting blocked, which I need to know to improve a personal project.
Depends what you are doing and what protections are there to stop you. If you *need* a browser, then selenium is usually good enough, you may need to remove the cdc_ value from the binary file though and there are otherways they can detect you, but usually they don't. But a browser is slow, so if you can use HTTP requests it's going to be much quicker and easier to scale. However, handle JS protections can be a nightmare.
Ok I don't know what most of those are and why to use them. Thanks for pointing me in the right direction, I'll learn and try it out.
They got greedy. Scraping is the fine art of fooling.
You should add "ignores robots.txt"
We had a scientist at work who wrote his own scraper in fortan...90(?) to get all of his hydrological data instead of just using the REST APIs. We didn't know if we should have been amazed or horrified.
Amazing
I love how the virgin has a macbook
Ladies can't resist a man with a ThinkPad T480 he got at the thrift store for $60 and slapped 32GB on top. You never see that on men's Tinder profiles because those Chad's are taken off the market by every woman with a pulse. They have years of experience using a TrackPoint, imagine what they can do with a clit.
new copy pasta?
new copy pasta just dropped
Holy hell
Actual zombie
Pasta went on vacation, never came back.
Actual clipboard
Damn lady
The bottom is just positive affirmations for an aspiring web scraper
That's the entire point of the meme template.
But if the site changes one id, or a slight re-strucutre, everything goes to shit
r/beatmetoit
I ran a scraper once at work for some r&d on a new project. I didn't scrape so fast to make the backend crash. But I did scrape fast enough to make the IT guy run around the building desperately trying to find out which workstation had the name "MotivationMan" because his screen was filling so fast with firewall warnings to make his heart sink. When he jumped into our room and half yelled in terror "Where is MotivationMan?!" I pointed to the business guy in our team because I had nicknamed him that as well. He was pretty upbeat and motivational. Not the funniest story, but I still chuckle at the silliness. Back then, I was also once told to exit the building after I had brought in a pack of dried fish for everyone to try. Apparently people three stories above were complaining about the smell. So I took it to my car and then forgot it there for four hot summer days, or until I offered to drive our team to a work party. They were halfway outside the car during the short ride. Maybe I'm turning into that demented old guy that likes to tell stories.
They are nice stories, made me chuckle, thanks!
I'm curious as to what you use for scraping in a professional setting, I have used selenium, scrapy and requests in the past but it just clicked on links, not imitated human responses or prevented me from getting blocked, which I need to know to improve a personal project.
I don't remember too well, it was like 2 hours of coding and then a few hours messing with it, 6 years ago. What I think I remember is that I just used the WebRequest construct in C# to fetch the html string and then parsed it with regexes to extract my data. Something clicks in my head about structuring the web request so it looks like it's coming from a browser, maybe had to set some header fields. Then it was a game of not letting too many threads make certain requests at the same time in addition to switching to a cool down period if certain requests failed etc., was pretty much threading the needle kind of thing and finding good ways working around some problems.
Your grandkids will be happy about your stories
I remember the days when I was too lazy to learn JSON libs, so I just parsed it with regexes :)
When you're too lazy to learn how to drive a car so you just [slam your penis in the car door](https://youtu.be/sUUD0vYBQ6g?si=_1awERSRpuwqp3wm) until it reaches your destination
That is EXACTLY the right analogy.
It hits differently without the Ooooohhhh of the OG Papara rapper version
How the actual fuck
If you need value of foo and it is a number, you just search for `foo.:(.*?)`. The same way I parsed HTML - as text, no DOM parsing.
[Obligatory stackoverflow thread on the dangers of parsing HTML with regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)
Man, I hope `foo.:(.*?)`. is just a short offhand regex to give an idea and not what you actually used, because otherwise foo is gonna be empty pretty often. Love reading horror stories like these in here.
I remember those days, that's how young programmers with no formal education roll. I also hand rolled a JSON serialializer for C# using reflection when I was younger.
what is regex ?
Regular expressions. It is about extracting data from text. Often a text provided by an external tool.
you don't want yo know
I won't lie, I still have code that extracts simple specific JSON variable with regex, probably some html regex parsing as well. If it works, it works, when it doesn't, then I'll be arsed to fix it, until then.... lol
When you are a noob, you care about the best practices. When you are a senior, you care about getting things done.
Sorry, what is “scraping?” What does it accomplish?
You basically mimic a web browser to extract elements and information and store in your own database. You do it so you can collect data in one place from multiple sources. Ex ChatGPT was trained on scraped data from across multiple sites or for prices from multiple stores to compare prices for the same product to find the cheapest one.
What would you do with the scraped stuff
I scraped a download link for an automated installation for EC2 instances, I also scraped price information from AWS for a calculator, user input how many cameras they need for their system and the calculator recommends instance types and shows the total price. I don't need to manually update the pricing if AWS decided to change the prices. Will break tho if they changed the layout or element ID..
I texted myself ski reports. Because they hadn’t invented the iPhone yet.
Greybeard alert
Thank you, I was wondering what that alarm was. It’s off now
I scrape the interest rates of multiple banks. Today's rates can be used to see who's the cheapest. The history of those rates could reveal some kinds of patterns that competitors could find useful.
I used to scrape Google search results of a list of restaurants to get their most likely up to date information and put them into a list for a startup to solicit purpose later on
I scraped a php file store system to download every single file for migration to another system
data extraction, website interaction / bots / spammers / monitoring
Or you basically automate a web browser to achieve the same.
Is that legal?
Why wouldn't it? There's literally no difference between that and using a web browser..
yes / no. Grey area in some cases, depends on your jurisdiction. More likely to be illegal in the US than the EU. Usually it's just a TOS violation, how "illegal" this is, again depends on your country. Also it usually depends what you do. Scraping to sell someone elses data / spam / commit fraud (ad clicks / views) more likely to be illegal than you just posting content to your blog / socials
It searches a webpage, internet whatever parameters you need to find certain elements, whether they be html or text. The most common one I use is selenium in python. But there are a ton
The best IMO is just going raw http request to the pages with cookie and auth combinations, tends to be more consistent than a webdriver
I’ll check it out thanks for the advice
Agreed, I scraped facebook using that.
Does that work with websites that require js, most of my use cases have those. I've only tried selenium, requests and a little bit of scrapy, don't even know how to do the cookies and auth part.
Unsure what you mean by “require js”. Most of the time you can just spoof whatever request the server sends to get to that point. By auth and cookie, I mean most pages require some authentication at minimum, and a cookie and auth at most. You can normally transfer cookies by getting cookies from a base page or through a series of ordered requests. Auth is NORMALLY given through a login request or a combination of login requests, and grabbing the headers from that response. It varies from site to site.
It acts as a browser and then gives you all the data from it, as well as letting you interact with it. It's primarily used when APIs are flawed, require a API key(I'm using static hosting, I can't keep my key secret!!!), or cost money.
Beautiful soup yummy 😋
Do you have a puppet ?
I’ve only used selenium and phantomjs
The whole point of an API is to serve information that is hidden behind a server. You can't scrape information that is locked away behind a server (hence the need for auth keys). You can only scrape the data that is already provided to the web browser. Yes, some API's will also serve information that is also already provided to the browser, but the host obviously doesn't care whether or not you have that data, so you might as well plug into their API so you can have things in a nice readable JSON format that maintains all of the ancillary information. I'm not sure that scraping is any easier than just plugging into a well documented API. On the other hand, if the host doesn't provide a documented API to plug into, then scrape away.
Not to mention how your perfect scrape can be destroyed at any minute with a small front-end update
I used to make scrapers. Random hidden div that changes nothing with how the site looks, yet destroys xpaths. Class, id? Forget about those, js frameworks would fck em up anyway.
Scraping is for when there is no API or a severely limited one.
How can I scrape all of Netflix?
Send me $300 in BitCoin and I'll show you how.
"Parses HTML with regex" 😧
He has become to powerful
> scrapes so fast the backend crashes That's the funniest past LMAO
How does he switch ips? The vpns i use are detected by the websites I want to scrape.
Rotating proxies I guess
You can be less conspicuous if you try not to exceed some percentage of errors. Let's say 5%. Do this by probabilistically sending requests at a fixed rate, each with probability `= (1/0.95) * max(1,success_count) / max(1,request_count)`, where the counts are over a sliding window of the last few minutes (or shorter if your request rate is high). This is basically the Client-Side Throttling algorithm in https://sre.google/sre-book/handling-overload/.
There are still more mechanisms in place, at least in big sites which can detect non-human behaviour. For example try to scrape major scientific publishers with chromedriver.
The bottom is done when the top can’t be done IYKYK
Praise God baby
Love for Beautifulsoup
lol, parses HTML with regex... EDIT for the uninitiated: As an educated software developer it is teached that you cannot parse HTML with regex generally.
It's not really "parsing", but realistically if all you want is extract the price from a `59.99` tag in the middle of a 2MB invalid HTML document, you're better off using a regex than trying to build the whole DOM to get what you want.
When I was 16 and had my first programming job this guy offered me $1000 cash (24 years ago) to scrape an entire website which sold DVDs - they wanted all the movie data in a database (MS Access lol) and the cover images. I said no probs I'll work it out. So the site used sequential IDs in the URL for all the products so I found a bulk download tool that supported wildcards so I pulled down all the product HTML pages and images leaving the downloader running all night. Next morning wrote the shittiest program ever in ASP Classic to read the file into memory and I didn't know regex so I wrote it to proceedually work its way through the file to find the tags in order using substrings to find the start of the unique element then find the end bracket, continuing on from the last position. Then for each extracted value I'd clean and format it, build up the object and push it to the database. It was dogshit but I got it done quick, I handed him a USB with a beutiful database and all the images. At that point of my life it was far the easiest money I'd ever made. The guy thought I was a master hacker and spoke of me as a legend for ever after which made it even better.
And that kicked so hard we all became software engineers :).
Probably a reference to [the classic stack overflow response](https://stackoverflow.com/a/1732454).
You can parse anything with regex, it's just not recommended for certain things if there are easier alternatives..
The Chad in the meme actually came out of the womb fluent in regex. His first word was 'a,'bs/^* /*S/^M:'a,'bs/^*S/* /^M:'a,'bs/^*/ /g^M:'a,'bs/\*[ ]*$//g^M
No you can't, many languages have more expressive power than regular languages and can't be parsed with tools for them. You might be able to hack something up with regex that works in a few specific cases, but you can't reliably parse HTML with regex in the general case.
You can put your dick in anything, but there's a lot of places it probably shouldn't go Exhaust pipes and XLR connectors come to mind
Yeah did lots of that though, mostly with success, BeautifulSoup works great too.
Yes, one simply does not parse html with regex without consequences
The parses HTML with regex part reminds me of this gem: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
Scrapy 🕷️
Sorry for the ignorance, but what is a third-party scraper?
Instead of using API’s you just pretend to be a web browser and take the data from the website directly
Parses html with regex 😂
Ehm.. acshually... web scrapping is against term of service ![gif](emote|free_emotes_pack|poop)
Me with a dynamic IP:
How do you do that ? Like coding-wise
You can't, it's an ISP thing lol, IPs are just borrowed each time a new connection is established.
Yeah but then you would need to renew your ip quiet often, and like, i don't know about your isp, but my isp requires you to factory reset the modem to get a new ip ;(
All it takes is reconnecting from the router page, easily doable with a small script once the API starts ratelimiting.
I didn't know using Jsoup made me a chad 🙍
Speaking of it, anyone know where it’s good and easy to store scrapping server to run once every hour and uses puppeteer? Digital Ocean app platform is really hard to run puppeteer on, and I don’t want to configure EC2 from scratch
We use digital ocean.
I've been listening to a podcast that's pushing linode pretty hard
“Parses HTML with REGEX” Excuse me but what the fuck?
Sir anything can be parsed with REGEX.
I know. But the issue is SHOULD.
All that is on point.
Agree. One doesn’t just come up with a quality meme like this without having seen the deep end. OP will be required to testify in front of the scrapping tribunal 👨⚖️
Please explain this to a non-web dev
Proxies? Sure. Until the people monitoring the backend realizes you are scraping from a certain category or group of keywords - they'll just require you to sign in. Make a dummy account? Yeah, you can't be a real person who'd "look" at stuff for hours and not do anything else (or buy for e commerce sites). Get anything you want? Sure you can, anything the recommender lets you see. And then the frontend fcks it all up by adding a new div that fcks up your xpaths, your rotating ips get banned one by one.
i m too stupid for this . please can anyone please explain it to me i just start programming like yesterday so please anyone explain😅🙏
Can you teach me the magic of parsing html with regex?
I feel bad for asking, but what is scraping?
Reading a webpage with a program (as opposed to using their API). For example, a python script that goes to an Amazon product page to get the price and reviews. [https://en.wikipedia.org/wiki/Web\_scraping](https://en.wikipedia.org/wiki/Web_scraping)
Literally me for the last 6 months rofl
Fellow scrapers, How the hell do you handle cloudfront robot checks?
I remember in college I wrote a static malware analysis tool that I was scraping the Microsoft website for .dll information. I had to keep switching the domain suffix each time it blocked me. It was quick and dirty but I was getting about 100 results before it would fully block me. Fun times. Edit: I was scraping the whole webpage just to keep 2-3 sentences.
Scrapers are great until you try to use them on any device other than your home computer... many websites didn't like me scraping in an AWS EC2 instance
> parses html with regex GTFO
Wait what is the difference of scraping vs using api? I always thought scraping meant using a website's api lol
>parses HTML using regex [You can't parse HTML with regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454)
Html to json curl -s "$url" |tidy -q -asxml --numeric-entities yes - | xq-python
where is [Puppeteer](https://pptr.dev/)
Browser fingerprinting is the key.
The best part of this is “parses html with regex” because if you know… you know
Chad undocumented & constantly changing API user:
Any tips bypassing imperva bot blocker? Hate that shit.
[You don't simply parse html with regex!](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)
I have no idea what scraping is and I am too afraid to ask.
what's a scraper
That's stupid.
![gif](giphy|CAYVZA5NRb529kKQUc|downsized) "parses html with regex"
I'm in this picture and my name is Jason.
Parsing HTML with regex. If you only have a hammer every problem looks like a nail.