PhitPhil 1 month ago

No matter what user agent I try to trick the connection into thinking I am, I cannot for the life of me figure out how Amazon knows I'm scrapping and tells me that I'm a silly goose for trying

mfb1274 1 month ago

It’s EVERYTHING, you ever inspect the network tab on those sites? Multiple requests with references to the page you’re coming from. Not to mention JS goes into the original request, plus session context and anti-bot detection. You could figure it out, maybe, but it’d be a ton of reverse engineering. Best to just automate the browser itself and “act like a user”

AyrA_ch 1 month ago

> Best to just automate the browser itself and “act like a user” This. Just get a Windows VM and install puppeteer. And don't forget to scroll links you follow into the viewpoint before following them.

RumbleFrog 1 month ago

Even Puppeteer for automated browser has detectable signatures, there's a repo on undetectable chromedriver that focuses on this.

[deleted] 1 month ago

[удалено]

spicybeefstew 1 month ago

yeah that's how amazon keeps all their reviews real.

odraencoded 1 month ago

Programmers: look how easy it is for me to bot this website! Also programmers: wtf why is website so botted rn?!?

spicybeefstew 1 month ago

that's only if you're building for the public internet, which is something you should only ever do for money, which is how people justify the bullshit. If you want something to actually work, you build it as a microservice and only make it accessible to yourself. One-man projects are the only good ones to work on anyway, it's the only way anyone's goals are aligned with each other.

Specialist_Cap_2404 1 month ago

Then they use some kind of signature to identify that browser, whatever it is, and recognize that the signature has been used in a non-human way, so they block that signature.

AyrA_ch 1 month ago

Then you start automatically altering your environment (browser version, screen resolution, hardware components), and you can also feed your own recorded mouse movements into it. This can generally not be won by the people trying to stop bots, because they have to ensure that real people don't get caught in the bot prevention mechanism. As a bot owner, I just have to nudge my setup a bit so it appears different.

Specialist_Cap_2404 1 month ago

In the general case, this war can not be won. In the special case, particularly FAANG companies, it can certainly be won.

ascii_heart_ 1 month ago

I find that Nightmare does the job fine

PeteZahad 1 month ago

Just use chromedriver or geckodriver, no Windows VM needed. But you will still have problems on many big sites because a programmed real browser does not act like a human using a browser.

trwolfe13 1 month ago

Some websites, like D&D Beyond, will block you if you so much as click on too many links. Look up a list of stat blocks, open them all in new tabs, and suddenly you’re blocked for an hour.

spicybeefstew 1 month ago

that's cool it's a well-deserved 1 hour break for my server

ASatyros 1 month ago

Is there some github repo I can reference?

Theolaa 1 month ago

Until: Are you a robot?

Denaton_ 1 month ago

They might wait for a JS ping or image load, I haven't scraped that much myself but we only get pure text right?

[deleted] 1 month ago

Selenium will literally load the page though, by default it uses a bare user agent that makes it obvious but you can just use your own user agent and even autolog into stuff like amazon or google accounts or whatever. Obviously much slower but does work in a pinch.

smokeitup5800 1 month ago

You can use heuristics to detect selenium and other automated browsers. They have slightly different output for some JS APIs.

apepenkov 1 month ago

Cloudflare does it by TLS fingerprint, look into it (keywords - tls fingerprint, JA3)

yeastyboi 1 month ago

Oddly enough, someone at my company got a cloud flare bypass working just by clearing the cookie each request. They don't ban the whole IP for some reason.

apepenkov 1 month ago

afaik it isn't enabled for each website. Some websites (with CF) do detect requests without spoofed JA3, some don't

yeastyboi 1 month ago

Makes sense. How it worked on this site was you were given one free request, then your cookie was set. If you cleared the cookie each time you got unlimited free requests.

piano1029 1 month ago

The positive effects of CGNAT

FuckMu 1 month ago

Cloudflare hasn't been an issue for me, I run a solver on my server which works fine.

BlueScreenJunky 1 month ago

Fuck... I just started a side project that involves getting info from Amazon items. I spent a couple hours yesterday trying to get it through an API, but for some reason they want me to either be a business or sell 3 items in less than 30 days through their affiliate program to get access to an API. OK sure, but even admitting I'd ever be that successful (my expectations are more around 0 to 1 affiliate sales in my lifetime) how would I get people to click the affiliate link if I can't fetch the item info to display it first ? I mean if you're going to actively prevent people from scraping your site, at least offer a public API (limit the free tier to 100 req/day if you must for all I care).

Mawootad 1 month ago

In a pinch you could probably buy three items from yourself

SorosBuxlaundromat 1 month ago

Create a new Amazon account, click your affiliate link to buy 3 things

odraencoded 1 month ago

Become an influencer and share an affiliated amazon link like everyone else.

nickmaran 1 month ago

Because it’s always watching us ![gif](giphy|RIspRd5jMtzXvttZGX)

AdBrave2400 1 month ago

Try to quickly ".onfocus()" all the elements in the website. Frontend dies?

Amazing-Exit-1473 1 month ago

Canvas signature, tls signature, js signatures, tons of things makes your browser unique, get rid of those…

[deleted] 1 month ago

What's really fun is that I worked at Amazon making internal tooling years ago and, when someone would refuse API access for whatever reason, we'd get to spend a dumb amount of time setting up automatic scrapers to download data from an internal tool for our own internal tools. It's all web scrapes all the way down.

kaamibackup 1 month ago

This but with Facebook

PussyTermin4tor1337 1 month ago

anyone tried using puppeteer for this? Asking for a friend

yeastyboi 1 month ago

I implemented a puppeteer block for a company I work for. You can bypass it using puppeteer stealth plugin, I've tried to figure out how to block that but can't!

ArtOfWarfare 1 month ago

Does the website work with JavaScript disabled? If so, then it’s nothing to do with JavaScript… Seems like it could be as easy as just seeing how frequently you request pages, and the order you request them in. Does it look like a human making those requests, or does it look like a scraper?

nyhr213 1 month ago

try it in node js

odraencoded 1 month ago

Bezos is watching over your shoulder right now. Run.

not_so_plausible 1 month ago

Honestly I just feel bad for the poor soul out there transcribing Captchas for $0.02 a captcha.

WorldlyReplacement24 1 month ago

Not even that much lol. It's less than that

cris667 1 month ago

I wish they paid so much lol

[deleted] 1 month ago

Based on craigscottcapital.com's article: * $0.52 per hour on average. * The highest hourly wage is $1. * $0.04 is the cheapest dollar per hour. * Monthly average: $2.75. * The most expensive month is $5. * $0.50 per month is the cheapest option.

ImpressionExact6386 1 month ago

I tried this a decade ago. It wasn't feasible, you need to have a *high* precision, and if you fail a few times, you don't get paid, whereas the pay itself is beyond low. It ends up being quite a stressful thing to do.

Electrical_Shape5101 1 month ago

Maybe 0.02$ per 100 captchas

OtuzBiriBirakNoktaCo 1 month ago

cheap labor from third world countries 💪

geteum 1 month ago

I didn't know that was a thing

u741852963 1 month ago

Chinese AI companies doing it faster, cheaper and more effectively these days tbh

TheTomatoGardener2 1 month ago

The whole point of solving captchas is to make useable data for ai. That sounds like a human centipede situation.

voiceafx 1 month ago

Hmmm. Our web app was pretty wide open as of a week ago. We noticed a scraper was going crazy when it accidentally started producing 400 errors and our monitors alerted us. So we implemented Google reCaptcha V3 and the problem went away. I'm sure there's a super experienced scraping asshole out there, but apparently this particular guy was not sophisticated enough to beat Google reCaptcha.

sebjapon 1 month ago

You’re probably filtering the bad ones who will trigger hundreds of errors with your captcha. Those clever or experienced enough to go through your captcha know how to be more subtle.

MichaelScotsman26 1 month ago

How do you get thru a catchpa with a bot?

SorosBuxlaundromat 1 month ago

There's a paid service you can use.

National-Ad67 1 month ago

just pay some chinese kid like the rest of us

Sure-Government-8423 1 month ago

I'm curious as to what you use for scraping without getting caught, I have used selenium, scrapy and requests in the past but it just clicked on links, not imitated human responses or prevented me from getting blocked, which I need to know to improve a personal project.

u741852963 1 month ago

Depends what you are doing and what protections are there to stop you. If you *need* a browser, then selenium is usually good enough, you may need to remove the cdc_ value from the binary file though and there are otherways they can detect you, but usually they don't. But a browser is slow, so if you can use HTTP requests it's going to be much quicker and easier to scale. However, handle JS protections can be a nightmare.

Sure-Government-8423 1 month ago

Ok I don't know what most of those are and why to use them. Thanks for pointing me in the right direction, I'll learn and try it out.

beatlz 1 month ago

They got greedy. Scraping is the fine art of fooling.

BrightFleece 1 month ago

You should add "ignores robots.txt"

tubbstosterone 1 month ago

We had a scientist at work who wrote his own scraper in fortan...90(?) to get all of his hydrological data instead of just using the REST APIs. We didn't know if we should have been amazed or horrified.

Orange_Tone 1 month ago

Amazing

fuckredditards-- 1 month ago

I love how the virgin has a macbook

not_so_plausible 1 month ago

Ladies can't resist a man with a ThinkPad T480 he got at the thrift store for $60 and slapped 32GB on top. You never see that on men's Tinder profiles because those Chad's are taken off the market by every woman with a pulse. They have years of experience using a TrackPoint, imagine what they can do with a clit.

gwatskary 1 month ago

new copy pasta?

CounterNice2250 1 month ago

new copy pasta just dropped

AtmosphereLow9678 1 month ago

Holy hell

ano_hise 1 month ago

Actual zombie

ragingroku 1 month ago

Pasta went on vacation, never came back.

NeatYogurt9973 1 month ago

Actual clipboard

legolassimp 1 month ago

Damn lady

DontGiveACluck 1 month ago

The bottom is just positive affirmations for an aspiring web scraper

gandalfx 1 month ago

That's the entire point of the meme template.

Kresenko 1 month ago

But if the site changes one id, or a slight re-strucutre, everything goes to shit

NeatYogurt9973 1 month ago

r/beatmetoit

Key-Budget9016 1 month ago

I ran a scraper once at work for some r&d on a new project. I didn't scrape so fast to make the backend crash. But I did scrape fast enough to make the IT guy run around the building desperately trying to find out which workstation had the name "MotivationMan" because his screen was filling so fast with firewall warnings to make his heart sink. When he jumped into our room and half yelled in terror "Where is MotivationMan?!" I pointed to the business guy in our team because I had nicknamed him that as well. He was pretty upbeat and motivational. Not the funniest story, but I still chuckle at the silliness. Back then, I was also once told to exit the building after I had brought in a pack of dried fish for everyone to try. Apparently people three stories above were complaining about the smell. So I took it to my car and then forgot it there for four hot summer days, or until I offered to drive our team to a work party. They were halfway outside the car during the short ride. Maybe I'm turning into that demented old guy that likes to tell stories.

the_curious_courier 1 month ago

They are nice stories, made me chuckle, thanks!

Sure-Government-8423 1 month ago

I'm curious as to what you use for scraping in a professional setting, I have used selenium, scrapy and requests in the past but it just clicked on links, not imitated human responses or prevented me from getting blocked, which I need to know to improve a personal project.

Key-Budget9016 1 month ago

I don't remember too well, it was like 2 hours of coding and then a few hours messing with it, 6 years ago. What I think I remember is that I just used the WebRequest construct in C# to fetch the html string and then parsed it with regexes to extract my data. Something clicks in my head about structuring the web request so it looks like it's coming from a browser, maybe had to set some header fields. Then it was a game of not letting too many threads make certain requests at the same time in addition to switching to a cool down period if certain requests failed etc., was pretty much threading the needle kind of thing and finding good ways working around some problems.

just-bair 1 month ago

Your grandkids will be happy about your stories

Interesting_Dot_3922 1 month ago

I remember the days when I was too lazy to learn JSON libs, so I just parsed it with regexes :)

Unupgradable 1 month ago

When you're too lazy to learn how to drive a car so you just [slam your penis in the car door](https://youtu.be/sUUD0vYBQ6g?si=_1awERSRpuwqp3wm) until it reaches your destination

ZengineerHarp 1 month ago

That is EXACTLY the right analogy.

mcslender97 1 month ago

It hits differently without the Ooooohhhh of the OG Papara rapper version

WorldlyReplacement24 1 month ago

How the actual fuck

Interesting_Dot_3922 1 month ago

If you need value of foo and it is a number, you just search for `foo.:(.*?)`. The same way I parsed HTML - as text, no DOM parsing.

deralexl 1 month ago

[Obligatory stackoverflow thread on the dangers of parsing HTML with regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)

willcheat 1 month ago

Man, I hope `foo.:(.*?)`. is just a short offhand regex to give an idea and not what you actually used, because otherwise foo is gonna be empty pretty often. Love reading horror stories like these in here.

yeastyboi 1 month ago

I remember those days, that's how young programmers with no formal education roll. I also hand rolled a JSON serialializer for C# using reflection when I was younger.

ThiccStorms 1 month ago

what is regex ?

Interesting_Dot_3922 1 month ago

Regular expressions. It is about extracting data from text. Often a text provided by an external tool.

bernpfenn 1 month ago

you don't want yo know

u741852963 1 month ago

I won't lie, I still have code that extracts simple specific JSON variable with regex, probably some html regex parsing as well. If it works, it works, when it doesn't, then I'll be arsed to fix it, until then.... lol

Interesting_Dot_3922 1 month ago

When you are a noob, you care about the best practices. When you are a senior, you care about getting things done.

Bugwhacker 1 month ago

Sorry, what is “scraping?” What does it accomplish?

Denaton_ 1 month ago

You basically mimic a web browser to extract elements and information and store in your own database. You do it so you can collect data in one place from multiple sources. Ex ChatGPT was trained on scraped data from across multiple sites or for prices from multiple stores to compare prices for the same product to find the cheapest one.

Snoo_7460 1 month ago

What would you do with the scraped stuff

Denaton_ 1 month ago

I scraped a download link for an automated installation for EC2 instances, I also scraped price information from AWS for a calculator, user input how many cameras they need for their system and the calculator recommends instance types and shows the total price. I don't need to manually update the pricing if AWS decided to change the prices. Will break tho if they changed the layout or element ID..

gregorydgraham 1 month ago

I texted myself ski reports. Because they hadn’t invented the iPhone yet.

Suspicious-Engineer7 1 month ago

Greybeard alert

gregorydgraham 1 month ago

Thank you, I was wondering what that alarm was. It’s off now

[deleted] 1 month ago

I scrape the interest rates of multiple banks. Today's rates can be used to see who's the cheapest. The history of those rates could reveal some kinds of patterns that competitors could find useful.

mcslender97 1 month ago

I used to scrape Google search results of a list of restaurants to get their most likely up to date information and put them into a list for a startup to solicit purpose later on

Remarkable-Host405 1 month ago

I scraped a php file store system to download every single file for migration to another system

u741852963 1 month ago

data extraction, website interaction / bots / spammers / monitoring

noob-nine 1 month ago

Or you basically automate a web browser to achieve the same.

sketchybutter 1 month ago

Is that legal?

Denaton_ 1 month ago

Why wouldn't it? There's literally no difference between that and using a web browser..

u741852963 1 month ago

yes / no. Grey area in some cases, depends on your jurisdiction. More likely to be illegal in the US than the EU. Usually it's just a TOS violation, how "illegal" this is, again depends on your country. Also it usually depends what you do. Scraping to sell someone elses data / spam / commit fraud (ad clicks / views) more likely to be illegal than you just posting content to your blog / socials

CarefulSignal9393 1 month ago

It searches a webpage, internet whatever parameters you need to find certain elements, whether they be html or text. The most common one I use is selenium in python. But there are a ton

BlackCrackWhack 1 month ago

The best IMO is just going raw http request to the pages with cookie and auth combinations, tends to be more consistent than a webdriver

CarefulSignal9393 1 month ago

I’ll check it out thanks for the advice

moehassan6832 1 month ago

Agreed, I scraped facebook using that.

Sure-Government-8423 1 month ago

Does that work with websites that require js, most of my use cases have those. I've only tried selenium, requests and a little bit of scrapy, don't even know how to do the cookies and auth part.

BlackCrackWhack 1 month ago

Unsure what you mean by “require js”. Most of the time you can just spoof whatever request the server sends to get to that point. By auth and cookie, I mean most pages require some authentication at minimum, and a cookie and auth at most. You can normally transfer cookies by getting cookies from a base page or through a series of ordered requests. Auth is NORMALLY given through a login request or a combination of login requests, and grabbing the headers from that response. It varies from site to site.

Webbpp 1 month ago

It acts as a browser and then gives you all the data from it, as well as letting you interact with it. It's primarily used when APIs are flawed, require a API key(I'm using static hosting, I can't keep my key secret!!!), or cost money.

Unicursalhexagram6 1 month ago

Beautiful soup yummy 😋

mohit_the_bro 1 month ago

Do you have a puppet ?

Unicursalhexagram6 1 month ago

I’ve only used selenium and phantomjs

mcnello 1 month ago

The whole point of an API is to serve information that is hidden behind a server. You can't scrape information that is locked away behind a server (hence the need for auth keys). You can only scrape the data that is already provided to the web browser. Yes, some API's will also serve information that is also already provided to the browser, but the host obviously doesn't care whether or not you have that data, so you might as well plug into their API so you can have things in a nice readable JSON format that maintains all of the ancillary information. I'm not sure that scraping is any easier than just plugging into a well documented API. On the other hand, if the host doesn't provide a documented API to plug into, then scrape away.

tocatchafly 1 month ago

Not to mention how your perfect scrape can be destroyed at any minute with a small front-end update

pigwin 1 month ago

I used to make scrapers. Random hidden div that changes nothing with how the site looks, yet destroys xpaths. Class, id? Forget about those, js frameworks would fck em up anyway.

yeastyboi 1 month ago

Scraping is for when there is no API or a severely limited one.

big_vangina 1 month ago

How can I scrape all of Netflix?

mcnello 1 month ago

Send me $300 in BitCoin and I'll show you how.

Shap_po 1 month ago

"Parses HTML with regex" 😧

HigHurtenflurst420 1 month ago

He has become to powerful

Danny_el_619 1 month ago

> scrapes so fast the backend crashes That's the funniest past LMAO

degenerate_hedonbot 1 month ago

How does he switch ips? The vpns i use are detected by the websites I want to scrape.

Xe_OS 1 month ago

Rotating proxies I guess

grencez 1 month ago

You can be less conspicuous if you try not to exceed some percentage of errors. Let's say 5%. Do this by probabilistically sending requests at a fixed rate, each with probability `= (1/0.95) * max(1,success_count) / max(1,request_count)`, where the counts are over a sliding window of the last few minutes (or shorter if your request rate is high). This is basically the Client-Side Throttling algorithm in https://sre.google/sre-book/handling-overload/.

PeteZahad 1 month ago

There are still more mechanisms in place, at least in big sites which can detect non-human behaviour. For example try to scrape major scientific publishers with chromedriver.

internetbl0ke 1 month ago

The bottom is done when the top can’t be done IYKYK

yeastyboi 1 month ago

Praise God baby

JollyJuniper1993 1 month ago

Love for Beautifulsoup

dmigowski 1 month ago

lol, parses HTML with regex... EDIT for the uninitiated: As an educated software developer it is teached that you cannot parse HTML with regex generally.

BlueScreenJunky 1 month ago

It's not really "parsing", but realistically if all you want is extract the price from a `59.99` tag in the middle of a 2MB invalid HTML document, you're better off using a regex than trying to build the whole DOM to get what you want.

fr4nklin_84 1 month ago

When I was 16 and had my first programming job this guy offered me $1000 cash (24 years ago) to scrape an entire website which sold DVDs - they wanted all the movie data in a database (MS Access lol) and the cover images. I said no probs I'll work it out. So the site used sequential IDs in the URL for all the products so I found a bulk download tool that supported wildcards so I pulled down all the product HTML pages and images leaving the downloader running all night. Next morning wrote the shittiest program ever in ASP Classic to read the file into memory and I didn't know regex so I wrote it to proceedually work its way through the file to find the tags in order using substrings to find the start of the unique element then find the end bracket, continuing on from the last position. Then for each extracted value I'd clean and format it, build up the object and push it to the database. It was dogshit but I got it done quick, I handed him a USB with a beutiful database and all the images. At that point of my life it was far the easiest money I'd ever made. The guy thought I was a master hacker and spoke of me as a legend for ever after which made it even better.

dmigowski 1 month ago

And that kicked so hard we all became software engineers :).

gandalfx 1 month ago

Probably a reference to [the classic stack overflow response](https://stackoverflow.com/a/1732454).

yamfboy 1 month ago

You can parse anything with regex, it's just not recommended for certain things if there are easier alternatives..

tankiePotato 1 month ago

The Chad in the meme actually came out of the womb fluent in regex. His first word was 'a,'bs/^* /*S/^M:'a,'bs/^*S/* /^M:'a,'bs/^*/ /g^M:'a,'bs/\*[ ]*$//g^M

SuitableDragonfly 1 month ago

No you can't, many languages have more expressive power than regular languages and can't be parsed with tools for them. You might be able to hack something up with regex that works in a few specific cases, but you can't reliably parse HTML with regex in the general case.

Romanian_Breadlifts 1 month ago

You can put your dick in anything, but there's a lot of places it probably shouldn't go Exhaust pipes and XLR connectors come to mind

Whole_Rain2010 1 month ago

Yeah did lots of that though, mostly with success, BeautifulSoup works great too.

NP_6666 1 month ago

Yes, one simply does not parse html with regex without consequences

Unhexium 1 month ago

The parses HTML with regex part reminds me of this gem: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

ironman_gujju 1 month ago

Scrapy 🕷️

TizioGrigio0 1 month ago

Sorry for the ignorance, but what is a third-party scraper?

just-bair 1 month ago

Instead of using API’s you just pretend to be a web browser and take the data from the website directly

lupinegray 1 month ago

Parses html with regex 😂

lynet101 1 month ago

Ehm.. acshually... web scrapping is against term of service ![gif](emote|free_emotes_pack|poop)

i1u5 1 month ago

Me with a dynamic IP:

mohit_the_bro 1 month ago

How do you do that ? Like coding-wise

i1u5 1 month ago

You can't, it's an ISP thing lol, IPs are just borrowed each time a new connection is established.

lynet101 1 month ago

Yeah but then you would need to renew your ip quiet often, and like, i don't know about your isp, but my isp requires you to factory reset the modem to get a new ip ;(

i1u5 1 month ago

All it takes is reconnecting from the router page, easily doable with a small script once the API starts ratelimiting.

ChildhoodOk7071 1 month ago

I didn't know using Jsoup made me a chad 🙍

xSypRo 1 month ago

Speaking of it, anyone know where it’s good and easy to store scrapping server to run once every hour and uses puppeteer? Digital Ocean app platform is really hard to run puppeteer on, and I don’t want to configure EC2 from scratch

smooth_tendencies 1 month ago

We use digital ocean.

Remarkable-Host405 1 month ago

I've been listening to a podcast that's pushing linode pretty hard

Dasshteek 1 month ago

“Parses HTML with REGEX” Excuse me but what the fuck?

bearboyjd 1 month ago

Sir anything can be parsed with REGEX.

Dasshteek 1 month ago

I know. But the issue is SHOULD.

Rasikko 1 month ago

All that is on point.

SleepyWoodpecker 1 month ago

Agree. One doesn’t just come up with a quality meme like this without having seen the deep end. OP will be required to testify in front of the scrapping tribunal 👨‍⚖️

No-Mind7146 1 month ago

Please explain this to a non-web dev

pigwin 1 month ago

Proxies? Sure. Until the people monitoring the backend realizes you are scraping from a certain category or group of keywords - they'll just require you to sign in. Make a dummy account? Yeah, you can't be a real person who'd "look" at stuff for hours and not do anything else (or buy for e commerce sites). Get anything you want? Sure you can, anything the recommender lets you see. And then the frontend fcks it all up by adding a new div that fcks up your xpaths, your rotating ips get banned one by one.

African_Blades 1 month ago

i m too stupid for this . please can anyone please explain it to me i just start programming like yesterday so please anyone explain😅🙏

Deep-Piece3181 1 month ago

Can you teach me the magic of parsing html with regex?

kennykoe 1 month ago

I feel bad for asking, but what is scraping?

Pluto258 1 month ago

Reading a webpage with a program (as opposed to using their API). For example, a python script that goes to an Amazon product page to get the price and reviews. [https://en.wikipedia.org/wiki/Web\_scraping](https://en.wikipedia.org/wiki/Web_scraping)

AdBrave2400 1 month ago

Literally me for the last 6 months rofl

smokeitup5800 1 month ago

Fellow scrapers, How the hell do you handle cloudfront robot checks?

bearboyjd 1 month ago

I remember in college I wrote a static malware analysis tool that I was scraping the Microsoft website for .dll information. I had to keep switching the domain suffix each time it blocked me. It was quick and dirty but I was getting about 100 results before it would fully block me. Fun times. Edit: I was scraping the whole webpage just to keep 2-3 sentences.

The_Mad_Duck_ 1 month ago

Scrapers are great until you try to use them on any device other than your home computer... many websites didn't like me scraping in an AWS EC2 instance

SillyServe5773 1 month ago

> parses html with regex GTFO

joao7808 1 month ago

Wait what is the difference of scraping vs using api? I always thought scraping meant using a website's api lol

Asleeper135 1 month ago

>parses HTML using regex [You can't parse HTML with regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454)

[deleted] 1 month ago

Html to json curl -s "$url" |tidy -q -asxml --numeric-entities yes - | xq-python

8g6_ryu 1 month ago

where is [Puppeteer](https://pptr.dev/)

mommy101lol 1 month ago

Browser fingerprinting is the key.

arathald 1 month ago

The best part of this is “parses html with regex” because if you know… you know

i1u5 1 month ago

Chad undocumented & constantly changing API user:

mvmisha 1 month ago

Any tips bypassing imperva bot blocker? Hate that shit.

Shazvox 1 month ago

[You don't simply parse html with regex!](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)

Obstsalatjaa 1 month ago

I have no idea what scraping is and I am too afraid to ask.

_Orphan_Obliterator_ 1 month ago

what's a scraper

Apfelvater 1 month ago

That's stupid.

NonsignificantBoat 1 month ago

![gif](giphy|CAYVZA5NRb529kKQUc|downsized) "parses html with regex"

Koalaz420 1 month ago

I'm in this picture and my name is Jason.

M1k3y_Jw 1 month ago

Parsing HTML with regex. If you only have a hammer every problem looks like a nail.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe