> The call to page.evaluate just hangs, and the browser dies silently. browser.close() is never reached, which can cause memory leaks over time.
Not just memory leaks. Since a couple months ago, if you use Chrome via playwright etc. on macOS, it will deposit a copy of Chrome (more than 1GB) into /private/var/folders/kd/<...>/X/com.google.Chrome.code_sign_clone/, and if you exit without a clean browser.close(), the copy of Chrome will remain there. I noticed after it ate up ~50GB in two days. No idea what's the point of this code sign clone thing, but I had to add --disable-features=MacAppCodeSignClone to all my invocations to prevent it, which is super annoying.
closewith 1 days ago [-]
That's an open bug at the minute, but the one saving grace is that they're APFS clones so don't actually consume disk space.
oefrha 1 days ago [-]
Interesting, IIRC I did free up quite a bit of disk space when I removed all the clones, but I also deleted a lot of other stuff that time so I could be mistaken. du(1) being unaware of APFS clones makes it hard to tell.
chrismorgan 2 days ago [-]
Checking https://issues.chromium.org/issues/340836884, I’m mildly surprised to find the report just under a year old, with no attention at all (bar a me-too comment after four months), despite having been filed with priority P1, which I understand is supposed to mean “aim to fix it within 30 days”. If it continues to get no attention, I’m curious if it’ll get bumped automatically in five days’ time when it hits one year, given that they do something like that with P2 and P3 bugs, shifting status to Available or something, can’t quite remember.
I say only “mildly”, because my experience on Chromium bugs (ones I’ve filed myself, or ones I’ve encountered that others have filed) has never been very good. I’ve found Firefox much better about fixing bugs.
I find the "don't let googlebot see this" kinda funny considering how top google results are often much worse. The captcha/anti-bot is getting so bad I had to move to Kagi to block some domains specifically as browsing contemporary web is almost impossible at times. Why isn't google down ranking this experience?
The reception was not really positive for the obvious reason at that time.
wslh 1 days ago [-]
In Google Chrome, at least, I tried an infinite loop modifying document.title and it freezes pages in other tabs as well. Now, I am not at my computer to try again.
1 days ago [-]
neuroelectron 1 days ago [-]
I, for one, find it hilarious that "headless browsers" are even required. JavaScript interpreters serving webpages is just another amusing bit of serendipity. "Version-less HTML" hahaha
kevin_thibedeau 1 days ago [-]
It exists because adtech providers and CDNs punish legitimate users who don't execute untrusted code on their property.
Thorrez 1 days ago [-]
Headless browsers exist because adtech providers and CDNs punish legitimate users who don't execute untrusted code on their property?
If we ask the creators of headless chrome or selenium why they created them, would they say "because adtech providers and CDNs punish legitimate users who don't execute untrusted code on their property"?
Bjartr 22 hours ago [-]
Whether or not it's true aside, why people decide to do something and why they say something is being done don't have to match.
immibis 16 hours ago [-]
Another use is testing websites.
jillyboel 1 days ago [-]
[flagged]
seventh12 1 days ago [-]
The intention is to crash bots' browsers, not users' browsers
ramesh31 1 days ago [-]
Please point me to this 100% correct bot detection system with zero false positives.
FridgeSeal 22 hours ago [-]
You understand the difference between intent and reality right?
The article even warns about this side-effect.
jillyboel 1 days ago [-]
[flagged]
h4ck_th3_pl4n3t 1 days ago [-]
If you are scraping forbidden data in my robots.txt, I don't give a damn. I am gonna mess with your bots however I like, and I'm willing to go as far as it takes to teach you a lesson about respecting my robots.txt.
immibis 16 hours ago [-]
Then I will teach you a lesson about trying to make public data private. Residential proxies and headful browsers go brrrr.
bryanrasmussen 1 days ago [-]
[flagged]
h4ck_th3_pl4n3t 1 days ago [-]
Malware installation is something completely different than segfaulting an .exe file that is running the scraper process.
If illegal scraping behavior is expected behavior of the machine, then what the machine is doing is already covered by the Computer Fraud Act.
bryanrasmussen 16 hours ago [-]
several points here -
not sure if the same jurisdictions that are under the Computer Fraud Act have determined there is such a thing as "illegal scraping".
Does the Computer Fraud Act cover segfaulting an .exe file? I don't know, I don't live in the country that has it.
If The Computer Fraud act says it is ok to segfault an .exe which I highly doubt, is the organization doing this segfaulting as part of their protection against this supposed "illegal scraping" actually checking that the machines that they are segfaulting are all in jurisdictions that are under the Computer Fraud Act?
What happens if they segfault outside those jurisdictions and there are other laws that pertain there? I'm guessing it might happen they screwed then. Should have thought about that, being so clever.
Hey I get it, I am totally the kind of guy who might decide to segfault someone costing me a lot of money by crawling my site and ignoring my robots.txt. I'm vengeful like that. But I would accept hey what I am doing is probably illegal somewhere, too bad, I definitely wouldn't be going around arguing it was totally legal, and I would also be open to the possibility hey, this fight I'm jumping into might have some collateral damages - sucks to be them.
Everybody else here seems to be all righteous about how they can destroy people's shit in retaliation, and the people whose computers they are destroying might not even know they got a beef with you.
on edit: obviously once it got to courts or the media I would argue it was totally legal, ethical and the right thing to do to prevent these people from being able to attack other sites with their "illegal scraping" behavior. Because I don't win the fight if I get punished for winning. I'm just talking about keeping a clear view of what one is actually doing in the process of winning the fight.
anthk 1 days ago [-]
Not my problem. The problem will be the for the malware creator. Twice.
anthk 1 days ago [-]
If you are crashing some browser from a disallowed directory in robots.txt, is not your fault.
chrismorgan 1 days ago [-]
[flagged]
BeFlatXIII 1 days ago [-]
> If you’re not familiar with this, read up on it, the reasons can be quite thought-provoking
Are the reasons relevant to headless web browsers?
Which people may be hurt by crashing the machine where the bot is running?
johnisgood 5 hours ago [-]
When said people decide to rob your home, they lose the right to not be hurt, IMO. Of course proportionality and all that.
lightedman 1 days ago [-]
If that's the case what do we do about websites and apps which do things like disable your back button (mobile phone's direct one) or your right click capabilities (desktop browser) while such functionality disabling is not present in the ToS or even presented to you upon visiting the site or using the app?
dmitrygr 1 days ago [-]
Then maybe we need laws about crashing my server by crawling it 163,000 times per minute nonstop, ignoring robots.txt? Until then, no pity for the bots.
jillyboel 7 hours ago [-]
if your software crashes due to normal usage then you only have yourself to blame
dmitrygr 2 hours ago [-]
Yes indeed. Nginx running out of RAM due to A”I” companies hammering my server is my fault.
jillyboel 1 hours ago [-]
Yes. Fix your configuration so it won't try to allocate more ram than you have. You can still be upset about them hammering your site but if your server software crashes because of it that's a misconfiguration that you should fix regardless.
sMarsIntruder 1 days ago [-]
Running a bot farm?
jillyboel 7 hours ago [-]
of course not, why are you immediately jumping at accusations? if i was i'd just patch the bug locally and thank OP for pointing out how they're doing it.
it's just blatantly illegal and i wouldn't want anyone to get into legal trouble
omneity 1 days ago [-]
[flagged]
randunel 1 days ago [-]
How do you deal with the usual CF, akamai and other fingerprinting and blocking you? Or is that the customer's job to figure out?
omneity 1 days ago [-]
Thank you for the question! It depends on the scale you're operating at.
1. For individual use (or company use but each user is on their device) typically the traffic is drown out in regular user activity since we use the same browser and no particular measure is needed, it just works. We have options for power users.
2. For large scale use, we offer tailored solutions depending on the anti-bot measures encountered. Part of it is to emulate #1.
3. We don't deal with "blackhat bots", so we don't offer support to work around legitimate anti-bot measures such as social spambots etc.
lyu07282 1 days ago [-]
If you don't put significant effort into it, any headless browser from cloud IP ranges will be banned by large parts of the internet. This isn't just about spam bots, you can't even read news articles in many cases. You will have some competition from residential proxies and other custom automation solutions that take care of all of that for their customers.
omneity 1 days ago [-]
Thanks, that's so true! We learned this the hard way building Monitoro[0] and large data scraping pipelines in the past, so we had the opportunity to build up the required muscle.
One thing to note, there are different "tiers" of websites, each requiring different counter-measures. Not everyone is pursuing the high competition websites, and most importantly as we learned in several cases scraping is fully consensual or within the rights of the user. For example:
* Many of our users scrape their own websites to send notifications to their discord community. It's a super easy way to create alerts without code.
* Sometimes users are locked in their own providers, for example some companies have years of job posting information in their ATS they cannot get out. We do help with that.
* Public data websites who are underutilized precisely because the data is difficult to access. We help make that data operational and actionable. We had for example a sailor setup alerts on buoys to stay safe in high waters. A random example[1]
We have a similar solution at metalsecurity.io :) handling large-scale automation for enterprise use cases, bypassing antibots
omneity 1 days ago [-]
That's super cool, thank you for sharing! It's based on playwright though right? Can you verify if the approach you are using is also subject to the bug in TFA?
My original point was not necessarily about bypassing anti-bot protections, and rather to offer a different branch of browser automation independent of incumbent solutions such as Puppeteer, Selenium and others, which we believe are not made for this purpose, and has many limitations as TFA mentions, requiring way too many workarounds as your solution illustrates.
erekp 1 days ago [-]
we fix leaks and bugs of automation frameworks, so we don't have that problem. The approach of using the user's browser, like yours, is that you will burn the user's fingerprint depending on scale.
omneity 1 days ago [-]
Thanks for sharing your experience! I'm quite opinionated on this topic so buckle up :D
We avoided the fork & patch route because it's both labor intensive for a limited return on investment, and a game of catching up. Updating the forked framework is challenging on its own right, let alone porting existing customer payloads to newer versions, locking you de-facto to older versions. I did maintain a custom fork at a previous workplace that was similar in scope to Browserless[0] and I can tell you it was a real pain.
Developing your own framework (besides satisfying the obvious NIH itch) allows you to precisely control your exposure (reduce the attack surface) from a security perspective, and protects your customers from upstream decisions such as deprecations or major changes that might not be aligned with your customer requirements. I also have enough experience in this space to know exactly what we need to implement and the capabilities we want to enable. No bloat (yet)
> you will burn the user's fingerprint depending on scale
It's relative to your activity. See my other comment about scale and use cases, for personal device usage this is not an issue in practice, and users can automate several websites[1] using their personal agents without worrying about this. For more involved scenarios we have appropriate strategies that avoid this issue.
> we fix leaks and bugs of automation frameworks
Sounds interesting! I'd love to read a write up, or PRs if you have contributed something upstream.
sounds good. As you can probably imagine, I also come from a lot of experience in the space :) But fair enough, everyone has their own opinion on what is more or less painful to implement and maintain and the associated pros and cons. We're tailored to very specific use cases that require scale and speed, so the route we took makes the most sense. I can't obviously share details of our implementation as it'd expose our evasions. And this is the exact problem of open source alternatives like camoufox and the now defunct puppeteer-stealth.
volemo 1 days ago [-]
Guess we gotta find a way to crash these bots too. :D
Not just memory leaks. Since a couple months ago, if you use Chrome via playwright etc. on macOS, it will deposit a copy of Chrome (more than 1GB) into /private/var/folders/kd/<...>/X/com.google.Chrome.code_sign_clone/, and if you exit without a clean browser.close(), the copy of Chrome will remain there. I noticed after it ate up ~50GB in two days. No idea what's the point of this code sign clone thing, but I had to add --disable-features=MacAppCodeSignClone to all my invocations to prevent it, which is super annoying.
I say only “mildly”, because my experience on Chromium bugs (ones I’ve filed myself, or ones I’ve encountered that others have filed) has never been very good. I’ve found Firefox much better about fixing bugs.
The reception was not really positive for the obvious reason at that time.
If we ask the creators of headless chrome or selenium why they created them, would they say "because adtech providers and CDNs punish legitimate users who don't execute untrusted code on their property"?
The article even warns about this side-effect.
If illegal scraping behavior is expected behavior of the machine, then what the machine is doing is already covered by the Computer Fraud Act.
not sure if the same jurisdictions that are under the Computer Fraud Act have determined there is such a thing as "illegal scraping".
Does the Computer Fraud Act cover segfaulting an .exe file? I don't know, I don't live in the country that has it.
If The Computer Fraud act says it is ok to segfault an .exe which I highly doubt, is the organization doing this segfaulting as part of their protection against this supposed "illegal scraping" actually checking that the machines that they are segfaulting are all in jurisdictions that are under the Computer Fraud Act?
What happens if they segfault outside those jurisdictions and there are other laws that pertain there? I'm guessing it might happen they screwed then. Should have thought about that, being so clever.
Hey I get it, I am totally the kind of guy who might decide to segfault someone costing me a lot of money by crawling my site and ignoring my robots.txt. I'm vengeful like that. But I would accept hey what I am doing is probably illegal somewhere, too bad, I definitely wouldn't be going around arguing it was totally legal, and I would also be open to the possibility hey, this fight I'm jumping into might have some collateral damages - sucks to be them.
Everybody else here seems to be all righteous about how they can destroy people's shit in retaliation, and the people whose computers they are destroying might not even know they got a beef with you.
on edit: obviously once it got to courts or the media I would argue it was totally legal, ethical and the right thing to do to prevent these people from being able to attack other sites with their "illegal scraping" behavior. Because I don't win the fight if I get punished for winning. I'm just talking about keeping a clear view of what one is actually doing in the process of winning the fight.
Are the reasons relevant to headless web browsers?
https://news.ycombinator.com/item?id=43947910
Which people may be hurt by crashing the machine where the bot is running?
it's just blatantly illegal and i wouldn't want anyone to get into legal trouble
1. For individual use (or company use but each user is on their device) typically the traffic is drown out in regular user activity since we use the same browser and no particular measure is needed, it just works. We have options for power users.
2. For large scale use, we offer tailored solutions depending on the anti-bot measures encountered. Part of it is to emulate #1.
3. We don't deal with "blackhat bots", so we don't offer support to work around legitimate anti-bot measures such as social spambots etc.
One thing to note, there are different "tiers" of websites, each requiring different counter-measures. Not everyone is pursuing the high competition websites, and most importantly as we learned in several cases scraping is fully consensual or within the rights of the user. For example:
* Many of our users scrape their own websites to send notifications to their discord community. It's a super easy way to create alerts without code.
* Sometimes users are locked in their own providers, for example some companies have years of job posting information in their ATS they cannot get out. We do help with that.
* Public data websites who are underutilized precisely because the data is difficult to access. We help make that data operational and actionable. We had for example a sailor setup alerts on buoys to stay safe in high waters. A random example[1]
0: https://monitoro.co
1: https://wavenet.cefas.co.uk/details/312/EXT
My original point was not necessarily about bypassing anti-bot protections, and rather to offer a different branch of browser automation independent of incumbent solutions such as Puppeteer, Selenium and others, which we believe are not made for this purpose, and has many limitations as TFA mentions, requiring way too many workarounds as your solution illustrates.
We avoided the fork & patch route because it's both labor intensive for a limited return on investment, and a game of catching up. Updating the forked framework is challenging on its own right, let alone porting existing customer payloads to newer versions, locking you de-facto to older versions. I did maintain a custom fork at a previous workplace that was similar in scope to Browserless[0] and I can tell you it was a real pain.
Developing your own framework (besides satisfying the obvious NIH itch) allows you to precisely control your exposure (reduce the attack surface) from a security perspective, and protects your customers from upstream decisions such as deprecations or major changes that might not be aligned with your customer requirements. I also have enough experience in this space to know exactly what we need to implement and the capabilities we want to enable. No bloat (yet)
> you will burn the user's fingerprint depending on scale
It's relative to your activity. See my other comment about scale and use cases, for personal device usage this is not an issue in practice, and users can automate several websites[1] using their personal agents without worrying about this. For more involved scenarios we have appropriate strategies that avoid this issue.
> we fix leaks and bugs of automation frameworks
Sounds interesting! I'd love to read a write up, or PRs if you have contributed something upstream.
0: https://www.browserless.io/
1: https://herd.garden/trails