I think your team has made a recent change in the last 1-2 months where you are returning 429 errors very aggressively when trying to “scrape” a merchant webpage (ie. curl).
I suppose you did this to prevent copycat / theft scrapping.
However this is backfiring for apps that actually need to scrape for actual issues. (IE. SEO Apps).
Is there a way you can WHITELIST our server (it’s a static ip)? Or make it less aggressive in general? Its being triggered with just a couple requests and causing all types of random issues (SEO King scrapes pages for MANY reasons).
I reduced the amount of scrapping being done by my server. Caused by a settings issue and it does seem to work better now = less 429 (maybe the issue is gone, not sure).
Its something to think about though. The scrapping wasn’t being done maliciously - it’s actually a way to get data from pages that are using page builder apps (can’t rely on the item data fields). Too many users that aren’t using page builders had it activated. However, if our app had 10x more users, and they in fact often used page builders, it would be a problem.
1 Like
Hey @jason_engage
Typically limits will be in place to ensure platform stability and security. It’s an interesting use case though, as you’re obviously not doing this maliciously and it’s benefiting merchants.
Can you share more on what you mean when you say you can’t rely on item data fields?
Have you looked in to some of the admin API online store or theme endpoints to see if you can get what you need there without needing to scrape the storefront?
I’ve seen some merchants where the Page Contents (Html) are not stored in the Item’s body_html field. I think it has to do with PageFly perhaps, or some other page management systems - which may be saving the contents somewhere else (maybe privately). I can’t say I’ve spent a lot of time tracking it all down, but I’ve seen it several times, and added some options to simply scrape the pages.
If you’re familiar with any of these types of scenarios, and some common ways to work around them, I’m all ears!
Thanks for sharing that. To make sure your assumptions are the actual issue? Do you have an x-request-id from one of the 429 responses you’ve received recently, preferably from one of your development stores. I can check our logs to confirm this is the case.
From there, my SEO knowledge is probably average, so I’m not fully aware of how other SEO tools that do similar scraping manage this. One thought is to see if it would work to use the API to get the bulk of the information you need and then scrape to fill in the gaps? Alternatively, use webhooks (like product update, collection update, etc) to help narrow down just the resources that may need to be re-scraped.
It may be worth testing to see if updating the shops robot.txt to allow your crawler may help as well. Customize robots.txt
I’ve been watching the 429 issues for a few days now. It really looks like you guys ramped up the “aggressivity meter”. Can you check back with the team responsible and confirm or deny this?
1 Like
Having similar issues using Screaming Frog. Seems like 429 errors are getting way more aggressive.
Hey Kevin, do you have a request id from one of these responses to help us pinpoint one of these occurrences?
Sadly I can’t provide you with request ids at the moment, but requests originated from 2001:4860:7:610::fb / 62.143.48.37 for store https://schleiftitan.myshopify.com/
Will try to gather more from team.
Kevin 
1 Like
When we use CURL - there are no request ids. It’s not part of the API. Its direct server access to the webpage. That’s why screaming frog would have similar issues.
Would there be anything to identify this request that’s being sent with the 429 error you are seeing?
Kyle, this issue can only be addressed by asking the team responsible for setting the HTTP 429 Error codes on the merchants sites, if they have recently modified it.
They may have only done so on the ‘.myshopify.com’ domains. They may also be interested to know that it is causing some issues.
We would like to know what they say about it.
You personally, would have no knowledge of this, it wouldn’t be written in any docs, and you wouldn’t be able to detect this issue since you don’t seem to know what an HTTP 429 error code is, or how to generate one. ie. you need to use curl from a server to make GET calls to a merchant website domain.
Do any questions from this forum reach any of the responsible teams? You can’t possibly be able to answer all these questions yourself, without consulting with others, right?
Thanks again for the additional context Jason. Happy to unpack this further:
You are 100% right, there’s no way I could possibly answer every question without help. In this case, our storefront team has reviewed this thread. They noted there is room for improvement in our docs around rate limits. With that in mind, they have the following suggestions:
- implement pacing on your crawl rate
- correctly advertise your User-Agent (do not hide it)
- apply to cloudflare’s verified bots registry
I would like to share a little context around our internal processes to assure you that questions we respond to aren’t ignored and going off in to the void.
Before bringing issues to our developers, we want to ensure we have as much context as we can gather. Replicating when possible, and finding details in our logs when replication isn’t possible. This ensures we are getting the issue to the correct team and that they have all the necessary details to address it properly. In this case, while I now realize you don’t have clean response headers like with an API request (thank you), details like user-agent, the URL being crawled and a timestamp could have still helped us narrow this down.
Let me know if there’s any other way I can help.