Bug: 429 page errors are too aggressive

I think your team has made a recent change in the last 1-2 months where you are returning 429 errors very aggressively when trying to “scrape” a merchant webpage (ie. curl).

I suppose you did this to prevent copycat / theft scrapping.

However this is backfiring for apps that actually need to scrape for actual issues. (IE. SEO Apps).

Is there a way you can WHITELIST our server (it’s a static ip)? Or make it less aggressive in general? Its being triggered with just a couple requests and causing all types of random issues (SEO King scrapes pages for MANY reasons).

I reduced the amount of scrapping being done by my server. Caused by a settings issue and it does seem to work better now = less 429 (maybe the issue is gone, not sure).

Its something to think about though. The scrapping wasn’t being done maliciously - it’s actually a way to get data from pages that are using page builder apps (can’t rely on the item data fields). Too many users that aren’t using page builders had it activated. However, if our app had 10x more users, and they in fact often used page builders, it would be a problem.

Hey @jason_engage

Typically limits will be in place to ensure platform stability and security. It’s an interesting use case though, as you’re obviously not doing this maliciously and it’s benefiting merchants.

Can you share more on what you mean when you say you can’t rely on item data fields?

Have you looked in to some of the admin API online store or theme endpoints to see if you can get what you need there without needing to scrape the storefront?

I’ve seen some merchants where the Page Contents (Html) are not stored in the Item’s body_html field. I think it has to do with PageFly perhaps, or some other page management systems - which may be saving the contents somewhere else (maybe privately). I can’t say I’ve spent a lot of time tracking it all down, but I’ve seen it several times, and added some options to simply scrape the pages.

If you’re familiar with any of these types of scenarios, and some common ways to work around them, I’m all ears!

Thanks for sharing that. To make sure your assumptions are the actual issue? Do you have an x-request-id from one of the 429 responses you’ve received recently, preferably from one of your development stores. I can check our logs to confirm this is the case.

From there, my SEO knowledge is probably average, so I’m not fully aware of how other SEO tools that do similar scraping manage this. One thought is to see if it would work to use the API to get the bulk of the information you need and then scrape to fill in the gaps? Alternatively, use webhooks (like product update, collection update, etc) to help narrow down just the resources that may need to be re-scraped.

It may be worth testing to see if updating the shops robot.txt to allow your crawler may help as well. Customize robots.txt

I’ve been watching the 429 issues for a few days now. It really looks like you guys ramped up the “aggressivity meter”. Can you check back with the team responsible and confirm or deny this?

Having similar issues using Screaming Frog. Seems like 429 errors are getting way more aggressive.

Hey Kevin, do you have a request id from one of these responses to help us pinpoint one of these occurrences?

Sadly I can’t provide you with request ids at the moment, but requests originated from 2001:4860:7:610::fb / 62.143.48.37 for store https://schleiftitan.myshopify.com/

Will try to gather more from team.

Kevin :hatching_chick:

When we use CURL - there are no request ids. It’s not part of the API. Its direct server access to the webpage. That’s why screaming frog would have similar issues.

Would there be anything to identify this request that’s being sent with the 429 error you are seeing?

Kyle, this issue can only be addressed by asking the team responsible for setting the HTTP 429 Error codes on the merchants sites, if they have recently modified it.

They may have only done so on the ‘.myshopify.com’ domains. They may also be interested to know that it is causing some issues.

We would like to know what they say about it.

You personally, would have no knowledge of this, it wouldn’t be written in any docs, and you wouldn’t be able to detect this issue since you don’t seem to know what an HTTP 429 error code is, or how to generate one. ie. you need to use curl from a server to make GET calls to a merchant website domain.

Do any questions from this forum reach any of the responsible teams? You can’t possibly be able to answer all these questions yourself, without consulting with others, right?

Thanks again for the additional context Jason. Happy to unpack this further:

You are 100% right, there’s no way I could possibly answer every question without help. In this case, our storefront team has reviewed this thread. They noted there is room for improvement in our docs around rate limits. With that in mind, they have the following suggestions:

I would like to share a little context around our internal processes to assure you that questions we respond to aren’t ignored and going off in to the void.

Before bringing issues to our developers, we want to ensure we have as much context as we can gather. Replicating when possible, and finding details in our logs when replication isn’t possible. This ensures we are getting the issue to the correct team and that they have all the necessary details to address it properly. In this case, while I now realize you don’t have clean response headers like with an API request (thank you), details like user-agent, the URL being crawled and a timestamp could have still helped us narrow this down.

Let me know if there’s any other way I can help.

@jason_engage @KyleG-Shopify I confirm that the 429 anti-scraping policy seems to have been recently changed. Our set of E2E tests for our app, automated using Cypress, was working reliably for many months. Recently it started failing (very early into the test suite) because of 429 errors. This makes our E2E tests useless. @KyleG-Shopify is there a process we can follow to whitelist a store (or a specific theme) or a user-agent so that Shopify doesn’t block us from running tests on our own dev store?

Hey Bart,

Thanks for this. There isn’t currently a process to whitelist a store or specific theme. What was mentioned above is what our team is suggesting:

Did these suggestions improve anything @jason_engage ?

I work for Equifax which owns Kount. We have an application that we test using e2e tests as well. This problem has caused us a lot of pain and time lost recently. Something changed around 3 weeks ago.

Our application is receiving 429 errors while loading the checkout page from within Cypress (Selenium alternative). Our usage of this functionality is probably around 100 an hour at the very most. When we start receiving 429s, it stops our usage of the Cypress environment for a long time – 1 hour or more. During that time, we can use the APIs with no issues.

Please lower this “screen scraping” limitation to the point that customers who are performing simple tests and placing a reasonable load are not impacted. This has been very painful.

@KyleG-Shopify, we validated that the User-Agent is set to:
Mozilla/5.0 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/126.0.0.0 Safari/537.36

Our “crawl rate” is very low. As mentioned earlier, it is around 100-200 an hour at most. We are not web scraping, we are running e2e tests that load only the checkout page via Cypress.

CLoudflare’s verified bots – it says one must have 1000 per day. I would doubt that on our busiest day ever we got over 400.

What can we do at this point? Same as Bart’s issue: E2E tests using Cypress. We have done this for years. Our big issues started around 3 weeks ago.

Thanks for those details. It’s interesting you’ve noticed a new change around 3 weeks ago. I’ll dig in here to see what I can find out for you.

Hi @KyleG-Shopify, thanks for investigating this. Adding another voice that we’ve noticed this issue recently too. It’s impacting our e2e tests of our checkout flow. We are also not at a volume to count for verified bots.

@KyleG-Shopify, we have added some research. We had our system idle for the entire evening. We made zero calls to Shopify in our E2E tests. After 15 hours of being idle, we ran our tests. From my quick look, 15 calls to Shopify web pages via our Cypress GUI worked. After the first 15 calls, we started our tests 10 minutes later. Failures started at this point.

So, we had 15 calls in a 10 minute period that worked AFTER having been idle for many hours. Whatever setting this is has us debilitated to the point we cannot deliver product.

See table for all X-Request-ID values, the times of day, and success or failure. Successes are 302’s and failures are 429’s. All URLs used in this test were delivered to us via previous endpoints called with direct API usage.

Column 1 Column 2 Column 3 Column 4 E F G H I J K L M
There are 8 tests. Each test calls cy.visit() to request the website twice. First to log in, and second to make the payment.
Target URL https://{store-name}.myshopify.com/cart/c/…
302 URL https://{store-name}.myshopify.com/checkouts/cn/…
Run 1 ~ 7:39 AM MDT Run 4 ~ 8:36 AM MDT
Overall 302s on both calls - Successful Tests Overall 302s on both calls - Successful Tests
Test Number Request Status Code X-Request-ID Test Number Request Status Code X-Request-ID
1 1 302 3e8e53ec-fea5-4e40-916f-95510c656417-1752673027 1 1 302 a68338fc-b201-4c04-9b51-682bd9cafb6f-1752676575
2 302 2264949c-a171-4598-b480-f7d6ec1f7672-1752673042 2 302 20004992-1fd1-4285-92e9-7fc3d250536c-1752676591
2 1 302 974f81bb-28dd-4290-b4fd-3bc9bd419f24-1752672968 2 1 302 82669854-7ea8-45b5-9082-2433a21a6190-1752676579
2 302 e3a6cb7c-1208-4986-b55f-c970687a8e5c-1752672983 2 302 3f2d356e-35fb-43b2-9051-d14d39308683-1752676593
3 1 302 04aae9d9-afec-4c09-9b11-0524c1a2b246-1752673025 3 1 302 ad31bedd-2447-4f3e-8295-3f24cdf1350b-1752676602
2 302 4a7a13cd-c97e-4e54-a067-247bf18144c3-1752673041 2 302 abd2004e-1ccb-4031-8754-e6aac28edd47-1752676616
4 1 302 274b435b-6d36-4ead-adab-306a69245e00-1752672974 4 1 302 ce41989e-af78-45b4-b0f7-654af529b9bb-1752676583
2 302 a57c1ec8-8a1a-4bc7-89ca-5e3898631af1-1752672989 2 302 5263ce74-5612-4229-9add-b0232ce8ac3c-1752676597
5 1 302 5472a180-c80d-4fc0-8bf3-08a841e1dcc3-1752672969 5 1 302 bdcaf892-0d89-44f1-8f82-484e0db5ddc6-1752676582
2 302 0b52968c-6a92-4889-9923-6e5499917554-1752672984 2 302 f9bee1c0-55e9-4acc-a2e9-353ad155eefc-1752676597
6 1 302 e0354e57-df4b-4f9a-959e-963e9cb2ad48-1752672976 6 1 302 94805b7f-bdf4-4eee-84d2-2841c1f53d57-1752676575
2 302 301e6987-3b22-4b40-9af0-af2aed46c59c-1752672991 2 302 720a7c3a-3c1d-4b1d-9c6f-fc3464a81dfe-1752676590
7 1 302 47baf4e4-7cb6-4e6a-9244-66b243e2f738-1752673024 7 1 302 5fa68fc7-2256-4d6f-962a-21c07fd6abff-1752676578
2 302 2ac69fb2-1c77-4562-a8cc-839a61e81ea2-1752673039 2 302 a5b12e6f-a221-4bfe-b02b-1942ebce62fb-1752676593
8 1 302 f2c56328-9223-4a33-b4f2-6c8c755ee693-1752672977 8 1 302 bc06287b-317a-4ce5-967d-a6ae7d926cbc-1752676602
2 302 bd0e8b4b-fe36-48b8-9cc6-02589c4c1780-1752672992 2 302 281c0e91-2321-4f0c-99d7-f0e4765934a6-1752676617
Run 2 ~ 7:49 AM MDT Run 5 ~ 8:46 AM MDT
Overall 302s and 429s on first calls, 429s on second calls -Failed Tests Overall 302s and 429s on first calls, 429s on second calls -Failed Tests
Test Number Request Status Code X-Request-ID Test Number Request Status Code X-Request-ID
1 1 429 b03ee016-bc3c-40fc-a761-7cf4c4955469-1752673695 1 1 302 813da178-5a57-4827-b9ef-a5278b6c8b24-1752677240
2 1 429 53e48123-ae1c-499c-ba93-5ec68cdd6263-1752673707 2 429 95f860cd-286d-4ea1-bf04-ff29fc850369-1752677255
3 1 302 2e76354b-ce1a-4e32-a133-c2a87ae99bcd-1752673691 2 1 429 660f15b4-42f3-49c9-b96a-9d924dc9ad3e-1752677243
2 429 c2585bbc-37ec-4c68-bca0-98e4d4a1b815-1752673706 3 1 429 4d19581b-b7e1-4d3f-9ad0-9b9be16a4de5-1752677257
4 1 429 6ef0521d-e101-4053-8f41-a9618b3a0d16-1752673706 4 1 429 3b52939f-ea24-4ca3-bca3-b4e9ff936f3f-1752677246
5 1 302 59c0bdd7-c381-4adc-8a28-0fef5dbc2cb8-1752673705 5 1 302 2bc6318b-dfea-43cd-893c-22b5e2284c08-1752677248
2 429 67a3e7c6-cbd7-4cc4-873b-6439c06cea9e-1752673721 2 429 1f38c0ec-5ae9-4fbf-a2c4-f1778ebae60b-1752677263
6 1 302 25d475e7-71d5-42d1-9709-829dbd876337-1752673698 6 1 302 40a9771d-3550-4b12-8871-7d0c56f25424-1752677252
2 429 1117fa67-656b-4779-886d-b37399a792b1-1752673713 2 429 dc114354-2082-47ce-8d64-8fbf07965a56-1752677266
7 1 429 f2f04fad-e74c-47e4-8692-4e44d75a4daa-1752673715 7 1 302 9ad006fe-d61c-4fba-ba87-54d8bfe47b84-1752677260
8 1 302 2a1a6ffc-3093-4056-8923-ffe219e52f4f-1752673705 2 429 b351eb29-7f08-4cfa-a089-243773398a8f-1752677275
2 429 7535811e-c471-4fa7-ba94-a4253e1ebe80-1752673720 8 1 429 b0c12f05-e0f9-41a7-a296-7c38597ca8a2-1752677267
Run 3 ~ 8:01 AM MDT Run 6 ~ 8:58 AM MDT
Overall 429s on first calls - Failed Tests Overall 429s on first calls - Failed Tests
Test Number Request Status Code X-Request-ID Test Number Request Status Code X-Request-ID
1 1 429 dcffa40d-acf1-4c2d-b282-c32d0ce6a518-1752674419 1 1 429 a81baeaf-017a-4e6e-aefd-5aa6333005ac-1752677975
2 1 429 33143d4b-39ea-4869-9c28-ace33ce21482-1752674433 2 1 429 9d368d64-f800-47a7-a4f1-a4228ff838ff-1752677960
3 1 429 d6ae9d4b-dc45-4896-8a54-aee79b7fd6bb-1752674433 3 1 429 7a8d313a-51da-4003-bdf3-80d99e6fe448-1752677956
4 1 429 82138659-e355-478a-8ef6-74d884a6c3e9-1752674425 4 1 429 e18071ee-d55b-4e25-8314-e3d9ba94bcf7-1752677954
5 1 429 e642f4dd-d794-4e2a-ae87-da8fb2ae851b-1752674436 5 1 429 fa025772-a6bb-47aa-8c20-32c03f8a5bc9-1752677975
6 1 429 22641add-dd4c-475d-b3a0-f72b543c7c1d-1752674436 6 1 429 0574b105-64d2-4419-bcd8-d46f7bf0d263-1752677962
7 1 429 bcc8a5db-4d0d-4448-84dd-f76c6b953f1a-1752674435 7 1 429 3ae06514-f913-4236-9151-c4e9a527219b-1752677958
8 1 429 a32f3556-111a-45a7-b29e-8e3565456b11-1752674443 8 1 429 5b79c354-0ff3-44a2-b9e7-0287cfbf15dd-1752677950