Random 502 errors on GraphQL calls

Hello all,

Our queue system has been encountering some random failures with some GraphQL API calls, usually for some period of time, and then it works later when we manually retry these calls.

This presents a challenge for our system, as we need to reliably query some objects and not have to manually retry for results to take place.

Example occurrences of this: (CET time)
Dec 6, 3:43:32 PM
Dec 6, 12:00:44 PM
Dec 5, 12:59:36 PM

And for example on Dec 5th between 9:47:56 AM and 12:59:37 PM this happened on more than 800 API calls.

This was the call that kept failing:

 query customerStatistics($id: ID!) {

                        customer(id: $id) {

                                statistics {

                                        predictedSpendTier

                                        rfmGroup

                                }

                        }

                }

Error was

{
"networkStatusCode": 502,
"message": "GraphQL Client: Bad Gateway",
"response": {}
}

I am not sure if there is something we need to change on our end, but would be great if you can investigate this as it affects a lot of our business operations, also happy to provide any extra technical information that can help trace the reason behind these failures.

Best,

Khaled

Hi @khaledAtKeaz! Can you share x-request-id values from a few of those failed responses? If the 502 responses include that header, it’ll let us trace exactly what happened in our logs.

If you don’t see x-request-ids, we could still take a look at the logs using the timestamps you provided if you can share the app/api_client_id and myshopify.com URL of the store(s) the calls were made to. If you’re not comfortable sharing this on the forum, feel free to get in touch with support directly and they’ll get this in front of the right people.

While you’re gathering the above, worth checking if you have any proxies, firewalls, or load balancers in your setup that might be timing out or returning 502s when the customer statistics query takes longer than expected. That said, 800+ failures in a short window does suggest something upstream, so the request IDs will help us determine where the issue actually is.