View Issue Details

IDProjectCategoryView StatusLast Update
0004126Rocky ServicesMirror Managerpublic2023-11-21 07:44
ReporterBrian Murrell Assigned ToNeil Hanlon  
PriorityhighSeveritymajorReproducibilityN/A
Status closedResolutionunable to reproduce 
Summary0004126: Mirror load-balancer doen't return 503 to client when they incur a 503 trying to fetch content
DescriptionIt would seem that when there is an issue behind the load-balancer trying to fetch content, the load-balancer returns a 200 response to the client, but with HTML content that says:

503 Service Unavailable

No server is available to handle this request.

Here is the log entry from our proxy that indicates that the response was a 200:

2023-09-05T13:33:55.220Z|40ca9f5eefc6c659|rocky-vault-proxy||HEAD|http://dl.rockylinux.org/vault/rocky/8.6/PowerTools/x86_64/os/repodata/145d657619670c3d373c5c3b6ec9e8b51b3386da2980b3dc761b7310a190869a-filelists.xml.gz|200|0|0|10442
2023-09-05T13:34:04.007Z|40ca9f5eefc6c659|rocky-vault-proxy||GET|http://dl.rockylinux.org/vault/rocky/8.6/PowerTools/x86_64/os/repodata/145d657619670c3d373c5c3b6ec9e8b51b3386da2980b3dc761b7310a190869a-filelists.xml.gz|200|0|0|8782

The 200 that is between the |…| in the above log entries is the HTTP status code, but the content that was actually received from the above GET was the:

503 Service Unavailable

No server is available to handle this request.

The problem with all of this is that a client that is going to cache a successful (i.e. 2xx) result is going to cache (and thus serve) erroneous content. If instead the client received the 503 response it would know the fetch was unsuccessful, not cache it and would serve from it's own cache. It would then know to check again for correct content in the near future.
TagsNo tags attached.

Activities

Neil Hanlon

Neil Hanlon

2023-09-07 02:43

administrator   ~0004588

Thank you for the report. I'm looking into this.
Brian Murrell

Brian Murrell

2023-09-27 11:37

reporter   ~0004720

Any progress on this? Your mirrors have just poisoned my cache once again with:

<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>

200 responses rather than returning a proper 503 to the client.
Neil Hanlon

Neil Hanlon

2023-09-27 11:43

administrator   ~0004721

I have been on vacation for the past two weeks. Working on it this week.

I've checked our CDN code and don't know why you are experiencing this -- our error codes are properly set everywhere I can tell. So, I will have to do some more investigation.

Might I recommend you mirror using rsync off of one our community mirrors, rather than trying to grab everything over HTTP, in the interim? In general, this is a better solution to reposyncing.
Brian Murrell

Brian Murrell

2023-09-27 12:34

reporter   ~0004722

I have no idea how your mirroring infra works, but it seems that something sits in front of a pool of servers proxying requests from clients, probably for load balancing/distribution/etc.

That something seems to run into 503 errors on the back-end (your pool of mirrors) occasionally and rather than proxy back that 503 error it sends back a 200 success code with some HTML text telling the "user" there was a 503 on the back-end, hence the:

<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>

This not an entirely unreasonable action when it's humans that are going to see such a message such as would be the case for regular WWW browsing. Humans couldn't care less about return codes and the 200 is needed to display the HTML message to the user in their browser. So that this all happens is not surprising. Whatever software is delivering that message was likely built with the humans-using-web-browsers use-case.

This of course is disastrous when it's machines involved as they want proper error codes and couldn't care less what the content is.

I would imagine setting up a reproducer in a sandbox would not be terribly difficult.

As for using rsync to completely mirror, this is not how our caching proxy works. It takes requests from clients, checks if it has that request cached and delivers it from the cache if it has it. If it is not cached, or it is stale it then fetches that item from the remote, caches it locally and then delivers it to the client.

In this way (a) we use no more disk space and network bandwidth than is needed to fetch just what the clients want rather than the entire mirror, most of which will probably never be requested by clients and (b) we similarly reduce the load on your mirrors by not trying to fetch the entire thing and only fetch the small percentage of content that our clients want.

The product we are using for this is Artifactory, for what it's worth.
Brian Murrell

Brian Murrell

2023-10-10 11:33

reporter   ~0004852

Is there any update here? The Rocky mirroring infrastructure seems very fragile and breaks in the way this ticket describes quite frequently. And when it does it poisons our cache.

We really need this to stop happening and for the mirrors to return proper return codes and not erroneously successful return codes when the fetch was not in fact successful.
Neil Hanlon

Neil Hanlon

2023-10-10 13:56

administrator   ~0004854

Hi Brian -- Just wondering, is this a duplicate of https://bugs.rockylinux.org/view.php?id=4094 ?

Are these the same issues?

I am unable to reproduce the exact problem that you are having, and I have thoroughly investigated our CDN and Load Balancer configurations and find no way for them to return an incorrect status code. I will note that we *did* have an issue some time ago (over 6 months) where this was happening, but it was quickly resolved after it was noticed--and was a regression itself, not something that had existed for a long time.
Brian Murrell

Brian Murrell

2023-10-10 15:29

reporter   ~0004856

@Neil Hanlon: Unfortunately there has been some conflation of issues in ticket 0004094. The comment at https://bugs.rockylinux.org/view.php?id=4094#c4819 is a duplicate of this issue but the original report is not.

> I am unable to reproduce the exact problem that you are having, and I have thoroughly investigated our CDN and Load Balancer configurations and find no way for them to return an incorrect status code. I will note that we *did* have an issue some time ago (over 6 months) where this was happening, but it was quickly resolved after it was noticed--and was a regression itself, not something that had existed for a long time.

How are you trying to reproduce this? Are you somehow forcing the backend server(s) to be generating a 503 error and seeing what the Load Balancer (I.e. what our clients will hit when they try to do a dnf makecache for example) does? Is this reproducer something I can try from our DNF client?

Can you describe the software components involved between a DNF client on the Internet and the actual server that holds the RockyLinux content that client will receive on an HTTP GET? I.e. what is the pipeline of components between those?
Brian Murrell

Brian Murrell

2023-10-11 21:06

reporter   ~0004857

Worth noting I am not the only person to have ever seen this or reported it from the Rocky mirror infrastructure:

https://forums.rockylinux.org/t/503-response-from-mirrors-rockylinux-org/9497

So it looks like it is the mirror infrastructure that reports this textual HTML encoded error message with a 200 return code.

Any more progress on this? Any answers to the above questions about how the infrastructure is composed, so that I can see if I can reproduce the problem? Or do you have a sandbox that you can reproduce the problem that I can test against?
Louis Abel

Louis Abel

2023-10-12 01:39

administrator   ~0004858

You appear to be conflating two separate issues. That issue you are linking to was due to IPA maintenance in the infrastructure which did not go as planned, and it was an extended outage that affected everybody. If you read through the rest of the thread, I make it a point to state that it was supposed to be transparent maintenance but that is not what happened and it didn't just affect the mirror manager, it affected everything else internally because of DNS. In comparison to this report, which is specifically about "503" errors being masked as "200" codes, where no one else is reporting this issue to us specifically. So far, it's only yourself and your colleague in the other report. It is not "widespread". If this was indeed a larger issue, we would have many, many more reports in our forums, mattermost, mail lists, and IRC rooms. Since this is currently not the case and in order for this ticket and the other to be properly resolved, the focus needs to be here between our endpoints only.

This shows a forced 503 from mirror manager on our testing debug instance. For context, the mirror manager sits behind our CDN, just like our repositories. We will make changes to mirror manager (that rocky linux uses by default) and test the changes here for sending them to the /mirrorlist endpoint. I can see a 503 being generated here.

```
> GET /debuglist?repo=AppStream-8&arch=x86_64&1697072209 HTTP/2
> Host: mirrors.rockylinux.org
> user-agent: curl/8.0.1
> accept: */*
> fastly-debug: 1
>
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [209 bytes data]
< HTTP/2 503
< server: Apache/2.4.37 (centos) OpenSSL/1.1.1k mod_wsgi/4.6.4 Python/3.6
< content-type: text/html; charset=iso-8859-1
< fastly-restarts: 1
< accept-ranges: bytes
< via: 1.1 varnish, 1.1 varnish
< date: Thu, 12 Oct 2023 00:56:49 GMT
< fastly-debug-path: (D cache-cmh1290034-CMH 1697072210) (F cache-cmh1290089-CMH 1697072210) (D cache-chi-kigq8000088-CHI 1697072210) (F cache-chi-kigq8000088-CHI 1697072210)
< fastly-debug-ttl: (M cache-cmh1290034-CMH - - -) (M cache-chi-kigq8000088-CHI - - -)
< fastly-debug-digest: 97b8526ecbdc284208dccfec20746f1df987df49c0395966756f854d41e86fc0
< x-served-by: cache-chi-kigq8000088-CHI, cache-cmh1290034-CMH
< x-cache: MISS, MISS
< x-cache-hits: 0, 0
< x-timer: S1697072210.703540,VS0,VE46
< x-rocky-debug: beresp_status=503,status=503,mirrorlist-errror
< content-length: 299
<
{ [299 bytes data]
* Connection #0 to host mirrors.rockylinux.org left intact
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>

The server is temporarily unable to service your
request due to maintenance downtime or capacity
problems. Please try again later.


</body></html>
```

With that said, I cannot speak to what Neil's reproduction steps are. I gather it may be similar.

Anyway, the order is this: client -> fastly CDN -> load balancer in AWS us-east-2 (ORD) -> three repo servers. There is no special sauce beyond this and dl.rockylinux.org being our tier 0 goes nowhere else, in comparison to mirrors found when using the mirror list.

It's important to note, that there are no 503's occurring in the logs for the repo servers anyway.

repo01# grep '" 200 ' access.log | wc -l
1147259
repo01# grep '" 503 ' access.log | wc -l
0
repo01# zgrep grep '" 503 ' access.log*.gz | wc -l
0

repo02# grep '" 200 ' access.log | wc -l
1181726
repo02# grep '" 503 ' access.log | wc -l
0
repo02# zgrep grep '" 503 ' access.log*.gz | wc -l
0

repo03# grep '" 200 ' access.log | wc -l
1201986
repo03# grep '" 503 ' access.log | wc -l
0
repo03# zgrep grep '" 503 ' access.log*.gz | wc -l
0

* You noted you are using a proxy. I am assuming this is an outbound proxy. Does this outbound proxy also cache? Have you considered not using an outbound proxy to connect to dl.rockylinux.org to rule out misbehavior from your end?
* Are you unable to pick a mirror from mirrors.rockylinux.org that has the current versions and continue to use dl for the vault? What would be the concern internally to your organization with using a tier 1 mirror? If your concern is stability and you believe that we are somehow polluting your cache, choosing another mirror should be an obvious choice for you in that case.
* Excluding the above with stability still being a concern, what is preventing you or your organization from using rsync or even reposync to get the data you want locally? Users and organizations generally sync the repositories locally. If you use rsync, most data at our endpoints are hardlinked which would alleviate some storage concerns.

Setting to needinfo.
Louis Abel

Louis Abel

2023-11-21 07:44

administrator   ~0005187

Hello. This is a notification that this bug will be closed. We are closing this ticket due to the following:

* Rocky Linux 9.3 has been released, 9.2 is vaulted. 8.9 soon to follow with 8.8 to be vaulted.
* We have not received any reports from end users of issues from the CDN, mirror manager, or our mirrors
* Ticket has been set to needinfo since October 11th and we have not received a response to our presented questions
* Due to the lack of response, we will assume this issue either the issue has resolved itself or you have resolved the issue internally in your organization.

If you are still experiencing the issues as reported here, please open a new bug report with previously presented data and if possible with answers to the questions we have asked here in the above comment. After opening the report, you can set this ticket as related in the "relationships" box below the report after submission.

Thank you. We hope you enjoy your holiday.

Issue History

Date Modified Username Field Change
2023-09-05 15:37 Brian Murrell New Issue
2023-09-07 02:43 Neil Hanlon Assigned To => Neil Hanlon
2023-09-07 02:43 Neil Hanlon Status new => acknowledged
2023-09-07 02:43 Neil Hanlon Note Added: 0004588
2023-09-27 11:37 Brian Murrell Note Added: 0004720
2023-09-27 11:43 Neil Hanlon Note Added: 0004721
2023-09-27 12:34 Brian Murrell Note Added: 0004722
2023-10-10 11:33 Brian Murrell Note Added: 0004852
2023-10-10 13:56 Neil Hanlon Note Added: 0004854
2023-10-10 15:29 Brian Murrell Note Added: 0004856
2023-10-11 21:06 Brian Murrell Note Added: 0004857
2023-10-12 01:39 Louis Abel Status acknowledged => needinfo
2023-10-12 01:39 Louis Abel Note Added: 0004858
2023-11-21 07:44 Louis Abel Status needinfo => closed
2023-11-21 07:44 Louis Abel Resolution open => unable to reproduce
2023-11-21 07:44 Louis Abel Note Added: 0005187