Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 22 Apr 2024 09:12:49 -0700
From:      Gleb Smirnoff <glebius@freebsd.org>
To:        Alexander Leidinger <Alexander@leidinger.net>
Cc:        Current <current@freebsd.org>
Subject:   Re: Strange network/socket anomalies since about a month
Message-ID:  <ZiaMgRH-8vecPfSt@cell.glebi.us>
In-Reply-To: <1fe609f252e7fae6d746530d5035ec0e@Leidinger.net>
References:  <1fe609f252e7fae6d746530d5035ec0e@Leidinger.net>

next in thread | previous in thread | raw e-mail | index | archive | help
  Alexander,

On Mon, Apr 22, 2024 at 09:26:59AM +0200, Alexander Leidinger wrote:
A> I see a higher failure rate of socket/network related stuff since a while.
A> Those failures are transient. Directly executing the same thing again may
A> or may not result in success/failure. I'm not able to reproduce this at
A> will. Sometimes they show up.
A> 
A> Examples:
A>  - poudriere runs with the sccache overlay (like ccache but also works for
A> rust) sometimes fail to create the communication socket and as such the
A> build fails. I have 3 different poudriere bulk runs after each other in my
A> build script, and when the first one fails, the second and third still run.
A> If the first fails due to the sccache issue, the second and 3rd may or may
A> not fail. Sometimes the first fails and the rest is ok. Sometimes all fail,
A> and if I then run one by hand it works (the script does the same as the
A> manual run, the script is simply a "for type in A B C; do; poudriere bulk
A> -O sccache -j $type -f  ${type}.pkglist; done" which I execute from the
A> same shell, and the script doesn't do env-sanityzing).
A>  - A webmail interface (inet / local net -> nginx (rev-proxy) -> nginx
A> (webmail service) -> php -> imap) sees intermittent issues sometimes.
A> Opening the same email directly again afterwards normally works. I've also
A> seen transient issues with pgp signing (webmail interface -> gnupg /
A> gpg-agent on the server), simply hitting send again after a failure works
A> fine.
A> 
A> Gleb, could this be related to the socket stuff you did 2 weeks ago? My
A> world is from 2024-04-17-112537. I do notice this since at least then, but
A> I'm not sure if they where there before that and I simply didn't notice
A> them. They are surely "new recently", that amount of issues I haven's seen
A> in January. The last two updates of current I did before the last one where
A> on 2024-03-31-120210 and 2024-04-08-112551.

The stuff I pushed 2 weeks ago was a large rewrite of unix/stream, but that was
reverted as it appears needs more work wrt to aio(4), nfs/rpc and also appeared
that sendfile(2) over unix(4) has some non-zero use.

There were several preparatory commits that were not reverted and one of them
had a bug.  The bug manifested itself as failure to send(2) zero bytes over
unix/stream.  It was fixed with e6a4b57239dafc6c944473326891d46d966c0264. Can
you please check you have this revision? Other than that there are no known
bugs left.

A> I could also imagine that some memory related transient failure could cause
A> this, but with >3 GB free I do not expect this. Important here may be that
A> I have https://reviews.freebsd.org/D40575 in my tree, which is memory
A> related, but it's only a metric to quantify memory fragmentation.
A> 
A> Any ideas how to track this down more easily than running the entire
A> poudriere in ktrace (e.g. a hint/script which dtrace probes to use)?

I don't have any better idea than ktrace over failing application.  Yep, I
understand that poudriere will produce a lot.  But first we need to determine
what syscall fails and on what type of socket.  After that we can scope down to
using dtrace on very particular functions.

-- 
Gleb Smirnoff



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?ZiaMgRH-8vecPfSt>