-
Notifications
You must be signed in to change notification settings - Fork 1.1k
replicator/ tests sometimes hang with Nightly #14454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I dug into it a bit more, here is where it hangs. The test thinks it is done after 200s, but then hangs until the timeout in this loop, trying to cleanup: message("TEST: top: test done, now close");
// ...
for (Thread thread : indexers) {
thread.join();
}
for (Thread thread : searchers) {
thread.join(); // <-- hang here
} Stacktrace:
Next step will be to look at how the test uses the searchers and see if there is anything suspicious. |
I found suspicious stuff :) Now i just have to review more stacktraces. I'm attaching the jenkins console output to the issue so it doesn't get lost (i dont know policeman retention policy). It is 10MB, sorry. |
the test framework had clearly pointed directly at the suspect all along. here is is:
This is the hung searcher that can't be interrupt()'d. It is blocking on a socket read, which is not good and causes the hang. Here is the relevant code: while (c.sockIn.available() == 0) {
if (stop.get()) {
break;
}
if (node.isOpen == false) {
throw new IOException("node closed");
}
Thread.sleep(1);
}
version = c.in.readVLong(); // <-- this is the blocking read that hangs forever The looping on In general, things are getting crashed here, maybe there are bugs in the test, but i'd rather we have failures instead of hangs. As a first step, I recommend setting |
These nightly tests sometimes hang and are killed by timeout, instead of passing or failing. Upon inspection of one of the failures, the hung thread is stuck in `SocketDispatcher.read0()` native code, presumably waiting for bytes that will never appear. Instead, set a socket timeout for these tests, so that infinite hangs (of this sort) will no longer happen: they may become failures instead. Failures are better than hangs for many reasons. Relates: apache#14454
Thanks Robert, this are good investigations. It is still unclear why the socket read hangs on a localhost connection, but indeed you never know. Out of file handles, another VM doing crazy macOS Hackintosh stuff, or whatever can cause strange things. |
These nightly tests sometimes hang and are killed by timeout, instead of passing or failing. Upon inspection of one of the failures, the hung thread is stuck in `SocketDispatcher.read0()` native code, presumably waiting for bytes that will never appear. Instead, set a socket timeout for these tests, so that infinite hangs (of this sort) will no longer happen: they may become failures instead. Failures are better than hangs for many reasons. Relates: #14454
I'm not sure if the test has killed all the JVMs yet in this case or not. Not sure where the connections are going and coming, maybe the node is attempting to search itself? Also the test kills JVMs, so not all sockets get Basically, I did not yet try to debug any higher level logic, just address this particular case. We should be able to beast the test now more efficiently and easier find problems. |
These nightly tests sometimes hang and are killed by timeout, instead of passing or failing. Upon inspection of one of the failures, the hung thread is stuck in `SocketDispatcher.read0()` native code, presumably waiting for bytes that will never appear. Instead, set a socket timeout for these tests, so that infinite hangs (of this sort) will no longer happen: they may become failures instead. Failures are better than hangs for many reasons. Relates: #14454
I'm keeping this issue open as I'm not sure what the test will do, if it encounters the same condition. I'm hoping it fails, and even better if that reproduces... and then we can try to stabilize that. |
I will setup nightly runs on policeman |
Yeah, that would be awesome! I'm not sure how much these tests are getting exercised today... these problems seem to surface during release votes. |
I created a linux job for now. Maybe I will change the "normal" job later to put "nightly" into the randomization to not have so many duplicates. But this helps for now. https://jenkins.thetaphi.de/job/Lucene-nightly-main-Linux/ Windows and Mac nodes will come later with above randomization. |
Thanks @uschindler this will definitely help. |
These nightly tests sometimes hang and are killed by timeout, instead of passing or failing. Upon inspection of one of the failures, the hung thread is stuck in `SocketDispatcher.read0()` native code, presumably waiting for bytes that will never appear. Instead, set a socket timeout for these tests, so that infinite hangs (of this sort) will no longer happen: they may become failures instead. Failures are better than hangs for many reasons. Relates: apache#14454
Description
Example found by @uschindler during release vote: https://jenkins.thetaphi.de/job/Lucene-10.x-Release-Tester/5/consoleFull
Maybe the test should be marked AwaitFix until it is deflaked?
Version and environment details
No response
The text was updated successfully, but these errors were encountered: