Skip to content

replicator/ tests sometimes hang with Nightly #14454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rmuir opened this issue Apr 8, 2025 · 10 comments
Open

replicator/ tests sometimes hang with Nightly #14454

rmuir opened this issue Apr 8, 2025 · 10 comments
Labels

Comments

@rmuir
Copy link
Member

rmuir commented Apr 8, 2025

Description

Example found by @uschindler during release vote: https://jenkins.thetaphi.de/job/Lucene-10.x-Release-Tester/5/consoleFull

Maybe the test should be marked AwaitFix until it is deflaked?

Version and environment details

No response

@rmuir
Copy link
Member Author

rmuir commented Apr 9, 2025

I dug into it a bit more, here is where it hangs. The test thinks it is done after 200s, but then hangs until the timeout in this loop, trying to cleanup:

message("TEST: top: test done, now close");
// ...
for (Thread thread : indexers) {
  thread.join();
}
for (Thread thread : searchers) {
  thread.join(); // <-- hang here
}

Stacktrace:

1> 228.525s       :     parent [      pump3] top: R3 tcpPort=37267: replica process finished
  2> ᎧᏬ 08, 2025 10:14:04 ᏌᎾᎴ com.carrotsearch.randomizedtesting.ThreadLeakControl$2 evaluate
  2> WARNING: Suite execution timed out: org.apache.lucene.replicator.nrt.TestStressNRTReplication
  2>    1) Thread[id=40, name=main, state=WAITING, group=TGRP-TestStressNRTReplication]
  2>         at java.base/java.lang.Object.wait0(Native Method)
  2>         at java.base/java.lang.Object.wait(Object.java:375)
  2>         at java.base/java.lang.Thread.join(Thread.java:2045)
  2>         at java.base/java.lang.Thread.join(Thread.java:2121)
  2>         at org.apache.lucene.replicator.nrt.TestStressNRTReplication.test(TestStressNRTReplication.java:348)

Next step will be to look at how the test uses the searchers and see if there is anything suspicious.

@rmuir
Copy link
Member Author

rmuir commented Apr 9, 2025

I found suspicious stuff :) Now i just have to review more stacktraces. I'm attaching the jenkins console output to the issue so it doesn't get lost (i dont know policeman retention policy).

It is 10MB, sorry.

#5.txt

@rmuir
Copy link
Member Author

rmuir commented Apr 9, 2025

the test framework had clearly pointed directly at the suspect all along. here is is:

  2> Apr 08, 2025 9:14:26 AM com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll
  2> SEVERE: There are still zombie threads that couldn't be terminated:
  2>    1) Thread[id=50, name=Searcher node=R0 tcpPort=33091, state=RUNNABLE, group=TGRP-TestStressNRTReplication]
  2>         at java.base/sun.nio.ch.SocketDispatcher.read0(Native Method)
  2>         at java.base/sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:47)
  2>         at java.base/sun.nio.ch.NioSocketImpl.tryRead(NioSocketImpl.java:256)
  2>         at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:307)
  2>         at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:346)
  2>         at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:796)
  2>         at java.base/java.net.Socket$SocketInputStream.implRead(Socket.java:1108)
  2>         at java.base/java.net.Socket$SocketInputStream.read(Socket.java:1095)
  2>         at java.base/java.net.Socket$SocketInputStream.read(Socket.java:1089)
  2>         at org.apache.lucene.core@10.2.0-SNAPSHOT/org.apache.lucene.store.InputStreamDataInput.readByte(InputStreamDataInput.java:34)
  2>         at org.apache.lucene.core@10.2.0-SNAPSHOT/org.apache.lucene.store.DataInput.readVLong(DataInput.java:199)
  2>         at org.apache.lucene.replicator.nrt.TestStressNRTReplication$SearchThread.run(TestStressNRTReplication.java:1079)

This is the hung searcher that can't be interrupt()'d. It is blocking on a socket read, which is not good and causes the hang. Here is the relevant code:

while (c.sockIn.available() == 0) {
  if (stop.get()) {
    break;
  }
  if (node.isOpen == false) {
    throw new IOException("node closed");
  }
  Thread.sleep(1);
}
version = c.in.readVLong();  // <-- this is the blocking read that hangs forever

The looping on available() was the "suspicious stuff" as the javadoc Impl Spec states: "The available method of InputStream always returns 0".

In general, things are getting crashed here, maybe there are bugs in the test, but i'd rather we have failures instead of hangs.

As a first step, I recommend setting .soTimeout() on sockets to a reasonable value such as 30s so that reads won't block forever. LockVerifyServer does this, and it never hangs infinitely: s.setSoTimeout(30000);

rmuir added a commit to rmuir/lucene that referenced this issue Apr 9, 2025
These nightly tests sometimes hang and are killed by timeout, instead of
passing or failing. Upon inspection of one of the failures, the hung
thread is stuck in `SocketDispatcher.read0()` native code, presumably
waiting for bytes that will never appear.

Instead, set a socket timeout for these tests, so that infinite hangs
(of this sort) will no longer happen: they may become failures instead.
Failures are better than hangs for many reasons.

Relates: apache#14454
@uschindler
Copy link
Contributor

Thanks Robert, this are good investigations. It is still unclear why the socket read hangs on a localhost connection, but indeed you never know. Out of file handles, another VM doing crazy macOS Hackintosh stuff, or whatever can cause strange things.

rmuir added a commit that referenced this issue Apr 10, 2025
These nightly tests sometimes hang and are killed by timeout, instead of
passing or failing. Upon inspection of one of the failures, the hung
thread is stuck in `SocketDispatcher.read0()` native code, presumably
waiting for bytes that will never appear.

Instead, set a socket timeout for these tests, so that infinite hangs
(of this sort) will no longer happen: they may become failures instead.
Failures are better than hangs for many reasons.

Relates: #14454
@rmuir
Copy link
Member Author

rmuir commented Apr 10, 2025

I'm not sure if the test has killed all the JVMs yet in this case or not. Not sure where the connections are going and coming, maybe the node is attempting to search itself?

Also the test kills JVMs, so not all sockets get shutdown() cleanly.

Basically, I did not yet try to debug any higher level logic, just address this particular case. We should be able to beast the test now more efficiently and easier find problems.

rmuir added a commit that referenced this issue Apr 10, 2025
These nightly tests sometimes hang and are killed by timeout, instead of
passing or failing. Upon inspection of one of the failures, the hung
thread is stuck in `SocketDispatcher.read0()` native code, presumably
waiting for bytes that will never appear.

Instead, set a socket timeout for these tests, so that infinite hangs
(of this sort) will no longer happen: they may become failures instead.
Failures are better than hangs for many reasons.

Relates: #14454
@rmuir
Copy link
Member Author

rmuir commented Apr 10, 2025

I'm keeping this issue open as I'm not sure what the test will do, if it encounters the same condition. I'm hoping it fails, and even better if that reproduces... and then we can try to stabilize that.

@uschindler
Copy link
Contributor

I will setup nightly runs on policeman

@rmuir
Copy link
Member Author

rmuir commented Apr 10, 2025

Yeah, that would be awesome! I'm not sure how much these tests are getting exercised today... these problems seem to surface during release votes.

@uschindler
Copy link
Contributor

uschindler commented Apr 10, 2025

I created a linux job for now. Maybe I will change the "normal" job later to put "nightly" into the randomization to not have so many duplicates. But this helps for now.

https://jenkins.thetaphi.de/job/Lucene-nightly-main-Linux/

Windows and Mac nodes will come later with above randomization.

@rmuir
Copy link
Member Author

rmuir commented Apr 10, 2025

Thanks @uschindler this will definitely help.

jpountz pushed a commit to jpountz/lucene that referenced this issue Apr 24, 2025
These nightly tests sometimes hang and are killed by timeout, instead of
passing or failing. Upon inspection of one of the failures, the hung
thread is stuck in `SocketDispatcher.read0()` native code, presumably
waiting for bytes that will never appear.

Instead, set a socket timeout for these tests, so that infinite hangs
(of this sort) will no longer happen: they may become failures instead.
Failures are better than hangs for many reasons.

Relates: apache#14454
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants