Skip to content

Propagate file settings health info to the health node #127397

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
May 5, 2025

Conversation

prdoyle
Copy link
Contributor

@prdoyle prdoyle commented Apr 25, 2025

Abstract

File settings are processed on the master node, but health is reported by the health node. To get status reported correctly, we must forward the health info from the master to the health node.

See ES-10965.

Design

The system of record for the file settings health info is the master node's FileSettingsHealthIndicatorService.currentInfo field, but health info is actually reported from the HealthInfoCache on the health node. Therefore, whenever currentInfo changes, we send a UpdateHealthInfoCacheAction request to the health node to update its cache to match the actual value on the master node.

Data representation

The HealthInfo record is returned by health API. I've added FileSettingsHealthInfo to this structure, and used the INDETERMINATE value to when no info is available, rather than null, to round off some corner cases. To ensure consistency, I've used this same record as the internal representation inside FileSettingsHealthIndicatorService itself; prior to this PR, we had been using a number of separate volatile fields for the individual parts of the health state.

The UpdateHealthInfoCacheAction.Request record, in contrast, is used to propagate info to the health node. This is triggered by individual health-reporting components, each of which should only set its own health info and leave the others unchanged. Hence, for this record, the absence of file settings health info is represented by null.

Propagation steps

  1. File settings are changed. On the master node, FileSettingsService.completeProcessing notes success or failure, recording the updated info as a FileSettingsHealthInfo object, and then calls FileSettingsHealthIndicatorService.publish in a finally block.
  2. FileSettingsHealthIndicatorPublisherImpl.publish looks up the health node, and sends it a UpdateHealthInfoCacheAction request containing the new health info, with all other fields null to indicate we're not changing those.
  3. On the health node, the HealthInfoCache is updated with the new FileSettingsHealthInfo.

@prdoyle prdoyle added >non-issue :Core/Infra/Core Core issues without another label v9.1.0 labels Apr 25, 2025
@prdoyle prdoyle self-assigned this Apr 25, 2025
@prdoyle prdoyle requested a review from a team as a code owner April 25, 2025 14:04
@elasticsearchmachine elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label Apr 25, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@prdoyle
Copy link
Contributor Author

prdoyle commented Apr 25, 2025

There's a failure:

REPRODUCE WITH: ./gradlew ":server:internalClusterTest" --tests "org.elasticsearch.reservedstate.service.FileSettingsServiceIT.testHealthIndicator" -Dtests.seed=CD596C989260970B -Dtests.locale=xh -Dtests.timezone=US/Hawaii -Druntime.java=24
FileSettingsServiceIT > testHealthIndicator FAILED
    java.lang.AssertionError: expected:<1> but was:<0>
        at __randomizedtesting.SeedInfo.seed([CD596C989260970B:CBC260190393BCD7]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:633)
        at org.elasticsearch.reservedstate.service.FileSettingsServiceIT.lambda$testHealthIndicator$13(FileSettingsServiceIT.java:528)
        at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1481)
        at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1453)
        at org.elasticsearch.reservedstate.service.FileSettingsServiceIT.testHealthIndicator(FileSettingsServiceIT.java:528)

This passes for me though. 🤔

@prdoyle
Copy link
Contributor Author

prdoyle commented Apr 25, 2025

It has passed for me 50 times out of 50 runs.

I don't understand how it could fail intermittently, unless somehow the health propagation is taking so long that something times out.

@prdoyle prdoyle added auto-backport Automatically create backport pull requests when merged v8.19.0 labels Apr 25, 2025
@nielsbauman nielsbauman self-requested a review April 25, 2025 15:09
@nielsbauman nielsbauman added Team:Data Management Meta label for data/management team :Data Management/Health labels Apr 25, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Apr 25, 2025
Copy link
Contributor

@nielsbauman nielsbauman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely in the direction that I had in mind. Thanks for working on this, Patrick! I left a few comments, let me know what you think.

@prdoyle
Copy link
Contributor Author

prdoyle commented Apr 29, 2025

@nielsbauman - Sorry for the force push. History got tangled so I just used the big hammer to straighten it out.

I think you reviewed up to 936813b in effect, so the commits after that would be new to you.

Copy link
Contributor

@nielsbauman nielsbauman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all the comments! I posted a few more small comments but we're almost good to go :).

Copy link
Contributor

@nielsbauman nielsbauman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks a lot for the iterations! 🚀

Co-authored-by: Niels Bauman <33722607+nielsbauman@users.noreply.github.com>
@prdoyle
Copy link
Contributor Author

prdoyle commented Apr 30, 2025

@prdoyle prdoyle enabled auto-merge (squash) May 5, 2025 12:53
@prdoyle prdoyle merged commit 5df5cb8 into elastic:main May 5, 2025
16 of 17 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.19 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 127397

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged backport pending :Core/Infra/Core Core issues without another label :Data Management/Health >non-issue serverless-linked Added by automation, don't add manually Team:Core/Infra Meta label for core/infra team Team:Data Management Meta label for data/management team v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants