Skip to content

Fix RegExp.toString generating invalid RE for CHAR and CHAR_RANGE #14493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Tasssadar
Copy link

The toString on regexes that contain a CHAR or CHAR_RANGE has been broken since 1efce54 , because ASCII character like s becomes \s, which parses as character class.

@rmuir
Copy link
Member

rmuir commented Apr 15, 2025

I like the change. We can get fancy and extend it in the future, but the toString() today does not do its one job, the output is not readable. This fixes it to be more practical and not try to escape, in unnecessary cases.

Separately: the output of .toString() is just that, for humans, and not guaranteed to be reparsable. I can practically guarantee it is a bad idea to do this.

Copy link
Member

@rmuir rmuir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at tests in diffs of this change, to see why this is an easy win.

@Tasssadar
Copy link
Author

Separately: the output of .toString() is just that, for humans, and not guaranteed to be reparsable. I can practically guarantee it is a bad idea to do this.

What would be the right way to get the original input string then, please? Or do you just declare that RegExp does not support that?

Also, it's not just about readability - even as a human, if I see regex.toString() == "\\s", before this fix, I don't know whether it is s character or whitespace class.

@rmuir
Copy link
Member

rmuir commented Apr 15, 2025

What would be the right way to get the original input string then, please? Or do you just declare that RegExp does not support that?

It doesn't have a method to support this. if the feature is really needed, I would be in favor of doing this in the obvious way if possible: just holding a ref to the original string before any parsing.

Also, it's not just about readability - even as a human, if I see regex.toString() == "\\s", before this fix, I don't know whether it is s character or whitespace class.

This PR makes it better though. Personally my opinion: the entire .toString() should be rewritten from scratch. It was never properly designed to be user-friendly in any way. I still plan to merge your incremental improvement though.

For testing purposes we rely on two things:

  • Correct Parse: mostly this is done via toStringTree() which returns structured-ish format of the represenation. See TestRegExpParsing.java as example.
  • Correct Behavior: mostly this is done via Automaton classes which is unambiguous.

Assertions on the toString() are mostly just legacy and there's no guarantees about the format of that method AFAIK, so we can change it to be better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants