[WIP] sh: implement non-nested brace expansion
Needs ReviewPublic
Actions

Authored by pstef on Sun, Mar 2, 10:39 AM.

Details

Reviewers

Summary

This works, but it doesn't feel as integrated into the parser as it could.
I was trying to use existing string-manipulation functions and macros in order to limit or eliminate all the allocations, but I don't understand this infrastructure well enough so it either didn't work or it crashed.
More documentation and test cases are planned.

Diff Detail

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

pstef created this revision.Sun, Mar 2, 10:39 AM

Herald added subscribers: ziaee, imp. · View Herald TranscriptSun, Mar 2, 10:39 AM

pstef requested review of this revision.Sun, Mar 2, 10:39 AM

This adds a non-POSIX feature (although it is pretty common); is the goal to make sh a mostly feature-complete alternative to bash?

I notice the exact behaviour of brace expansion differs considerably between bash, ksh93 and zsh. In bash, it appears to be performed as part of parsing, and in ksh93, it appears to be performed just before pathname generation (so {x,y} from parameter expansion might generate x and y words). Also, I notice that bash does brace expansion in the command {if,1} (reporting that if was not found), while zsh does not. On the other hand, both bash and zsh do brace expansion in i{f,1} and agree that this does not generate a keyword (witnessed by the error that if was not found).

bin/sh/parser.c
837	The parser should use classification macros from syntax.c such as `is_alpha` and not the ones from ctype.h, so it is not inappropriately affected by locales. Where the parser depends on the locale, this should be the locale that was in effect when the shell started (e.g. `initial_localeisutf8`), so there are no weird dependencies on the parse/execute split. By the way, `isalpha` requires a `char` to be cast to `unsigned char` first, but `is_alpha` does not.
931–934	The code in `readtoken1` has already replaced quoting characters and dollar signs with `CTL` sequences at this point. Any backslashes and dollar signs left are literals.
959	Out of memory should fail the parse, not continue without brace expansion. Also, keep in mind the `SIGINT` handler that calls `longjmp` if not disabled. Since this is a temporary allocation, the `parser_temp` mechanism seems suitable to fix both problems. It will throw an exception (`longjmp`) if the allocation fails.
1037	The `!quoteflag` above means that the code will not do brace expansion if there are any quoting characters in the word, which does not match other shells. Fixing this may not be so easy.
1037	`wordtext` can't be null for a `TWORD`, so it should not be checked. Instead, a check for a `{` seems appropriate, so as to avoid a memstream allocation for every word.
1050	Reprocessing the intermediate representation seems difficult to get right, since there is no obvious way to go back to a shell language word from the intermediate representation. Also, the memory allocated by the memstream is leaked in both `expanded` and `!expanded` cases.

In D49201#1122784, @jilles wrote:

This adds a non-POSIX feature (although it is pretty common); is the goal to make sh a mostly feature-complete alternative to bash?

I use FreeBSD and sh daily. I don't like the common argument "install bash from ports", this is impractical in some cases (ephemeral instances, or someone else's hosts, for two examples). Recently I wanted to put something like find /usr{,/local}/bin ... into a crontab. I don't remember the details.

In bash, it appears to be performed as part of parsing, and in ksh93, it appears to be performed just before pathname generation (so {x,y} from parameter expansion might generate x and y words). Also, I notice that bash does brace expansion in the command {if,1} (reporting that if was not found), while zsh does not. On the other hand, both bash and zsh do brace expansion in i{f,1} and agree that this does not generate a keyword (witnessed by the error that if was not found).

I don't know what bash really does, but the manual page claims that

Brace expansion is performed before any other expansions, and any
characters special to other expansions are preserved in the result.  It
is strictly textual.  Bash does not apply any syntactic interpretation
to the context of the expansion or the text between the braces.

At this point I was seeking advice on how to integrate this.

I actually don't like the memstream approach, but it works and seems to be a good starting point for discussion to me. I would be glad to replace it with something lighter, but I don't know how.

The overly strict parsing is inherited from the design phase where I started with just the parser, before I tried to integrate it with /bin/sh. It may be doing too much checking in this context, but it was good for fuzzing. I'm not sure what to do about it here, maybe the call to expandbrace will happen somewhere else in the end, so these checks will suddenly be useful again?

In D49201#1122836, @pstef wrote:

In D49201#1122784, @jilles wrote:

This adds a non-POSIX feature (although it is pretty common); is the goal to make sh a mostly feature-complete alternative to bash?

I use FreeBSD and sh daily. I don't like the common argument "install bash from ports", this is impractical in some cases (ephemeral instances, or someone else's hosts, for two examples). Recently I wanted to put something like find /usr{,/local}/bin ... into a crontab. I don't remember the details.

Hmm, so we are moving more in the direction of bash. Traditionally, this resulted in some objections on the mailing lists but perhaps opinions have changed.

To make it work in a crontab you'll likely also need to enable brace expansion by default. I guess this is OK because of the language in POSIX that requires applications to quote literal {.

In bash, it appears to be performed as part of parsing, and in ksh93, it appears to be performed just before pathname generation (so {x,y} from parameter expansion might generate x and y words). Also, I notice that bash does brace expansion in the command {if,1} (reporting that if was not found), while zsh does not. On the other hand, both bash and zsh do brace expansion in i{f,1} and agree that this does not generate a keyword (witnessed by the error that if was not found).

I don't know what bash really does, but the manual page claims that

Brace expansion is performed before any other expansions, and any
characters special to other expansions are preserved in the result.  It
is strictly textual.  Bash does not apply any syntactic interpretation
to the context of the expansion or the text between the braces.

At this point I was seeking advice on how to integrate this.

I actually don't like the memstream approach, but it works and seems to be a good starting point for discussion to me. I would be glad to replace it with something lighter, but I don't know how.

The overly strict parsing is inherited from the design phase where I started with just the parser, before I tried to integrate it with /bin/sh. It may be doing too much checking in this context, but it was good for fuzzing. I'm not sure what to do about it here, maybe the call to expandbrace will happen somewhere else in the end, so these checks will suddenly be useful again?

Looking more closely, bash's set -o braceexpand and set +o braceexpand will affect the behaviour of an existing function definition (like f() { echo {a,b}; }), so apparently bash performs brace expansion just before other expansions.

In the sh code, this would imply adjusting mksyntax so CTLESC bytes will be added before quoted {, , and }, and doing the new processing in expand.c if EXP_SPLIT is set.

In D49201#1124014, @jilles wrote:

Hmm, so we are moving more in the direction of bash. Traditionally, this resulted in some objections on the mailing lists but perhaps opinions have changed.

As you've noted - there are other shells that support this feature. They slightly differ in corner cases, and I don't care about what bash does at all, it's the basic case where they all agree and is the most used one that I care about. We can mimic what zsh does in all cases, if that's the consensus, I have no strong feelings about this. My preference would be to pick the least invasive implementation and reason about the behavior in corner cases post factum. For example, I believe FreeBSD sh should not support nested brace expansion as it would require significantly more code than this simplified implementation (or a similar one) requires. I think this is the main point of contention with users - they protest against what they perceive as bloat.