@@ -73,11 +73,11 @@ The :mod:`shlex` module defines the following functions:
7373The :mod: `shlex ` module defines the following class:
7474
7575
76- .. class :: shlex(instream=None, infile=None, posix=False)
76+ .. class :: shlex(instream=None, infile=None, posix=False, punctuation_chars=False )
7777
7878 A :class: `~shlex.shlex ` instance or subclass instance is a lexical analyzer
7979 object. The initialization argument, if present, specifies where to read
80- characters from. It must be a file-/stream-like object with
80+ characters from. It must be a file-/stream-like object with
8181 :meth: `~io.TextIOBase.read ` and :meth: `~io.TextIOBase.readline ` methods, or
8282 a string. If no argument is given, input will be taken from ``sys.stdin ``.
8383 The second optional argument is a filename string, which sets the initial
@@ -87,8 +87,19 @@ The :mod:`shlex` module defines the following class:
8787 when *posix * is not true (default), the :class: `~shlex.shlex ` instance will
8888 operate in compatibility mode. When operating in POSIX mode,
8989 :class: `~shlex.shlex ` will try to be as close as possible to the POSIX shell
90- parsing rules.
91-
90+ parsing rules. The *punctuation_chars * argument provides a way to make the
91+ behaviour even closer to how real shells parse. This can take a number of
92+ values: the default value, ``False ``, preserves the behaviour seen under
93+ Python 3.5 and earlier. If set to ``True ``, then parsing of the characters
94+ ``();<>|& `` is changed: any run of these characters (considered punctuation
95+ characters) is returned as a single token. If set to a non-empty string of
96+ characters, those characters will be used as the punctuation characters. Any
97+ characters in the :attr: `wordchars ` attribute that appear in
98+ *punctuation_chars * will be removed from :attr: `wordchars `. See
99+ :ref: `improved-shell-compatibility ` for more information.
100+
101+ .. versionchanged :: 3.6
102+ The `punctuation_chars ` parameter was added.
92103
93104.. seealso ::
94105
@@ -191,7 +202,13 @@ variables which either control lexical analysis or can be used for debugging:
191202.. attribute :: shlex.wordchars
192203
193204 The string of characters that will accumulate into multi-character tokens. By
194- default, includes all ASCII alphanumerics and underscore.
205+ default, includes all ASCII alphanumerics and underscore. In POSIX mode, the
206+ accented characters in the Latin-1 set are also included. If
207+ :attr: `punctuation_chars ` is not empty, the characters ``~-./*?= ``, which can
208+ appear in filename specifications and command line parameters, will also be
209+ included in this attribute, and any characters which appear in
210+ ``punctuation_chars `` will be removed from ``wordchars `` if they are present
211+ there.
195212
196213
197214.. attribute :: shlex.whitespace
@@ -222,9 +239,13 @@ variables which either control lexical analysis or can be used for debugging:
222239
223240.. attribute :: shlex.whitespace_split
224241
225- If ``True ``, tokens will only be split in whitespaces. This is useful, for
242+ If ``True ``, tokens will only be split in whitespaces. This is useful, for
226243 example, for parsing command lines with :class: `~shlex.shlex `, getting
227- tokens in a similar way to shell arguments.
244+ tokens in a similar way to shell arguments. If this attribute is ``True ``,
245+ :attr: `punctuation_chars ` will have no effect, and splitting will happen
246+ only on whitespaces. When using :attr: `punctuation_chars `, which is
247+ intended to provide parsing closer to that implemented by shells, it is
248+ advisable to leave ``whitespace_split `` as ``False `` (the default value).
228249
229250
230251.. attribute :: shlex.infile
@@ -245,10 +266,9 @@ variables which either control lexical analysis or can be used for debugging:
245266 This attribute is ``None `` by default. If you assign a string to it, that
246267 string will be recognized as a lexical-level inclusion request similar to the
247268 ``source `` keyword in various shells. That is, the immediately following token
248- will be opened as a filename and input will
249- be taken from that stream until EOF, at which
250- point the :meth: `~io.IOBase.close ` method of that stream will be called and
251- the input source will again become the original input stream. Source
269+ will be opened as a filename and input will be taken from that stream until
270+ EOF, at which point the :meth: `~io.IOBase.close ` method of that stream will be
271+ called and the input source will again become the original input stream. Source
252272 requests may be stacked any number of levels deep.
253273
254274
@@ -275,6 +295,16 @@ variables which either control lexical analysis or can be used for debugging:
275295 (``'' ``), in non-POSIX mode, and to ``None `` in POSIX mode.
276296
277297
298+ .. attribute :: shlex.punctuation_chars
299+
300+ Characters that will be considered punctuation. Runs of punctuation
301+ characters will be returned as a single token. However, note that no
302+ semantic validity checking will be performed: for example, '>>>' could be
303+ returned as a token, even though it may not be recognised as such by shells.
304+
305+ .. versionadded :: 3.6
306+
307+
278308.. _shlex-parsing-rules :
279309
280310Parsing Rules
@@ -327,3 +357,62 @@ following parsing rules.
327357* EOF is signaled with a :const: `None ` value;
328358
329359* Quoted empty strings (``'' ``) are allowed.
360+
361+ .. _improved-shell-compatibility :
362+
363+ Improved Compatibility with Shells
364+ ----------------------------------
365+
366+ .. versionadded :: 3.6
367+
368+ The :class: `shlex ` class provides compatibility with the parsing performed by
369+ common Unix shells like ``bash ``, ``dash ``, and ``sh ``. To take advantage of
370+ this compatibility, specify the ``punctuation_chars `` argument in the
371+ constructor. This defaults to ``False ``, which preserves pre-3.6 behaviour.
372+ However, if it is set to ``True ``, then parsing of the characters ``();<>|& ``
373+ is changed: any run of these characters is returned as a single token. While
374+ this is short of a full parser for shells (which would be out of scope for the
375+ standard library, given the multiplicity of shells out there), it does allow
376+ you to perform processing of command lines more easily than you could
377+ otherwise. To illustrate, you can see the difference in the following snippet::
378+
379+ import shlex
380+
381+ for punct in (False, True):
382+ if punct:
383+ message = 'Old'
384+ else:
385+ message = 'New'
386+ text = "a && b; c && d || e; f >'abc'; (def \"ghi\")"
387+ s = shlex.shlex(text, punctuation_chars=punct)
388+ print('%s: %s' % (message, list(s)))
389+
390+ which prints out::
391+
392+ Old: ['a', '&', '&', 'b', ';', 'c', '&', '&', 'd', '|', '|', 'e', ';', 'f', '>', "'abc'", ';', '(', 'def', '"ghi"', ')']
393+ New: ['a', '&&', 'b', ';', 'c', '&&', 'd', '||', 'e', ';', 'f', '>', "'abc'", ';', '(', 'def', '"ghi"', ')']
394+
395+ Of course, tokens will be returned which are not valid for shells, and you'll
396+ need to implement your own error checks on the returned tokens.
397+
398+ Instead of passing ``True `` as the value for the punctuation_chars parameter,
399+ you can pass a string with specific characters, which will be used to determine
400+ which characters constitute punctuation. For example::
401+
402+ >>> import shlex
403+ >>> s = shlex.shlex("a && b || c", punctuation_chars="|")
404+ >>> list(s)
405+ ['a', '&', '&', 'b', '||', 'c']
406+
407+ .. note :: When ``punctuation_chars`` is specified, the :attr:`~shlex.wordchars`
408+ attribute is augmented with the characters ``~-./*?= ``. That is because these
409+ characters can appear in file names (including wildcards) and command-line
410+ arguments (e.g. ``--color=auto ``). Hence::
411+
412+ >>> import shlex
413+ >>> s = shlex.shlex('~/a && b-c --color=auto || d *.py?',
414+ ... punctuation_chars=True)
415+ >>> list(s)
416+ ['~/a', '&&', 'b-c', '--color=auto', '||', 'd', '*.py?']
417+
418+
0 commit comments