Examples#
Here are some examples that use tokenize
.
To simplify the examples, the following helper function is used.
>>> import tokenize
>>> import io
>>> def tokenize_string(s):
... """
... Generator of tokens from the string s
... """
... return tokenize.tokenize(io.BytesIO(s.encode('utf-8')).readline)
Processing Tokens#
These examples show different ways that tokens can be processed.
inside_string()
#
inside_string(s, row, col)
takes the Python code s
and determines if
position at (row, col)
is inside a STRING
token.
To simplify the example for the purposes of illustration, inside_string
returns True
even if (row, col)
is on the quote delimiter or prefix
character (like r
or f
) of the string.
>>> def inside_string(s, row, col):
... """
... Returns True if (row, col) is inside a string in s, False otherwise.
...
... row starts at 1 and col starts at 0.
... """
... try:
... for toknum, tokval, start, end, _ in tokenize_string(s):
... if toknum == tokenize.ERRORTOKEN and tokval[0] in '"\'':
... # There is an unclosed string. We haven't gotten to the
... # position yet, so it must be inside this string
... return True
... if start <= (row, col) <= end:
... return toknum == tokenize.STRING
... except tokenize.TokenError as e:
... # Uncompleted docstring or braces.
... # 'string' in the exception means uncompleted multi-line string
... return 'string' in e.args[0]
...
... return False
Let’s walk through the code.
We don’t want the function to raise TokenError
on
uncompleted delimiters or unclosed multi-line strings, so we wrap the loop in
a try
block.
try:
Next we have the main loop. We don’t use the line
attribute, so we use _
instead to indicate it isn’t used.
for toknum, tokval, start, end, _ in tokenize_string(s):
The idea is to loop through the tokens until we find one that (row, col)
is
contained in (it is between the start
and
end
tokens). This may not actually happen, for
instance, if the (row, col)
is inside whitespace that isn’t tokenized.
The first thing to check for is an ERRORTOKEN
caused by an unclosed single-quoted string. If an unclosed single-quoted (not
multi-line) string is encountered, that is, it is closed by a newline, like
"an unclosed string
, and we haven’t reached our (row, col)
yet, then we
assume our (row, col)
is inside this unclosed string. This implicitly makes
the rest of the document part of the unclosed string. We could also easily
modify this to only assume the rest of the line is inside the unclosed string.
if toknum == tokenize.ERRORTOKEN and tokval[0] in '"\'':
# There is an unclosed string. We haven't gotten to the
# position yet, so it must be inside this string
return True
Now we have the main condition. If the (row, col)
is between the
start
and end
of a
token, we have gone as far as we need to.
if start <= (row, col) <= end:
That token is either a STRING
token, in which case, we should return True
,
or it is another token type, which means our (row, col)
is not on a STRING
token and we can return False
. This can be written as simply:
return toknum == tokenize.STRING
Now the exceptional case. If we see a TokenError
,
we don’t want the function to fail.
except tokenize.TokenError as e:
Remember that there are two possibilities for a
TokenError
. If 'statement'
is in the error
message, there is an unclosed brace somewhere. This case only happens when
tokenize()
has reached the end of the token stream, so if the above checks
haven’t returned True
yet, then (row, col)
must not be inside a STRING
token, so we should return False
.
If 'string'
is inside the error message, there is an unclosed multi-line
string. In this case, we want to check if we are inside this string. We can
check the start of the multi-line string in the
TokenError
. Remember that the message is in
e.args[0]
and the start is in e.args[1]
. So we should return True
in
this case if the (row, col)
are after the e.args[1]
, and False
otherwise.
This logic can all be written succinctly as
# Uncompleted docstring or braces.
# 'string' in the exception means uncompleted multi-line string
return 'string' in e.args[0] and (row, col) >= e.args[1]
Finally, if we reach the end of the token stream without returning anything,
it means we never found a STRING
that is on our (row, col)
.
return False
Here are some test cases to verify the code is correct
>>> # Basic test. Remember that lines start at 1 and columns start at 0.
>>> inside_string("print('a string')", 1, 4) # 't' in print
False
>>> inside_string("print('a string')", 1, 9) # 'a' in 'a string'
True
>>> # Note: because our input uses """, the first line is empty
>>> inside_string("""
... "an unclosed single quote string
... 1 + 1
... """, 2, 4) # 'u' in 'unclosed'
True
>>> # Check for whitespace right before TokenError
>>> inside_string("""
... '''an unclosed multi-line string
... 1 + 1
... """, 1, 0) # the space before '''
False
>>> # Check inside an unclosed multi-line string
>>> inside_string("""
... '''an unclosed multi-line string
... 1 + 1
... """, 1, 4) # 'a' in 'an'
True
>>> # Check for whitespace between tokens
>>> inside_string("""
... def hello(name):
... return 'hello %s' % name
... """, 2, 3) # ' ' before hello(name)
False
>>> # Check TokenError from unclosed delimiters
>>> inside_string("""
... def hello(name:
... return 'hello %s' % name
... """, 4, 0) # Last character in the input
False
>>> inside_string("""
... def hello(name:
... return 'hello %s' % name
... """, 3, 12) # 'h' in 'hello'
True
Exercises#
Modify
inside_string
to returnFalse
if(row, col)
is on a prefix or quote character. For instance inrb'abc'
it should only return True on theabc
part of the string. (This is more challenging than it may sound. Be sure to write lots of test cases)Right now if
(row, col)
is a whitespace character that is not tokenized, the loop will pass over it and tokenize the entire input before returningFalse
. Make this more efficientOnly consider characters to be inside an unclosed single-quoted string if they are on the same line.
Write a version of
inside_string()
using parso’s tokenizer (parso.python.tokenize.tokenize()
).
line_numbers()
#
Let’s go back to our motivating example from the tokenize
vs.
Alternatives section, a function that prints the line
numbers of every function definition. Our
function looked like this (rewritten to use our
tokenize_string()
helper):
>>> def line_numbers(s):
... for tok in tokenize_string(s):
... if tok.type == tokenize.NAME and tok.string == 'def':
... print(tok.start[0])
As we noted, this function works, but it doesn’t handle any of our error conditions.
Looking at our exceptions list, SynatxError
and
IndentationError
are unrecoverable, so we
will just let them bubble up. However, TokenError
simply means that the input had an unclosed brace or multi-line string. In the
former case, the tokenization reaches the end of the input before the
exception is raised, and in the latter case, the remainder of the input is
inside the unclosed multi-line string, so we can safely ignore
TokenError
in either case.
>>> def line_numbers(s):
... try:
... for tok in tokenize_string(s):
... if tok.type == tokenize.NAME and tok.string == 'def':
... print(tok.start[0])
... except tokenize.TokenError:
... pass
Finally, let’s consider ERRORTOKEN
due to unclosed
single-quoted strings. Our motivation for using tokenize
to solve this
problem is to handle incomplete or invalid Python (otherwise, we should use
the ast
implementation, which is much simpler).
Thus, it makes sense to treat unclosed single-quoted strings as if they were
closed at the end of the line.
>>> def line_numbers(s):
... try:
... skip_line = -1
... for tok in tokenize_string(s):
... if tok.start[0] == skip_line:
... continue
... elif tok.start[0] >= skip_line:
... # reset skip_line
... skip_line = -1
... if tok.type == tokenize.ERRORTOKEN and tok.string in '"\'':
... # Unclosed single-quoted string. Ignore the rest of this line
... skip_line = tok.start[0]
... continue
... if tok.type == tokenize.NAME and tok.string == 'def':
... print(tok.start[0])
... except tokenize.TokenError:
... pass
Here are our original test cases, plus some additional ones for our added behavior.
>>> code = """\
... def f(x):
... pass
...
... class MyClass:
... def g(self):
... pass
... """
>>> line_numbers(code)
1
5
>>> code = '''\
... FUNCTION_SKELETON = """
... def {name}({args}):
... {body}
... """
... '''
>>> line_numbers(code) # no output
>>> code = """\
... def f():
... '''
... an unclosed docstring.
... """
>>> line_numbers(code)
1
>>> code = """\
... def f(: # Unclosed parenthesis
... pass
... """
>>> line_numbers(code)
1
>>> code = """\
... def f():
... "an unclosed single-quoted string. It should not match this def
... def g():
... pass
... """
>>> line_numbers(code)
1
3
Indentation Level#
INDENT
and DEDENT
tokens are
always balanced in the token stream (unless there is an
IndentationError
), so it is easy to detect the
indentation level of a block of code by incrementing and decrementing a
counter.
>>> def indentation_level(s, row, col):
... """
... Returns the indentation level of the code at (row, col)
... """
... level = 0
... try:
... for tok in tokenize_string(s):
... if tok.start >= (row, col):
... return level
... if tok.type == tokenize.INDENT:
... level += 1
... if tok.type == tokenize.DEDENT:
... level -= 1
... except tokenize.TokenError:
... # Ignore TokenError (we don't care about incomplete code)
... pass
... return level
To demonstrate the function, let’s apply it to itself.
>>> indentation_level_source = '''\
... def indentation_level(s, row, col):
... """
... Returns the indentation level of the code at (row, col)
... """
... level = 0
... try:
... for tok in tokenize_string(s):
... if tok.start >= (row, col):
... return level
... if tok.type == tokenize.INDENT:
... level += 1
... if tok.type == tokenize.DEDENT:
... level -= 1
... except tokenize.TokenError:
... # Ignore TokenError (we don't care about incomplete code)
... pass
... return level
... '''
>>> # Use a large column number so it always looks at the fully indented line.
>>> for i in range(1, indentation_level_source.count('\n') + 2):
... print(indentation_level(indentation_level_source, i, 100), indentation_level_source.split('\n')[i-1])
0 def indentation_level(s, row, col):
1 """
1 Returns the indentation level of the code at (row, col)
1 """
1 level = 0
1 try:
2 for tok in tokenize_string(s):
3 if tok.start >= (row, col):
4 return level
3 if tok.type == tokenize.INDENT:
4 level += 1
3 if tok.type == tokenize.DEDENT:
4 level -= 1
1 except tokenize.TokenError:
1 # Ignore TokenError (we don't care about incomplete code)
2 pass
1 return level
0
An oddity worth mentioning: the comment near the end of the function is not
considered indented more than the except
line. Remember that the C parser,
which tokenize
is based on, ignores comments, so true
indentations with INDENT
must occur on lines with real
code.
Mismatched Parentheses#
This example shows how to use a very important tool when processing tokens, a stack. A stack is a data structure that operates as Last In, First Out. A stack has two basic operations, push, which adds something to the stack, and pop, which removes the most recently added item.
In Python, a stack is usually implemented using a list. The push method is
list.append()
and the pop method is list.pop()
.
>>> stack = []
>>> stack.append(1)
>>> stack.append(2)
>>> stack.pop()
2
>>> stack.append(3)
>>> stack.pop()
3
>>> stack.pop()
1
Stacks are important because they allow keeping track of nested structures. Going down one level of nesting can be represented by pushing something on the stack, and going up can be represented by popping. The stack allows keeping track of the opening of the nested structure to ensure it properly matches the closing.
In particular, for a set of parentheses, such as (((())())())
, a stack can
be used to check if they are properly balanced. Every time we encounter an
opening parenthesis, we push it on the stack, and every time we encounter a
closing parenthesis, we pop the stack. If the stack is empty when we try to
pop it, or if it still has items when we finish processing all the
parentheses, it means they are not balanced. Otherwise, they are. In this
case, we could simply use a counter like we did for the previous example, and
make sure it doesn’t go negative and ends at 0, but a stack is required to
handle more than one type of brace, like (())([])[]
, which, of course, is
the case in Python.
Here is an example showing how to use a stack to find all the mismatched
parentheses or braces in a piece of Python code. The function handles ()
,
[]
, and {}
type braces.
>>> braces = {
... tokenize.LPAR: tokenize.RPAR, # ()
... tokenize.LSQB: tokenize.RSQB, # []
... tokenize.LBRACE: tokenize.RBRACE, # {}
... }
...
>>> def matching_parens(s):
... """
... Find matching and mismatching parentheses and braces
...
... s should be a string of (partial) Python code.
...
... Returns a tuple (matching, mismatching).
...
... matching is a list of tuples of matching TokenInfo objects for
... matching parentheses/braces.
...
... mismatching is a list of TokenInfo objects for mismatching
... parentheses/braces.
... """
... stack = []
... matching = []
... mismatching = []
... try:
... for tok in tokenize_string(s):
... exact_type = tok.exact_type
... if exact_type == tokenize.ERRORTOKEN and tok.string[0] in '"\'':
... # There is an unclosed string. If we do not break here,
... # tokenize will tokenize the stuff after the string
... # delimiter.
... break
... elif exact_type in braces:
... stack.append(tok)
... elif exact_type in braces.values():
... if not stack:
... mismatching.append(tok)
... continue
... prevtok = stack.pop()
... if braces[prevtok.exact_type] == exact_type:
... matching.append((prevtok, tok))
... else:
... mismatching.insert(0, prevtok)
... mismatching.append(tok)
... else:
... continue
... except tokenize.TokenError:
... # Either unclosed brace (what we are trying to handle here), or
... # unclosed multi-line string (which we don't care about).
... pass
...
... matching.reverse()
...
... # Anything remaining on the stack is mismatching. Keep the mismatching
... # list in order.
... stack.reverse()
... mismatching = stack + mismatching
... return matching, mismatching
>>> matching, mismatching = matching_parens("('a', {(1, 2)}, ]")
>>> matching
[(TokenInfo(..., string='{', ...), TokenInfo(..., string='}', ...)),
(TokenInfo(..., string='(', ...), TokenInfo(..., string=')', ...))]
>>> mismatching
[TokenInfo(..., string='(', ...), TokenInfo(..., string=']', ...)]
Exercise#
Add a flag, allow_intermediary_mismatches
, which when True
, allows an
opening brace to still be considered matching if it is closed with the wrong
brace but later closed with the correct brace (False
would give the current
behavior, that is, once an opening brace is closed with the wrong brace
it—and any unclosed braces before it—cannot be matched).
For example, consider '[ { ] }'
. Currently, all the braces are considered
mismatched.
>>> matching, mismatching = matching_parens('[ { ] }')
>>> matching
[]
>>> mismatching
[TokenInfo(..., string='[', ...),
TokenInfo(..., string='{', ...),
TokenInfo(..., string=']', ...),
TokenInfo(..., string='}', ...)]
With allow_intermediary_mismatches
set to True
, the {
and }
should be
considered matching.
>>> matching, mismatching = matching_parens('[ { ] }',
... allow_intermediary_mismatches=True)
>>> matching
[(TokenInfo(..., string='{', ...),
TokenInfo(..., string='}', ...))]
>>> mismatching
[TokenInfo(..., string='[', ...),
TokenInfo(..., string=']', ...)]
Furthermore, with '[ { ] } ]'
only the middle ]
would be considered
mismatched (with the current version, all would be mismatched).
>>> matching, mismatching = matching_parens('[ { ] } ]',
... allow_intermediary_mismatches=True)
>>> matching
[(TokenInfo(..., string='[', ...), TokenInfo(..., string=']', start=(1, 8), ...)),
(TokenInfo(..., string='{', ...), TokenInfo(..., string='}', ...))]
>>> mismatching
[TokenInfo(..., string=']', start=(1, 4), ...)]
>>> # The current version, which would be allow_intermediary_mismatches=False
>>> matching, mismatching = matching_parens('[ { ] } ]')
>>> matching
[]
>>> mismatching
[TokenInfo(..., string='[', ...),
TokenInfo(..., string='{', ...),
TokenInfo(..., string=']', ...),
TokenInfo(..., string='}', ...),
TokenInfo(..., string=']', ...)]
The current behavior (allow_intermediary_mismatches=False
) is a more
technically correct version, but allow_intermediary_mismatches=True
would
provide more useful feedback for applications that might use this function to
highlight mismatching braces, as it would be more likely to highlight only the
“mistake” braces.
Since this exercise is relatively difficult, I’m providing the solution. I recommend trying to solve it yourself first, as it will really force you to understand how the function works.
Click here to show the solution
No really, do try it yourself first. At least think about it.
I've thought about it. Show the solution
Replace
mismatching.insert(0, prevtok)
mismatching.append(tok)
with
if allow_intermediary_mismatches:
stack.append(prevtok)
else:
mismatching.insert(0, prevtok)
mismatching.append(tok)
In this code block, tok
is a closing brace and prevtok
is the most
recently found opening brace (stack.pop()
). Under the current code, we
append both braces to the mismatching
list (keeping their order), and we
continue to do that with allow_intermediary_mismatches=False
. However, if
allow_intermediary_mismatches=True
, we instead put the prevtok
back on the
stack, and still put the tok
in the mismatching
list. This allows
prevtok
to be matched by a closing brace later.
For example, suppose we have ( } )
. We first append (
to the stack, so the
stack is ['(']
. Then when we get to }
. We pop (
from the stack, and see
that it doesn’t match. If allow_intermediary_mismatches=False
, we consider
these both to be mismatched, and add them to the mismatched
list in the
correct order (['(', '}']
). If allow_intermediary_mismatches=True
, though,
we only add '}'
to the mismatched
list and put (
back on the stack.
Then we get to )
. In the allow_intermediary_mismatches=False
case, the
stack will be empty, so it will not be considered matching, and thus be placed
in the mismatching
list (the if not stack:
block prior to the code we
modified). In the allow_intermediary_mismatches=True
case, the stack is
['(']
, so prevtok
will be (
, which matches the )
, so they are both put
in the matching
list.
Modifying Tokens#
These examples show some ways that you can modify the token stream.
The general pattern we will apply here is to get the token stream from
tokenize()
, modify it in some way, and convert it back to a bytes string
with untokenize()
.
When new tokens are added, untokenize()
does not maintain
whitespace between tokens in a human-readable way. Doing this is possible, by
keeping track of column offsets, but we will not bother with it here except
where it is convenient. See the discussion in the untokenize()
section.
Converting ^
to **
#
Python’s syntax uses **
for exponentiation, although many might expect it to
use ^
instead. ^
is actually the XOR
operator.
>>> bin(0b101 ^ 0b001)
'0b100'
Suppose you don’t care about XOR, and want to allow ^
to represent
exponentiation. You might think to use the ast
module and replace
BitXor
nodes with
Pow
, but
this will not work, because ^
has a different precedence than **
:
>>> import ast
>>> ast.dump(ast.parse('x**2 + 1'))
"Module(body=[Expr(value=BinOp(left=BinOp(left=Name(id='x', ctx=Load()), op=Pow(), right=Constant(value=2)), op=Add(), right=Constant(value=1)))], type_ignores=[])"
>>> ast.dump(ast.parse('x^2 + 1'))
"Module(body=[Expr(value=BinOp(left=Name(id='x', ctx=Load()), op=BitXor(), right=BinOp(left=Constant(value=2), op=Add(), right=Constant(value=1))))], type_ignores=[])"
This is difficult to read, but it basically says that x**2 + 1
is parsed
like (x**2) + 1
and x^2 + 1
is parsed like x^(2 + 1)
. There’s no way to
distinguish the two in the AST representation, because it does not keep track
of redundant parentheses.
We could do a simple s.replace('^', '**')
, but this would also
replace any occurrences of ^
in
strings and comments.
Instead, we can use tokenize
. The replacement is quite easy to do:
>>> def xor_to_pow(s):
... result = []
... for tok in tokenize_string(s):
... if tok.type == tokenize.ENCODING:
... encoding = tok.string
... if tok.exact_type == tokenize.CIRCUMFLEX: # CIRCUMFLEX is ^
... result.append((tokenize.OP, '**'))
... else:
... result.append(tok)
... return tokenize.untokenize(result).decode(encoding)
...
>>> xor_to_pow('x^2 + 1')
'x**2 +1 '
Because we are replacing a 1-character token with a 2-character token,
untokenize()
removes the
original whitespace and replaces it with its own. An exercise for the reader
is to redefine the column offsets for the new token and all subsequent tokens
on that line to avoid this issue.
Wrapping floats with decimal.Decimal
#
This example is modified from the example in the standard library
docs for
tokenize
. It is a good example for modifying tokens because the logic is not
too complex, and it is something that is not possible to do with other tools
such as the ast
module, because ast
does not keep the full precision of
floats as they are in the input.
>>> def float_to_decimal(s):
... result = []
... for tok in tokenize_string(s):
... if tok.type == tokenize.ENCODING:
... encoding = tok.string
... # A float is a NUMBER token with a . or e (scientific notation)
... if tok.type == tokenize.NUMBER and '.' in tok.string or 'e' in tok.string.lower():
... result.extend([
... (tokenize.NAME, 'Decimal'),
... (tokenize.OP, '('),
... (tokenize.STRING, repr(tok.string)),
... (tokenize.OP, ')')
... ])
... else:
... result.append(tok)
... return tokenize.untokenize(result).decode(encoding)
This works like this
>>> 1e-1000 + 1.000000000000000000000000000000001
1.0
>>> float_to_decimal('1e-1000 + 1.000000000000000000000000000000001')
"Decimal ('1e-1000')+Decimal ('1.000000000000000000000000000000001')"
Notice that because new tokens were added as length 2 tuples, the whitespace of the result is not the same as the input, and does not really follow PEP 8.
The transformed code can produce arbitrary precision decimals. Note that the
decimal
module still requires setting the context precision high enough to
avoid rounding the input. An exercise for the reader is to extend
float_to_decimal
to determine the required precision automatically.
>>> from decimal import Decimal, getcontext
>>> getcontext().prec = 1001
>>> eval(float_to_decimal('1e-1000 + 1.000000000000000000000000000000001'))
Decimal('1.0000000000000000000000000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001')
Extending Python’s Syntax#
Because tokenize()
emits ERRORTOKEN
on any
unrecognized operators, it can be used to add extensions to the Python syntax.
This can be challenging to do in general, as you may need to do significant
parsing of the tokens to ensure that your new “operator” has the correct
precedence.
You can find some more advanced examples of extending Python’s syntax in
SymPy’s parser
module,
for example, implicit multiplication (x y
⮕ x*y
), implicit function
application (sin x
⮕ sin(x)
), factorial notation (x!
⮕ factorial(x)
),
and more.
Emoji Math#
The below example is relatively simple. It allows the “emoji” mathematical symbols ➕, ➖, ➗, and ✖ to be used instead of their ASCII counterparts.
>>> emoji_map = {
... '➕': '+',
... '➖': '-',
... '➗': '/',
... '✖': '*',
... }
>>> def emoji_math(s):
... result = []
... for tok in tokenize_string(s):
... if tok.type == tokenize.ENCODING:
... encoding = tok.string
... if tok.type == tokenize.ERRORTOKEN and tok.string in emoji_map:
... new_tok = (tokenize.OP, emoji_map[tok.string], *tok[2:])
... result.append(new_tok)
... else:
... result.append(tok)
... return tokenize.untokenize(result).decode(encoding)
...
>>> emoji_math('1 ➕ 2 ➖ 3➗4✖5')
'1 + 2 - 3/4*5'
Because we are replacing a single character with a single character, we can
use 5-tuples and keep the column offsets intact, making
untokenize()
maintain the whitespace of the input.
Note
These emoji may often appear as two characters, for instance, ✖ may often
appear instead as ✖️, which is ✖ (HEAVY MULTIPLICATION X
) + (VARIATION SELECTOR-16
). The VARIATION SELECTOR-16
is an invisible character which
forces it to render as an emoji. The above example does not include the
VARIATION SELECTOR-16
. An exercise for the reader is to modify the above
function to work with this.
Backporting Underscores in Numeric Literals#
Python 3.6 added a new syntactic feature that allows underscores in numeric literals. To quote the docs, “single underscores are allowed between digits and after any base specifier. Leading, trailing, or multiple underscores in a row are not allowed.”
For example,
>>> # Python 3.6+ only
>>> 123_456
123456
We can write a function using tokenize
to backport this feature to Python
3.5. Some experimentation shows how the new literals tokenize in Python 3.5:
123_456
tokenizes asNUMBER
(123
) andNAME
(_456
). In general, multiple underscores between tokens will tokenize like this.Additionally, underscores are allowed after base specifiers, like
0x_1
. This also tokenizes asNUMBER
(0
) andNAME
(x_1
). SinceNUMBER
andNAME
tokens cannot appear next to one another in valid Python, we can simply combine them when they do.Finally, if an underscore appears before the
.
in a floating point literal, like1_2.3_4
it will tokenizeNUMBER
(1
),NAME
(_2
),NUMBER
(.3
),NAME
(_4
).
Note that in this example, the ENCODING
token allows
us to access result[-1]
unconditionally, as we know there must always be at
least this token already processed before any NAME
token. This is often a useful property the ENCODING
allows us to take advantage of.
We do some basic checks here to not allow spaces before underscores and double
underscores, which are not allowed in Python 3.6. But for simplicity, this
function takes a garbage in, garbage
out approach. Invalid
syntax in Python 3.6, like 123a_bc
, will transform to something that is
still invalid syntax in Python 3.5 (123abc
).
>>> import sys
>>> def underscore_literals(s):
... if sys.version_info >= (3, 6):
... return s
...
... result = []
... for tok in tokenize_string(s):
... if tok.type == tokenize.ENCODING:
... encoding = tok.string
... if tok.type == tokenize.NAME and result[-1].type == tokenize.NUMBER:
... # Check that there are no spaces between the tokens
... # e.g., 123 _456 is not allowed, and there aren't multiple
... # consecutive undercores, e.g., 123__456 is not allowed.
... if result[-1].end == tok.start and '__' not in tok.string:
... new_tok = tokenize.TokenInfo(
... tokenize.NUMBER,
... result[-1].string + tok.string.replace('_', ''),
... result[-1].start,
... tok.end,
... tok.line,
... )
... result[-1] = new_tok
... continue
... if tok.type == tokenize.NUMBER and tok.string[0] == '.' and result[-1].type == tokenize.NUMBER:
... # Float with underscore before the ., like 1_2.0, which is 1, _2, .0
... if result[-1].end == tok.start:
... new_tok = tokenize.TokenInfo(
... tokenize.NUMBER,
... result[-1].string + tok.string,
... result[-1].start,
... tok.end,
... tok.line,
... )
... result[-1] = new_tok
... continue
... if tok.exact_type == tokenize.DOT and result[-1].type == tokenize.NUMBER:
... # Like 1_2. which becomes 1, _2, .
... if result[-1].end == tok.start:
... new_tok = tokenize.TokenInfo(
... tokenize.NUMBER,
... result[-1].string + tok.string,
... result[-1].start,
... tok.end,
... tok.line,
... )
... result[-1] = new_tok
... continue
...
... result.append(tok)
... return tokenize.untokenize(result).decode(encoding)
...
Note that by reusing the start
and end
tokens, we are able to make untokenize()
keep the whitespace,
even though characters were removed. untokenize()
only uses
the differences between end
and start
to
determine how many spaces to add between tokens, not their absolute values
(remember that untokenize()
only requires the
start
and end
tuples to be
nondecreasing; it doesn’t care if the actual column values are correct).
>>> s = '1_0 + 0b_101 + 0o_1_0 + 0x_a - 1.0_0 + 1e1 + 1.0_0j + 1_2.3_4 + 1_2.'
>>> # In Python 3.5
>>> underscore_literals(s)
'10 + 0b101 + 0o10 + 0xa - 1.00 + 1e1 + 1.00j + 12.34 + 12.'