Brown Water Python: Better Docs for the Python tokenize
Module.#
The tokenize
module in the Python standard library is very powerful, but its
documentation is somewhat
limited. In the spirit of Thomas Kluyver’s Green Tree
Snakes project, which provides
similar extended documentation for the ast
module, I am providing here some
extended documentation for effectively working with the tokenize
module.
Python Versions Supported#
The contents of this guide apply to Python 3.5 and up. Several minor changes
were made to the tokenize
module in various Python versions between 3.5
and 3.8, and they have been noted where appropriate.
The tokenize
module tokenizes code according to the version of Python that
it is being run under. For example, some new syntax features in 3.6 affect
tokenization (in particular,
f-strings
and underscores in numeric
literals).
Take 123_456
. This will tokenize as a single token in Python 3.6+, NUMBER
(123_456
), but in Python 3.5, it tokenizes as two tokens, NUMBER
(123
)
and NAME
(_456
) (see the reference for the NUMBER
token type for more info).
Most of what is written here will also apply to earlier Python 3 versions, with obvious exceptions (like tokens that were added for new syntax), though none of it has been tested.
I don’t have any interest in supporting Python 2 in this guide. Its lifetime has officially come to an end, so you should strongly consider being Python 3-only for new code that is written.
With that being said, I will point out one important difference in Python 2:
the tokenize()
function in Python 2 prints the tokens instead of
returning them. Instead, you should use the generate_tokens()
function,
which works like tokenize()
in Python 3 (see the docs).
>>> # Python 2.7 tokenize example
>>> import tokenize
>>> import io
>>> for tok in tokenize.generate_tokens(io.BytesIO('1 + 2').readline):
... print tok
...
(2, '1', (1, 0), (1, 1), '1 + 2')
(51, '+', (1, 2), (1, 3), '1 + 2')
(2, '2', (1, 4), (1, 5), '1 + 2')
(0, '', (2, 0), (2, 0), '')
Another difference is that the result of this function in Python 2 is a
regular tuple, not a namedtuple
, so you will not be able to use attributes
to access the members. Instead use something like for toknum, tokval, start, end, line in tokenize.generate_tokens(...):
(this pattern can be used in
Python 3 as well, see the Usage section).
Contributing#
Contributions are welcome.
So are questions. My
goal here is to help people to understand the tokenize
module, so if
something is not clear, please let me
know. If you see
something written here that is wrong, please make a pull
request correcting it.
I’m not an expert at tokenize
. I mainly know what is written here from trial
and error and from reading the source
code.
Table of Contents#