What is Tokenization?#
In the field of parsing, a tokenizer, also called a lexer, is a program that takes a string of characters and splits it into tokens. A token is a substring that has semantic meaning in the grammar of the language.
An example should clarify things. Consider the string of partial Python code,
("a") + True -
.
>>> import tokenize
>>> import io
>>> string = '("a") + True -\n'
>>> for tok in tokenize.tokenize(io.BytesIO(string.encode('utf-8')).readline):
... print(tok)
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=54 (OP), string='(', start=(1, 0), end=(1, 1), line='("a") + True -\n')
TokenInfo(type=3 (STRING), string='"a"', start=(1, 1), end=(1, 4), line='("a") + True -\n')
TokenInfo(type=54 (OP), string=')', start=(1, 4), end=(1, 5), line='("a") + True -\n')
TokenInfo(type=54 (OP), string='+', start=(1, 6), end=(1, 7), line='("a") + True -\n')
TokenInfo(type=1 (NAME), string='True', start=(1, 8), end=(1, 12), line='("a") + True -\n')
TokenInfo(type=54 (OP), string='-', start=(1, 13), end=(1, 14), line='("a") + True -\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 14), end=(1, 15), line='("a") + True -\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
The string is split into the following tokens: (
, "a"
, )
, +
, True
, and
-
(ignore the BytesIO
bit and the ENCODING
and ENDMARKER
tokens for
now).
I chose this example to demonstrate a few things:
The Tokens in Python are things like parentheses, strings, operators, keywords, and variable names.
Every token is a represented by
namedtuple
calledTokenInfo
, which has atype
, represented by an integer constant, and astring
, which is the substring of the input representing the given token. Thenamedtuple
also gives line and column information that indicates exactly where in the input string the token was found.The input does not need to be valid Python. Our input,
("a") + True -
is not valid Python. It is, however, a potential beginning of a valid Python string. If a valid Python expression were to be added to the end of the input, completing the subtraction operator, such as("a") + True - x
, it would become valid Python. This illustrates an important aspect of tokenize, which is that it fundamentally works on a stream of characters. This means that tokens are output as they are seen, without regard to what comes later (the tokenize module does do lookahead on the input stream internally to ensure that the correct tokens are output, but from the point of view of a user oftokenize
, each token can be processed as it is seen). This is whytokenize.tokenize
produces a generator.However, it should be noted that tokenize does raise exceptions on certain incomplete or invalid Python statements. For example, if we omit the closing parenthesis, tokenize produces all the tokens as before, but then raises
TokenError
:>>> string = '("a" + True -' >>> for tok in tokenize.tokenize(io.BytesIO(string.encode('utf-8')).readline): ... print(tok) TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='') TokenInfo(type=54 (OP), string='(', start=(1, 0), end=(1, 1), line='("a" + True -') TokenInfo(type=3 (STRING), string='"a"', start=(1, 1), end=(1, 4), line='("a" + True -') TokenInfo(type=54 (OP), string='+', start=(1, 5), end=(1, 6), line='("a" + True -') TokenInfo(type=1 (NAME), string='True', start=(1, 7), end=(1, 11), line='("a" + True -') TokenInfo(type=54 (OP), string='-', start=(1, 12), end=(1, 13), line='("a" + True -') Traceback (most recent call last): ... tokenize.TokenError: ('EOF in multi-line statement', (2, 0))
One of the goals of this guide is to quantify exactly when these error conditions can occur, so that code that attempts to tokenize partial Python code can deal with them properly.
Syntactically irrelevant aspects of the input such as redundant parentheses are maintained. The parentheses around the
"a"
in the input string are completely unnecessary, but they are included as tokens anyway. This does not apply to whitespace, however (indentation is an exception to this, as we will see later), although the whitespace between tokens can generally be deduced from the additional information procided in theTokenInfo
.The input need not be semantically meaningful in any way. The input string, even if completed, can only raise a
TypeError
because"a" + True
is not allowed by Python. The tokenize module does not know or care about objects, types, or any high-level Python constructs.Some tokens can be right next to one another in the input string. Other tokens must be separated by a space (for instance,
foriinrange(10)
will tokenize differently fromfor i in range(10)
). The complete set of rules for when spaces are required or not required to separate Python tokens is quite complicated, especially when multi-line statements with indentation are considered (as an example, consider that1jand2
is valid Python—it’s tokenized into three tokens,NUMBER
(1j
),NAME
(and
), andNUMBER
(2
)). One use-case of thetokenize
module is to combine tokens into valid Python using theuntokenize
function, which handles these details automatically.All parentheses and operators are tokenized as
OP
. Both variable names and keywords are tokenized asNAME
. To determine the exact type of a token often requires further inspection than simply looking at thetype
(this guide will detail exactly how to do this).The above example does not show it, but even code that can never be valid Python is often tokenized. For example:
>>> string = 'a$b\n' >>> for tok in tokenize.tokenize(io.BytesIO(string.encode('utf-8')).readline): ... print(tok) TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='') TokenInfo(type=1 (NAME), string='a', start=(1, 0), end=(1, 1), line='a$b\n') TokenInfo(type=60 (ERRORTOKEN), string='$', start=(1, 1), end=(1, 2), line='a$b\n') TokenInfo(type=1 (NAME), string='b', start=(1, 2), end=(1, 3), line='a$b\n') TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 3), end=(1, 4), line='a$b\n') TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
This can be useful for dealing with code that has minor typos that makes it invalid. It can also be used to build modules that extend the Python language in limited ways, but be warned that the
tokenize
module makes no guarantees about how it tokenizes invalid Python. For example, if a future version of Python added$
as an operator, the above string could tokenize completely differently. This exact thing happened, for instance, with f-strings. In Python 3.5,f"{a}"
tokenizes as two tokens,NAME
(f
) andSTRING
("{a}"
). In Python 3.6, it tokenizes as one token,STRING
(f"{a}"
).Finally, the key thing to understand about tokenization is that tokens are a very low level abstraction of the Python syntax. The same token may have different meanings in different contexts. For example, in
[1]
, the[
token is part of a list literal, whereas ina[1]
, the[
token is part of a slice. If you want to manipulate higher level abstractions, you might want to use theast
module instead (see the next section).
This guide does not detail how things are tokenized, that is, how tokenize
chooses which tokens to use for a given input string, except in the ways that
this matters as an end-user of tokenize
. For details on how Python is lexed,
see the page on lexical
analysis in the
official Python documentation.