rym-token

The token module provides string tokenization.

Usage

Tokenize Strings

The tokenize function is a generator that scans a string for certain patterns and returns matched substrings with their location.

>>> from rym.token import tokenize
>>> text = "42 is an integer; 2023-10-30T21:30:00Z is a date string"
>>> list(tokenize(text))
[Token(type='INTEGER', value=42, line=0, column=0), Token(type='WORD', value='is', line=0, column=3), Token(type='WORD', value='an', line=0, column=6), Token(type='WORD', value='integer', line=0, column=9), Token(type='PUNCTUATION', value=';', line=0, column=16), Token(type='TIMESTAMP', value=datetime.datetime(2023, 10, 30, 21, 30, tzinfo=datetime.timezone.utc), line=0, column=18), Token(type='WORD', value='is', line=0, column=39), Token(type='WORD', value='a', line=0, column=42), Token(type='WORD', value='date', line=0, column=44), Token(type='WORD', value='string', line=0, column=49)]

While rym.token provides several token specifications, you may also provide your own. Patterns are regex strings, and matching is case sensitive.

>>> from rym.token import TokenSpec
>>> spec = TokenSpec("BOOL", r"True|False")
>>> text = "I prefer True/False over multiple choice"
>>> list(tokenize(text, [spec]))
[Token(type='BOOL', value='True', line=0, column=9), Token(type='BOOL', value='False', line=0, column=14)]

Type Handlers

You may also provide a type handler to customize the final value. This feature should be used with care as it may prevent recreation of the input text from the tokens. Type handlers are included in a few of the included spec: integer, number, timestamp, date, and time.

>>> spec = TokenSpec(
...     "BOOL", r"True|False",
...     lambda x: True if x.lower() == 'true' else False)
>>> list(tokenize(text, [spec]))
[Token(type='BOOL', value=True, line=0, column=9), Token(type='BOOL', value=False, line=0, column=14)]

Subtypes

You may also define subtypes for a type specification. These are evaluated prior to execution of a type handler and are case-insensitive.

>>> from rym.token.tokenspec import build_subtype_assignment
>>> subtypes = (
...     ('TRUE', ('true', )),
...     ('FALSE', ('false',)),
... )
>>> subtype = build_subtype_assignment(subtypes)
>>> spec = TokenSpec(
...     "BOOL",
...     r"True|False",
...     lambda x: True if x.lower() == 'true' else False,
...     subtype=subtype)
>>> list(tokenize(text, [spec]))
[Token(type='TRUE', value=True, line=0, column=9), Token(type='FALSE', value=False, line=0, column=14)]

See also

-

API

class rym.token.Token(type: str, value: str, line: int, column: int)
class rym.token.TokenSpec(type, pattern, handle, subtype)
handle: Callable[[...], Any]

Alias for field number 2

pattern: str

Alias for field number 1

subtype: Callable[[str], str]

Alias for field number 3

type: str

Alias for field number 0

rym.token.tokenize(block: str, specs: Iterable[Callable[[...], TokenSpec]] = None) Iterable[Token]

Given a string, identify contextual tokens.