Syntax¶

Path: source/lexbor/css/syntax/
Includes: lexbor/css/syntax/tokenizer.h, lexbor/css/syntax/token.h, lexbor/css/syntax/parser.h, lexbor/css/syntax/anb.h
Specification: CSS Syntax Module

Overview¶

The Syntax subsystem is the lowest-level component of the CSS module. It converts raw CSS text into tokens according to CSS Syntax Module.

What’s Inside¶

Quick Start — minimal working example: tokenize CSS and print each token
How the Tokenizer Works — peek, lookahead, and consume: the three-function pattern
Token Types — complete list of token types produced by the tokenizer
Tokenizer Errors — error types collected during tokenization and how to read them
Parser — Syntax Parser documentation

Quick Start¶

#include <lexbor/css/css.h>

static lxb_status_t
callback(const lxb_char_t *data, size_t len, void *ctx)
{
    printf("%.*s", (int) len, (const char *) data);
    return LXB_STATUS_OK;
}

int main(void)
{
    lxb_status_t status;
    const lxb_char_t *name;
    const lxb_css_syntax_token_t *token;
    const lxb_char_t css[] = "div { color: red; }";

    /* Create and initialize tokenizer */
    lxb_css_syntax_tokenizer_t *tkz = lxb_css_syntax_tokenizer_create();
    status = lxb_css_syntax_tokenizer_init(tkz);
    if (status != LXB_STATUS_OK) {
        fprintf(stderr, "Failed to initialize tokenizer\n");
        return EXIT_FAILURE;
    }

    /* Set input buffer */
    lxb_css_syntax_tokenizer_buffer_set(tkz, css, sizeof(css) - 1);

    /* Read tokens one by one */
    token = lxb_css_syntax_token(tkz);

    while (token != NULL && token->type != LXB_CSS_SYNTAX_TOKEN__EOF) {
        /* Get token type name */
        name = lxb_css_syntax_token_type_name_by_id(token->type);

        /* Serialize token value */
        printf("%s: \"", (const char *) name);
        lxb_css_syntax_token_serialize(token, callback, NULL);
        printf("\"\n");

        /* Advance to next token */
        lxb_css_syntax_token_consume(tkz);
        token = lxb_css_syntax_token(tkz);
    }

    /* Clean up */
    lxb_css_syntax_tokenizer_destroy(tkz);

    return EXIT_SUCCESS;
}

Output:

ident: "div"
whitespace: " "
left-curly-bracket: "{"
whitespace: " "
ident: "color"
colon: ":"
whitespace: " "
ident: "red"
semicolon: ";"
whitespace: " "
right-curly-bracket: "}"

How the Tokenizer Works¶

The tokenizer works on a peek / lookahead / consume pattern. Three functions control the token stream:

Function	Role	Description
`lxb_css_syntax_token()`	Peek	Returns the current token. If no token has been generated yet, tokenizes the next one from the input. Calling it multiple times without consuming returns the same token.
`lxb_css_syntax_token_next()`	Lookahead	Always tokenizes the next token from the input and appends it to the internal buffer. Use this to look ahead without consuming the current token.
`lxb_css_syntax_token_consume()`	Consume	Removes the current (first) token from the buffer and frees its memory. After this call, the next buffered token becomes the current one.

Token Buffer¶

Internally, the tokenizer maintains a linked list of tokens (via first and last pointers). Tokens are allocated from an object pool and linked together:

first -> Token A -> Token B -> Token C -> NULL
                                 |
                                last

lxb_css_syntax_token() returns first. If first is NULL, it tokenizes one token from the input, sets it as first (and last), and returns it.
lxb_css_syntax_token_next() tokenizes a new token, links it after last, and updates last.
lxb_css_syntax_token_consume() removes first, advances first to first->next, and returns the old token to the object pool.

Tokenization is lazy: tokens are generated only when requested via lxb_css_syntax_token() or lxb_css_syntax_token_next().

Basic Pattern: Sequential Reading¶

The most common usage is a simple loop: peek at the current token, process it, consume, repeat.

token = lxb_css_syntax_token(tkz);

while (token != NULL && token->type != LXB_CSS_SYNTAX_TOKEN__EOF) {
    /* Process token... */

    lxb_css_syntax_token_consume(tkz);
    token = lxb_css_syntax_token(tkz);
}

At each iteration:

lxb_css_syntax_token() — returns the current token (generates it on first call).
You inspect and process the token.
lxb_css_syntax_token_consume() — removes the token from the buffer and frees it.
lxb_css_syntax_token() — generates the next token (buffer is now empty, so it tokenizes from input).

Lookahead Pattern¶

Sometimes you need to see the next token before deciding what to do with the current one. Use lxb_css_syntax_token_next() for this:

lxb_css_syntax_token_t *current = lxb_css_syntax_token(tkz);
lxb_css_syntax_token_t *next = lxb_css_syntax_token_next(tkz);

/* Buffer state:
 *   first -> current -> next -> NULL
 *                         |
 *                        last
 */

if (current->type == LXB_CSS_SYNTAX_TOKEN_IDENT
    && next->type == LXB_CSS_SYNTAX_TOKEN_COLON)
{
    /* This is a "name: ..." pair, handle accordingly. */
}

/* Consume current, next becomes the new current. */
lxb_css_syntax_token_consume(tkz);

/* Now lxb_css_syntax_token(tkz) returns what was 'next'. */

You can call lxb_css_syntax_token_next() multiple times to buffer several tokens ahead. They form a chain in the linked list and are consumed one at a time.

Bulk Consume¶

To skip multiple tokens at once, use lxb_css_syntax_token_consume_n():

/* Skip 3 tokens. */
lxb_css_syntax_token_consume_n(tkz, 3);

This simply calls lxb_css_syntax_token_consume() the specified number of times.

Memory and Lifetime¶

Tokens are allocated from a pre-allocated object pool inside the tokenizer. There is no manual malloc/free for individual tokens.
lxb_css_syntax_token_consume() returns the token object to the pool. After consuming, the pointer to the consumed token is invalid — do not use it.
String data inside tokens (e.g. identifier names, string values) initially references a temporary buffer inside the tokenizer. When a new token is generated, the previous token’s string data is automatically moved to stable memory. This means the data pointer in a token is valid as long as the token has not been consumed.

Token Types¶

The tokenizer produces tokens defined in lxb_css_syntax_token_type_t enum. All token type constants have the prefix LXB_CSS_SYNTAX_TOKEN_. Types are grouped by their internal data structure.

String Tokens¶

These tokens carry a string value accessible via the corresponding cast macro (e.g. lxb_css_syntax_token_ident(token)->data).

Token Type	Description	CSS Example
`LXB_CSS_SYNTAX_TOKEN_IDENT`	An identifier — any CSS name such as a property name, tag name, keyword, or custom identifier.	`div`, `color`, `auto`, `--my-var`
`LXB_CSS_SYNTAX_TOKEN_FUNCTION`	A function token — an identifier immediately followed by `(`. The token value is the function name without the opening parenthesis.	`rgb(`, `calc(`, `var(`
`LXB_CSS_SYNTAX_TOKEN_AT_KEYWORD`	An at-keyword — `@` followed by an identifier. Used for CSS at-rules. The token value is the name without `@`.	`@media`, `@import`, `@keyframes`
`LXB_CSS_SYNTAX_TOKEN_HASH`	A hash token — `#` followed by name characters. Used for ID selectors and color values.	`#main`, `#ff0000`, `#content`
`LXB_CSS_SYNTAX_TOKEN_STRING`	A quoted string — text enclosed in matching single or double quotes.	`"hello"`, `'world'`
`LXB_CSS_SYNTAX_TOKEN_BAD_STRING`	A malformed string — a string token that was terminated by an unescaped newline instead of a matching quote. Indicates a parse error.	`"unterminated` + newline
`LXB_CSS_SYNTAX_TOKEN_URL`	A URL token — the contents of `url(...)` when the argument is not a quoted string.	`url(image.png)`, `url(https://example.com/bg.jpg)`
`LXB_CSS_SYNTAX_TOKEN_BAD_URL`	A malformed URL — a `url()` token that contains invalid characters (unescaped whitespace, quotes, or parentheses). Indicates a parse error.	`url(bad value)`
`LXB_CSS_SYNTAX_TOKEN_COMMENT`	A CSS comment (`/* ... /`). This token is not part of the CSS specification* — it is a lexbor extension, since the spec says comments should be discarded during tokenization.	`/* comment */`
`LXB_CSS_SYNTAX_TOKEN_WHITESPACE`	One or more whitespace characters (spaces, tabs, newlines) collapsed into a single token.	, `\t`, `\n`

Numeric Tokens¶

These tokens carry a numeric value. Access via lxb_css_syntax_token_number(token)->num.

Token Type	Description	CSS Example
`LXB_CSS_SYNTAX_TOKEN_NUMBER`	A numeric value — integer or floating-point. The token stores whether it is a float (`is_float`) and whether it has an explicit sign (`have_sign`).	`42`, `3.14`, `-1`, `+0.5`
`LXB_CSS_SYNTAX_TOKEN_PERCENTAGE`	A number immediately followed by `%`. Internally has the same structure as a number token.	`100%`, `50%`, `33.3%`

Dimension Token¶

The dimension token combines a numeric value and a unit string.

Token Type	Description	CSS Example
`LXB_CSS_SYNTAX_TOKEN_DIMENSION`	A number immediately followed by an identifier (the unit). Contains both a numeric part (accessible via `lxb_css_syntax_token_dimension(token)->num`) and a string part for the unit (accessible via `lxb_css_syntax_token_dimension_string(token)`).	`16px`, `2em`, `100vh`, `90deg`, `300ms`

Delimiter Token¶

Token Type	Description	CSS Example
`LXB_CSS_SYNTAX_TOKEN_DELIM`	A single code point that doesn’t match any other token type. The character is accessible via `lxb_css_syntax_token_delim_char(token)`. Used for operators, combinators, and other punctuation in CSS.	`.`, `>`, `+`, `~`, `*`, `=`, `\|`, `/`

Unicode Range Token¶

Token Type	Description	CSS Example
`LXB_CSS_SYNTAX_TOKEN_UNICODE_RANGE`	A Unicode range used in `@font-face` rules. Contains a start and end code point accessible via `lxb_css_syntax_token_unicode_range(token)->start` and `->end`.	`U+0025-00FF`, `U+4E00-9FFF`, `U+26`

Punctuation Tokens¶

These tokens represent single punctuation characters. They carry only the base data (position and length), with no additional value.

Token Type	Description	Character
`LXB_CSS_SYNTAX_TOKEN_COLON`	A colon. Separates property names from values, and used in pseudo-class selectors.	`:`
`LXB_CSS_SYNTAX_TOKEN_SEMICOLON`	A semicolon. Separates declarations in a rule block.	`;`
`LXB_CSS_SYNTAX_TOKEN_COMMA`	A comma. Separates values in lists and selectors.	`,`
`LXB_CSS_SYNTAX_TOKEN_LS_BRACKET`	Left square bracket (U+005B). Used in attribute selectors.	`[`
`LXB_CSS_SYNTAX_TOKEN_RS_BRACKET`	Right square bracket (U+005D). Closes attribute selectors.	`]`
`LXB_CSS_SYNTAX_TOKEN_L_PARENTHESIS`	Left parenthesis (U+0028). Opens function arguments and grouped expressions.	`(`
`LXB_CSS_SYNTAX_TOKEN_R_PARENTHESIS`	Right parenthesis (U+0029). Closes function arguments and grouped expressions.	`)`
`LXB_CSS_SYNTAX_TOKEN_LC_BRACKET`	Left curly bracket (U+007B). Opens a declaration block or rule body.	`{`
`LXB_CSS_SYNTAX_TOKEN_RC_BRACKET`	Right curly bracket (U+007D). Closes a declaration block or rule body.	`}`

HTML Comment Tokens¶

Legacy tokens for compatibility with HTML-style comments embedded in CSS (from the era of <style> without proper parser support).

Token Type	Description	CSS Representation
`LXB_CSS_SYNTAX_TOKEN_CDO`	Comment Data Open — the `<!--` sequence. In modern CSS this is effectively ignored but required by the specification for backwards compatibility.	`<!--`
`LXB_CSS_SYNTAX_TOKEN_CDC`	Comment Data Close — the `-->` sequence. Same as CDO: exists for backwards compatibility with HTML comments in CSS.	`-->`

Special Tokens¶

Token Type	Description
`LXB_CSS_SYNTAX_TOKEN_UNDEF`	Undefined/uninitialized token. Value `0x00`. Indicates the token has not been set.
`LXB_CSS_SYNTAX_TOKEN__EOF`	End of file. The tokenizer reached the end of the input data.
`LXB_CSS_SYNTAX_TOKEN__END`	End of tokenization. Alias for the deprecated `LXB_CSS_SYNTAX_TOKEN__TERMINATED`. Signals that the tokenizer has been explicitly stopped. Used in parser states.
`LXB_CSS_SYNTAX_TOKEN__LAST_ENTRY`	Sentinel value marking the end of the enum. Used internally for bounds checking; not a real token type.

Tokenizer Errors¶

API source/lexbor/css/syntax/tokenizer/error.h

During tokenization the tokenizer collects parse errors. Errors do not stop tokenization — they are recorded and can be inspected after processing.

Error Types¶

Error	Description	CSS Example
`LXB_CSS_SYNTAX_TOKENIZER_ERROR_UNEOF`	Unexpected end of file.	(empty input where a token is expected)
`LXB_CSS_SYNTAX_TOKENIZER_ERROR_EOINCO`	EOF in comment — a `/*` comment was never closed.	`/* unclosed comment`
`LXB_CSS_SYNTAX_TOKENIZER_ERROR_EOINST`	EOF in string — a quoted string was never closed.	`"unterminated string`
`LXB_CSS_SYNTAX_TOKENIZER_ERROR_EOINUR`	EOF in URL — a `url()` token was never closed.	`url(image.png`
`LXB_CSS_SYNTAX_TOKENIZER_ERROR_EOINES`	EOF in escape — an escape sequence (`\`) at end of input.	`div\`
`LXB_CSS_SYNTAX_TOKENIZER_ERROR_QOINUR`	Quote in URL — a quote character inside an unquoted `url()`.	`url(ba"d)`
`LXB_CSS_SYNTAX_TOKENIZER_ERROR_WRESINUR`	Wrong escape in URL — an invalid escape sequence inside `url()`.	`url(bad\↵value)`
`LXB_CSS_SYNTAX_TOKENIZER_ERROR_NEINST`	Newline in string — an unescaped newline inside a quoted string.	`"line1↵line2"`
`LXB_CSS_SYNTAX_TOKENIZER_ERROR_BACH`	Bad character — a character that is not valid in CSS (e.g. NULL).	`\0`
`LXB_CSS_SYNTAX_TOKENIZER_ERROR_BACOPO`	Bad code point — an escape sequence that resolves to an invalid code point (surrogate or out of range).	`\DFFF`

Error Structure¶

Each error is stored as lxb_css_syntax_tokenizer_error_t:

typedef struct {
    const lxb_char_t                    *pos;   /* position in the input buffer */
    lxb_css_syntax_tokenizer_error_id_t id;     /* error type */
} lxb_css_syntax_tokenizer_error_t;

Errors are accumulated in the tokenizer’s parse_errors array (lexbor_array_obj_t). You can iterate them after tokenization using lexbor_array_obj_length() and lexbor_array_obj_get().

Example: Collecting Tokenizer Errors¶

#include <lexbor/css/css.h>

int main(void)
{
    lxb_status_t status;
    const lxb_css_syntax_token_t *token;
    lxb_css_syntax_tokenizer_t *tkz;
    lxb_css_syntax_tokenizer_error_t *error;

    /* CSS with errors: unterminated string and unterminated comment. */
    const lxb_char_t css[] = "div { content: \"bad string\n } /* unclosed";

    /* Create and initialize tokenizer. */
    tkz = lxb_css_syntax_tokenizer_create();
    status = lxb_css_syntax_tokenizer_init(tkz);
    if (status != LXB_STATUS_OK) {
        return EXIT_FAILURE;
    }

    /* Set input buffer. */
    lxb_css_syntax_tokenizer_buffer_set(tkz, css, sizeof(css) - 1);

    /* Consume all tokens to trigger error detection. */
    token = lxb_css_syntax_token(tkz);

    while (token != NULL && token->type != LXB_CSS_SYNTAX_TOKEN__EOF) {
        lxb_css_syntax_token_consume(tkz);
        token = lxb_css_syntax_token(tkz);
    }

    /* Print all tokenizer errors. */
    size_t errors_count = lexbor_array_obj_length(tkz->parse_errors);

    printf("Found %zu tokenizer error(s):\n", errors_count);

    for (size_t i = 0; i < errors_count; i++) {
        error = lexbor_array_obj_get(tkz->parse_errors, i);

        size_t pos = error->pos - css;

        switch (error->id) {
            case LXB_CSS_SYNTAX_TOKENIZER_ERROR_EOINCO:
                printf("  [%zu] EOF in comment (offset: %zu)\n", i, pos);
                break;
            case LXB_CSS_SYNTAX_TOKENIZER_ERROR_NEINST:
                printf("  [%zu] Newline in string (offset: %zu)\n", i, pos);
                break;
            case LXB_CSS_SYNTAX_TOKENIZER_ERROR_EOINST:
                printf("  [%zu] EOF in string (offset: %zu)\n", i, pos);
                break;
            default:
                printf("  [%zu] Error id: %d (offset: %zu)\n",
                       i, error->id, pos);
                break;
        }
    }

    /* Clean up. */
    lxb_css_syntax_tokenizer_destroy(tkz);

    return EXIT_SUCCESS;
}

Output:

Found 2 tokenizer error(s):
  [0] Newline in string (offset: 26)
  [1] EOF in comment (offset: 41)

Parser¶

Please, see the Parser documentation for details on parsing CSS syntax into higher-level structures like declarations, rules, and selectors.