URL Module

  • Version: 0.4.0

  • Path: source/lexbor/url

  • Base Includes: lexbor/url/url.h

  • Examples: examples/lexbor/url

  • Specification: WHATWG URL Living Standard

Overview

The URL module implements WHATWG URL Living Standard for parsing, manipulating, and serializing URLs.

Browsers use this exact algorithm to process every URL you type in the address bar, click in a link, or set in JavaScript. This module gives you the same behavior in C — byte-for-byte compatible with how browsers parse URLs.

Why does this matter? Because URLs are surprisingly tricky. The WHATWG specification defines a complex state machine with many edge cases to handle all the weird and wonderful URLs that exist in the wild. By following this spec, lexbor’s URL module ensures that you get the same results as browsers do.

Key Features

  • Specification Compliant — follows WHATWG URL Living Standard

  • Unicode Support — handles international domain names with IDNA/Punycode

  • Relative URL Resolution — parse relative URLs against a base URL

  • Component Access — extract and modify individual URL components

  • Serialization — convert URL objects to strings (callback or string output)

  • URLSearchParams — parse and manipulate query parameters

  • URL API — modify individual URL components after parsing (href, host, port, path, etc.)

What’s Inside

Quick Start

Basic URL Parsing

#include <lexbor/url/url.h>

int main(void)
{
    lxb_url_t *url;
    lxb_status_t status;
    lxb_url_parser_t parser;

    static const lxb_char_t url_str[]
        = "https://example.com:8080/path?query=value#fragment";

    /* Initialize parser (on the stack) */
    status = lxb_url_parser_init(&parser, NULL);
    if (status != LXB_STATUS_OK) {
        return EXIT_FAILURE;
    }

    /* Parse URL */
    url = lxb_url_parse(&parser, NULL, url_str, sizeof(url_str) - 1);
    if (url == NULL) {
        lxb_url_parser_destroy(&parser, false);
        return EXIT_FAILURE;
    }

    /* Access components */
    printf("Scheme: %.*s\n",
           (int) lxb_url_scheme(url)->length,
           lxb_url_scheme(url)->data);

    printf("Port: %u\n", lxb_url_port(url));

    printf("Path: %.*s\n",
           (int) lxb_url_path_str(url)->length,
           lxb_url_path_str(url)->data);

    /* Cleanup */
    lxb_url_parser_destroy(&parser, false);
    lxb_url_memory_destroy(url);

    return EXIT_SUCCESS;
}

Output:

Scheme: https
Port: 8080
Path: /path

URL Structure

A URL has the following structure according to the WHATWG specification:

  https://user:password@example.com:8080/path/to/page?key=value#section
  \___/   \__/ \______/ \_________/ \__/\___________/\________/\______/
    |       |      |         |        |       |           |        |
  scheme  user  password   host     port    path       query   fragment

After parsing, a URL is represented by the lxb_url_t structure. Each component can be accessed through inline functions:

Component

Accessor

Returns

Scheme

lxb_url_scheme(url)

const lexbor_str_t * — e.g. "https"

Username

lxb_url_username(url)

const lexbor_str_t *

Password

lxb_url_password(url)

const lexbor_str_t *

Host

lxb_url_host(url)

const lxb_url_host_t *

Port

lxb_url_port(url)

uint16_t

Has Port

lxb_url_has_port(url)

bool — true if port was explicitly set

Path

lxb_url_path(url)

const lxb_url_path_t *

Path (string)

lxb_url_path_str(url)

const lexbor_str_t * — e.g. "/path/to/page"

Query

lxb_url_query(url)

const lexbor_str_t * — e.g. "key=value" (without ?)

Fragment

lxb_url_fragment(url)

const lexbor_str_t * — e.g. "section" (without #)

Note about lexbor_str_t: This is lexbor’s string type with data (pointer to characters) and length fields. If a component is absent, data will be NULL and length will be 0.

Parsing URLs

Parser Lifecycle

The URL parser has the standard create/init/clean/destroy lifecycle:

/* Option 1: Stack allocation (recommended for simple use) */
lxb_url_parser_t parser;
lxb_url_parser_init(&parser, NULL);
/* ... use parser ... */
lxb_url_parser_destroy(&parser, false);  /* false = don't free the struct */

/* Option 2: Heap allocation */
lxb_url_parser_t *parser = lxb_url_parser_create();
lxb_url_parser_init(parser, NULL);
/* ... use parser ... */
lxb_url_parser_destroy(parser, true);  /* true = free the struct too */

Important: The parser is not bound to the URLs it creates. After parsing, you can destroy the parser and continue working with the parsed URL objects. Each URL stores a pointer to its own memory allocator.

Parsing an Absolute URL

lxb_url_t *url = lxb_url_parse(&parser, NULL, url_str, url_str_length);
if (url == NULL) {
    /* Parsing failed — invalid URL */
}

The second argument (NULL here) is the base URL. Pass NULL when parsing absolute URLs.

Reusing the Parser

To parse multiple URLs with the same parser, call lxb_url_parser_clean() between parses:

lxb_url_parser_t parser;
lxb_url_parser_init(&parser, NULL);

lxb_url_t *url1 = lxb_url_parse(&parser, NULL, str1, len1);

/* Clean parser state before next parse */
lxb_url_parser_clean(&parser);

lxb_url_t *url2 = lxb_url_parse(&parser, NULL, str2, len2);

/* Both url1 and url2 are valid and independent */

lxb_url_parser_destroy(&parser, false);

/* URLs are still alive — parser doesn't own them */
lxb_url_destroy(url1);
lxb_url_destroy(url2);

Basic URL Parser

For advanced use cases, lxb_url_parse_basic() gives more control:

lxb_status_t status = lxb_url_parse_basic(
    &parser,
    NULL,               /* existing URL to update, or NULL to create new */
    NULL,               /* base URL */
    data, length,       /* input string */
    LXB_URL_STATE__UNDEF,  /* override state (UNDEF = default) */
    LXB_ENCODING_UTF_8    /* input encoding */
);

After lxb_url_parse_basic() succeeds, retrieve the URL with lxb_url_get(&parser).

This function allows you to:

  • Start parsing from a specific state (e.g., LXB_URL_STATE_HOST_STATE to parse only the host part)

  • Specify input encoding other than UTF-8

  • Update an existing URL object instead of creating a new one

Relative URL Resolution

Relative URLs are resolved against a base URL. This is exactly what browsers do when they encounter a relative link on a page.

#include <lexbor/url/url.h>

int main(void)
{
    lxb_url_t *url, *base_url;
    lxb_status_t status;
    lxb_url_parser_t parser;

    static const lxb_char_t base_str[] = "https://example.com:2030/";
    static const lxb_char_t rel_str[] = "/path/to/page?id=123#comments";

    status = lxb_url_parser_init(&parser, NULL);
    if (status != LXB_STATUS_OK) {
        return EXIT_FAILURE;
    }

    /* First, parse the base URL */
    base_url = lxb_url_parse(&parser, NULL,
                              base_str, sizeof(base_str) - 1);
    if (base_url == NULL) {
        return EXIT_FAILURE;
    }

    lxb_url_parser_clean(&parser);

    /* Parse relative URL against the base */
    url = lxb_url_parse(&parser, base_url,
                         rel_str, sizeof(rel_str) - 1);
    if (url == NULL) {
        return EXIT_FAILURE;
    }

    /* url is now: https://example.com:2030/path/to/page?id=123#comments */

    lxb_url_parser_destroy(&parser, false);
    lxb_url_memory_destroy(url);

    return EXIT_SUCCESS;
}

The resolution follows the WHATWG algorithm:

  • "/path" with base "https://example.com/" -> "https://example.com/path"

  • "../other" with base "https://example.com/a/b/" -> "https://example.com/a/other"

  • "?q=1" with base "https://example.com/page" -> "https://example.com/page?q=1"

  • "#frag" with base "https://example.com/page?q=1" -> "https://example.com/page?q=1#frag"

  • "https://other.com" with any base -> "https://other.com" (absolute URL ignores base)

Serialization

Serialization converts a parsed URL object back into a string. The module provides callback-based serialization for all URL components.

Full URL Serialization

static lxb_status_t
callback(const lxb_char_t *data, size_t len, void *ctx)
{
    printf("%.*s", (int) len, (const char *) data);
    return LXB_STATUS_OK;
}

/* Serialize full URL */
lxb_url_serialize(url, callback, NULL, false);
/* false = include fragment; true = exclude fragment */

Note: The callback may be called multiple times for a single URL. For example, serializing https://example.com/path will call the callback separately for https, ://, example.com, and /path.

Unicode Domain Serialization

By default, international domain names are stored in their ASCII (Punycode) form. To serialize them back to Unicode, use lxb_url_serialize_idna():

lxb_unicode_idna_t idna;

lxb_unicode_idna_init(&idna);

/* Serialize with Unicode domain names */
lxb_url_serialize_idna(&idna, url, callback, NULL, false);

lxb_unicode_idna_destroy(&idna, false);

For a URL like https://тест.com/, lxb_url_serialize() outputs https://xn--e1afmapc.com/, while lxb_url_serialize_idna() outputs https://тест.com/.

Component Serialization

Each URL component can be serialized individually:

lxb_url_serialize_scheme(url, callback, ctx);      /* "https"       */
lxb_url_serialize_username(url, callback, ctx);    /* "user"        */
lxb_url_serialize_password(url, callback, ctx);    /* "pass"        */
lxb_url_serialize_host(host, callback, ctx);       /* "example.com" */
lxb_url_serialize_port(url, callback, ctx);        /* "8080"        */
lxb_url_serialize_path(path, callback, ctx);       /* "/path"       */
lxb_url_serialize_query(url, callback, ctx);       /* "key=value"   */
lxb_url_serialize_fragment(url, callback, ctx);    /* "section"     */

Note that lxb_url_serialize_host() takes const lxb_url_host_t * (use lxb_url_host(url)) and lxb_url_serialize_path() takes const lxb_url_path_t * (use lxb_url_path(url)).

For Unicode domain serialization:

lxb_url_serialize_host_unicode(&idna, lxb_url_host(url), callback, ctx);

There are also specialized host serialization functions:

lxb_url_serialize_host_ipv4(ipv4_value, callback, ctx);   /* "192.168.0.1"    */
lxb_url_serialize_host_ipv6(ipv6_array, callback, ctx);   /* "[::1]"          */

Accessing URL Components

After parsing, use inline accessor functions to read URL components:

lxb_url_t *url = lxb_url_parse(&parser, NULL, str, len);

/* Scheme: always present after successful parse */
const lexbor_str_t *scheme = lxb_url_scheme(url);
printf("Scheme: %.*s\n", (int) scheme->length, scheme->data);

/* Host: check type before accessing */
const lxb_url_host_t *host = lxb_url_host(url);

if (host->type == LXB_URL_HOST_TYPE_DOMAIN) {
    printf("Domain: %.*s\n",
           (int) host->u.domain.length, host->u.domain.data);
}
else if (host->type == LXB_URL_HOST_TYPE_IPV4) {
    printf("IPv4: %u\n", host->u.ipv4);
}

/* Port: check if explicitly present */
if (lxb_url_has_port(url)) {
    printf("Port: %u\n", lxb_url_port(url));
}

/* Path */
const lexbor_str_t *path = lxb_url_path_str(url);
if (path->data != NULL) {
    printf("Path: %.*s\n", (int) path->length, path->data);
}

/* Query */
const lexbor_str_t *query = lxb_url_query(url);
if (query->data != NULL) {
    printf("Query: %.*s\n", (int) query->length, query->data);
}

/* Fragment */
const lexbor_str_t *fragment = lxb_url_fragment(url);
if (fragment->data != NULL) {
    printf("Fragment: %.*s\n", (int) fragment->length, fragment->data);
}

/* Credentials */
const lexbor_str_t *user = lxb_url_username(url);
const lexbor_str_t *pass = lxb_url_password(url);
if (user->length > 0) {
    printf("User: %.*s\n", (int) user->length, user->data);
}

Modifying URLs (URL API)

After parsing a URL, you can modify individual components using the URL API functions. These follow the WHATWG URL API specification — the same interface that JavaScript’s URL object provides.

All API functions accept NULL as the data pointer, which is treated as an empty string.

lxb_url_t *url = lxb_url_parse(&parser, NULL, str, len);

/* Change the protocol (scheme) */
lxb_url_api_protocol_set(url, &parser,
                          (lxb_char_t *) "http:", 5);

/* Change the host */
lxb_url_api_host_set(url, &parser,
                      (lxb_char_t *) "other.com:9090", 14);

/* Change just the hostname (without port) */
lxb_url_api_hostname_set(url, &parser,
                          (lxb_char_t *) "new.com", 7);

/* Change the port */
lxb_url_api_port_set(url, &parser,
                      (lxb_char_t *) "3000", 4);

/* Change the path */
lxb_url_api_pathname_set(url, &parser,
                          (lxb_char_t *) "/new/path", 9);

/* Change the query */
lxb_url_api_search_set(url, &parser,
                        (lxb_char_t *) "?newkey=newval", 13);

/* Change the fragment */
lxb_url_api_hash_set(url, &parser,
                      (lxb_char_t *) "#new-section", 12);

/* Change credentials */
lxb_url_api_username_set(url, (lxb_char_t *) "admin", 5);
lxb_url_api_password_set(url, (lxb_char_t *) "secret", 6);

/* Replace the entire URL */
lxb_url_api_href_set(url, &parser,
                      (lxb_char_t *) "https://completely.new/url", 26);

Note: The parser parameter is optional for some API functions. If you pass NULL, parsing still works but no validation logs will be generated. The lxb_url_api_username_set() and lxb_url_api_password_set() functions don’t need a parser at all.

URLSearchParams

URLSearchParams provides a convenient way to work with URL query string parameters. It implements the URLSearchParams interface from the WHATWG specification — the same API available in JavaScript.

Creating from a Query String

#include <lexbor/url/url.h>

/* URLSearchParams needs a memory allocator */
lexbor_mraw_t *mraw = lexbor_mraw_create();
lexbor_mraw_init(mraw, 256);

/* Parse query parameters */
static const lxb_char_t query[] = "name=Alice&age=30&color=blue&color=red";

lxb_url_search_params_t *sp = lxb_url_search_params_init(
    mraw, query, sizeof(query) - 1
);

/* sp now contains 4 entries: name=Alice, age=30, color=blue, color=red */

You can also use the memory allocator from a parsed URL:

lxb_url_t *url = lxb_url_parse(&parser, NULL, str, len);

/* Use the URL's own memory allocator */
lxb_url_search_params_t *sp = lxb_url_search_params_init(
    url->mraw, lxb_url_query(url)->data, lxb_url_query(url)->length
);

Getting Values

/* Get the first value for a parameter name */
lexbor_str_t *value = lxb_url_search_params_get(sp,
    (lxb_char_t *) "name", 4);

if (value != NULL) {
    printf("name = %.*s\n", (int) value->length, value->data);
}

/* Get the entry object (has both name and value) */
lxb_url_search_entry_t *entry = lxb_url_search_params_get_entry(sp,
    (lxb_char_t *) "name", 4);

/* Get count of values for a parameter name */
size_t count = lxb_url_search_params_get_count(sp,
    (lxb_char_t *) "color", 5);
/* count = 2 (blue, red) */

/* Get all values for a parameter name */
lexbor_str_t *buf[10];
size_t found = lxb_url_search_params_get_all(sp,
    (lxb_char_t *) "color", 5, buf, 10);
/* found = 2, buf[0] = "blue", buf[1] = "red" */

Checking Existence

/* Check if parameter name exists */
bool exists = lxb_url_search_params_has(sp,
    (lxb_char_t *) "name", 4, NULL, 0);

/* Check if specific name=value pair exists */
bool exact = lxb_url_search_params_has(sp,
    (lxb_char_t *) "color", 5,
    (lxb_char_t *) "blue", 4);

Adding and Modifying

/* Append a new parameter (duplicates allowed) */
lxb_url_search_params_append(sp,
    (lxb_char_t *) "lang", 4,
    (lxb_char_t *) "en", 2);

/* Set a parameter (removes all existing with this name, creates one) */
lxb_url_search_params_set(sp,
    (lxb_char_t *) "color", 5,
    (lxb_char_t *) "green", 5);
/* Now there is only one color=green (blue and red were removed) */

Deleting

/* Delete all parameters with a given name */
lxb_url_search_params_delete(sp,
    (lxb_char_t *) "age", 3, NULL, 0);

/* Delete only a specific name=value pair */
lxb_url_search_params_delete(sp,
    (lxb_char_t *) "color", 5,
    (lxb_char_t *) "blue", 4);

Sorting

/* Sort parameters alphabetically by name */
lxb_url_search_params_sort(sp);

Serialization

static lxb_status_t
callback(const lxb_char_t *data, size_t len, void *ctx)
{
    printf("%.*s", (int) len, (const char *) data);
    return LXB_STATUS_OK;
}

/* Serialize to application/x-www-form-urlencoded format */
lxb_url_search_params_serialize(sp, callback, NULL);
/* Outputs: "name=Alice&color=green&lang=en" */

The callback is called exactly once with the fully prepared string.

Iteration

You can iterate through matching entries using a callback:

static lexbor_action_t
match_cb(lxb_url_search_params_t *sp,
         lxb_url_search_entry_t *entry, void *ctx)
{
    printf("%.*s = %.*s\n",
           (int) entry->name.length, entry->name.data,
           (int) entry->value.length, entry->value.data);

    return LEXBOR_ACTION_OK;  /* continue iteration */
    /* return LEXBOR_ACTION_STOP to stop early */
}

/* Iterate all entries with name "color" */
lxb_url_search_params_match(sp,
    (lxb_char_t *) "color", 5, NULL, 0,
    match_cb, NULL);

Or iterate manually using lxb_url_search_params_match_entry():

lxb_url_search_entry_t *entry = NULL;

while ((entry = lxb_url_search_params_match_entry(sp,
    (lxb_char_t *) "color", 5, NULL, 0, entry)) != NULL)
{
    printf("color = %.*s\n",
           (int) entry->value.length, entry->value.data);
}

Cleanup

lxb_url_search_params_destroy(sp);
lexbor_mraw_destroy(mraw, true);

Special Schemes

The URL specification defines a set of “special” schemes with default ports and specific parsing rules:

Scheme

Default Port

Notes

http

80

https

443

ws

80

WebSocket

wss

443

Secure WebSocket

ftp

21

file

No port, no credentials

Special schemes have the following behavior differences:

  • Default ports are implicit. Parsing https://example.com:443/ will not store port 443 — lxb_url_has_port() returns false, because 443 is the default for HTTPS.

  • Host is required. http:path is invalid. Non-special schemes like foo:path are valid.

  • Backslash \ is treated as / in the authority and path. https://example.com\path is parsed as https://example.com/path.

  • Two slashes // are expected after the scheme. http://host is the normal form.

Non-special schemes (like data:, mailto:, blob:, or any custom scheme) have more relaxed rules: they can have opaque paths, don’t require a host, and don’t have default ports.

International Domain Names

Domain names with non-ASCII characters (like тест.com or münchen.de) are automatically converted to their ASCII (Punycode) representation during parsing. So тест.com becomes xn--e1afmapc.com in the parsed URL.

To display the domain back in Unicode form, use lxb_url_serialize_host_unicode() with an IDNA object (see Serialization).

Memory Management

The URL module uses lexbor_mraw_t as its memory allocator.

How It Works

  1. When you call lxb_url_parser_init(&parser, NULL), the parser creates its own lexbor_mraw_t object internally.

  2. Every URL created by lxb_url_parse() gets its memory from this allocator and stores a pointer to it (url->mraw).

  3. The parser and URLs are independent — destroying the parser does NOT destroy the URLs.

Destroying a Single URL

lxb_url_t *url = lxb_url_parse(&parser, NULL, str, len);
/* ... use url ... */
lxb_url_destroy(url);  /* Frees this URL's memory */

Destroying All URLs at Once

If you created multiple URLs from the same parser, you can destroy all of them at once by destroying the memory allocator:

lxb_url_t *url1 = lxb_url_parse(&parser, NULL, str1, len1);
lxb_url_parser_clean(&parser);
lxb_url_t *url2 = lxb_url_parse(&parser, NULL, str2, len2);

/* Destroy all URLs created by this parser's allocator */
lxb_url_memory_destroy(url1);
/* Both url1 and url2 are now invalid — they shared the same allocator */

You can also destroy all URLs via the parser:

lxb_url_parser_memory_destroy(&parser);
/* All URLs created by this parser are now invalid */

Using a Custom Memory Allocator

You can pass your own lexbor_mraw_t to the parser:

lexbor_mraw_t *mraw = lexbor_mraw_create();
lexbor_mraw_init(mraw, 4096);

lxb_url_parser_init(&parser, mraw);

/* URLs will use your allocator */
lxb_url_t *url = lxb_url_parse(&parser, NULL, str, len);

/* You manage the allocator's lifecycle */
lexbor_mraw_destroy(mraw, true);  /* Destroys all URLs too */

Replacing the Allocator

After calling lxb_url_parser_memory_destroy(), you need to assign a new allocator before parsing again:

lxb_url_parser_memory_destroy(&parser);
/* parser->mraw is now garbage */

lexbor_mraw_t *new_mraw = lexbor_mraw_create();
lexbor_mraw_init(new_mraw, 4096);

lxb_url_mraw_set(&parser, new_mraw);
/* Parser is ready to create new URLs */

Cloning a URL

You can clone a URL, optionally into a different memory allocator:

/* Clone into the same allocator */
lxb_url_t *copy = lxb_url_clone(url->mraw, url);

/* Clone into a different allocator */
lexbor_mraw_t *other_mraw = lexbor_mraw_create();
lexbor_mraw_init(other_mraw, 4096);

lxb_url_t *independent_copy = lxb_url_clone(other_mraw, url);
/* independent_copy has its own memory — original can be destroyed */

Error Handling

Parse Errors

lxb_url_parse() returns NULL if the URL is invalid. lxb_url_parse_basic() returns a status code.

The parser generates validation errors for various issues during parsing. These correspond to the validation errors defined in the WHATWG specification:

Error Type

Description

LXB_URL_ERROR_TYPE_DOMAIN_TO_ASCII

IDNA encoding failed

LXB_URL_ERROR_TYPE_DOMAIN_INVALID_CODE_POINT

Invalid character in domain

LXB_URL_ERROR_TYPE_HOST_INVALID_CODE_POINT

Invalid character in host

LXB_URL_ERROR_TYPE_IPV4_TOO_MANY_PARTS

IPv4 address has more than 4 parts

LXB_URL_ERROR_TYPE_IPV4_OUT_OF_RANGE_PART

IPv4 part exceeds 255

LXB_URL_ERROR_TYPE_IPV6_UNCLOSED

Missing closing ] in IPv6

LXB_URL_ERROR_TYPE_IPV6_TOO_MANY_PIECES

IPv6 has more than 8 groups

LXB_URL_ERROR_TYPE_INVALID_URL_UNIT

Invalid character in URL

LXB_URL_ERROR_TYPE_MISSING_SCHEME_NON_RELATIVE_URL

No scheme and no base URL

LXB_URL_ERROR_TYPE_HOST_MISSING

Special scheme requires a host

LXB_URL_ERROR_TYPE_PORT_OUT_OF_RANGE

Port number exceeds 65535

LXB_URL_ERROR_TYPE_PORT_INVALID

Port contains non-numeric characters

LXB_URL_ERROR_TYPE_INVALID_CREDENTIALS

Credentials in URL where not allowed

All error types are defined in the lxb_url_error_type_t enum.