Hello, everyone!
We continue the series of articles on developing a browser engine. Yes, better late than never. Yes, there was a long break. At the end of the article, I will describe how the lexbor project is doing and what is happening with it.
In this article, I will try to delve into the peculiarities of parsing Cascading Style Sheets (CSS). I'll explain how to turn the "hedgehog" inside out and how to test the obtained result.
Everything in CSS specifications is explained, well, almost everything. Here, I will tell you how everything is organized, where to look, and where to start. This article is more of an overview, and it won't delve into implementation details. Instead, it provides general information and basic algorithms. For the smallest implementation details, please refer to the GitHub code.
And, of course, as usual, we will aim for the title of the fastest CSS parser.
There are two sources for CSS specifications:
We will be working with drafts.csswg.org. It's concise, and each module has links to all versions of the module, from draft to recommended for use.
CSS is organized into modules: Syntax, Namespaces, Selectors, CSSOM, Values, and so on. You can find a complete list on csswg.org.
Each module, or rather its version, has a status: Working Draft, Candidate Recommendation, Recommendation, and so on. You can see all stages on w3.org. In simple terms, each module has a note: recommended for use, just started writing and it's unclear what will come of it, maybe we'll delete this module altogether, and so on, but more concise.
We are interested only in Editor’s Draft and Working Draft with a glance at Recommendation. Honestly, W3 is a bit slow, and by the time they reach Recommendation (billions of stars will burn in space), everything will be outdated. In other words, we will treat CSS standards as living (living), just like the HTML standard.
Of all the CSS modules, some are fundamental—without them, nothing works. This is the BASIS.
Syntax
The syntax module is the foundation. It describes the structure of CSS, the principles of token construction, and their subsequent parsing.
Values
The module describes the grammar. In all modules, we will encounter grammars and basic CSS types.
For example, for the width
property, the grammar looks like:
width = auto | <length-percentage [0,∞]> | min-content | max-content | fit-content(<length-percentage [0,∞]>)
In addition, the module describes basic types and mathematical functions: <length>
, <angle>
, <time>
, min()
, max()
, and so on.
Let's look at how the <length-percentage>
type breaks down:
<length-percentage> = <length> | <percentage-token>
<length> = <dimension-token> | <number-token [0,0]>
All types ending in -token
are tokens obtained from the tokenizer, meaning they carry various data. In this article, we won't go into detail about each type. You'll find everything in the specifications I mention. I'm describing the overall picture.
All of this is the foundation to not be intimidated by the other modules and to understand what's written in them. And to understand what unequal battle we will face next.
Syntax
: TokenizerJust like everywhere else, the tokenizer takes a data stream and breaks it into tokens.
Let's consider an example, taking CSS:
div {width: 10px !important}
The tokenizer will create tokens:
"div" — <ident-token>
" " — <whitespace-token>
"{" — <left-curly-bracket-token>
"width" — <ident-token>
":" — <colon-token>
" " — <whitespace-token>
"10px" — <dimension-token>
" " — <whitespace-token>
"!" — <delim-token>
"important" — <ident-token>
"}" — <right-curly-bracket-token>
You can find all types of tokens and the algorithm for creating them in the specification. There's nothing unique here.
Syntax
: ParsingParsing accumulates tokens into specific structures for further processing in various CSS modules.
A specific structure includes:
For example, let's take a closer look at Qualified Rule
.
Again, take our example:
div {width: 10px !important}
Qualified Rule
can have a prelude
and rules
.
Here:
div
— is the prelude.{width: 10px !important}
— is the rules list.In this case:
Prelude
contains a Component value
.Rules
contain Lists of Declarations
.width: 10px !important
— is a declaration. When there are many, it becomes a list of declarations (Lists of Declarations
).
Therefore, the declaration contains:
width
— the name (name
)10px
— the value (value
)!important
— the important
flag.Ordinary users would describe this as:
div
— CSS Selectors{width: 10px !important}
— a list of CSS properties.It's not complicated. Other structures are organized in a similar way. For example, Stylesheet
contains Rules
, namely At-Rule
and Qualified Rule
in any quantity.
As we have already understood, the created structures with lists of tokens must be processed by someone else. That someone else is other CSS modules.
For example, the CSS Selectors module receives data from Qualified Rule
Prelude
for parsing. The [Media Queries module](https://drafts.csswg.org/mediaqueries
-4/#mq-syntax) receives the entire structure of At-Rule
for parsing. Each to its own!
Let's take a well-known bootstrap.css
whose weight is approximately ≈180KB
. The tokenizer will create around ≈51700
tokens from this file, excluding comments. That's a considerable number.
Now, let's imagine that in the first pass, we go through them when forming the structure in syntax, and in the second and subsequent passes, we break them down by modules. Of course, subsequent passes will deal with a smaller number of tokens; {
, }
, [
, ]
, and some others will be discarded depending on the situation.
Here is where we start thinking together—how do we optimize this?! There are several options to implement an optimized CSS parser; let's consider the main ones.
We can set up callbacks for all stages of CSS Syntax structures and pass tokens there. It would look something like this:
div {width: 10px !important}
"div" — callback_qualified_rule_prelude(<ident-token>)
" " — callback_qualified_rule_prelude(<whitespace-token>)
— callback_qualified_rule_prelude(<end-token>)
"{" — skip <left-curly-bracket-token>
"width" — callback_declaration_name(<ident-token>)
":" — skip <colon-token>
" " — skip <whitespace-token>
"10px" — callback_declaration_value(<dimension-token>)
" " — skip <whitespace-token>
"!" — skip <delim-token>
"important" — skip <ident-token>
- callback_declaration_important(true)
"}" — skip <right-curly-bracket-token>
As seen from this small example, there will be many callbacks. It will be challenging to keep track of this entire zoo.
In this variant, much is shifted to the user. Also, a significant problem is the inability to look at the next token. Someone might argue and say, "I can call the tokenizer and ask for the next token without violating the general algorithm." The answer is simple: you cannot do that. We won't know if the next token relates to our current stage or not. Therefore, we would be taking on double work, analyzing the CSS structure again.
Pros:
Cons:
Let's have each module keep track of the CSS structure by itself! In general, let's shift everything onto the user; let them figure it out, they know better.
If the user starts parsing CSS Selectors, they should not only deal with the selector grammar but also track nesting through tokens: {
, }
, (
, )
, function(
.
It will look something like this:
div {width: 10px !important}
"div {" — Selectors parse
"width: 10px !important}" — Declarations parse
Parsing selectors will proceed as follows:
{
at the current nesting depth?{
, switch to the next parsing stage.Sounds simple, but in reality, it's more complicated.
{
or }
token before passing it to the next stage?}
. What if it was (})
? It's always essential to precisely understand how the CSS structure is described at the current moment. This is very problematic and significantly complicates development and debugging if something goes wrong.Pros:
Cons:
Initially, I "played" with this parsing approach.
The idea was:
In other words, we give the code generator the grammar of the Selectors module, and it generates C code for parsing, taking into account the global CSS structure. It mixes the understanding of the structure into the Selectors grammar.
I achieved decent results with this approach but decided to stop in this direction. Creating such a code generator is a very large and challenging task, but interesting. Just one optimization stage is worth it. I decided to return to this when the lexbor
project is extensively used and a significant speed boost is needed.
We've already established that it's crucial for us to have the ability to control tokens, obtain them independently.
When we talk about parsing any data, we always imagine a clear sequence:
Everything is logical.
What if we make the tokenizer keep track of the global CSS structure? In other words, when requesting a token from the tokenizer, it internally monitors the CSS structure. A sort of inside-out parsing.
This approach is implemented in my lexbor
project.
The principle is as follows: We set up callbacks for different stages of parsing the CSS structure. At each stage, the callback is called only once. If the prelude begins, the beginning callback is called. Not for every token, only at the start.
Okay, but how do we track the structure in our callbacks?
We create several functions, proxy functions for tokenizer functions. The token acquisition function lxb_css_syntax_parser_token()
and the token consumption function lxb_css_syntax_parser_consume()
. The user interacts not directly with the tokenizer but with our proxy functions.
The algorithm of the lxb_css_syntax_parser_token()
function is simple:
LXB_CSS_SYNTAX_TOKEN__TERMINATED
.When the termination token LXB_CSS_SYNTAX_TOKEN__TERMINATED
is returned to the user, the CSS structure analyzer enters the stage of waiting for a decision from the user. There can be several decisions:
lxb_css_parser_success()
— everything went successfully, the expected data.lxb_css_parser_failed()
— unexpected data for us.lxb_css_parser_memory_fail()
— a memory allocation error occurred, ending parsing.lxb_css_parser_stop()
— simply stop further parsing.In other words, if the user is in any parsing stage, for example, in the Prelude
stage of the Qualified Rule
, they cannot exit it, and the CSS structure analyzer will not switch further until the user returns success
or failed
. In addition, decision-making functions can be called immediately. If you call lxb_css_parser_success()
and there are tokens not related to LXB_CSS_SYNTAX_TOKEN_WHITESPACE
in the current stage, the current stage automatically switches to failed
. This is convenient because we don't need to check all tokens until LXB_CSS_SYNTAX_TOKEN__TERMINATED
arrives. If we are sure that everything is parsed in the current stage, we return lxb_css_parser_success()
. If there are still some tokens, everything will automatically resolve.
The final algorithm looks something like this:
Qualified Rule
. Sets callbacks for the beginning of Prelude
, the beginning of the Rules
block, and the end of parsing Qualified Rule
.lxb_css_syntax_parser_token()
.
3.1.1. Get a token from the tokenizer.
3.1.2. Analyze it in the CSS structure.
3.1.3. The user callback is set by the parser based on the CSS structure.
3.2. We call the set callback with the obtained token.
3.3. In the callback, the user retrieves all tokens until LXB_CSS_SYNTAX_TOKEN__TERMINATED
using the function lxb_css_syntax_parser_token()
.
3.4. The user returns control to us, go to step 3.1.This is a very simplified picture of the world. Inside the lxb_css_syntax_parser_token()
function, there are phases—switching between CSS structures like Qualified Rule
, At-Rule
, and so on + various system phases. There is also a stack; CSS structure is recursive, and we don't like recursion.
Pros:
Cons:
To grasp the scale of the parsing challenges, let's delve into how grammar syntax is structured. Values in grammars can have combinators and multipliers.
Sequential Order:
<my> = a b c
<my>
can contain the following value:
<my> = a b c
One Value from the List:
<my> = a | b | c
<my>
can contain the following values:
<my> = a
<my> = b
<my> = c
One or All Values from the List in Any Order:
<my> = a || b || c
<my>
can contain the following values:
<my> = a
<my> = a b
<my> = a c
<my> = a b c
<my> = a c b
<my> = b
<my> = b a
<my> = b c
<my> = b a c
<my> = b c a
<my> = c
<my> = c a
<my> = c b
<my> = c a b
<my> = c b a
All Values from the List in Any Order:
<my> = a && b && c
<my>
can contain the following values:
<my> = a b c
<my> = a c b
<my> = b a c
<my> = b c a
<my> = c a b
<my> = c b a
Values can be Grouped:
<my> = [ [a | b | c] && x ] r t v
Those familiar with regex will understand immediately.
Zero or Infinite Number of Times:
<my> = a*
<my>
can contain the following values:
<my> = a
<my> = a a a a a a a a a a a a a
<my> =
One or Infinite Number of Times:
<my> = a+
<my>
can contain the following values:
<my> = a
<my> = a a a a a a a a a a a a a
May or May Not be Present:
<my> = a?
<my>
can contain the following values:
<my> = a
<my> =
May be Present from A
to B
Times, Period:
<my> = a{1,4}
<my>
can contain the following values:
<my> = a
<my> = a a
<my> = a a a
<my> = a a a a
One or Infinite Number of Times Separated by Comma:
<my> = a#
<my>
can contain the following values:
<my> = a
<my> = a, a
<my> = a, a, a
<my> = a, a, a, a
Exactly One Value Must be Present:
<my> = [a? | b? | c?]!
In this example, absence of a value within the group is allowed, but the exclamation mark !
demands at least one, otherwise it's an error.
Multipliers can be Combined:
<my> = a#{1,5}
Values of a
, separated by commas, from one to five times.
That's all you need to know about grammar; the most crucial parts.
Now, let's look at the grammar syntax for color:
<color> = <absolute-color-base> | currentcolor | <system-color>
<absolute-color-base> = <hex-color> | <absolute-color-function> | <named-color> | transparent
<absolute-color-function> = <rgb()> | <rgba()> |
<hsl()> | <hsla()> | <hwb()> |
<lab()> | <lch()> | <oklab()> | <oklch()> |
<color()>
<rgb()> = [ <legacy-rgb-syntax> | <modern-rgb-syntax> ]
<rgba()> = [ <legacy-rgba-syntax> | <modern-rgba-syntax> ]
<legacy-rgb-syntax> = rgb( <percentage>#{3} , <alpha-value>? ) |
rgb( <number>#{3} , <alpha-value>? )
<legacy-rgba-syntax> = rgba( <percentage>#{3} , <alpha-value>? ) |
rgba( <number>#{3} , <alpha-value>? )
<modern-rgb-syntax> = rgb( [ <number> | <percentage> | none]{3}
[ / [<alpha-value> | none] ]?)
<modern-rgba-syntax> = rgba([ <number> | <percentage> | none]{3}
[ / [<alpha-value> | none] ]?)
...
It's hard to figure it out while nursing a hangover. However, writing a parser for this is not difficult, even straightforward. But writing all test variations for this is quite challenging. And if you consider the myriad CSS declarations, your hands drop even more.
The logical thought here is to write a test generator for grammars! It turned out to be easier said than done. Not rocket science, but it makes one ponder in the cold.
The main problems I encountered:
For example, take such a grammar:
<text-decoration-line> = none | [ underline || overline || line-through || blink ]
<text-decoration-style> = solid | double | dotted | dashed | wavy
<text-decoration-color> = <color>
<text-decoration> = <text-decoration-line> || <text-decoration-style> || <text-decoration-color>
Generating tests for <text-decoration>
would go on indefinitely.
All <color>
values must be combined with <text-decoration-line>
and <text-decoration-style>
in different variations, which is a lot.
Had to come up with a limiter for group options — /1
. A slash and a value indicating how many options should be taken from the group.
As a result, <text-decoration>
was transformed into:
<text-decoration> = <text-decoration-line> || <text-decoration-style> || <text-decoration-color>/1
LXB_CSS_SYNTAX_TOKEN_WHITESPACE
).Usually, it is mentioned below the grammars that certain values should not have spaces between them.
This doesn't suit us; we need to consider this directly in the grammar.
The ^WS
modifier (Without Spaces) was introduced:
<frequency> = <number-token> <frequency-units>^WS
<
frequency-units> = Hz | kHz
Spaces before <frequency-units>
are not allowed.
The order of input data can be arbitrary, but after parsing, values find their positions in structures, and during serialization, they have a precise order.
For example:
<x> = a && b && c
Tests would be generated as follows:
<x> = a b c
<x> = a c b
<x> = b a c
<x> = b c a
<x> = c a b
<x> = c b a
All these tests are valid, but our result after parsing will always be <x> = a b c
. The question arises — how to compare with the others?
My inner sense hinted that the task suddenly became much more complicated, but the sense of courage (not foolishness) pushed to tackle it head-on.
And, as you understand, it didn't work out right away. I had to think!
Let's look at such an example:
<x> = a && [x || y || [f && g && h]] && c
It became clear that the result of generating tests for each group should return not just a test but also the correct answer for that test. This became a bit of a problem. The implementation got complicated.
Various solutions were tried, I would even say crutches, to avoid complicating the code. But they all brought a bunch of exceptions and were 10% non-working code. It seemed like assigning a unique index in ascending order, depending on where the value is, and in the end, when the test is formed, sorting the indices to get the result for the parser. Here you have a test and what result the parser will return.
Still no. Each value has its own Combinators and Multipliers that can place different values between them. For example, somewhere there might be a space, somewhere not, somewhere commas between values. If we simply sort the final result, we get a mess.
In the end, the most reliable solution is to generate the test and result separately. In other words, forming the result will go through the same stages as forming the test. Costly, of course, but okay, we're not real-time; we can wait.
As a result, we got a wonderful tool that generates tests for grammars.
Now tests for CSS Declarations (properties) are generated in 1
second, and their number is 82911
. On disk, they occupy about 20MB
in JSON format.
Admit it, manually writing so many tests is a time-consuming task.
This approach helped uncover quite a few parsing issues with properties; I think I caught around 10 errors. But now there is 100% confidence that valid CSS will be parsed correctly.
And here comes an urgent question — how to test invalid CSS?
At this stage, I actively use the Clang
fuzzer. But, of course, it is necessary to implement incorrect/broken tests using grammar. So that from incorrect tests, a false correct result does not come out.
Code testing takes as much time as its development, if not more. Currently, the lexbor
project undergoes continuous testing on approximately 30 operating systems with enabled asan
, msan
, UB
, and memory leak
(where possible). Additionally, Clang
fuzzers are constantly at work.
Despite a long hiatus, which was indeed prolonged, the project is actively evolving. Much has already been implemented, and currently, the implementation of the layout/renderer tree is in progress. In other words, the project will soon create windows, render glyphs, draw blocks, inlines, and so forth. The code is abundant; it could fill five articles, provided one refrains from verbosity, and possibly even more.
The break in development was tied to my work at NGINX, where I immersed myself in developing the NJS JavaScript engine. This, in turn, benefitted the lexbor project in terms of knowledge.
Fortunately, a very capable individual (proficient in English and technical documentation) joined me, assisting with the project's documentation. He has taken charge of all English language aspects and anything related to project documentation in general.
They won't be available, at least not right now.
At present, the lexbor project only supports Selectors and Declarations for full parsing. This suffices for the engine's further development. As soon as @media
, variables, and other features appear, we'll immediately measure ourselves against others.
Anything unsupported by the CSS parser is categorized as CUSTOM
. In other words, in the final tree, all data will be present, just with a *_custom
structure.
I'll hint only that lightningcss, written in Rust and positioned as "An extremely fast CSS parser," did not surprise me.
Project link: lexbor.
CSS examples: CSS Examples.
Examples of usage with HTML: HTML + CSS.