# Part One - HTML Hello, everyone! In this article, I will tell you how to create a superfast HTML parser supporting DOM. We will look at the HTML specification and its disadvantages in terms of performance and resource consumption during HTML parsing. I assume the reader has a basic knowledge of HTML: tags, nodes, elements, and namespace. ## HTML Specification Before we implement the parser, it is necessary to understand which HTML specification we should rely upon. There are two HTML specifications: 1. [WHATWG](https://html.spec.whatwg.org/multipage/) * Apple, Mozilla, Google, Microsoft 2. [W3C](https://www.w3.org/TR/html/) * A long list of companies Naturally, our choice fell on the industry leaders, namely, WHATWG. It's a living standard; these are large companies, each maintaining its own browser or a browser engine. UPDATE: Unfortunately, the links don't work if you are located in Russia. Apparently, it's the "echo" of the Telegram throttling attempts by the Man. ## HTML Parsing Details HTML tree construction can be divided into four parts: 1. Decoder 2. Preprocessing 3. Tokenizer 4. Tree construction Let's consider each stage separately. ## Decoder The tokenizer accepts Unicode characters (code points) as input. Accordingly, we need to convert the byte stream to Unicode characters. For this, we use the Encoding specification. If our HTML encoding is unknown (there's no HTTP header), we need to identify it to start decoding. To do so, we use the encoding sniffing algorithm. Essentially, the algorithm is as follows: we wait for `500ms` or the first `1024 bytes` from the byte stream and run a prescan for the byte stream to determine its encoding; this algorithm attempts to find the `` tag with `http-equiv`, content, or charset attributes to understand which encoding the HTML developer has assumed. The Encoding specification identifies the minimum set of supported encodings for a browser engine (21 in total): UTF-8, ISO-8859-2, ISO-8859-7, ISO-8859-8, windows-874, windows-1250, windows-1251, windows-1252, windows-1254, windows-1255, windows-1256, windows-1257, windows-1258, gb18030, Big5, ISO-2022-JP, Shift_JIS, EUC-KR, UTF-16BE, UTF-16LE, and x-user-defined. ## Preprocessing Once we have decoded the bytes into Unicode characters, we need to perform a "sweep". Specifically, we have replace all carriage return characters (`\r`) followed by a newline character (`\n`) with a single carriage return character (`\r`). Next, we replace all carriage return characters with newline characters (`\n`). That's how the specification has it. Namely, `\r\n` => `\r`, `\r` => `\n`. However, in fact, no one does exactly this. It's done in a simpler manner: If you've got a carriage return character (`\r`), check whether there is a newline character (`\n`) coming. If there is one, we replace both characters with a newline character (`\n`); otherwise, we replace only the first character (`\r`) with a newline (`\n`). This completes the preprocessing. Yes, all you have to do is get rid of the carriage return symbols so they don't get into the tokenizer. The tokenizer does not expect them and does not know what to do with the carriage returns. ## Parsing Errors To answer upcoming questions in advance, I should tell now what a parse error is. It's really not a big deal. It sounds menacing, but in fact it is only a warning: we expect one thing and receive another. A parse error does not stop data processing or prevent us from constructing a tree. It is a message which says that our HTML is invalid. The parse error can occur due to a surrogate pair, `\0`, wrong tag location, wrong ``, etc. By the way, some parse errors actually have consequences. For example, if you get "bad" ``, the HTML tree will be labeled as `QUIRKS` which affects the logic of some DOM functions. ## Tokenizer As mentioned earlier, the tokenizer accepts Unicode characters as input. It is a state machine which has `80 states`. Each state has conditions for Unicode characters. Depending on the arriving symbol, the tokenizer may: 1. Change state 2. Generate token and change state 3. Do nothing and wait for the next character The tokenizer generates six types of tokens: `DOCTYPE`, `Start Tag`, `End Tag`, `Comment`, `Character`, `End-Of-File`. Which enter the tree construction stage. Note that the tokenizer does not actually know all its states, only about 40% of them (approximately). "Why all the others?", one might ask. The remaining 60% are used to build the tree. This is required to parse such tags as `