<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Untitled Publication]]></title><description><![CDATA[Untitled Publication]]></description><link>https://blog.sylver.dev</link><generator>RSS for Node</generator><lastBuildDate>Fri, 17 Apr 2026 15:02:35 GMT</lastBuildDate><atom:link href="https://blog.sylver.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Build your own SQLite, Part 6: Overflow pages]]></title><description><![CDATA[Up to this point, we've been using simple test databases where the data for each row fits within a single page. However, in the wild, it is quite common for a row to be larger than a single page (typically 4096 bytes), especially when using variable-...]]></description><link>https://blog.sylver.dev/build-your-own-sqlite-part-6-overflow-pages</link><guid isPermaLink="true">https://blog.sylver.dev/build-your-own-sqlite-part-6-overflow-pages</guid><category><![CDATA[Rust]]></category><category><![CDATA[SQLite]]></category><category><![CDATA[from scratch]]></category><dc:creator><![CDATA[Geoffrey Copin]]></dc:creator><pubDate>Mon, 07 Jul 2025 22:59:49 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1751929165169/d7ea3979-d856-4a1d-9300-e4b0c7fa7d68.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Up to this point, we've been using simple test databases where the data for each row fits within a single page. However, in the wild, it is quite common for a row to be larger than a single page (typically 4096 bytes), especially when using variable-length fields like <code>TEXT</code> or <code>BLOB</code>. How does SQLite handle such cases?</p>
<p>In this post, we'll explore the overflow mechanism in SQLite and implement it in our own toy database, allowing us to read large <code>text</code>s and <code>blob</code>s.</p>
<p>As usual, the source code for this post is available on <a target="_blank" href="https://github.com/geoffreycopin/rqlite/tree/fb4e122c7ef01a9dbc69c01c68919d91a66188d0">Github</a>.</p>
<h2 id="heading-overview">Overview</h2>
<pre><code class="lang-text">Overflow pages structure
------------------------ 

Page 3                                Page 4
┌─────────────────────────────────┐   ┌─────────────────────────────────┐
│ Next: Page 4 │   Data...        │┌─&gt;│ Next: NULL   │   Data...        │
└─────────────────────────────────┘│  └─────────────────────────────────┘
       │                           │
       └───────────────────────────┘
</code></pre>
<p>When a field's length exceeds the maximum usable page size, our database has two questions to answer:</p>
<ol>
<li><p>where and in what format to store data that doesn't fit in a page?</p>
</li>
<li><p>what to write in the B-tree cell?</p>
</li>
</ol>
<p>The answer to the first question is quite elegant. The data that doesn't fit is split into multiple overflow pages, using the following structure: the first four bytes of every overflow page contain the page number of the next overflow page (or <code>0</code> if there are no more overflow pages), and the rest of the page contains the overflow data. Therefore, overflow pages form a linked list.</p>
<p>As for the second question, the B-tree cell will contain the first<code>N</code> bytes of the complete payload (where <code>N</code> is calculated using a formula we'll explore in the following sections), followed by the page number of the first overflow page.</p>
<p>With this quick overview, let's dive in the implementation!</p>
<h3 id="heading-rust-edition">Rust edition</h3>
<p>Some code in this post uses the new <a target="_blank" href="https://blog.rust-lang.org/2025/06/26/Rust-1.88.0/#let-chains">Let chains</a> feature from Rust 1.88.0. If you want to follow along, make sure to update your <code>Cargo.toml</code> to use the 2024 edition:</p>
<pre><code class="lang-toml"><span class="hljs-comment"># Cargo.toml</span>
<span class="hljs-section">[package]</span>
<span class="hljs-comment"># [...]</span>
<span class="hljs-attr">edition</span> = <span class="hljs-string">"2024"</span>
</code></pre>
<h3 id="heading-erratum">Erratum</h3>
<p>If you're one of the early readers of the first post, and you coded along, a small mistake might have slipped into your code: the <code>read_varint_at</code> function - which decodes a variable-length integer - treats varints as little-endian, while SQLite uses big-endian encoding. Here is the corrected version:</p>
<pre><code class="lang-rust"><span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">read_varint_at</span></span>(buffer: &amp;[<span class="hljs-built_in">u8</span>], <span class="hljs-keyword">mut</span> offset: <span class="hljs-built_in">usize</span>) -&gt; (<span class="hljs-built_in">u8</span>, <span class="hljs-built_in">i64</span>) {
    <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> size = <span class="hljs-number">0</span>;
    <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> result = <span class="hljs-number">0</span>;

    <span class="hljs-keyword">while</span> size &lt; <span class="hljs-number">9</span> {
        <span class="hljs-keyword">let</span> current_byte = buffer[offset] <span class="hljs-keyword">as</span> <span class="hljs-built_in">i64</span>;
        <span class="hljs-keyword">if</span> size == <span class="hljs-number">8</span> {
            result = (result &lt;&lt; <span class="hljs-number">8</span>) | current_byte;
        } <span class="hljs-keyword">else</span> {
            result = (result &lt;&lt; <span class="hljs-number">7</span>) | (current_byte &amp; <span class="hljs-number">0b0111_1111</span>);
        }

        offset += <span class="hljs-number">1</span>;
        size += <span class="hljs-number">1</span>;

        <span class="hljs-keyword">if</span> current_byte &amp; <span class="hljs-number">0b1000_0000</span> == <span class="hljs-number">0</span> {
            <span class="hljs-keyword">break</span>;
        }
    }

    (size, result)
}
</code></pre>
<p>You'll notice that we now shift the <code>result</code> accumulator at each iteration instead of shifting the <code>current_byte</code>. The ninth byte special case is also handled differently, as it is no longer the most significant byte.</p>
<h2 id="heading-building-the-test-database">Building the test database</h2>
<p>To test our implementation, we'll need a test database that contains rows with large enough fields to trigger the overflow mechanism. To create such a database, you can use the following commands:</p>
<pre><code class="lang-bash">sqlite3 test.db
sqlite&gt; create table t1(id <span class="hljs-built_in">integer</span>, value string);
sqlite&gt; insert into t1(id, value) values (42, <span class="hljs-built_in">printf</span>(<span class="hljs-string">'%.*c'</span>, 10000, <span class="hljs-string">'a'</span>));
</code></pre>
<blockquote>
<p><code>printf('%.*c', 10000, 'a')</code> returns a string composed of 10000 <code>a</code>s.</p>
</blockquote>
<h2 id="heading-computing-the-local-payload-size">Computing the local payload size</h2>
<p>As we mentioned, when reading a record that's too big to fit in a single page, our first task is to compute the local payload size. This is the number of bytes that will be stored directly in the B-tree leaf cell. To compute this size, we first need to define a few variables:</p>
<ul>
<li><p><code>P</code>, the payload size</p>
</li>
<li><p><code>U</code>, the usable page size (which is the page size minus the number of reserved bytes)</p>
</li>
<li><p><code>X = U - 35</code>, which is the overflow threshold: if the payload size is less than or equal to <code>X</code> it will be stored entirely in a B-tree leaf cell, without overflow</p>
</li>
<li><p><code>M = ((U-12)*32/255)-23</code>, the minimum local payload size</p>
</li>
<li><p><code>K = M+((P-M)%(U-4))</code>, the maximum local payload size</p>
</li>
</ul>
<p>The local payload size is computed according to the following rules:</p>
<ol>
<li><p>If <code>P &lt;= X</code>, the local payload size is <code>P</code> and there is no overflow.</p>
</li>
<li><p>Otherwise, there is an overflow, and:</p>
<ul>
<li><p>If <code>P &lt;= K</code>, the local payload size is <code>K</code>.</p>
</li>
<li><p>Otherwise, the local payload size is <code>M</code>.</p>
</li>
</ul>
</li>
</ol>
<p>We'll start by implementing the computation of <code>U</code>, the usable page size. It differs from the page size since SQLite reserves a few bytes at the end of each page for use by extensions. The exact number of reserved bytes is defined by a two-byte integer in the database header, at offset 20. First, we'll extend our <code>DbHeader</code> struct to include the number of reserved bytes:</p>
<pre><code class="lang-diff">// src/page.rs

#[derive(Debug, Copy, Clone)]
pub struct DbHeader {
    pub page_size: u32,
<span class="hljs-addition">+   pub page_reserved_size: u8,</span>
}

<span class="hljs-addition">+impl DbHeader {</span>
<span class="hljs-addition">+    pub fn usable_page_size(&amp;self) -&gt; usize {</span>
<span class="hljs-addition">+        self.page_size as usize - (self.page_reserved_size as usize)</span>
<span class="hljs-addition">+    }</span>
<span class="hljs-addition">+}</span>
</code></pre>
<p>We also added a utility method that returns the value of <code>U</code>, computed from the page size and reserved bytes count.</p>
<p>Our db header parsing function must also be updated to populate the new field:</p>
<pre><code class="lang-diff">// src/pager.rs
<span class="hljs-addition">+const HEADER_PAGE_RESERVED_SIZE_OFFSET: usize = 20;</span>

// [...]

pub fn parse_header(buffer: &amp;[u8]) -&gt; anyhow::Result&lt;page::DbHeader&gt; {
    if !buffer.starts_with(HEADER_PREFIX) {
        let prefix = String::from_utf8_lossy(&amp;buffer[..HEADER_PREFIX.len()]);
        anyhow::bail!("invalid header prefix: {prefix}");
    }

    let page_size_raw = read_be_word_at(buffer, HEADER_PAGE_SIZE_OFFSET);
    let page_size = match page_size_raw {
        1 =&gt; PAGE_MAX_SIZE,
        n if n.is_power_of_two() =&gt; n as u32,
        _ =&gt; anyhow::bail!("page size is not a power of 2: {}", page_size_raw),
    };

<span class="hljs-addition">+   let page_reserved_size = buffer[HEADER_PAGE_RESERVED_SIZE_OFFSET];</span>

<span class="hljs-deletion">-   Ok(page::DbHeader { page_size })</span>
<span class="hljs-addition">+   Ok(page::DbHeader {</span>
<span class="hljs-addition">+       page_size,</span>
<span class="hljs-addition">+       page_reserved_size,</span>
<span class="hljs-addition">+   })</span>
}
</code></pre>
<p>With this in place, we can implement the <code>local_payload_size</code> method according to the formulas defined above:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// src/page.rs</span>

<span class="hljs-comment">// [...]</span>

<span class="hljs-keyword">impl</span> PageHeader {
    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">local_payload_size</span></span>(
        &amp;<span class="hljs-keyword">self</span>,
        db_header: &amp;DbHeader,
        payload_size: <span class="hljs-built_in">usize</span>,
    ) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;<span class="hljs-built_in">usize</span>&gt; {
        <span class="hljs-keyword">match</span> <span class="hljs-keyword">self</span>.page_type {
            PageType::TableInterior =&gt; bail!(<span class="hljs-string">"no payload size for interior pages"</span>),
            PageType::TableLeaf =&gt; {
                <span class="hljs-keyword">let</span> usable = db_header.usable_page_size();
                <span class="hljs-keyword">let</span> max_size = usable - <span class="hljs-number">35</span>;
                <span class="hljs-keyword">if</span> payload_size &lt;= max_size {
                    <span class="hljs-keyword">return</span> <span class="hljs-literal">Ok</span>(payload_size);
                }
                <span class="hljs-keyword">let</span> min_size = ((usable - <span class="hljs-number">12</span>) * <span class="hljs-number">32</span> / <span class="hljs-number">255</span>) - <span class="hljs-number">23</span>;
                <span class="hljs-keyword">let</span> k = min_size + ((payload_size - min_size) % (usable - <span class="hljs-number">4</span>));
                <span class="hljs-keyword">let</span> size = <span class="hljs-keyword">if</span> k &lt;= max_size { k } <span class="hljs-keyword">else</span> { min_size };
                <span class="hljs-literal">Ok</span>(size)
            }
        }
    }

    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">local_and_overflow_size</span></span>(
        &amp;<span class="hljs-keyword">self</span>,
        db_header: &amp;DbHeader,
        payload_size: <span class="hljs-built_in">usize</span>,
    ) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;(<span class="hljs-built_in">usize</span>, <span class="hljs-built_in">Option</span>&lt;<span class="hljs-built_in">usize</span>&gt;)&gt; {
        <span class="hljs-keyword">let</span> local = <span class="hljs-keyword">self</span>.local_payload_size(db_header, payload_size)?;
        <span class="hljs-keyword">if</span> local == payload_size {
            <span class="hljs-literal">Ok</span>((local, <span class="hljs-literal">None</span>))
        } <span class="hljs-keyword">else</span> {
            <span class="hljs-literal">Ok</span>((local, <span class="hljs-literal">Some</span>(payload_size.saturating_sub(local))))
        }
    }
}
</code></pre>
<p>For table-leaf pages, the implementation of <code>local_payload_size</code> is a straightforward translation of the formulas. For interior pages, we simply return an error, as they do not contain an actual payload, so the concept of a local payload size does not apply. In future posts, we'll discover index pages, which will require us to implement a slightly altered version of the formulas.</p>
<p>We also defined a convenient method that computes the local payload size and the overflow size, if any.</p>
<p>How are we going to use this information? When we parse a B-tree leaf cell, we'll compute the local and overflow sizes. If the overflow size is not <code>None</code>, we know that a pointer to the first overflow page is stored right after the local payload, so we'll read this pointer and record it in the resulting <code>Cell</code> struct. We don't want to read the content of the overflow pages just yet, as we have no way to know if the query will actually need it.</p>
<p>Let's modify our <code>Cell</code> struct and the corresponding parsing function:</p>
<pre><code class="lang-diff">// src/page.rs

// [...]

#[derive(Debug, Clone)]
pub struct TableLeafCell {
pub size: i64,
    pub size: i64,
    pub row_id: i64,
    pub payload: Vec&lt;u8&gt;,
<span class="hljs-addition">+   pub first_overflow: Option&lt;usize&gt;,</span>
}
</code></pre>
<pre><code class="lang-diff">// src/pager.rs

// [...] 

<span class="hljs-deletion">-fn parse_table_leaf_cell(mut buffer: &amp;[u8]) -&gt; anyhow::Result&lt;page::Cell&gt; {</span>
<span class="hljs-addition">+fn parse_table_leaf_cell(</span>
<span class="hljs-addition">+   db_header: &amp;DbHeader,</span>
<span class="hljs-addition">+   header: &amp;PageHeader,</span>
<span class="hljs-addition">+   mut buffer: &amp;[u8],</span>
<span class="hljs-addition">+) -&gt; anyhow::Result&lt;page::Cell&gt; {</span>
    let (n, size) = read_varint_at(buffer, 0);
    buffer = &amp;buffer[n as usize..];

    let (n, row_id) = read_varint_at(buffer, 0);
    buffer = &amp;buffer[n as usize..];

<span class="hljs-addition">+   let (local_size, overflow_size) = header.local_and_overflow_size(db_header, size as usize)?;</span>
<span class="hljs-addition">+   let first_overflow = overflow_size.map(|_| read_be_double_at(buffer, local_size) as usize);</span>

<span class="hljs-deletion">-   let payload = buffer[..size as usize].to_vec();</span>
<span class="hljs-addition">+   let payload = buffer[..local_size].to_vec();</span>

    Ok(page::TableLeafCell {
        size,
        row_id,
        payload,
        first_overflow,
    }
    .into())
}
</code></pre>
<p>The modifications to <code>parse_table_leaf_cell</code> are straightforward: we leverage our utility method to compute the local and overflow sizes, as well as the first overflow page pointer, if any. Note that we modified the function's signature to accept a reference to the database header, so you'll need to propagate this change to the caller, and adapt the signature of <code>parse_table_interior_cell</code> as well.</p>
<p>We're not far from having a complete implementation of the overflow mechanism. What's left is to implement a way to read and parse the overflow pages, and exercise it when reading fields from a cursor.</p>
<h2 id="heading-reading-overflow-pages">Reading overflow pages</h2>
<p>Before we implement the parsing, we'll create a type to represent an overflow page:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// src/page.rs</span>

<span class="hljs-comment">// [...]</span>

<span class="hljs-meta">#[derive(Debug, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">OverflowPage</span></span> {
    <span class="hljs-keyword">pub</span> next: <span class="hljs-built_in">Option</span>&lt;<span class="hljs-built_in">usize</span>&gt;,
    <span class="hljs-keyword">pub</span> payload: <span class="hljs-built_in">Vec</span>&lt;<span class="hljs-built_in">u8</span>&gt;,
}
</code></pre>
<p>Parsing an overflow page is quite simple: the first four bytes contain the next overflow page pointer (or <code>0</code> if there are no more overflow pages), and the rest of the page contains the overflow payload,</p>
<pre><code class="lang-rust"><span class="hljs-comment">// src/pager.rs</span>

<span class="hljs-comment">// [...]</span>

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_overflow_page</span></span>(buffer: &amp;[<span class="hljs-built_in">u8</span>]) -&gt; page::OverflowPage {
    <span class="hljs-keyword">let</span> next = read_be_double_at(buffer, <span class="hljs-number">0</span>);
    page::OverflowPage {
        payload: buffer[<span class="hljs-number">4</span>..].to_vec(),
        next: <span class="hljs-keyword">if</span> next != <span class="hljs-number">0</span> { <span class="hljs-literal">Some</span>(next <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>) } <span class="hljs-keyword">else</span> { <span class="hljs-literal">None</span> },
    }
}
</code></pre>
<p>Since our <code>Pager</code>'s cache expects to only store <code>Page</code> structs, we need to adapt it so it can read and cache either a <code>Page</code> or an <code>OverflowPage</code>. First we'll define a new enum to represent either a <code>Page</code> or an <code>OverflowPage</code>:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// src/pager.rs</span>

<span class="hljs-comment">// [...]</span>

<span class="hljs-meta">#[derive(Debug, Clone)]</span>
<span class="hljs-class"><span class="hljs-keyword">enum</span> <span class="hljs-title">CachedPage</span></span> {
    Page(Arc&lt;page::Page&gt;),
    Overflow(Arc&lt;page::OverflowPage&gt;),
}

<span class="hljs-keyword">impl</span> <span class="hljs-built_in">From</span>&lt;Arc&lt;page::Page&gt;&gt; <span class="hljs-keyword">for</span> CachedPage {
    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">from</span></span>(value: Arc&lt;page::Page&gt;) -&gt; <span class="hljs-keyword">Self</span> {
        CachedPage::Page(value)
    }
}

<span class="hljs-keyword">impl</span> TryFrom&lt;CachedPage&gt; <span class="hljs-keyword">for</span> Arc&lt;page::Page&gt; {
    <span class="hljs-class"><span class="hljs-keyword">type</span> <span class="hljs-title">Error</span></span> = anyhow::Error;

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">try_from</span></span>(value: CachedPage) -&gt; <span class="hljs-built_in">Result</span>&lt;<span class="hljs-keyword">Self</span>, Self::Error&gt; {
        <span class="hljs-keyword">if</span> <span class="hljs-keyword">let</span> CachedPage::Page(p) = value {
            <span class="hljs-literal">Ok</span>(p.clone())
        } <span class="hljs-keyword">else</span> {
            bail!(<span class="hljs-string">"expected a regular page"</span>)
        }
    }
}

<span class="hljs-keyword">impl</span> <span class="hljs-built_in">From</span>&lt;Arc&lt;page::OverflowPage&gt;&gt; <span class="hljs-keyword">for</span> CachedPage {
    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">from</span></span>(value: Arc&lt;page::OverflowPage&gt;) -&gt; <span class="hljs-keyword">Self</span> {
        CachedPage::Overflow(value)
    }
}

<span class="hljs-keyword">impl</span> TryFrom&lt;CachedPage&gt; <span class="hljs-keyword">for</span> Arc&lt;page::OverflowPage&gt; {
    <span class="hljs-class"><span class="hljs-keyword">type</span> <span class="hljs-title">Error</span></span> = anyhow::Error;

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">try_from</span></span>(value: CachedPage) -&gt; <span class="hljs-built_in">Result</span>&lt;<span class="hljs-keyword">Self</span>, Self::Error&gt; {
        <span class="hljs-keyword">if</span> <span class="hljs-keyword">let</span> CachedPage::Overflow(o) = value {
            <span class="hljs-literal">Ok</span>(o.clone())
        } <span class="hljs-keyword">else</span> {
            bail!(<span class="hljs-string">"expected an overflow page"</span>)
        }
    }
}
</code></pre>
<p>It's a simple enum with two variants and a few conversion traits to allow our <code>Pager</code> to handle both types seamlessly.</p>
<p>Then, we need to adapt our <code>Pager</code> to support both types. We'll extract the bulk of the logic into a new generic<code>load</code> method that takes a page number and a parsing function and parses uncached pages with the provided function. Here is the updated <code>Pager</code> implementation:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// src/pager.rs</span>

<span class="hljs-comment">// [...]</span>

<span class="hljs-meta">#[derive(Debug)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">Pager</span></span>&lt;I: Read + Seek = std::fs::File&gt; {
    input: Arc&lt;Mutex&lt;I&gt;&gt;,
    pages: Arc&lt;RwLock&lt;HashMap&lt;<span class="hljs-built_in">usize</span>, CachedPage&gt;&gt;&gt;,
    header: DbHeader,
}

<span class="hljs-keyword">impl</span>&lt;I: Read + Seek&gt; Pager&lt;I&gt; {
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">new</span></span>(header: DbHeader, input: I) -&gt; <span class="hljs-keyword">Self</span> {
        <span class="hljs-keyword">Self</span> {
            input: Arc::new(Mutex::new(input)),
            pages: Arc::default(),
            header,
        }
    }

    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">read_overflow</span></span>(&amp;<span class="hljs-keyword">self</span>, n: <span class="hljs-built_in">usize</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;Arc&lt;page::OverflowPage&gt;&gt; {
        <span class="hljs-keyword">self</span>.load(n, |buffer| <span class="hljs-literal">Ok</span>(parse_overflow_page(buffer)))
    }

    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">read_page</span></span>(&amp;<span class="hljs-keyword">self</span>, n: <span class="hljs-built_in">usize</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;Arc&lt;page::Page&gt;&gt; {
        <span class="hljs-keyword">self</span>.load(n, |buffer| parse_page(&amp;<span class="hljs-keyword">self</span>.header, buffer, n))
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">load</span></span>&lt;T&gt;(&amp;<span class="hljs-keyword">self</span>, n: <span class="hljs-built_in">usize</span>, f: <span class="hljs-keyword">impl</span> <span class="hljs-built_in">Fn</span>(&amp;[<span class="hljs-built_in">u8</span>]) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;T&gt;) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;Arc&lt;T&gt;&gt;
    <span class="hljs-keyword">where</span>
        Arc&lt;T&gt;: <span class="hljs-built_in">Into</span>&lt;CachedPage&gt;,
        CachedPage: TryInto&lt;Arc&lt;T&gt;, Error=anyhow::Error&gt;,
    {
        {
            <span class="hljs-keyword">let</span> read_pages = <span class="hljs-keyword">self</span>
                .pages
                .read()
                .map_err(|_| anyhow!(<span class="hljs-string">"poisoned page cache lock"</span>))?;

            <span class="hljs-keyword">if</span> <span class="hljs-keyword">let</span> <span class="hljs-literal">Some</span>(page) = read_pages.get(&amp;n).cloned() {
                <span class="hljs-keyword">return</span> page.try_into();
            }
        }

        <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> write_pages = <span class="hljs-keyword">self</span>
            .pages
            .write()
            .map_err(|_| anyhow!(<span class="hljs-string">"failed to acquire pager write lock"</span>))?;

        <span class="hljs-keyword">if</span> <span class="hljs-keyword">let</span> <span class="hljs-literal">Some</span>(page) = write_pages.get(&amp;n).cloned() {
            <span class="hljs-keyword">return</span> page.try_into();
        }

        <span class="hljs-keyword">let</span> buffer = <span class="hljs-keyword">self</span>.load_raw(n)?;
        <span class="hljs-keyword">let</span> parsed = f(&amp;buffer[<span class="hljs-number">0</span>..<span class="hljs-keyword">self</span>.header.usable_page_size()])?;
        <span class="hljs-keyword">let</span> ptr = Arc::new(parsed);

        write_pages.insert(n, ptr.clone().into());

        <span class="hljs-literal">Ok</span>(ptr)
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">load_raw</span></span>(&amp;<span class="hljs-keyword">self</span>, n: <span class="hljs-built_in">usize</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;<span class="hljs-built_in">Vec</span>&lt;<span class="hljs-built_in">u8</span>&gt;&gt; {
        <span class="hljs-keyword">let</span> offset = n.saturating_sub(<span class="hljs-number">1</span>) * <span class="hljs-keyword">self</span>.header.page_size <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>;

        <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> input_guard = <span class="hljs-keyword">self</span>
            .input
            .lock()
            .map_err(|_| anyhow!(<span class="hljs-string">"poisoned pager mutex"</span>))?;

        input_guard
            .seek(SeekFrom::Start(offset <span class="hljs-keyword">as</span> <span class="hljs-built_in">u64</span>))
            .context(<span class="hljs-string">"seek to page start"</span>)?;

        <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> buffer = <span class="hljs-built_in">vec!</span>[<span class="hljs-number">0</span>; <span class="hljs-keyword">self</span>.header.page_size <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>];
        input_guard.read_exact(&amp;<span class="hljs-keyword">mut</span> buffer).context(<span class="hljs-string">"read page"</span>)?;

        <span class="hljs-literal">Ok</span>(buffer)
    }
}
</code></pre>
<h2 id="heading-putting-it-all-together">Putting it all together</h2>
<p>The main building blocks of our implementation are in place: we now how to detect when a row is too large to fit in a single page, and we have a way to load overflow pages through our <code>Pager</code>. The last step is to lazily read the overflow data when accessing a field that requires it.</p>
<p>To do this, we'll implement an <code>OverflowScanner</code> with a <code>read</code> method that takes as input the index of the first overflow page and the minimum amount of overflow data to read. The scanner will follow the linked list until the required amount of data is read of there are no more overflow pages.</p>
<pre><code class="lang-rust"><span class="hljs-comment">// src/cursor.rs</span>

<span class="hljs-comment">// [...]</span>

<span class="hljs-meta">#[derive(Debug)]</span>
<span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">OverflowScanner</span></span> {
    pager: Pager,
}

<span class="hljs-keyword">impl</span> OverflowScanner {
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">new</span></span>(pager: Pager) -&gt; <span class="hljs-keyword">Self</span> {
        <span class="hljs-keyword">Self</span> { pager }
    }

    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">read</span></span>(&amp;<span class="hljs-keyword">self</span>, first_page: <span class="hljs-built_in">usize</span>, size: <span class="hljs-built_in">usize</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;(<span class="hljs-built_in">Option</span>&lt;<span class="hljs-built_in">usize</span>&gt;, <span class="hljs-built_in">Vec</span>&lt;<span class="hljs-built_in">u8</span>&gt;)&gt; {
        <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> next_page = <span class="hljs-literal">Some</span>(first_page);
        <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> buffer = <span class="hljs-built_in">Vec</span>::with_capacity(size);

        <span class="hljs-keyword">while</span> buffer.len() &lt; size
            &amp;&amp; <span class="hljs-keyword">let</span> <span class="hljs-literal">Some</span>(next) = next_page
        {
            <span class="hljs-keyword">let</span> overflow = <span class="hljs-keyword">self</span>.pager.read_overflow(next)?;
            next_page = overflow.next;
            buffer.extend_from_slice(&amp;overflow.payload);
        }

        <span class="hljs-literal">Ok</span>((next_page, buffer))
    }
}
</code></pre>
<p>Our <code>Cursor</code> will use this new scanner in the following way: When reading a field, we'll compute the end offset of the field (based on the field's offset and size). If that end offset exceeds the size of the currently loaded payload, we'll read (<code>end_offset - payload.len()</code>) bytes through the overflow scanner, and append the result to the payload. If we wanted to read fields that are so large that they can't fit in RAM, we should implement a more sophisticated streaming mechanism, but for our purposes, reading the overflow data into memory is enough.</p>
<p>We'll start by implementing a utility method to compute the end offset of a field:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// src/cursor.rs</span>

<span class="hljs-comment">// [...]</span>

<span class="hljs-keyword">impl</span> RecordField {
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">end_offset</span></span>(&amp;<span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">usize</span> {
        <span class="hljs-keyword">let</span> size = <span class="hljs-keyword">match</span> <span class="hljs-keyword">self</span>.field_type {
            RecordFieldType::Null =&gt; <span class="hljs-number">0</span>,
            RecordFieldType::I8 =&gt; <span class="hljs-number">1</span>,
            RecordFieldType::I16 =&gt; <span class="hljs-number">2</span>,
            RecordFieldType::I24 =&gt; <span class="hljs-number">3</span>,
            RecordFieldType::I32 =&gt; <span class="hljs-number">4</span>,
            RecordFieldType::I48 =&gt; <span class="hljs-number">5</span>,
            RecordFieldType::I64 =&gt; <span class="hljs-number">8</span>,
            RecordFieldType::Float =&gt; <span class="hljs-number">8</span>,
            RecordFieldType::Zero =&gt; <span class="hljs-number">0</span>,
            RecordFieldType::One =&gt; <span class="hljs-number">0</span>,
            RecordFieldType::<span class="hljs-built_in">String</span>(size) | RecordFieldType::Blob(size) =&gt; size,
        };

        <span class="hljs-keyword">self</span>.offset + size
    }
}
</code></pre>
<p>Next, we'll implement the overflow reading logic in the <code>Cursor</code>:</p>
<pre><code class="lang-diff">// src/cursor.rs

// [...]

#[derive(Debug)]
pub struct Cursor {
    header: RecordHeader,
    payload: Vec&lt;u8&gt;,
<span class="hljs-addition">+   pager: Pager,</span>
<span class="hljs-addition">+   next_overflow_page: Option&lt;usize&gt;,</span>
}

impl Cursor {
<span class="hljs-deletion">-   pub fn owned_field(&amp;self, n: usize) -&gt; Option&lt;OwnedValue&gt; {</span>
<span class="hljs-deletion">-       self.field(n).map(Into::into)</span>
<span class="hljs-deletion">-   }</span>
<span class="hljs-addition">+   pub fn owned_field(&amp;mut self, n: usize) -&gt; anyhow::Result&lt;Option&lt;OwnedValue&gt;&gt; {</span>
<span class="hljs-addition">+       Ok(self.field(n)?.map(Into::into))</span>
<span class="hljs-addition">+   }</span>

<span class="hljs-deletion">-   pub fn field(&amp;self, n: usize) -&gt; Option&lt;Value&gt; {</span>
<span class="hljs-addition">+   pub fn field(&amp;mut self, n: usize) -&gt; anyhow::Result&lt;Option&lt;Value&gt;&gt; {</span>
<span class="hljs-deletion">-       let record_field = self.header.fields.get(n)?;</span>
<span class="hljs-addition">+       let Some(record_field) = self.header.fields.get(n) else {</span>
<span class="hljs-addition">+           return Ok(None);</span>
<span class="hljs-addition">+       };</span>

<span class="hljs-addition">+       let end_offset = record_field.end_offset();</span>

<span class="hljs-addition">+       if end_offset &gt; (self.payload.len() - 1)</span>
<span class="hljs-addition">+           &amp;&amp; let Some(overflow_page) = self.next_overflow_page</span>
<span class="hljs-addition">+       {</span>
<span class="hljs-addition">+           let overflow_size = end_offset.saturating_sub(self.payload.len());</span>
<span class="hljs-addition">+           let (next_overflow, overflow_data) = OverflowScanner::new(self.pager.clone())</span>
<span class="hljs-addition">+               .read(overflow_page, overflow_size)</span>
<span class="hljs-addition">+               .context("read overflow page")?;</span>
<span class="hljs-addition">+           self.next_overflow_page = next_overflow;</span>
<span class="hljs-addition">+           self.payload.extend_from_slice(&amp;overflow_data);</span>
<span class="hljs-addition">+       }</span>

<span class="hljs-deletion">-       match record_field.field_type {</span>
<span class="hljs-addition">+       let value = match record_field.field_type {</span>
            RecordFieldType::Null =&gt; Some(Value::Null),
            RecordFieldType::I8 =&gt; Some(Value::Int(read_i8_at(&amp;self.payload, record_field.offset))),
            RecordFieldType::I16 =&gt; {
                Some(Value::Int(read_i16_at(&amp;self.payload, record_field.offset)))
            }
            RecordFieldType::I24 =&gt; {
                Some(Value::Int(read_i24_at(&amp;self.payload, record_field.offset)))
            }
            RecordFieldType::I32 =&gt; {
                Some(Value::Int(read_i32_at(&amp;self.payload, record_field.offset)))
            }
            RecordFieldType::I48 =&gt; {
                Some(Value::Int(read_i48_at(&amp;self.payload, record_field.offset)))
            }
            RecordFieldType::I64 =&gt; {
                Some(Value::Int(read_i64_at(&amp;self.payload, record_field.offset)))
            }
            RecordFieldType::Float =&gt; Some(Value::Float(read_f64_at(
                &amp;self.payload,
                record_field.offset,
            ))),
            RecordFieldType::String(length) =&gt; {
                let value = std::str::from_utf8(
                    &amp;self.payload[record_field.offset..record_field.offset + length],
                )
                .expect("invalid utf8");
                Some(Value::String(Cow::Borrowed(value)))
            }
            RecordFieldType::Blob(length) =&gt; {
                let value = &amp;self.payload[record_field.offset..record_field.offset + length];
                Some(Value::Blob(Cow::Borrowed(value)))
            }
            RecordFieldType::One =&gt; Some(Value::Int(1)),
            RecordFieldType::Zero =&gt; Some(Value::Int(0)),
<span class="hljs-deletion">-       }</span>
_       };

<span class="hljs-addition">+       Ok(value)</span>
    }
}
</code></pre>
<blockquote>
<p>Note that we added fields to the <code>Cursor</code> and slightly modified its methods signature. These changes will need to be propagated to the consumers of the <code>Cursor</code> struct.</p>
</blockquote>
<h2 id="heading-conclusion">Conclusion</h2>
<p>This concludes our implementation of the overflow mechanism. Our little database can now read large <code>TEXT</code> and <code>BLOB</code> fields that are split across multiple pages. In the next post, we'll get back to the query engine and implement simple <code>WHERE</code> clauses, allowing us to filter rows based on their content.</p>
]]></content:encoded></item><item><title><![CDATA[Build a Compiler from Scratch, Part 1.2: Intermediate Representation and Code Generation]]></title><description><![CDATA[The frontend part of our compiler is complete, and we can parse the source code of a pylite
program into an AST. This leaves us with a final task: translating the program described
by the AST into assembly code. Technically, we could generate assembl...]]></description><link>https://blog.sylver.dev/build-a-compiler-from-scratch-part-12-intermediate-representation-and-code-generation</link><guid isPermaLink="true">https://blog.sylver.dev/build-a-compiler-from-scratch-part-12-intermediate-representation-and-code-generation</guid><category><![CDATA[Rust]]></category><category><![CDATA[Tutorial]]></category><category><![CDATA[compiler]]></category><dc:creator><![CDATA[Geoffrey Copin]]></dc:creator><pubDate>Tue, 24 Jun 2025 23:43:08 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1750974007637/11fc3a7f-d99f-4044-8828-6c294bf63d6d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The frontend part of our compiler is complete, and we can parse the source code of a pylite
program into an AST. This leaves us with a final task: translating the program described
by the AST into assembly code. Technically, we could generate assembly code directly from the AST,
but this poses a few problems, in particular:</p>
<ul>
<li>the structure of the AST is dictated by the syntax of the source language, and does not lend itself
especially well to the tasks we need to perform before generating the assembly code, such as control-flow
analysis, optimization, and generally breaking-up high-level operations into sequences of assembly instructions</li>
<li>if we want to extend our compiler to support more CPU architectures, we would need to duplicate
the entire code generation logic for each architecture, thus hurting the maintainability of the codebase</li>
</ul>
<p>For these reasons, we'll split the process of generating assembly code into multiple steps. First, we'll
translate the AST into an intermediate representation (IR) that is more suitable for optimization and code
generation. Then, we'll perform optimization passes on the IR, and finally we'll generate assembly code from the
optimized IR. The process of translating the AST into the IR is often referred to as "lowering" the AST into the
IR.</p>
<p>What will our IR look like? On one end of the spectrum, we could design a very high-level IR that closely resembles
the source language, and on the other we could design a low-level linear IR that mimics the assembly code.
We'll go for a hybrid graph-based approach, where branchless sections of the code are represented as
linear sequences of low-level instructions -which we'll call "basic blocks"- and conditional jumps are
represented as edges between the basic blocks. This structure is called a control-flow graph (CFG).</p>
<p>As an example, let's take the following <code>Pylite</code> program, which computes nth value of the Fibonacci sequence:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fib</span>(<span class="hljs-params">n: int</span>) -&gt; int:</span>
    a = <span class="hljs-number">1</span>
    b = <span class="hljs-number">1</span>

    <span class="hljs-keyword">while</span> n &gt; <span class="hljs-number">0</span>:
        c = b
        b = a + b
        a = c
        n = n - <span class="hljs-number">1</span>

    <span class="hljs-keyword">return</span> a
</code></pre>
<p>The corresponding CFG is shown below:</p>
<pre><code class="lang-plaintext">          ┌───────┐
          │ start │
          └───┬───┘
              │
              ▼
          ┌───────┐
          │ a = 1 │
          │ b = 1 │
          └───┬───┘
              │
              ▼
          ┌───────┐  False  ┌──────────┐
  ┌──────▶│ n &gt; 0 │────────▶│ return a │
  |       └───┬───┘         └──────────┘
  |           │ True                  
  |           ▼                     
  |    ┌───────────────┐           
  |    │ c = b         │          
  |    │ b = a + b     │         
  |    │ a = c         │        
  |    │ n = n - 1     │       
  |    └────┬──────────┘      
  |         │             
  └─────────┘
</code></pre>
<p>You'll notice that control-flow constructs -such as the <code>while</code> in our function definition-
are absent from the basic blocks. This is because they are represented as edges between the basic blocks.</p>
<p>For now our CFGs will be much simpler, as our function bodies will only contain a single
return statement.</p>
<p>Let's start by defining the IR nodes:</p>
<pre><code class="lang-rust"><span class="hljs-comment">//! compiler/src/ir/nodes.rs</span>

<span class="hljs-meta">#[derive(Debug, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">enum</span> <span class="hljs-title">Decl</span></span> {
    Function(Function),
}

<span class="hljs-meta">#[derive(Debug, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">Function</span></span> {
    <span class="hljs-keyword">pub</span> name: <span class="hljs-built_in">String</span>,
    <span class="hljs-keyword">pub</span> cfg: Cfg,
}

<span class="hljs-meta">#[derive(Debug, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">Cfg</span></span> {
    <span class="hljs-keyword">pub</span> blocks: <span class="hljs-built_in">Vec</span>&lt;BasicBlock&gt;,
}

<span class="hljs-meta">#[derive(Debug, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">BasicBlock</span></span> {
    <span class="hljs-keyword">pub</span> instructions: <span class="hljs-built_in">Vec</span>&lt;Instruction&gt;,
}

<span class="hljs-meta">#[derive(Debug, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">enum</span> <span class="hljs-title">Instruction</span></span> {
    Return(Operand),
}

<span class="hljs-meta">#[derive(Debug, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">enum</span> <span class="hljs-title">Operand</span></span> {
    Immediate(<span class="hljs-built_in">i64</span>),
}
</code></pre>
<p>Here is the IR representation of our simple Pylite <code>main</code> function:</p>
<pre><code class="lang-plaintext">Function(
    Function {
        name: "main",
        cfg: Cfg {
            blocks: [
                BasicBlock {
                    instructions: [
                        Return(
                            Immediate(
                                1,
                            ),
                        ),
                    ],
                },
            ],
        },
    },
),
</code></pre>
<p>At this stage, there is a direct corespondance between the AST nodes and their IR
counterparts. Let's build a <code>Generator</code> struct to perform the mapping between
the two representations. It's only public method <code>generate</code> will translate a single
AST <code>Decl</code> into the matching IR construct.</p>
<pre><code class="lang-rust"><span class="hljs-keyword">use</span> crate::ast;

<span class="hljs-keyword">use</span> super::nodes::{BasicBlock, Cfg, Decl, Function, Instruction, Operand};

<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">Generator</span></span> {
    instructions: <span class="hljs-built_in">Vec</span>&lt;Instruction&gt;,
}

<span class="hljs-keyword">impl</span> Generator {
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">new</span></span>() -&gt; <span class="hljs-keyword">Self</span> {
        <span class="hljs-keyword">Self</span> {
            instructions: <span class="hljs-built_in">Vec</span>::new(),
        }
    }

    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">generate</span></span>(<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>, decl: &amp;ast::Decl) -&gt; Decl {
        <span class="hljs-keyword">match</span> &amp;decl.kind {
            ast::DeclKind::Function(f) =&gt; {
                <span class="hljs-keyword">self</span>.generate_function(f);
                Decl::Function(Function {
                    name: f.name.name.clone(),
                    cfg: Cfg {
                        blocks: <span class="hljs-built_in">vec!</span>[BasicBlock {
                            instructions: <span class="hljs-keyword">self</span>.instructions,
                        }],
                    },
                })
            }
        }
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">generate_function</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>, function: &amp;ast::FunDecl) {
        <span class="hljs-keyword">for</span> stmt <span class="hljs-keyword">in</span> &amp;function.body {
            <span class="hljs-keyword">self</span>.generate_stmt(stmt);
        }
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">generate_stmt</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>, stmt: &amp;ast::BlockStatement) {
        <span class="hljs-keyword">match</span> &amp;stmt.kind {
            ast::BlockStatementKind::Return { value } =&gt; {
                <span class="hljs-keyword">let</span> operand = <span class="hljs-keyword">self</span>.generate_expr(value);
                <span class="hljs-keyword">self</span>.instructions.push(Instruction::Return(operand))
            }
        }
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">generate_expr</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>, stmt: &amp;ast::Expr) -&gt; Operand {
        <span class="hljs-keyword">match</span> &amp;stmt.kind {
            ast::ExprKind::IntLit { value } =&gt; Operand::Immediate(*value),
        }
    }
}
</code></pre>
<h2 id="heading-assembly-language-101">Assembly language 101</h2>
<p>At the end of the compilation pipeline, our compiler will output assembly code.
Assembly code is a textual representation of the machine code that's executed by
the CPU. Therefore, assembly code is specific to a given CPU architexture, and
the assembly code generated by our compiler will only run natively on x86 CPUs
(which is a prevalent architecture in both personal computers and servers).</p>
<p>By the end of this section, you'll know enough about assembly programming to
understand each line of the assembly equivalent of our Pylite <code>main</code> function.</p>
<h3 id="heading-your-first-assembly-program">Your first assembly program</h3>
<pre><code class="lang-assembly">.globl main     ; on macOs replace with .globl _main
main:           ; on macOs replace with _main:
    mov rax, 1
    ret
</code></pre>
<p>This is the assembly equivalent of our Pylite <code>main</code> function.
In assembly, each line represents a single instruction or assembler directive,
and instructions are executed sequentially, from top to bottom. Let's break this
down line by line:</p>
<pre><code class="lang-assembly">    .globl main
</code></pre>
<p>This line is an assembler directive that declares the <code>main</code> label as a global symbol.
By default, symbols are local to the source file they are defined in, so referencing
<code>main</code> from another source file would result in an error.</p>
<p>In this case, there is only one source file, so why do we need to declare <code>main</code> as a global
symbol? Well,<code>main</code> is the actual entry point of our program: the entry point is
the <code>_start</code> symbol, defined in an object file named <code>crt0.o</code> (C runtime zero).
<code>crt0</code> is part of the C standard library, and it is responsible for setting up the
execution environment of our program before invoking our <code>main</code> function.</p>
<pre><code class="lang-assembly">main:
</code></pre>
<p>This is the definition of the <code>main</code> label. Unlike mainstream programming languages,
assembly does not have functions or procedures, but instead uses labels to mark
locations in the code that can be used as targets for jumps and calls.
We'll make extensive use of labels when implementing control flow structures
and function calls in our compiler.</p>
<blockquote>
<p>On macOS, the label must be prefixed with an underscore, so it would be <code>_main:</code> instead of <code>main:</code>.</p>
</blockquote>
<pre><code class="lang-assembly">    mov rax, 1
</code></pre>
<p>This instruction moves the immediate value <code>1</code> into the <code>rax</code> register.
Registers are small, fast and fixed-size storage locations within the CPU that are used to hold
values that are being processed. There is a fixed set of registers, some of which
are "general-purpose" registers, meaning that they can be used for almost any operation,
while others are specialized and have very specific behaviors.</p>
<p><code>rax</code> is one of the general-purpose registers, and like most registers in a 64-bit CPU architecture,
it can hold 64 bits of data. There are 15 other general-purpose registers, named
<code>rbx</code>, <code>rcx</code>, <code>rdx</code>, <code>rsi</code>, <code>rdi</code>, <code>rsp</code>, <code>rbp</code>, and <code>r8</code> to <code>r15</code>.
Don't feel compelled to memorize the full list right now, as we'll discuss a
lot more about registers in the next chapters.</p>
<pre><code class="lang-assembly">    ret
</code></pre>
<p>Through a mechanism that we'll explore in greater depth in the next chapters, <code>ret</code> orders the CPU to
resume execution from the point where we jumped to the <code>main</code> label. In this case, it
will return to the code in <code>crt0</code>, which calls the <code>exit</code> system call with the value
in <code>rax</code> as the exit code of the program.</p>
<h2 id="heading-building-the-executable">Building the executable</h2>
<p>We'll use <code>gcc</code> to assemble and link our assembly code. The following sections
will show you how to set up your environment to assemble and run the assembly
code generated by our compiler.</p>
<h3 id="heading-macos-setup">macOS setup</h3>
<p>On macOS, we'll use the <code>homebrew</code> package manager to install <code>gcc</code>. To install,
<code>homebrew</code>, follow the instructions on the <a target="_blank" href="https://brew.sh/">official website</a>.
You can check your installation by running the following command in your terminal:</p>
<pre><code class="lang-bash">$ brew --version
</code></pre>
<p>Once <code>homebrew</code> is installed, you can install <code>gcc</code> by running:</p>
<pre><code class="lang-bash">$ brew install gcc
</code></pre>
<p>If your computer is running an x86 CPU, you're all set! But if you have an Apple Silicon
CPU, there is one last thing to be aware of: while it is perfectly fine to run the
compiler on an Apple Silicon CPU, the assembly code generated by our compiler
will only run on x86 CPUs. To run the generated assembly code, you'll need to
use <code>Rosetta 2</code>, which is a translation layer that allows x86 code to run on Apple Silicon CPUs.
You can install <code>Rosetta</code> by running the following command in your terminal:</p>
<pre><code class="lang-bash">$ softwareupdate --install-rosetta
</code></pre>
<p>Once <code>Rosetta 2</code> is installed, you can run <code>gcc</code> as an x86 binary by first opening
an x86 shell and running <code>gcc</code> from there:</p>
<pre><code class="lang-bash">$ arch -x86_64 zsh
$ &lt;your_gcc_invocation&gt;
</code></pre>
<h3 id="heading-linux-setup">Linux setup</h3>
<p>On Linux, you can install <code>gcc</code> using your distribution's package manager.
For example, on Ubuntu, you can run the following command:</p>
<pre><code class="lang-bash">$ sudo apt install gcc
</code></pre>
<p>You can check your installation by running the following command:</p>
<pre><code class="lang-bash">$ gcc --version
</code></pre>
<h3 id="heading-windows-setup">Windows setup</h3>
<p>Our compiler will not run natively on Windows, but you can use the Windows Subsystem for
Linux (WSL) to run it. WSL allows you to run a Linux distribution on Windows, so you can
use the same instructions as for Linux to install <code>gcc</code> and run the compiler.
To install WSL, follow the instructions on the <a target="_blank" href="https://learn.microsoft.com/en-us/windows/wsl/install">official website</a>.</p>
<h3 id="heading-assembling-and-linking-the-assembly-code">Assembling and linking the assembly code</h3>
<p>For this section, we'll assume that our assembly code is saved in a file named <code>return_code.s</code>.
We can assemble and link the assembly code with a single <code>gcc</code> command:</p>
<pre><code class="lang-bash">$ gcc -masm=intel return_code.s -o return_code
</code></pre>
<p>It instructs <code>gcc</code> to assemble and link the code in <code>return_code.s</code>, and produces an
executable named <code>return_code</code>. The <code>-masm=intel</code> flag tells <code>gcc</code> to use the Intel syntax
for the assembly code, which is the syntax used in our compiler. By default, <code>gcc</code> uses the
AT&amp;T syntax, which is slightly different.</p>
<p>We finally have a working executable! You can run it by executing the following command:</p>
<pre><code class="lang-bash">$ ./return_code
</code></pre>
<p>And predictably... nothing happens. Our program runs and exits, but is does not
print anything to the console. To verify that it worked properly, we can
inspect the exit code of the previous command by running:</p>
<pre><code class="lang-bash">$ <span class="hljs-built_in">echo</span> $?
1
</code></pre>
<p>It prints <code>1</code>, which is the value we returned from the <code>main</code> function.
You can try to substitute <code>1</code> with any other small integer in the assembly code,
rerun <code>gcc</code>, and run the executable again to see the exit code change accordingly.</p>
<h2 id="heading-code-generation">Code generation</h2>
<p>We're ready to build the final piece of our compiler: the code generator.
It will take the IR representation of our program as input, and generate the
corresponding x86 assembly code.
This will be a two-step process: first, we'll translate the IR data structure
into another data structure representing the linear sequence of assembly instructions,
then we'll emit the actual textual assembly code.</p>
<p>As we'll discover in the next chapters, splitting this process into two steps will
allow us to perform some transformations at the assembly level before emitting the
final assembly code.</p>
<p>The final assembly code for our Pylite <code>main</code> function will look like this:</p>
<pre><code class="lang-assembly">.globl main
main:
    mov rax, 1
    ret
</code></pre>
<p>We'll define a few rust types to represent assembly instructions and operands.</p>
<pre><code class="lang-rust"><span class="hljs-comment">//! compiler/src/codegen/asm.rs</span>

<span class="hljs-meta">#[derive(Debug, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">Function</span></span> {
    <span class="hljs-keyword">pub</span> label: <span class="hljs-built_in">String</span>,
    <span class="hljs-keyword">pub</span> instructions: <span class="hljs-built_in">Vec</span>&lt;Instruction&gt;,
}

<span class="hljs-meta">#[derive(Debug, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">enum</span> <span class="hljs-title">Instruction</span></span> {
    Mov { src: Operand, dst: Operand },
    Ret,
}

<span class="hljs-meta">#[derive(Debug, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">enum</span> <span class="hljs-title">Operand</span></span> {
    Register(Register),
    Immediate(<span class="hljs-built_in">i64</span>),
}

<span class="hljs-meta">#[derive(Debug, Copy, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">enum</span> <span class="hljs-title">Register</span></span> {
    Rax,
}

<span class="hljs-keyword">impl</span> std::fmt::Display <span class="hljs-keyword">for</span> Register {
    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">fmt</span></span>(&amp;<span class="hljs-keyword">self</span>, f: &amp;<span class="hljs-keyword">mut</span> std::fmt::Formatter&lt;<span class="hljs-symbol">'_</span>&gt;) -&gt; std::fmt::<span class="hljs-built_in">Result</span> {
        <span class="hljs-keyword">match</span> <span class="hljs-keyword">self</span> {
            Register::Rax =&gt; <span class="hljs-built_in">write!</span>(f, <span class="hljs-string">"rax"</span>),
        }
    }
}
</code></pre>
<p>To translate the IR into assembly instructions, we need to iterate over each
basic block of a function's CFG and generate the corresponding assembly Instruction.
To that end, we'll define a <code>FnGenerator</code> struct that wraps a function's IR representation
and a mutable assembly instructions buffer and iterates over the basic block's instructions,
pushing the corresponding assembly instructions into the buffer.
We'll also create a utility function that instantiates a <code>FnGenerator</code> for every function
in the IR program and returns a <code>Vec</code> of assembly<code>Function</code> structs, each containing the
function's label and the corresponding assembly instructions.</p>
<pre><code class="lang-rust"><span class="hljs-keyword">use</span> crate::ir;

<span class="hljs-keyword">use</span> super::asm::{Function, Instruction, Operand, Register};


<span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">FnGenerator</span></span>&lt;<span class="hljs-symbol">'f</span>&gt; {
    function: &amp;<span class="hljs-symbol">'f</span> ir::Function,
    instructions: <span class="hljs-built_in">Vec</span>&lt;Instruction&gt;,
}

<span class="hljs-keyword">impl</span>&lt;<span class="hljs-symbol">'f</span>&gt; FnGenerator&lt;<span class="hljs-symbol">'f</span>&gt; {
    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">new</span></span>(function: &amp;<span class="hljs-symbol">'f</span> ir::Function) -&gt; <span class="hljs-keyword">Self</span> {
        <span class="hljs-keyword">Self</span> {
            function,
            instructions: <span class="hljs-built_in">Vec</span>::new(),
        }
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">generate</span></span>(<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">Vec</span>&lt;Instruction&gt; {
        <span class="hljs-keyword">for</span> block <span class="hljs-keyword">in</span> &amp;<span class="hljs-keyword">self</span>.function.cfg.blocks {
            <span class="hljs-keyword">self</span>.generate_block(block);
        }
        <span class="hljs-keyword">self</span>.instructions
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">generate_block</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>, block: &amp;ir::BasicBlock) {
        <span class="hljs-keyword">for</span> instruction <span class="hljs-keyword">in</span> &amp;block.instructions {
            <span class="hljs-keyword">match</span> &amp;instruction {
                ir::Instruction::Return(op) =&gt; {
                    <span class="hljs-keyword">let</span> operand = <span class="hljs-keyword">self</span>.generate_operand(op);
                    <span class="hljs-keyword">self</span>.push(Instruction::Mov {
                        src: operand,
                        dst: Operand::Register(Register::Rax),
                    });
                    <span class="hljs-keyword">self</span>.push(Instruction::Ret)
                }
            }
        }
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">generate_operand</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>, operand: &amp;ir::Operand) -&gt; Operand {
        <span class="hljs-keyword">match</span> operand {
            ir::Operand::Immediate(value) =&gt; Operand::Immediate(*value),
        }
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">push</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>, instruction: Instruction) {
        <span class="hljs-keyword">self</span>.instructions.push(instruction);
    }
}

<span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">generate</span></span>(program: &amp;[ir::Decl]) -&gt; <span class="hljs-built_in">Vec</span>&lt;Function&gt; {
    program
        .iter()
        .map(|decl| {
            <span class="hljs-keyword">let</span> ir::Decl::Function(f) = decl;
            Function {
                label: f.name.clone(),
                instructions: FnGenerator::new(f).generate(),
            }
        })
        .collect()
}
</code></pre>
<p>The final step before we can assemble and run our program is to render the
assembly code into its textual representation.</p>
<p>One thing to note is that the format of labels is inconsistent across platforms:
on macOS function labels must start with an underscore, which is not the case
on Linux. To make our code portable, we'll use conditional compilation, and define
two different implementations of the function that renders function labels: one
that prepends an underscore to the label on macOS, and an other that returns
the label as is on Linux.</p>
<pre><code class="lang-rust"><span class="hljs-comment">//! compiler/src/codegen/render.rs</span>

<span class="hljs-meta">#[cfg(target_os = <span class="hljs-meta-string">"macos"</span>)]</span>
<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">format_fun_label</span></span>(label: &amp;<span class="hljs-built_in">str</span>) -&gt; <span class="hljs-built_in">String</span> {
    <span class="hljs-built_in">format!</span>(<span class="hljs-string">"_{}"</span>, label)
}

<span class="hljs-meta">#[cfg(target_os = <span class="hljs-meta-string">"linux"</span>)]</span>
<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">format_fun_label</span></span>(label: &amp;<span class="hljs-built_in">str</span>) -&gt; <span class="hljs-built_in">String</span> {
    label.to_string()
}
</code></pre>
<p>With this in place, we can implement our <code>render_program</code> function, which takes a slice
of assembly <code>Function</code> structs and returns the textual representation of the assembly code.</p>
<pre><code class="lang-rust"><span class="hljs-comment">//! compiler/src/codegen/render.rs</span>

<span class="hljs-keyword">use</span> super::asm::{Function, Instruction, Operand};

<span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">render_program</span></span>(program: &amp;[Function]) -&gt; <span class="hljs-built_in">String</span> {
    <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> code = <span class="hljs-built_in">String</span>::new();
    <span class="hljs-keyword">for</span> function <span class="hljs-keyword">in</span> program {
        code.push_str(&amp;render_function(function));
    }
    code
}

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">render_function</span></span>(function: &amp;Function) -&gt; <span class="hljs-built_in">String</span> {
    <span class="hljs-keyword">let</span> label = format_fun_label(&amp;function.label);
    <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> code = <span class="hljs-built_in">format!</span>(<span class="hljs-string">"\t.globl {label}\n{label}:\n"</span>);

    <span class="hljs-keyword">for</span> instruction <span class="hljs-keyword">in</span> &amp;function.instructions {
        <span class="hljs-keyword">let</span> rendered = render_instruction(instruction);
        code.push_str(&amp;<span class="hljs-built_in">format!</span>(<span class="hljs-string">"\t{rendered}\n"</span>));
    }

    code
}

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">render_instruction</span></span>(instruction: &amp;Instruction) -&gt; <span class="hljs-built_in">String</span> {
    <span class="hljs-keyword">match</span> instruction {
        Instruction::Mov { src, dst } =&gt; {
            <span class="hljs-built_in">format!</span>(<span class="hljs-string">"mov {}, {}"</span>, render_operand(dst), render_operand(src))
        }
        Instruction::Ret =&gt; <span class="hljs-string">"ret"</span>.to_string(),
    }
}

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">render_operand</span></span>(operand: &amp;Operand) -&gt; <span class="hljs-built_in">String</span> {
    <span class="hljs-keyword">match</span> operand {
        Operand::Register(register) =&gt; register.to_string(),
        Operand::Immediate(i) =&gt; i.to_string(),
    }
}

<span class="hljs-comment">// [...]</span>
</code></pre>
<h2 id="heading-putting-it-all-together">Putting it all together</h2>
<p>At this stage, we have all the components of a complete compiler.
Let's put it to the test by writing a straightforward main function that
binds all of these components to turn our <code>Pylite</code> code into an assembly program:</p>
<pre><code class="lang-rust"><span class="hljs-comment">//! compiler/src/main.rs</span>

<span class="hljs-keyword">mod</span> ast;
<span class="hljs-keyword">mod</span> codegen;
<span class="hljs-keyword">mod</span> ctx;
<span class="hljs-keyword">mod</span> db;
<span class="hljs-keyword">mod</span> error;
<span class="hljs-keyword">mod</span> id;
<span class="hljs-keyword">mod</span> ir;
<span class="hljs-keyword">mod</span> parse;

<span class="hljs-keyword">use</span> id::UniqueIdGenerator;

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">main</span></span>() {
    <span class="hljs-comment">// Read the input source code</span>
    <span class="hljs-keyword">let</span> input_file = std::env::args().nth(<span class="hljs-number">1</span>).expect(<span class="hljs-string">"missing input file"</span>);
    <span class="hljs-keyword">let</span> source_code = std::fs::read_to_string(&amp;input_file).expect(<span class="hljs-string">"failed to read input file"</span>);

    <span class="hljs-keyword">let</span> id_gen = UniqueIdGenerator::default();

    <span class="hljs-comment">// Parse the code into an Abstract Syntax Tree</span>
    <span class="hljs-keyword">let</span> ast_statements = parse::parser::Parser::new(id_gen.clone(), &amp;source_code)
        .parse_module()
        .expect(<span class="hljs-string">"failed to parse module"</span>);

    <span class="hljs-comment">// Lower the AST to the Intermediate Representation</span>
    <span class="hljs-keyword">let</span> ir_statements = ast_statements
        .iter()
        .map(|stmt| {
            <span class="hljs-keyword">let</span> ast::StatementKind::Decl(decl) = &amp;stmt.kind;
            ir::gen::Generator::new().generate(decl)
        })
        .collect::&lt;<span class="hljs-built_in">Vec</span>&lt;_&gt;&gt;();

    <span class="hljs-comment">// Generate and print the assembly code</span>
    <span class="hljs-keyword">let</span> asm = codegen::gen::generate(&amp;ir_statements);
    <span class="hljs-built_in">println!</span>(<span class="hljs-string">"{}"</span>, codegen::render::render_program(&amp;asm));
}
</code></pre>
<p>With our input program in <code>res/samples/return_const/main.py</code>, we can run our compiler and inspect it's output
using the following commands:</p>
<pre><code class="lang-bash">$ cargo run -- res/samples/return_const/main.py &gt; out.s
$ cat out.s
        .globl _main
_main:
        mov rax, 1
        ret
</code></pre>
<p>It seems that the compilation was successful! We can execute the program and observe its output code
by running the following commands:</p>
<pre><code class="lang-bash">$ gcc -masm=intel out.s -o main
$ ./main
$ <span class="hljs-built_in">echo</span> $?
1
</code></pre>
<blockquote>
<p>If you are using a mac with an Apple Silicon processor, remember to run <code>arch -x86_64 zsh</code>
before typing these commands.</p>
</blockquote>
<p>We just ran our first <code>Pylite</code> program!
In the next section, we'll start laying the foundations for a more robust
architecture.</p>
]]></content:encoded></item><item><title><![CDATA[Build a Compiler from Scratch, Part 1.1: A Hello World of sorts]]></title><description><![CDATA[It has become common practice to start with a "Hello World" program when learning a new programming language.
This is a simple program that outputs the text "Hello, World!" to the screen. While writing such a program 
is a trivial task with most prog...]]></description><link>https://blog.sylver.dev/build-a-compiler-from-scratch-part-11-a-hello-world-of-sorts</link><guid isPermaLink="true">https://blog.sylver.dev/build-a-compiler-from-scratch-part-11-a-hello-world-of-sorts</guid><category><![CDATA[Rust]]></category><category><![CDATA[compiler]]></category><category><![CDATA[Tutorial]]></category><dc:creator><![CDATA[Geoffrey Copin]]></dc:creator><pubDate>Tue, 24 Jun 2025 23:36:42 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1750973993317/8659f628-c3c4-4f65-a9a7-64d97f5b6e39.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>It has become common practice to start with a "Hello World" program when learning a new programming language.
This is a simple program that outputs the text "Hello, World!" to the screen. While writing such a program 
is a trivial task with most programming languages, getting our own language to interface with the 
OS and write data to the standard output will take a fair amount of work.</p>
<p>So we'll start with something simpler. How much simpler? Well, here is the first <code>Pylite</code>
program that our compiler will translate to machine code:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
  <span class="hljs-keyword">return</span> <span class="hljs-number">1</span>
</code></pre>
<p>It just sets a return code and exits. This is the simplest program that we can write while
having an easily observable effect. It will be our "Hello World" program for now.</p>
<div class="hn-embed-widget" id="zenyth-support"></div><p> </p>
<p>Throughout this chapter, we will build the basic building blocks of our compiler, including the
parser, and translation passes to the intermediate representation and machine code. Each of these
blocks will only support the minimum features needed to compile our "Hello World" program, and
will be expanded in later chapters to support more complex constructs.</p>
<h2 id="heading-setting-up-the-project">Setting up the project</h2>
<p>We will start by creating a new cargo workspace to host our compiler.</p>
<pre><code class="lang-bash">$ mkdir pylite
$ <span class="hljs-built_in">cd</span> pylite
</code></pre>
<p>For now, this workspace will contain a single <code>compiler</code> crate:</p>
<pre><code class="lang-toml"><span class="hljs-comment"># Cargo.toml</span>
<span class="hljs-section">[workspace]</span>
<span class="hljs-attr">resolver</span> = <span class="hljs-string">"2"</span>
<span class="hljs-attr">members</span> = [<span class="hljs-string">"compiler"</span>]
</code></pre>
<p>We can now create the <code>compiler</code> crate:</p>
<pre><code class="lang-bash">$ cargo new compiler --bin
$ cargo run 2&gt;/dev/null
Hello, world!
</code></pre>
<p>Our top-level <code>pylite</code> folder should have the following structure:</p>
<pre><code>.
├── Cargo.toml
└── compiler
    ├── Cargo.toml
    └── src
        └── main.rs
</code></pre><h2 id="heading-parsing-pylite">Parsing Pylite</h2>
<p>The textual representation of a program does not lend itself
well to the kind of analysis and transformation that we need to perform. In order to resolve types,
associate identifiers with their definitions and generate our low-level intermediate representation,
we need to convert the unstructured sequence of characters that make up the source code into a
tree-like structure that represents the program's syntax.
This process if called parsing, and the data structure it produces is called an Abstract Syntax Tree (AST).
The parsing process is typically split into two parts: lexical analysis (or tokenization) and syntactic
analysis (or parsing).</p>
<pre><code class="lang-plaintext">Source code
===========

def main():
    return 1


Token sequence
==============

┌───┐ ┌────┐ ┌─┐ ┌─┐ ┌─┐ ┌──────┐ ┌──────┐ ┌─┐ ┌──────┐
│def│ │main│ │(│ │)│ │:│ │INDENT│ │return│ │1│ │DEDENT│
└───┘ └────┘ └─┘ └─┘ └─┘ └──────┘ └──────┘ └─┘ └──────┘


Abstract Syntax Tree
====================

┌─────────────┐
│ FunctionDef │
│   (main)    │
└─────────────┘
        │
        └── ┌─────────────┐
            │ Return Stmt │
            └─────────────┘
                    │
                    └── ┌─────────┐
                        │ Literal │
                        │    1    │
                        └─────────┘
</code></pre>
<p>During lexical analysis, we group individual characters into tokens, which are the smallest
meaningful units of the language. For example, the characters <code>d</code>, <code>e</code> and <code>f</code> will be grouped
into a single token representing the <code>def</code> keyword. Irrelevant characters, such as non-significant
whitespace, are discarded during this process.
Since <code>Pylite</code> - like Python - is indentation-sensitive, we need to keep track of the current indentation level.
This will be done by inserting <code>INDENT</code> and <code>DEDENT</code> tokens into the token sequence whenever the indentation level changes.</p>
<p>The syntax analysis phase will then take this sequence of tokens and match it against our language's syntax rules
(or grammar) to build a tree-like structure called an Abstract Syntax Tree (AST). </p>
<h3 id="heading-writing-the-lexer">Writing the lexer</h3>
<p>The first step in writing our lexer is to choose a representation for our tokens. Rust's enums are a great 
fit for this purpose. We'll also pair each token with a <code>Span</code> struct that will hold the token's 
start and end byte position in the source code. Keeping track of the token's position will prove useful to display
meaningful error messages to the user.</p>
<pre><code class="lang-rust"><span class="hljs-comment">//! compiler/src/ast.rs</span>

<span class="hljs-meta">#[derive(Debug, Copy, Clone, Eq, PartialEq, Hash)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">Span</span></span> {
    <span class="hljs-keyword">pub</span> start: <span class="hljs-built_in">u32</span>,
    <span class="hljs-keyword">pub</span> end: <span class="hljs-built_in">u32</span>,
}

<span class="hljs-keyword">impl</span> Span {
    <span class="hljs-comment">// Create a new span with the given start and end position.</span>
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">new</span></span>(start: <span class="hljs-built_in">u32</span>, end: <span class="hljs-built_in">u32</span>) -&gt; <span class="hljs-keyword">Self</span> {
        <span class="hljs-keyword">Self</span> { start, end }
    }

    <span class="hljs-comment">// Create a new span that covers both the current span and the given span.</span>
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">merge</span></span>(<span class="hljs-keyword">self</span>, other: Span) -&gt; <span class="hljs-keyword">Self</span> {
        <span class="hljs-keyword">Self</span> {
            start: <span class="hljs-keyword">self</span>.start.min(other.start),
            end: <span class="hljs-keyword">self</span>.end.max(other.end),
        }
    }
}
</code></pre>
<pre><code class="lang-rust"><span class="hljs-comment">//! compiler/src/parse/token.rs</span>

<span class="hljs-meta">#[derive(Debug, Clone, Eq, PartialEq, Hash)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">enum</span> <span class="hljs-title">Token</span></span> {
    Def,
    Return,
    LPar,
    RPar,
    Colon,
    Identifier(<span class="hljs-built_in">String</span>),
    Int(<span class="hljs-built_in">i64</span>),
    Unknown(<span class="hljs-built_in">char</span>),
}
</code></pre>
<blockquote>
<p>The <code>Unknown</code> variant will be used to represent any unexpected character that we encounter 
during the tokenization process.</p>
</blockquote>
<p>With these definitions in place, we can create a scaffolding for our lexer. We'll model it as an iterator
yielding tuples of <code>(Token, Span)</code>.</p>
<pre><code class="lang-rust"><span class="hljs-comment">//! compiler/src/parse/lexer.rs</span>

<span class="hljs-keyword">use</span> std::{iter::Peekable, <span class="hljs-built_in">str</span>::Chars};

<span class="hljs-keyword">use</span> crate::ast::Span;

<span class="hljs-keyword">use</span> super::token::Token;

<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">Lexer</span></span>&lt;<span class="hljs-symbol">'c</span>&gt; {
    <span class="hljs-comment">// iterator over the source code's characters</span>
    input: Peekable&lt;Chars&lt;<span class="hljs-symbol">'c</span>&gt;&gt;,
    <span class="hljs-comment">// current byte position in the source code</span>
    position: <span class="hljs-built_in">u32</span>,
}

<span class="hljs-keyword">impl</span>&lt;<span class="hljs-symbol">'c</span>&gt; Lexer&lt;<span class="hljs-symbol">'c</span>&gt; {
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">new</span></span>(input: &amp;<span class="hljs-symbol">'c</span> <span class="hljs-built_in">str</span>) -&gt; <span class="hljs-keyword">Self</span> {
        <span class="hljs-keyword">Self</span> {
            input: input.chars().peekable(),
            position: <span class="hljs-number">0</span>,
        }
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">next_token</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">Option</span>&lt;(Token, Span)&gt; {
        <span class="hljs-built_in">unimplemented!</span>() 
    }
}

<span class="hljs-keyword">impl</span> <span class="hljs-built_in">Iterator</span> <span class="hljs-keyword">for</span> Lexer&lt;<span class="hljs-symbol">'_</span>&gt; {
    <span class="hljs-class"><span class="hljs-keyword">type</span> <span class="hljs-title">Item</span></span> = (Token, Span);

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">next</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">Option</span>&lt;Self::Item&gt; {
        <span class="hljs-keyword">self</span>.next_token()
    }
}
</code></pre>
<p>We'll also add the following helper methods to simplify the implementation of <code>next_token</code>:</p>
<ul>
<li><code>emit_token</code> takes a <code>start</code> position and a <code>Token</code> and returns a tuple of the form <code>(Token, Span)</code>.
 where the <code>Span</code> starts at <code>start</code> and ends at the current position.</li>
<li><code>next_char</code> consumes the next character in the input (if any), and updates the current position.</li>
<li><code>next_char_if</code> conditionally consumes the next character and updates the current position.</li>
</ul>
<blockquote>
<p>Since we track the byte position of the tokens and utf-8 characters can span multiple bytes,
we can't simply increment the position for each character. Instead, we'll use the <code>char::len_utf8</code>
method to get the byte size of the current character and update the position accordingly.</p>
</blockquote>
<pre><code class="lang-rust"><span class="hljs-comment">//! compiler/src/parse/lexer.rs</span>

<span class="hljs-keyword">impl</span>&lt;<span class="hljs-symbol">'c</span>&gt; Lexer&lt;<span class="hljs-symbol">'c</span>&gt; {
    <span class="hljs-comment">// [...] </span>

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">emit_token</span></span>(&amp;<span class="hljs-keyword">self</span>, start: <span class="hljs-built_in">u32</span>, token: Token) -&gt; <span class="hljs-built_in">Option</span>&lt;(Token, Span)&gt; {
        <span class="hljs-literal">Some</span>((
            token,
            Span {
                start,
                end: <span class="hljs-keyword">self</span>.position,
            },
        ))
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">next_char</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">Option</span>&lt;<span class="hljs-built_in">char</span>&gt; {
        <span class="hljs-keyword">self</span>.next_char_if(|_| <span class="hljs-literal">true</span>)
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">next_char_if</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>, f: <span class="hljs-keyword">impl</span> <span class="hljs-built_in">FnOnce</span>(<span class="hljs-built_in">char</span>) -&gt; <span class="hljs-built_in">bool</span>) -&gt; <span class="hljs-built_in">Option</span>&lt;<span class="hljs-built_in">char</span>&gt; {
        <span class="hljs-keyword">self</span>.input.next_if(|&amp;c| f(c)).inspect(|c| {
            <span class="hljs-keyword">self</span>.position += c.len_utf8() <span class="hljs-keyword">as</span> <span class="hljs-built_in">u32</span>;
        })
    }  
}
</code></pre>
<p>Some tokens contain a single character, such as <code>(</code> or <code>:</code>. We can handle these tokens by matching comparing the
current character to the expected one and returning the corresponding token if they match.</p>
<pre><code class="lang-rust"><span class="hljs-comment">//! compiler/src/parse/lexer.rs</span>

<span class="hljs-keyword">impl</span>&lt;<span class="hljs-symbol">'c</span>&gt; Lexer&lt;<span class="hljs-symbol">'c</span>&gt; {
    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">next_token</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">Option</span>&lt;(Token, Span)&gt; {
        <span class="hljs-keyword">let</span> start_pos = <span class="hljs-keyword">self</span>.position;

        <span class="hljs-keyword">match</span> <span class="hljs-keyword">self</span>.next_char()? {
            <span class="hljs-string">'('</span> =&gt; <span class="hljs-keyword">self</span>.emit_token(start_pos, Token::LPar),
            <span class="hljs-string">')'</span> =&gt; <span class="hljs-keyword">self</span>.emit_token(start_pos, Token::RPar),
            <span class="hljs-string">':'</span> =&gt; <span class="hljs-keyword">self</span>.emit_token(start_pos, Token::Colon),
            c =&gt; <span class="hljs-built_in">unimplemented!</span>(),
        }
    }
}
</code></pre>
<p>The remaining cases are more involved, but follow a similar pattern:</p>
<ul>
<li>if the current character is a digit, we consume the following characters until we reach a
non-digit character. We then parse the resulting string as an integer and create an integer token.</li>
<li>if the current character is a letter, we consume the following characters until we reach a
non-alphanumeric character different from <code>_</code>. We then check if the resulting string matches
a keyword and create a keyword token if it does, or an identifier token otherwise.</li>
<li>if the current character is a whitespace, we skip every following whitespace and return the following token.</li>
</ul>
<p>Finally, if the current character does not match any of the above cases, we create an <code>Unknown</code> token.</p>
<pre><code class="lang-rust"><span class="hljs-comment">//! compiler/src/parse/lexer.rs</span>

<span class="hljs-keyword">impl</span>&lt;<span class="hljs-symbol">'c</span>&gt; Lexer&lt;<span class="hljs-symbol">'c</span>&gt; {
    <span class="hljs-comment">// [...]</span>
    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">next_token</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">Option</span>&lt;(Token, Span)&gt; {
        <span class="hljs-comment">// [...]</span>
        <span class="hljs-keyword">match</span> <span class="hljs-keyword">self</span>.next_char()? {
            <span class="hljs-comment">// [...]</span>
            c <span class="hljs-keyword">if</span> c.is_numeric() =&gt; {
                <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> int_repr = <span class="hljs-built_in">String</span>::from(c);
                <span class="hljs-keyword">while</span> <span class="hljs-keyword">let</span> <span class="hljs-literal">Some</span>(c) = <span class="hljs-keyword">self</span>.next_char_if(|c| c.is_numeric()) {
                    int_repr.push(c);
                }
                <span class="hljs-keyword">let</span> value = int_repr.parse::&lt;<span class="hljs-built_in">i64</span>&gt;().expect(<span class="hljs-string">"int token"</span>);
                <span class="hljs-keyword">self</span>.emit_token(start_pos, Token::Int(value))
            }
            c <span class="hljs-keyword">if</span> c.is_alphabetic() =&gt; {
                <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> identifier = <span class="hljs-built_in">String</span>::from(c);
                <span class="hljs-keyword">while</span> <span class="hljs-keyword">let</span> <span class="hljs-literal">Some</span>(c) = <span class="hljs-keyword">self</span>.next_char_if(|c| c.is_alphanumeric() || c == <span class="hljs-string">'_'</span>) {
                    identifier.push(c);
                }
                <span class="hljs-keyword">match</span> identifier.as_str() {
                    <span class="hljs-string">"def"</span> =&gt; <span class="hljs-keyword">self</span>.emit_token(start_pos, Token::Def),
                    <span class="hljs-string">"return"</span> =&gt; <span class="hljs-keyword">self</span>.emit_token(start_pos, Token::Return),
                    _ =&gt; <span class="hljs-keyword">self</span>.emit_token(start_pos, Token::Identifier(identifier)),
                }
            }
            c <span class="hljs-keyword">if</span> c.is_whitespace() =&gt; {
                <span class="hljs-keyword">while</span> <span class="hljs-keyword">self</span>.next_char_if(|c| c.is_whitespace()).is_some() {}
                <span class="hljs-keyword">self</span>.next_token()
            }
            c =&gt; <span class="hljs-keyword">self</span>.emit_token(start_pos, Token::Unknown(c)),
        }
    }
}
</code></pre>
<p>With our lexer almost complete, it's time to write our first test!
As building and verifying the token spans by hand would be tedious, we'll write a helper <code>test_lexer</code> 
function that takes as input the source code and a <code>vec</code> of <code>(String, Token)</code> tuples representing the 
expected tokens, with the <code>String</code> being the token's textual representation.
If tokenizing the input and rendering the <code>Span</code>s does not yield the same <code>vec</code>, the test will fail.</p>
<pre><code class="lang-rust"><span class="hljs-comment">//! compiler/src/parse/lexer.rs</span>

<span class="hljs-meta">#[cfg(test)]</span>
<span class="hljs-keyword">mod</span> test {
    <span class="hljs-keyword">use</span> super::*;

    <span class="hljs-meta">#[test]</span>
    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">tokenize_return_const</span></span>() {
        <span class="hljs-keyword">let</span> input = <span class="hljs-string">r##"
def main():
    return 34
"##</span>;

        <span class="hljs-keyword">let</span> expected = <span class="hljs-built_in">vec!</span>[
            (<span class="hljs-string">"def"</span>, Token::Def),
            (<span class="hljs-string">"main"</span>, Token::Identifier(<span class="hljs-string">"main"</span>.to_string())),
            (<span class="hljs-string">"("</span>, Token::LPar),
            (<span class="hljs-string">")"</span>, Token::RPar),
            (<span class="hljs-string">":"</span>, Token::Colon),
            (<span class="hljs-string">"return"</span>, Token::Return),
            (<span class="hljs-string">"34"</span>, Token::Int(<span class="hljs-number">34</span>)),
        ];

        test_lexer(input, expected);
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">test_lexer</span></span>(input: &amp;<span class="hljs-built_in">str</span>, expected: <span class="hljs-built_in">Vec</span>&lt;(&amp;<span class="hljs-built_in">str</span>, Token)&gt;) {
        <span class="hljs-keyword">let</span> rendered: <span class="hljs-built_in">Vec</span>&lt;(&amp;<span class="hljs-built_in">str</span>, Token)&gt; = Lexer::new(input)
            .map(|(token, span)| {
                <span class="hljs-keyword">let</span> rendered = &amp;input[span.start <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>..span.end <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>];
                (rendered, token)
            })
            .collect();

        <span class="hljs-built_in">assert_eq!</span>(expected, rendered);
    }
}
</code></pre>
<p>Let's make sure that our test passes before moving on to the next section:</p>
<pre><code class="lang-bash">$ cargo <span class="hljs-built_in">test</span>
running 1 <span class="hljs-built_in">test</span>
<span class="hljs-built_in">test</span> parse::lexer::tests::tokenize_return_const ... ok

<span class="hljs-built_in">test</span> result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished <span class="hljs-keyword">in</span> 0.00s
</code></pre>
<h4 id="heading-handling-indentation">Handling indentation</h4>
<p>One defining aspect of <code>Pylite</code>'s syntax is that it is indentation-sensitive. This means that blocks
are delimited by their indentation level, rather than by explicit braces or keywords.
To simplify the parsing process, we'll insert <code>Indent</code> and <code>Dedent</code> tokens into the token stream
whenever we detect a change in the current indentation level.</p>
<p>How to detect to such changes? First, we need to define which character - or sequence of characters - represents
a single level of indentation. In <code>Pylite</code> a single tab, or four consecutive spaces, will represent a single level 
of indentation. Every time we reach a new line, we'll count the number of tabs and spaces at the beginning of the line
and compare the new indentation level to the previous one. If the indentation levels match, there is no extra token to
insert. If they don't match we'll compute <code>delta = abs(new_indentation - previous_indentation)</code> and insert <code>delta</code> 
<code>Indent</code> or <code>Dedent</code> tokens accordingly.
In case of inconsistent indentation, for example, if a new line starts with three spaces, we'll insert an <code>InconsistentIndentation</code>
token. </p>
<p>Let's start by adding the new variants to our <code>Token</code> enum:</p>
<pre><code class="lang-diff">//! compiler/src/parse/token.rs

#[derive(Debug, Clone, Eq, PartialEq, Hash)]
pub enum Token {
    Def,
    Return,
    LPar,
    RPar,
    Colon,
    Identifier(String),
    Int(i64),
    Unknown(char),
<span class="hljs-addition">+   Indent,</span>
<span class="hljs-addition">+   Dedent,</span>
<span class="hljs-addition">+   InconsistentIndentation,</span>
}

// [...]
</code></pre>
<p>In some cases, the difference in indentation level will be higher than one. For example, we'll have to insert
two <code>Dedent</code> tokens at the end of the following function definition. One to close the if block, and the 
other to close the function definition.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
    <span class="hljs-keyword">if</span> <span class="hljs-literal">True</span>:
        <span class="hljs-keyword">return</span> <span class="hljs-number">1</span>
</code></pre>
<p>This doesn't fit well with the current design of our lexer, as we return tokens eagerly as soon as they are
recognized. We'll need to refactor our lexer to support emitting multiple tokens at once. For this reason,
we'll split the work done by our <code>iter</code> method into two separate steps: first push the recognized tokens into
a queue, and then return the first token in the queue, if any.</p>
<pre><code class="lang-diff">//! compiler/src/parse/lexer.rs

<span class="hljs-deletion">-use std::{iter::Peekable, str::Chars};</span>
<span class="hljs-addition">+use std::{collections::VecDeque, iter::Peekable, str::Chars};</span>

pub struct Lexer&lt;'c&gt; {
    input: Peekable&lt;Chars&lt;'c&gt;&gt;,
    // queue of tokens to emit, we use a VecDeque to efficiently pop tokens from the front
    // without having to shift the remaining elements
<span class="hljs-addition">+   token_queue: VecDeque&lt;(Token, Span)&gt;,</span>
    position: u32,
<span class="hljs-addition">+   indentation: u32, </span>
}

impl&lt;'c&gt; Lexer&lt;'c&gt; {
    pub fn new(input: &amp;'c str) -&gt; Self {
        Self {
            input: input.chars().peekable(),
<span class="hljs-addition">+           token_queue: VecDeque::default(),</span>
            position: 0,
<span class="hljs-addition">+           indentation: 0, </span>
        }
    }

<span class="hljs-deletion">-   fn next_token(&amp;mut self) -&gt; Option&lt;(Token, Span)&gt; {</span>
<span class="hljs-addition">+   fn emit_next_tokens(&amp;mut self) {</span>
        // handle indents at the beginning of a file
<span class="hljs-addition">+       if self.position == 0 {</span>
<span class="hljs-addition">+           self.emit_indentation_tokens();</span>
<span class="hljs-addition">+       }</span>
<span class="hljs-addition">+</span>
        let start_pos = self.position;

<span class="hljs-addition">+       let Some(first_char) = self.next_char() else {</span>
<span class="hljs-addition">+           return;</span>
<span class="hljs-addition">+       };</span>

<span class="hljs-deletion">-       match self.next_char()? {</span>
<span class="hljs-addition">+       match first_char {</span>
            // [...]
<span class="hljs-addition">+           '\n' =&gt; {</span>
<span class="hljs-addition">+               self.emit_indentation_tokens(); </span>
<span class="hljs-addition">+               self.emit_next_tokens();</span>
            }
            // [...]
        }
    }

<span class="hljs-addition">+   fn emit_indentation_tokens(&amp;mut self) {</span>
<span class="hljs-addition">+       let mut space_count = 0;</span>
<span class="hljs-addition">+       let mut indentation = 0;</span>

<span class="hljs-addition">+       while let Some(c) = self.next_char_if(char::is_whitespace) {</span>
<span class="hljs-addition">+           match c {</span>
                // a tab increases the indentation level by 1
<span class="hljs-addition">+               '\t' =&gt; {</span>
<span class="hljs-addition">+                   if space_count % 4 != 0 {</span>
<span class="hljs-addition">+                       space_count = 4;</span>
<span class="hljs-addition">+                       self.emit_token(self.position, Token::InconsistentIndentation);</span>
<span class="hljs-addition">+                   }</span>
<span class="hljs-addition">+                   indentation += 1;</span>
<span class="hljs-addition">+               }</span>
                // indentation is resetted on every new line
<span class="hljs-addition">+               '\n' =&gt; {</span>
<span class="hljs-addition">+                   space_count = 0;</span>
<span class="hljs-addition">+                   indentation = 0;</span>
<span class="hljs-addition">+               }</span>
<span class="hljs-addition">+               _ =&gt; {</span>
<span class="hljs-addition">+                   space_count += 1;</span>
                    // four spaces increase the indentation level
<span class="hljs-addition">+                   if space_count % 4 == 0 {</span>
<span class="hljs-addition">+                       indentation += 1;</span>
<span class="hljs-addition">+                   }</span>
<span class="hljs-addition">+               }</span>
<span class="hljs-addition">+           }</span>
<span class="hljs-addition">+       }</span>

<span class="hljs-addition">+       if space_count % 4 != 0 {</span>
<span class="hljs-addition">+           self.emit_token(self.position, Token::InconsistentIndentation)</span>
<span class="hljs-addition">+       }</span>

<span class="hljs-addition">+       if indentation == self.indentation {</span>
<span class="hljs-addition">+           return;</span>
<span class="hljs-addition">+       }</span>

        // emit indent/dedent tokens if the new indentation level is
        // different from the previous one
<span class="hljs-addition">+       for _ in 0..self.indentation.abs_diff(indentation) {</span>
<span class="hljs-addition">+           let token = if indentation &gt; self.indentation {</span>
<span class="hljs-addition">+               Token::Indent</span>
<span class="hljs-addition">+           } else {</span>
<span class="hljs-addition">+               Token::Dedent</span>
<span class="hljs-addition">+           };</span>
<span class="hljs-addition">+           self.emit_token(self.position, token);</span>
<span class="hljs-addition">+       }</span>

<span class="hljs-addition">+       self.indentation = indentation;</span>
<span class="hljs-addition">+   }</span>

    fn emit_token(&amp;mut self, start: u32, token: Token) {
        self.token_queue.push_back((
            token,
            Span {
                start,
                end: self.position,
            },
        ));
    }

    // [...]
}

impl Iterator for Lexer&lt;'_&gt; {
    type Item = (Token, Span);

    fn next(&amp;mut self) -&gt; Option&lt;Self::Item&gt; {
<span class="hljs-deletion">-       self.next_token();</span>
<span class="hljs-addition">+       if self.token_queue.is_empty() {</span>
<span class="hljs-addition">+           self.emit_next_tokens();</span>
<span class="hljs-addition">+       }</span>
<span class="hljs-addition">+       self.token_queue.pop_front()</span>
    }
}
</code></pre>
<p>Now, our lexer will emit <code>Indent</code> and <code>Dedent</code> tokens whenever the indentation level changes.
The only thing that is left to do is to update our test:</p>
<pre><code class="lang-diff">//! compiler/src/parse/lexer.rs

[...]

    #[test]
    fn tokenize_return_const() {
        let input = r##"
def main():
    return 34
"##;

        let expected = vec![
            ("def", Token::Def),
            ("main", Token::Identifier("main".to_string())),
            ("(", Token::LPar),
            (")", Token::RPar),
            (":", Token::Colon),
<span class="hljs-addition">+           ("", Token::Indent),</span>
            ("return", Token::Return),
            ("34", Token::Int(34)),
<span class="hljs-addition">+           ("", Token::Dedent),</span>
        ];

        test_lexer(input, expected);
    }

    [...]
</code></pre>
<h2 id="heading-building-the-parser">Building the parser</h2>
<h3 id="heading-introduction-to-formal-grammars">Introduction to formal grammars</h3>
<p>Throughout this series, we'll try to limit the amount of purely theoretical content. But
it's challenging to study compilers without at least a cursory understanding of formal grammars.
We'll actually make use of the concepts introduced in this section when writing our parser,
as the shape of our parsing functions will closely mimic the structure of our grammar
rules. </p>
<blockquote>
<p>The kind of grammar that we'll be using is called a context-free grammar (CFG).
Context-free grammars are a subset of formal grammars that are widely used in
computer science, particularly in the field of programming languages.
To learn more about formal grammars, you can check out the <a target="_blank" href="https://en.wikipedia.org/wiki/Chomsky_hierarchy">Wikipedia entry</a>
on the Chomsky hierarchy, which classified grammars into multiple nested categories.</p>
</blockquote>
<p>But first, what is a language's grammar? 
In the context of programming languages, a grammar is a set of rules that define the
syntax of a language. It specifies how tokens can be combined to form valid statements
and expressions. In a way, grammars solve the same problem as regular expressions: 
given a piece of text, determine whether it is "valid," according to a set of rules.</p>
<p>Why not use regular expressions to define the syntax of our language, then?
The answer is simple: regular expressions are not powerful enough to express the 
syntax of most programming languages. Regular expressions can handle simple patterns
like matching keywords or numeric literals, but can't describe nested structures
and recursive patterns that are fundamental to programming languages.
As a motivating example, consider the challenge of matching balanced parentheses—a
structure that appears everywhere in programming languages, from function calls to
mathematical expressions. You can try to write a regular expression that matches
balanced parentheses: inputs like <code>()</code>, <code>(())</code> and <code>(()())</code> should all be matched, 
while inputs like <code>(()</code>, <code>())</code> should not.</p>
<p>This is a classic example of something that cannot be expressed with regular
expressions, but can be expressed with a context-free grammar. Without
further ado, let's see how to write a grammar for our balanced parentheses
example.</p>
<pre><code class="lang-plaintext">BalancedParenthesis = { Nested }-;
Nested = "(", { Nested }, ")";
</code></pre>
<p>Rules are defined using the <code>=</code> symbol, and are terminated by a semicolon.
Within a rule <code>,</code> denotes a sequence of elements. Rule names can be
surrounded to express repetition:</p>
<ul>
<li><code>[]</code> means "0 or 1"</li>
<li><code>{}</code> means "0 or more"</li>
<li><code>{}-</code> means "1 or more"</li>
</ul>
<p>Translated into english, this grammar states that a <code>BalancedParenthesis</code> is
a sequence of one or more <code>Nested</code>, where a <code>Nested</code> is
defined a a <code>(</code>, followed by zero or more <code>Nested</code> and then a <code>)</code>.</p>
<p>How to verify that our grammar is correct? We can use rule substitution to
replace the rule names with their definition, until we reach the expected
string. For example, let's try to verify that the string <code>(()())</code> is
indeed a valid <code>BalancedParenthesis</code>:</p>
<pre><code>(()()) = BalancedParenthesis
       = { Nested }- <span class="hljs-comment">// 1 </span>
       = <span class="hljs-string">"("</span>, { Nested }, <span class="hljs-string">")"</span> <span class="hljs-comment">// 2</span>
       = <span class="hljs-string">"("</span>, <span class="hljs-string">"("</span>, { Nested }, <span class="hljs-string">")"</span>, <span class="hljs-string">"("</span>, { Nested }, <span class="hljs-string">")"</span>, <span class="hljs-string">")"</span>  <span class="hljs-comment">// 3</span>
       = <span class="hljs-string">"("</span>, <span class="hljs-string">"("</span>, <span class="hljs-string">")"</span>, <span class="hljs-string">"("</span>, <span class="hljs-string">")"</span>, <span class="hljs-string">")"</span> <span class="hljs-comment">// 4</span>
</code></pre><p>We start by replacing <code>BalancedParenthesis</code> with its definition (step 1), then
we substitute <code>Nested</code> with its definition (step 2), and replace the inner
<code>{ Nested }</code> with two successive <code>Nested</code> (step 3). Finally, we replace
the innermost <code>Nested</code> (step 4) and reach the expected string.</p>
<p>Reading and writing grammars can prove to be challenging at first, but
with a bit of practice, you'll find that they are a powerful tool to explore
and describe the syntax of programming languages.
Luckily for us, at this point, the grammar for our language is quite simple,
as it does not include repetitions or recursion like the grammar for balanced parentheses.</p>
<pre><code class="lang-plaintext">Program = FunDecl;

FunDecl = "def" Identifier "(" ")" ":" Block;

Block = Indent ReturnStatement Dedent;

BlockStatement = ReturnStatement;

ReturnStatement = "return" Expression;

Expression = Int;
</code></pre>
<p>You'll notice that some rules don't have a matching definition, such as <code>Identifier</code>, <code>Int</code>, <code>Indent</code>
and <code>Dedent</code>. This is because they are terminal rules, which means that they directly represent
tokens in the token stream produced by the lexer.</p>
<p>As an exercise, you can try to use successive substitutions starting from the <code>Program</code>
rule to verify that the following <code>Pylite</code> program is syntactically valid:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
    <span class="hljs-keyword">return</span> <span class="hljs-number">42</span>
</code></pre>
<h3 id="heading-generating-unique-identifiers">Generating unique identifiers</h3>
<p>The data structures built by a compiler naturally tend to contain cycles. For example,
a recursive function definition includes the AST nodes representing the function's body, which 
in turn contain a reference to the function itself (after name-resolution is complete).
This is a problem for our compiler, as Rust's borrow checker makes it difficult to
create cyclic data structures. 
To solve this problem, we'll add a level of indirection: each node in the AST - and
in general, each entity within the compiler - will be identified by a unique identifier.
In the previous example, during name resolution, we'll simply register the association
between the function's unique identifier and the unique identifier of the recursive call
within the function's body, without introducing any cycle.</p>
<p>Our unique identifiers will be backed by monotonically increasing unsigned integers.
To keep track of the last assigned identifier and generate new ones, we'll create a
<code>UniqueIdGenerator</code> struct. It will have a pointer to an <code>AtomicU32</code> representing
the next identifier to be assigned. Generating a new identifier is as simple as
incrementing the atomic counter and returning its previous value.</p>
<pre><code class="lang-rust"><span class="hljs-comment">//! compiler/src/id.rs</span>

<span class="hljs-keyword">use</span> std::{rc::Rc, sync::atomic::AtomicU32};

<span class="hljs-meta">#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">UniqueId</span></span>(<span class="hljs-built_in">u32</span>);

<span class="hljs-keyword">impl</span> <span class="hljs-built_in">From</span>&lt;<span class="hljs-built_in">u32</span>&gt; <span class="hljs-keyword">for</span> UniqueId {
    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">from</span></span>(value: <span class="hljs-built_in">u32</span>) -&gt; <span class="hljs-keyword">Self</span> {
        UniqueId(value)
    }
}

<span class="hljs-meta">#[derive(Debug, Clone, Default)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">UniqueIdGenerator</span></span> {
    next_id: Rc&lt;AtomicU32&gt;,
}

<span class="hljs-keyword">impl</span> UniqueIdGenerator {
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">generate</span></span>(&amp;<span class="hljs-keyword">self</span>) -&gt; UniqueId {
        UniqueId(
            <span class="hljs-keyword">self</span>.next_id
                .fetch_add(<span class="hljs-number">1</span>, std::sync::atomic::Ordering::Relaxed),
        )
    }
}
</code></pre>
<h3 id="heading-parsing-the-token-stream">Parsing the token stream</h3>
<p>The next step is to transform the token stream into an Abstract Syntax Tree (AST).
We'll start by defining a set of <code>struct</code> and <code>enums</code> to represent the different
nodes in our AST.</p>
<blockquote>
<p>For some additional type safety, we leverage the <code>nonempty</code> crate.
It provides a <code>NonEmpty</code> type that mimics the behavior of <code>Vec</code>, but guarantees
that the vector always contains at least one element. We'll use this type
to represent function bodies, which always contain at least one statement.
Let's install it by running <code>cargo add nonempty</code> in the <code>compiler</code> folder:
```rust
// compiler/ast.rs</p>
</blockquote>
<p>use std::{path::PathBuf, rc::Rc};</p>
<p>use nonempty::NonEmpty;</p>
<p>use crate::id::UniqueId;</p>
<p>// [...]</p>
<p>// Represents a Pylite module, which is an entire source file.</p>
<p>#[derive(Debug, Clone)]
pub struct Module {
    pub path: Rc,
}</p>
<p>impl Module {
    pub fn new(path: PathBuf) -&gt; Self {
        Self {
            path: Rc::new(path),
        }
    }
}</p>
<p>#[derive(Debug, Clone, Eq, PartialEq)]
pub struct Statement {
    pub id: UniqueId,
    pub span: Span,
    pub kind: StatementKind,
}</p>
<p>#[derive(Debug, Clone, Eq, PartialEq)]
pub enum StatementKind {
    Decl(Decl),
}</p>
<p>#[derive(Debug, Clone, Eq, PartialEq)]
pub struct Decl {
    pub id: UniqueId,
    pub span: Span,
    pub kind: DeclKind,
}</p>
<p>#[derive(Debug, Clone, Eq, PartialEq)]
pub enum DeclKind {
    Function(FunDecl),
}</p>
<p>#[derive(Debug, Clone, Eq, PartialEq)]
pub struct FunDecl {
    pub name: Identifier,
    pub body: NonEmpty,
}</p>
<p>#[derive(Debug, Clone, Eq, PartialEq)]
pub struct Identifier {
    pub id: UniqueId,
    pub span: Span,
    pub name: String,
}</p>
<p>#[derive(Debug, Clone, Eq, PartialEq)]
pub struct BlockStatement {
    pub id: UniqueId,
    pub span: Span,
    pub kind: BlockStatementKind,
}</p>
<p>#[derive(Debug, Clone, Eq, PartialEq)]
pub enum BlockStatementKind {
    Return { value: Expr },
}</p>
<p>#[derive(Debug, Clone, Eq, PartialEq)]
pub struct Expr {
    pub span: Span,
    pub id: UniqueId,
    pub kind: ExprKind,
}</p>
<p>#[derive(Debug, Clone, Eq, PartialEq)]
pub enum ExprKind {
    IntLit { value: i64 },
}</p>
<pre><code>Enum nodes are split into two parts: the <span class="hljs-string">`Kind`</span> enum, which contains the different
variants <span class="hljs-keyword">of</span> the node, and a wrapper struct that contains fields that are common to 
all variants. This is a common pattern used <span class="hljs-keyword">in</span> [rustc](https:<span class="hljs-comment">//doc.rust-lang.org/beta/nightly-rustc/rustc_ast/ast/struct.Expr.html),</span>
among others.

With our AST nodes defined, we can start writing the parser.
We have many parsing algorithms to choose <span class="hljs-keyword">from</span> to transform the token stream into an AST,
and the one we<span class="hljs-string">'ll use is called recursive descent parsing. It has the dual advantage
of being relatively simple to implement and easy to extend to support extra features, such as error
recovery and reporting.

In recursive descent parsing, we define a function for each non-terminal rule in the grammar.
Every time a rule references a terminal we consume the corresponding token from the
token stream, and every time a rule references a non-terminal we call the corresponding
function. This process continues until we reach the end of the input or encounter an error.

Before we start implementing the parser, let'</span>s define an error type that will 
represent all the error scenarios that can occur during parsing. For now,
we<span class="hljs-string">'ll handle three types of errors:
- `UnexpectedToken` is used when we encounter a token that does not match
  what was expected
- `UnexpectedEof` is used when we reach the end of the input while still
  expecting more tokens
- `Io` is used when we encounter an I/O error while reading the source code

To create error types more easily, we'</span>ll use the [thiserror](https:<span class="hljs-comment">//docs.rs/thiserror/latest/thiserror/) crate.</span>
It provides a convenient macro to derive the <span class="hljs-string">`std::error::Error`</span> trait <span class="hljs-keyword">for</span> our error types.
It can be installed by running <span class="hljs-string">`cargo add thiserror`</span> <span class="hljs-keyword">in</span> the <span class="hljs-string">`compiler`</span> folder.

<span class="hljs-string">``</span><span class="hljs-string">`rust
//! compiler/src/parse/error.rs

use crate::ast::Span;

use super::token::Token;

#[derive(Debug, thiserror::Error)]
pub enum Error {
    #[error("Unexpected token {0:?}")]
    UnexpectedToken(Token, Span),
    #[error("io error: {0}")]
    Io(#[from] std::io::Error),
    #[error("Unexpected end of file")]
    UnexpectedEof,
}</span>
</code></pre><p>We can now define our <code>Parser</code> struct, with a few utility methods to help us
implement our parsing rules.</p>
<pre><code class="lang-rust"><span class="hljs-comment">//! compiler/src/parse/parser.rs</span>

<span class="hljs-keyword">use</span> std::iter::Peekable;

<span class="hljs-keyword">use</span> nonempty::NonEmpty;

<span class="hljs-keyword">use</span> crate::{ast::*, id::UniqueIdGenerator};

<span class="hljs-keyword">use</span> super::{error::Error, lexer::Lexer, token::Token};

<span class="hljs-meta">#[derive(Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">Parser</span></span>&lt;<span class="hljs-symbol">'c</span>&gt; {
    lexer: Peekable&lt;Lexer&lt;<span class="hljs-symbol">'c</span>&gt;&gt;,
    id_gen: UniqueIdGenerator,
}

<span class="hljs-keyword">impl</span>&lt;<span class="hljs-symbol">'c</span>&gt; Parser&lt;<span class="hljs-symbol">'c</span>&gt; {
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">new</span></span>(id_gen: UniqueIdGenerator, code: &amp;<span class="hljs-symbol">'c</span> <span class="hljs-built_in">str</span>) -&gt; <span class="hljs-keyword">Self</span> {
        <span class="hljs-keyword">Self</span> {
            id_gen,
            lexer: Lexer::new(code).peekable(),
        }
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">expect_eq</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>, expected: Token) -&gt; <span class="hljs-built_in">Result</span>&lt;Span, Error&gt; {
        <span class="hljs-keyword">let</span> (token, span) = <span class="hljs-keyword">self</span>.next_token()?;
        <span class="hljs-keyword">if</span> token == expected {
            <span class="hljs-literal">Ok</span>(span)
        } <span class="hljs-keyword">else</span> {
            <span class="hljs-literal">Err</span>(Error::UnexpectedToken(token, span))
        }
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">next_token</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">Result</span>&lt;(Token, Span), Error&gt; {
        <span class="hljs-keyword">self</span>.lexer.next().ok_or(Error::UnexpectedEof)
    }
}
</code></pre>
<ul>
<li><code>expect_eq</code> compares the next token to the expected one and returns its span if they match, 
  or an <code>UnexpectedToken</code> error if they don't</li>
<li><code>next_token</code> returns the next token and its span, or an <code>UnexpectedEof</code> error if there 
are no more tokens</li>
</ul>
<h3 id="heading-expressions-and-identifiers">Expressions and identifiers</h3>
<p>Remember that our current grammar rule for expressions is:</p>
<pre><code class="lang-plaintext">Expression = Int;
</code></pre>
<p>The corresponding parsing method is a direct translation of the BNF rule: 
we consume the next token and check if it is an integer literal. If it is, we create
an <code>Expr</code> node where the <code>kind</code> field represents the integer literal. Otherwise, we
return an <code>UnexpectedToken</code> error.</p>
<pre><code class="lang-rust"><span class="hljs-comment">//! compiler/src/parse/parser.rs</span>

<span class="hljs-comment">// [...]</span>
<span class="hljs-keyword">impl</span> &lt;<span class="hljs-symbol">'c</span>&gt; Parser&lt;<span class="hljs-symbol">'c</span>&gt; {
    <span class="hljs-comment">// [...]</span>

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_expr</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">Result</span>&lt;Expr, Error&gt; {
        <span class="hljs-keyword">match</span> <span class="hljs-keyword">self</span>.next_token()? {
            (Token::Int(n), span) =&gt; <span class="hljs-literal">Ok</span>(Expr {
                span,
                id: <span class="hljs-keyword">self</span>.id_gen.generate(),
                kind: ExprKind::IntLit { value: n },
            }),
            (token, span) =&gt; <span class="hljs-literal">Err</span>(Error::UnexpectedToken(token, span)),
        }
    }
}
</code></pre>
<p>Identifiers follow the same pattern, as <code>Identifier</code> AST nodes wrap a single
<code>Indentifier</code> token:</p>
<pre><code class="lang-rust"><span class="hljs-comment">//! compiler/src/parse/parser.rs</span>

<span class="hljs-keyword">impl</span> &lt;<span class="hljs-symbol">'c</span>&gt; Parser&lt;<span class="hljs-symbol">'c</span>&gt; {
    <span class="hljs-comment">// [...]</span>

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_identifier</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">Result</span>&lt;Identifier, Error&gt; {
        <span class="hljs-keyword">match</span> <span class="hljs-keyword">self</span>.next_token()? {
            (Token::Identifier(name), span) =&gt; <span class="hljs-literal">Ok</span>(Identifier {
                id: <span class="hljs-keyword">self</span>.id_gen.generate(),
                span,
                name,
            }),
            (token, span) =&gt; <span class="hljs-literal">Err</span>(Error::UnexpectedToken(token, span)),
        }
    }
}
</code></pre>
<h3 id="heading-blocks-and-block-statements">Blocks and block statements</h3>
<p>Block statements are statements that appear within an indented block.
For now the only block statement we support is the <code>return</code> statement.
The grammar rule for blocks and block statements is as follows:</p>
<pre><code class="lang-plaintext">Block = Indent ReturnStatement Dedent;
BlockStatement = ReturnStatement;
ReturnStatement = "return" Expression;
</code></pre>
<p>Translating these rules into parsing methods is quite straightforward.
Terminals are matched with the <code>expect_eq</code> utility, and non-terminals
are parsed by delegating to the corresponding parsing method.</p>
<pre><code class="lang-rust"><span class="hljs-comment">//! compiler/src/parse/parser.rs</span>

<span class="hljs-keyword">impl</span> &lt;<span class="hljs-symbol">'c</span>&gt; Parser&lt;<span class="hljs-symbol">'c</span>&gt; {
    <span class="hljs-comment">// [...]</span>
    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_block</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">Result</span>&lt;NonEmpty&lt;BlockStatement&gt;, Error&gt; {
        <span class="hljs-keyword">self</span>.expect_eq(Token::Indent)?;
        <span class="hljs-keyword">let</span> stmts = NonEmpty::new(<span class="hljs-keyword">self</span>.parse_block_stmt()?);
        <span class="hljs-keyword">self</span>.expect_eq(Token::Dedent)?;
        <span class="hljs-literal">Ok</span>(stmts)
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_block_stmt</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">Result</span>&lt;BlockStatement, Error&gt; {
        <span class="hljs-keyword">self</span>.parse_return()
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_return</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">Result</span>&lt;BlockStatement, Error&gt; {
        <span class="hljs-keyword">let</span> ret_span = <span class="hljs-keyword">self</span>.expect_eq(Token::Return)?;
        <span class="hljs-keyword">let</span> expr = <span class="hljs-keyword">self</span>.parse_expr()?;
        <span class="hljs-literal">Ok</span>(BlockStatement {
            id: <span class="hljs-keyword">self</span>.id_gen.generate(),
            <span class="hljs-comment">// merging the return token span with the expression span</span>
            <span class="hljs-comment">// produces a span that covers the entire return statement</span>
            span: ret_span.merge(expr.span),
            kind: BlockStatementKind::Return { value: expr },
        })
    }
}
</code></pre>
<h2 id="heading-function-declarations">Function declarations</h2>
<p>Function declarations are a bit more complex than the rules we've seen so far,
but the translation from grammar to code follows the exact same pattern.
Here is the grammar rule for function declarations: </p>
<pre><code>FunDecl = <span class="hljs-string">"def"</span> Identifier <span class="hljs-string">"("</span> <span class="hljs-string">")"</span> <span class="hljs-string">":"</span> Block;
</code></pre><p>And the corresponding parsing method:</p>
<pre><code class="lang-rust"><span class="hljs-comment">//! compiler/src/parse/parser.rs</span>

<span class="hljs-keyword">impl</span> &lt;<span class="hljs-symbol">'c</span>&gt; Parser&lt;<span class="hljs-symbol">'c</span>&gt; {
    <span class="hljs-comment">// [...]</span>

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_fun_decl</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">Result</span>&lt;Decl, Error&gt; {
        <span class="hljs-keyword">let</span> span_start = <span class="hljs-keyword">self</span>.expect_eq(Token::Def)?;
        <span class="hljs-keyword">let</span> name = <span class="hljs-keyword">self</span>.parse_identifier()?;
        <span class="hljs-keyword">self</span>.expect_eq(Token::LPar)?;
        <span class="hljs-keyword">self</span>.expect_eq(Token::RPar)?;
        <span class="hljs-keyword">self</span>.expect_eq(Token::Colon)?;
        <span class="hljs-keyword">let</span> body = <span class="hljs-keyword">self</span>.parse_block()?;
        <span class="hljs-literal">Ok</span>(Decl {
            id: <span class="hljs-keyword">self</span>.id_gen.generate(),
            span: span_start.merge(body.last().span),
            kind: DeclKind::Function(FunDecl { name, body }),
        })
    }
}
</code></pre>
<p>With this addition, our parser is nearly complete. We'll define two last parsing methods:</p>
<ul>
<li><code>parse_decl</code> to parse a <code>Decl</code> node</li>
<li><code>parse_module</code> to parse the top level declarations in a module</li>
</ul>
<p>Both methods are straightforward, as they only need to delegate to <code>parse_fun_decl</code>.</p>
<pre><code class="lang-rust"><span class="hljs-comment">//! compiler/src/parse/parser.rs</span>

<span class="hljs-keyword">impl</span> &lt;<span class="hljs-symbol">'c</span>&gt; Parser&lt;<span class="hljs-symbol">'c</span>&gt; {
    <span class="hljs-comment">// [...]</span>

    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_module</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">Result</span>&lt;<span class="hljs-built_in">Vec</span>&lt;Statement&gt;, Error&gt; {
        <span class="hljs-keyword">let</span> decl = <span class="hljs-keyword">self</span>.parse_decl()?;
        <span class="hljs-literal">Ok</span>(<span class="hljs-built_in">vec!</span>[Statement {
            id: <span class="hljs-keyword">self</span>.id_gen.generate(),
            span: decl.span,
            kind: StatementKind::Decl(decl),
        }])
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_decl</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">Result</span>&lt;Decl, Error&gt; {
        <span class="hljs-keyword">self</span>.parse_fun_decl()
    }
}
</code></pre>
<p>Let's edit our main function to test our parser:</p>
<pre><code class="lang-rust"><span class="hljs-comment">//! compiler/src/main.rs</span>

<span class="hljs-keyword">mod</span> ast;
<span class="hljs-keyword">mod</span> ctx;
<span class="hljs-keyword">mod</span> db;
<span class="hljs-keyword">mod</span> error;
<span class="hljs-keyword">mod</span> id;
<span class="hljs-keyword">mod</span> parse;

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">main</span></span>() {
    <span class="hljs-keyword">let</span> input_file = std::env::args().nth(<span class="hljs-number">1</span>).expect(<span class="hljs-string">"missing input file"</span>);
    <span class="hljs-keyword">let</span> input_code = std::fs::read_to_string(input_file).expect(<span class="hljs-string">"failed to read source code"</span>);
    <span class="hljs-keyword">let</span> id_gen = id::UniqueIdGenerator::default();
    <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> parser = parse::parser::Parser::new(id_gen, &amp;input_code);
    <span class="hljs-keyword">let</span> stmts = parser.parse_module().expect(<span class="hljs-string">"failed to parse module"</span>);
    <span class="hljs-built_in">println!</span>(<span class="hljs-string">"{stmts:#?}"</span>);
}
</code></pre>
<p>Assuming we have saved our small <code>Pylite</code> program in the <code>res/samples/return_const/main.py</code>
file, we can test our parser by running the following command:</p>
<pre><code class="lang-bash">$ cargo run -- res/samples/return_const/main.py
</code></pre>
<p>It should print the following output, confirming that our parser works as expected:</p>
<pre><code class="lang-plaintext">[
    Statement {
        id: UniqueId(
            4,
        ),
        span: Span {
            start: 0,
            end: 24,
        },
        kind: Decl(
            Decl {
                id: UniqueId(
                    3,
                ),
                span: Span {
                    start: 0,
                    end: 24,
                },
                kind: Function(
                    FunDecl {
                        name: Identifier {
                            id: UniqueId(
                                0,
                            ),
                            span: Span {
                                start: 4,
                                end: 8,
                            },
                            name: "main",
                        },
                        body: NonEmpty {
                            head: BlockStatement {
                                id: UniqueId(
                                    2,
                                ),
                                span: Span {
                                    start: 16,
                                    end: 24,
                                },
                                kind: Return {
                                    value: Expr {
                                        span: Span {
                                            start: 23,
                                            end: 24,
                                        },
                                        id: UniqueId(
                                            1,
                                        ),
                                        kind: IntLit {
                                            value: 1,
                                        },
                                    },
                                },
                            },
                            tail: [],
                        },
                    },
                ),
            },
        ),
    },
]
</code></pre>
<p>This concludes the first half of our compiler. In the next part, we'll build an
intermediate representation for our program, and start generating assembly code!</p>
]]></content:encoded></item><item><title><![CDATA[Build a Compiler from Scratch, Part 0: Introduction]]></title><description><![CDATA[Compilers are frustrating.
Two decades ago, during a boring afternoon in my teenage room, I set out to
discover how the software and games I spent so many hours with were made. I plunged 
into a Google Search rabbit hole that would yield a puzzling
a...]]></description><link>https://blog.sylver.dev/build-a-compiler-from-scratch-part-0-introduction</link><guid isPermaLink="true">https://blog.sylver.dev/build-a-compiler-from-scratch-part-0-introduction</guid><category><![CDATA[Rust]]></category><category><![CDATA[compiler]]></category><category><![CDATA[Tutorial]]></category><dc:creator><![CDATA[Geoffrey Copin]]></dc:creator><pubDate>Tue, 24 Jun 2025 23:31:47 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1750973970045/7be059d9-f4bf-41d0-bd74-de6147b36c9b.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Compilers are frustrating.</p>
<p>Two decades ago, during a boring afternoon in my teenage room, I set out to
discover how the software and games I spent so many hours with were made. I plunged 
into a Google Search rabbit hole that would yield a puzzling
answer: to build a piece of software, you need... a piece of software.
A compiler or an interpreter, to be precise.</p>
<p>It took me an embarrassingly long time to understand what seemed to me like a
paradox. But the journey of understanding how high-level code is transformed into
machine code and interacts with the OS was fascinating.</p>
<p>In this series, we'll go through this journey together, one tiny iteration at a time.
We'll build a program to compile a minimalistic language inspired by Python into
x86 assembly code. Along the way, we'll learn many things, including how to write
a lexer, a parser, a type checker, a garbage collector, and of course we'll discover
the basics of x86 assembly as we implement our code generator.
We'll also cover more advanced topics, such as code optimization,
intermediate representations, and rich error messages. </p>
<div class="hn-embed-widget" id="zenyth-support"></div><p> </p>
<h2 id="heading-the-mile-high-view">The mile-high view</h2>
<p>Before we dive into the implementation, let's take a moment to briefly discuss
the high-level architecture of our compiler.
In its simplest form, a compiler is a program that takes source code as input,
and translates it into another language, typically assembly code or
bytecode for a virtual machine. In our case, the source language will
be <code>Pylite</code>, a minimalistic language designed specifically for this series,
and the target language will be assembly code for the x86 family of processors.</p>
<p>One might think that this translation process is done in a single pass, emitting
assembly code as we read the input source code. And indeed, this is how many
early compilers worked. However, modern compilers are often implemented as
a pipeline of stages, with each stage taking the output of the previous stages
as input to transform or enrich it in some way. 
Our compilation pipeline will be quite typical, consisting of the following stages:</p>
<pre><code class="lang-text">
        ┌────────┐     ┌──────────┐     ┌─────────────┐
        │ PARSER │────▶│ SEMANTIC │────▶│     IR      │
        └────────┘     │ ANALYZER │     │  GENERATOR  │
                       └──────────┘     └─────────────┘
                                                 │
                                                 ▼
                       ┌─────────────┐     ┌─────────┐
                       │    CODE     │◀────│OPTIMIZER│
                       │  GENERATOR  │     └─────────┘
                       └─────────────┘
</code></pre>
<p>The parser's input is the raw source code, and the code generator produces the final
assembly code. Here is a quick breakdown of each stage:</p>
<ul>
<li>parser: this stage reads the source code and produces a parse tree, which is a
structured in-memory representation of the source code. </li>
<li>semantic analyzer: this stage extracts semantic information from the parse tree.
It performs type checking and name resolution, which is the process of
associating names with their matching declarations.</li>
<li>IR generator: for reasons that we'll explore later in the series, the parse
tree is not the ideal representation for low-level optimizations and code generation.
For this reason, it is transformed (or "lowered") into an intermediate representation
(IR) by the IR generator before the optimizer and code generator stages.</li>
<li>optimizer: this stage performs various optimizations on the IR to improve
the performance of the generated code. </li>
<li>code generator: generates the final assembly code from the optimized IR</li>
</ul>
<p>Most modern compilers use some variation of this pipeline, often with additional
stages with their own intermediate representations.
The first three stages, the parser, the semantic analyzer and the IR generator, are often
referred to as the front-end of the compiler, while the later stages are called the 
backend. Generally speaking, the front-end is responsible for analyzing the source code
and producing an intermediate representation, while the backend is responsible for
IR optimization and code generation.</p>
<p>With this overview in mind, let's move on to the next part, where we'll build
a complete compiler for a tiny subset of the <code>Pylite</code> language!</p>
]]></content:encoded></item><item><title><![CDATA[Build your own SQLite, Part 5: Evaluating queries]]></title><description><![CDATA[In the previous posts, we've explored the SQLite file format and built a simple SQL parser. It's time to put these pieces together and implement a query evaluator! In this post, we'll lay the groundwork for evaluating SQL queries and build a query ev...]]></description><link>https://blog.sylver.dev/build-your-own-sqlite-part-5-evaluating-queries</link><guid isPermaLink="true">https://blog.sylver.dev/build-your-own-sqlite-part-5-evaluating-queries</guid><category><![CDATA[Rust]]></category><category><![CDATA[SQLite]]></category><category><![CDATA[database]]></category><category><![CDATA[from scratch]]></category><dc:creator><![CDATA[Geoffrey Copin]]></dc:creator><pubDate>Wed, 19 Feb 2025 22:25:33 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1740034565623/1672932a-521c-4c45-9598-34409a3cb56d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the previous posts, we've explored the <a target="_blank" href="/build-your-own-sqlite-part-1-listing-tables">SQLite file format</a> and built a simple <a target="_blank" href="/build-your-own-sqlite-part-3-sql-parsing-101">SQL parser</a>. It's time to put these pieces together and implement a query evaluator! In this post, we'll lay the groundwork for evaluating SQL queries and build a query evaluator that can handle basic SELECT statements. While our initial implementation won't support filtering, sorting, grouping, or joins yet, it will give us the foundation to add these features in future posts.</p>
<p>As usual, the complete source code for this post is available on <a target="_blank" href="https://github.com/geoffreycopin/rqlite/commit/c7dfeeea6956e209ccbd50a727c2b9352c246082">GitHub</a>.</p>
<h2 id="heading-setting-up-our-test-database">Setting up our test database</h2>
<p>Before we can evaluate queries, we need a database to query. We'll start by creating a simple database with a single table, <code>table1</code>, with two columns, <code>id</code> and <code>value</code>:</p>
<pre><code class="lang-bash">sqlite3 queries_test.db
sqlite&gt; create table table1(id <span class="hljs-built_in">integer</span>, value text);
sqlite&gt; insert into table1(id, value) values
    ...&gt; (1, <span class="hljs-string">'11'</span>),
    ...&gt; (2, <span class="hljs-string">'12'</span>),
    ...&gt; (3, <span class="hljs-string">'13'</span>);
sqlite&gt; .<span class="hljs-built_in">exit</span>
</code></pre>
<p>⚠️ You might be tempted to use an existing SQLite database to test your queries, but keep in mind that our implementation does not support overflow pages yet, so it might not be able to read the data from your database file.</p>
<h2 id="heading-making-the-pager-shareable">Making the pager shareable</h2>
<hr />
<p>This section is specific to the Rust implementation. If you're following along with another language, you can safely skip it!</p>
<hr />
<p>Currently, our pager can only be used through an exclusive mutable reference. This was fine for our initial use cases, but as we start building more complex features, maintaining this restriction will constrain our design. We'll make the pager shareable by wrapping its inner mutable fields in an <code>Arc&lt;Mutex&lt;_&gt;&gt;</code> and <code>Arc&lt;RwLock&lt;_&gt;&gt;</code>. This will allow us to effectively clone the pager and use it from multiple places without running into borrow checker issues. At this stage of the project we could have chosen to use a simple <code>Rc&lt;RefCell&lt;_&gt;&gt;</code>, but we'll eventually need to support concurrent access to the pager, so we'll use thread-safe counterparts from the start.</p>
<pre><code class="lang-diff">// src/pager.rs

<span class="hljs-deletion">- #[derive(Debug, Clone)]</span>
<span class="hljs-addition">+ #[derive(Debug)]</span>
pub struct Pager&lt;I: Read + Seek = std::fs::File&gt; {
<span class="hljs-deletion">-   input: I,</span>
<span class="hljs-addition">+   input: Arc&lt;Mutex&lt;I&gt;&gt;</span>
    page_size: usize,
<span class="hljs-deletion">-   pages: HashMap&lt;usize, page::Page&gt;,</span>
<span class="hljs-addition">+   pages: Arc&lt;RwLock&lt;HashMap&lt;usize, Arc&lt;page::Page&gt;&gt;&gt;&gt;,</span>
}
</code></pre>
<p>The <code>read_page</code> and <code>load_page</code> methods need to be updated accordingly:</p>
<pre><code class="lang-rust"><span class="hljs-keyword">impl</span>&lt;I: Read + Seek&gt; Pager&lt;I&gt; {
    <span class="hljs-comment">// [...] </span>
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">read_page</span></span>(&amp;<span class="hljs-keyword">self</span>, n: <span class="hljs-built_in">usize</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;Arc&lt;page::Page&gt;&gt; {
        {
            <span class="hljs-keyword">let</span> read_pages = <span class="hljs-keyword">self</span>
                .pages
                .read()
                .map_err(|_| anyhow!(<span class="hljs-string">"failed to acquire pager read lock"</span>))?;

            <span class="hljs-keyword">if</span> <span class="hljs-keyword">let</span> <span class="hljs-literal">Some</span>(page) = read_pages.get(&amp;n) {
                <span class="hljs-keyword">return</span> <span class="hljs-literal">Ok</span>(page.clone());
            }
        }

        <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> write_pages = <span class="hljs-keyword">self</span>
            .pages
            .write()
            .map_err(|_| anyhow!(<span class="hljs-string">"failed to acquire pager write lock"</span>))?;

        <span class="hljs-keyword">if</span> <span class="hljs-keyword">let</span> <span class="hljs-literal">Some</span>(page) = write_pages.get(&amp;n) {
            <span class="hljs-keyword">return</span> <span class="hljs-literal">Ok</span>(page.clone());
        }

        <span class="hljs-keyword">let</span> page = <span class="hljs-keyword">self</span>.load_page(n)?;
        write_pages.insert(n, page.clone());
        <span class="hljs-literal">Ok</span>(page)
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">load_page</span></span>(&amp;<span class="hljs-keyword">self</span>, n: <span class="hljs-built_in">usize</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;Arc&lt;page::Page&gt;&gt; {
        <span class="hljs-keyword">let</span> offset = n.saturating_sub(<span class="hljs-number">1</span>) * <span class="hljs-keyword">self</span>.page_size;

        <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> input_guard = <span class="hljs-keyword">self</span>
            .input
            .lock()
            .map_err(|_| anyhow!(<span class="hljs-string">"failed to lock pager mutex"</span>))?;

        input_guard
            .seek(SeekFrom::Start(offset <span class="hljs-keyword">as</span> <span class="hljs-built_in">u64</span>))
            .context(<span class="hljs-string">"seek to page start"</span>)?;

        <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> buffer = <span class="hljs-built_in">vec!</span>[<span class="hljs-number">0</span>; <span class="hljs-keyword">self</span>.page_size];
        input_guard.read_exact(&amp;<span class="hljs-keyword">mut</span> buffer).context(<span class="hljs-string">"read page"</span>)?;

        <span class="hljs-literal">Ok</span>(Arc::new(parse_page(&amp;buffer, n)?))
    }
}
</code></pre>
<p>Two things to note regarding the <code>read_page</code> method:</p>
<ul>
<li><p>the initial attempt to read the page from the cache is nested in a block to limit the scope of the read lock, ensuring that it is released before we try to acquire the write lock</p>
</li>
<li><p>after acquiring the write lock, we check again if the page is already in the cache, in case it was inserted in between the two lock acquisitions</p>
</li>
</ul>
<p>Similarly, we'll define an owned version of our <code>Value</code> enum that we'll use in the query evaluator:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// src/value.rs</span>

<span class="hljs-comment">// [...]</span>

<span class="hljs-meta">#[derive(Debug, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">enum</span> <span class="hljs-title">OwnedValue</span></span> {
    Null,
    <span class="hljs-built_in">String</span>(Rc&lt;<span class="hljs-built_in">String</span>&gt;),
    Blob(Rc&lt;<span class="hljs-built_in">Vec</span>&lt;<span class="hljs-built_in">u8</span>&gt;&gt;),
    Int(<span class="hljs-built_in">i64</span>),
    Float(<span class="hljs-built_in">f64</span>),
}

<span class="hljs-keyword">impl</span>&lt;<span class="hljs-symbol">'p</span>&gt; <span class="hljs-built_in">From</span>&lt;Value&lt;<span class="hljs-symbol">'p</span>&gt;&gt; <span class="hljs-keyword">for</span> OwnedValue {
    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">from</span></span>(value: Value&lt;<span class="hljs-symbol">'p</span>&gt;) -&gt; <span class="hljs-keyword">Self</span> {
        <span class="hljs-keyword">match</span> value {
            Value::Null =&gt; Self::Null,
            Value::Int(i) =&gt; Self::Int(i),
            Value::Float(f) =&gt; Self::Float(f),
            Value::Blob(b) =&gt; Self::Blob(Rc::new(b.into_owned())),
            Value::<span class="hljs-built_in">String</span>(s) =&gt; Self::<span class="hljs-built_in">String</span>(Rc::new(s.into_owned())),
        }
    }
}

<span class="hljs-keyword">impl</span> std::fmt::Display <span class="hljs-keyword">for</span> OwnedValue {
    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">fmt</span></span>(&amp;<span class="hljs-keyword">self</span>, f: &amp;<span class="hljs-keyword">mut</span> std::fmt::Formatter&lt;<span class="hljs-symbol">'_</span>&gt;) -&gt; std::fmt::<span class="hljs-built_in">Result</span> {
        <span class="hljs-keyword">match</span> <span class="hljs-keyword">self</span> {
            OwnedValue::Null =&gt; <span class="hljs-built_in">write!</span>(f, <span class="hljs-string">"null"</span>),
            OwnedValue::<span class="hljs-built_in">String</span>(s) =&gt; s.fmt(f),
            OwnedValue::Blob(items) =&gt; {
                <span class="hljs-built_in">write!</span>(
                    f,
                    <span class="hljs-string">"{}"</span>,
                    items
                        .iter()
                        .filter_map(|&amp;n| <span class="hljs-built_in">char</span>::from_u32(n <span class="hljs-keyword">as</span> <span class="hljs-built_in">u32</span>).filter(<span class="hljs-built_in">char</span>::is_ascii))
                        .collect::&lt;<span class="hljs-built_in">String</span>&gt;()
                )
            }
            OwnedValue::Int(i) =&gt; i.fmt(f),
            OwnedValue::Float(x) =&gt; x.fmt(f),
        }
    }
}
</code></pre>
<p>Finally, we'll enrich our <code>Cursor</code> struct with a method that returns the value of a field as an <code>OwnedValue</code>:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// src/cursor.rs</span>

<span class="hljs-keyword">impl</span> Cursor {
    <span class="hljs-comment">// [...] </span>
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">owned_field</span></span>(&amp;<span class="hljs-keyword">self</span>, n: <span class="hljs-built_in">usize</span>) -&gt; <span class="hljs-built_in">Option</span>&lt;OwnedValue&gt; {
        <span class="hljs-keyword">self</span>.field(n).map(<span class="hljs-built_in">Into</span>::into)
    }
    <span class="hljs-comment">// [...]</span>
}
</code></pre>
<h2 id="heading-evaluating-select-statements">Evaluating <code>SELECT</code> statements</h2>
<p>Our query engine will be composed of two main components:</p>
<ul>
<li><p>an iterator-like <code>Operator</code> enum that represents nestable operations on the database, such as scanning a table or filtering rows. Our initial implementation will only contain a <code>SeqScan</code> operator that yields all rows from a table.</p>
</li>
<li><p>a <code>Planner</code> struct that takes a parsed SQL query and produces an <code>Operator</code> that can be evaluated to produce the query result.</p>
</li>
</ul>
<p>Let's start by defining the <code>Operator</code> enum:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// src/engine/operator.rs</span>
<span class="hljs-keyword">use</span> anyhow::Context;

<span class="hljs-keyword">use</span> crate::{cursor::Scanner, value::OwnedValue};

<span class="hljs-meta">#[derive(Debug)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">enum</span> <span class="hljs-title">Operator</span></span> {
    SeqScan(SeqScan),
}

<span class="hljs-keyword">impl</span> Operator {
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">next_row</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;<span class="hljs-built_in">Option</span>&lt;&amp;[OwnedValue]&gt;&gt; {
        <span class="hljs-keyword">match</span> <span class="hljs-keyword">self</span> {
            Operator::SeqScan(s) =&gt; s.next_row(),
        }
    }
}
</code></pre>
<p>The result of evaluating a query will be obtained by repeatedly calling the <code>next_row</code> method on the <code>Operator</code> until it returns <code>None</code>. Each value in the returned slice corresponds to a column in the query result.</p>
<p>The <code>SeqScan</code> struct will be responsible for scanning a table and yielding its rows:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// src/engine/operator.rs</span>

<span class="hljs-comment">// [...]</span>

<span class="hljs-meta">#[derive(Debug)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">SeqScan</span></span> {
    fields: <span class="hljs-built_in">Vec</span>&lt;<span class="hljs-built_in">usize</span>&gt;,
    scanner: Scanner,
    row_buffer: <span class="hljs-built_in">Vec</span>&lt;OwnedValue&gt;,
}

<span class="hljs-keyword">impl</span> SeqScan {
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">new</span></span>(fields: <span class="hljs-built_in">Vec</span>&lt;<span class="hljs-built_in">usize</span>&gt;, scanner: Scanner) -&gt; <span class="hljs-keyword">Self</span> {
        <span class="hljs-keyword">let</span> row_buffer = <span class="hljs-built_in">vec!</span>[OwnedValue::Null; fields.len()];

        <span class="hljs-keyword">Self</span> {
            fields,
            scanner,
            row_buffer,
        }
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">next_row</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;<span class="hljs-built_in">Option</span>&lt;&amp;[OwnedValue]&gt;&gt; {
        <span class="hljs-keyword">let</span> <span class="hljs-literal">Some</span>(record) = <span class="hljs-keyword">self</span>.scanner.next_record()? <span class="hljs-keyword">else</span> {
            <span class="hljs-keyword">return</span> <span class="hljs-literal">Ok</span>(<span class="hljs-literal">None</span>);
        };

        <span class="hljs-keyword">for</span> (i, &amp;n) <span class="hljs-keyword">in</span> <span class="hljs-keyword">self</span>.fields.iter().enumerate() {
            <span class="hljs-keyword">self</span>.row_buffer[i] = record.owned_field(n).context(<span class="hljs-string">"missing record field"</span>)?;
        }

        <span class="hljs-literal">Ok</span>(<span class="hljs-literal">Some</span>(&amp;<span class="hljs-keyword">self</span>.row_buffer))
    }
}
</code></pre>
<p>The <code>SeqScan</code> struct is initialized with a list of field indices to read from each record and a <code>Scanner</code> that will yield the records for every row in the table to be scanned. As the number of fields to read is identical for every row, we can preallocate a buffer to store the values of the selected fields. The next_row method retrieves the next record from the scanner, extracts the requested fields (specified by their indices), and stores them in our buffer.</p>
<p>Now that we have an <code>Operator</code> to evaluate <code>SELECT</code> statements, let's move on to the <code>Planner</code> struct that will produce the <code>Operator</code> from a parsed SQL query:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// src/engine/plan.rs</span>

<span class="hljs-keyword">use</span> anyhow::{bail, Context, <span class="hljs-literal">Ok</span>};

<span class="hljs-keyword">use</span> crate::{
    db::Db,
    sql::ast::{<span class="hljs-keyword">self</span>, SelectFrom},
};

<span class="hljs-keyword">use</span> super::operator::{Operator, SeqScan};

<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">Planner</span></span>&lt;<span class="hljs-symbol">'d</span>&gt; {
    db: &amp;<span class="hljs-symbol">'d</span> Db,
}

<span class="hljs-keyword">impl</span>&lt;<span class="hljs-symbol">'d</span>&gt; Planner&lt;<span class="hljs-symbol">'d</span>&gt; {
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">new</span></span>(db: &amp;<span class="hljs-symbol">'d</span> Db) -&gt; <span class="hljs-keyword">Self</span> {
        <span class="hljs-keyword">Self</span> { db }
    }

    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">compile</span></span>(<span class="hljs-keyword">self</span>, statement: &amp;ast::Statement) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;Operator&gt; {
        <span class="hljs-keyword">match</span> statement {
            ast::Statement::Select(s) =&gt; <span class="hljs-keyword">self</span>.compile_select(s),
            stmt =&gt; bail!(<span class="hljs-string">"unsupported statement: {stmt:?}"</span>),
        }
    }
}
</code></pre>
<p>The <code>Planner</code> struct is initialized with a reference to the database and provides a <code>compile</code> method that takes a parsed SQL statement and returns the corresponding <code>Operator</code>. The <code>compile</code> method dispatches to a specific method for each type of SQL statement.</p>
<p>Let's see how to build an <code>Operator</code> for a <code>SELECT</code> statement:</p>
<pre><code class="lang-rust">
<span class="hljs-comment">// src/engine/plan.rs</span>

<span class="hljs-keyword">impl</span>&lt;<span class="hljs-symbol">'d</span>&gt; Planner&lt;<span class="hljs-symbol">'d</span>&gt; {
    <span class="hljs-comment">// [...] </span>

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">compile_select</span></span>(<span class="hljs-keyword">self</span>, select: &amp;ast::SelectStatement) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;Operator&gt; {
        <span class="hljs-keyword">let</span> SelectFrom::Table(table_name) = &amp;select.core.from;

        <span class="hljs-keyword">let</span> table = <span class="hljs-keyword">self</span>
            .db
            .tables_metadata
            .iter()
            .find(|m| &amp;m.name == table_name)
            .with_context(|| <span class="hljs-built_in">format!</span>(<span class="hljs-string">"invalid table name: {table_name}"</span>))?;

        <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> columns = <span class="hljs-built_in">Vec</span>::new();

        <span class="hljs-keyword">for</span> res_col <span class="hljs-keyword">in</span> &amp;select.core.result_columns {
            <span class="hljs-keyword">match</span> res_col {
                ast::ResultColumn::Star =&gt; {
                    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-number">0</span>..table.columns.len() {
                        columns.push(i);
                    }
                }
                ast::ResultColumn::Expr(e) =&gt; {
                    <span class="hljs-keyword">let</span> ast::Expr::Column(col) = &amp;e.expr;
                    <span class="hljs-keyword">let</span> (index, _) = table
                        .columns
                        .iter()
                        .enumerate()
                        .find(|(_, c)| c.name == col.name)
                        .with_context(|| <span class="hljs-built_in">format!</span>(<span class="hljs-string">"invalid column name: {}"</span>, col.name))?;
                    columns.push(index);
                }
            }
        }

        <span class="hljs-literal">Ok</span>(Operator::SeqScan(SeqScan::new(
            columns,
            <span class="hljs-keyword">self</span>.db.scanner(table.first_page),
        )))
    }
}
</code></pre>
<p>First, we find a table metadata entry that matches the table name in the <code>SELECT</code> statement. Then we iterate over the statement's result columns and build a list of field indices to read from each record, either by expanding <code>*</code> to all columns or by looking up the column name in the table metadata.</p>
<p>Finally, we create a <code>SeqScan</code> operator that will scan the entire tabl and yield the selected fields for each row.</p>
<h2 id="heading-query-evaluation-in-the-repl">Query evaluation in the REPL</h2>
<p>It's time to put our query evaluator to the test! We'll create a simple function that reads a raw SQL query and evaluates it:</p>
<pre><code class="lang-rust">
<span class="hljs-comment">// src/main.rs</span>

<span class="hljs-comment">// [...]</span>

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">eval_query</span></span>(db: &amp;db::Db, query: &amp;<span class="hljs-built_in">str</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;()&gt; {
    <span class="hljs-keyword">let</span> parsed_query = sql::parse_statement(query, <span class="hljs-literal">false</span>)?;
    <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> op = engine::plan::Planner::new(db).compile(&amp;parsed_query)?;

    <span class="hljs-keyword">while</span> <span class="hljs-keyword">let</span> <span class="hljs-literal">Some</span>(values) = op.next_row()? {
        <span class="hljs-keyword">let</span> formated = values
            .iter()
            .map(<span class="hljs-built_in">ToString</span>::to_string)
            .collect::&lt;<span class="hljs-built_in">Vec</span>&lt;_&gt;&gt;()
            .join(<span class="hljs-string">"|"</span>);

        <span class="hljs-built_in">println!</span>(<span class="hljs-string">"{formated}"</span>);
    }

    <span class="hljs-literal">Ok</span>(())
}
</code></pre>
<p>This function creates a pipeline: it parses the SQL query, builds an <code>Operator</code> with our Planner, and then repeatedly calls next_row() on the resulting operator to retrieve and display each row of the result.</p>
<p>The final step is to use this function in the REPL loop:</p>
<pre><code class="lang-diff">// src/main.rs

// [...]

 fn cli(mut db: db::Db) -&gt; anyhow::Result&lt;()&gt; {
     print_flushed("rqlite&gt; ")?;

     let mut line_buffer = String::new();

     while stdin().lock().read_line(&amp;mut line_buffer).is_ok() {
         match line_buffer.trim() {
             ".exit" =&gt; break,
             ".tables" =&gt; display_tables(&amp;mut db)?,
<span class="hljs-addition">+            stmt =&gt; eval_query(&amp;db, stmt)?, </span>
<span class="hljs-deletion">-            stmt =&gt; match sql::parse_statement(stmt, true) {</span>
<span class="hljs-deletion">-                Ok(stmt) =&gt; {</span>
<span class="hljs-deletion">-                    println!("{:?}", stmt);</span>
<span class="hljs-deletion">-                }</span>
<span class="hljs-deletion">-                Err(e) =&gt; {</span>
<span class="hljs-deletion">-                    println!("Error: {}", e);</span>
<span class="hljs-deletion">-                }</span>
<span class="hljs-deletion">-            },</span>
         }

         print_flushed("\nrqlite&gt; ")?;

         line_buffer.clear();
     }

     Ok(())
 }
</code></pre>
<p>Now we can run the REPL and evaluate some simple <code>SELECT</code> statements:</p>
<pre><code class="lang-bash">cargo run -- queries_test.db
rqlite&gt; select * from table1;
</code></pre>
<p>If everything went well, you should see the following output:</p>
<pre><code class="lang-bash">1|11
2|12
3|13
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Our small database engine is starting to take shape! We can now parse and evaluate simple <code>SELECT</code> queries. But there's still a lot to cover before we can call it a fully functional database engine. In the next posts, we'll discover how to filter rows, read indexes, and implement sorting and grouping.</p>
]]></content:encoded></item><item><title><![CDATA[Build your own SQLite, Part 4: reading tables metadata]]></title><description><![CDATA[As we saw in the opening post, SQLite stores metadata about tables in a special "schema table" starting on page 1. We've been reading records from this table to list the tables in the current database, but before we can start evaluating SQL queries a...]]></description><link>https://blog.sylver.dev/build-your-own-sqlite-part-4-reading-tables-metadata</link><guid isPermaLink="true">https://blog.sylver.dev/build-your-own-sqlite-part-4-reading-tables-metadata</guid><category><![CDATA[Rust]]></category><category><![CDATA[Databases]]></category><category><![CDATA[from scratch]]></category><category><![CDATA[SQLite]]></category><dc:creator><![CDATA[Geoffrey Copin]]></dc:creator><pubDate>Mon, 03 Feb 2025 21:42:29 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1738618916098/fd348757-41bf-483d-9c65-8b4a345d4c2a.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>As we saw in the <a target="_blank" href="/build-your-own-sqlite-part-1-listing-tables">opening post</a>, SQLite stores metadata about tables in a special "schema table" starting on page 1. We've been reading records from this table to list the tables in the current database, but before we can start evaluating SQL queries against user-defined tables, we need to extract more information from the schema table.</p>
<p>For each table, we need to know:</p>
<ul>
<li><p>the table name</p>
</li>
<li><p>the root page</p>
</li>
<li><p>the name and type of each column</p>
</li>
</ul>
<p>The first two are very easy to extract, as they are directly stored in fields 1 and 3 of the schema table's records. But column names and types will be a bit trickier, as they are not neatly separated into record fields, but are stored in a single field in the form of a <code>CREATE TABLE</code> statement that we'll need to parse.</p>
<p>The complete source code is available on <a target="_blank" href="https://github.com/geoffreycopin/rqlite/tree/4e098ca03b814448eb1a2650d64cda12227e9300">GitHub</a>.</p>
<h2 id="heading-parsing-create-table-statements">Parsing <code>CREATE TABLE</code> statements</h2>
<p>The first step in extending our SQL parser to support <code>CREATE TABLE</code> statements it to add the necessary token types to the tokenizer. We'll support <code>CREATE TABLE</code> statements of the following form:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> table_name
(
    column1_name column1_type,
    column2_name column2_type, 
    ...
)
</code></pre>
<p>The following tokens are new and need to be added to the <code>Token</code> enum: <code>CREATE</code>, <code>TABLE</code>, <code>(</code>, <code>)</code>.</p>
<pre><code class="lang-diff">// sql/tokenizer.rs

#[derive(Debug, Eq, PartialEq)]
pub enum Token {
<span class="hljs-addition">+   Create,</span>
<span class="hljs-addition">+   Table,</span>
    Select,
    As,
    From,
<span class="hljs-addition">+   LPar,</span>
<span class="hljs-addition">+   RPar,</span>
    Star,
    Comma,
    SemiColon,
    Identifier(String),
}

//[...]

pub fn tokenize(input: &amp;str) -&gt; anyhow::Result&lt;Vec&lt;Token&gt;&gt; {
    let mut tokens = Vec::new();
    let mut chars = input.chars().peekable();

    while let Some(c) = chars.next() {
        match c {
<span class="hljs-addition">+           '(' =&gt; tokens.push(Token::LPar),</span>
<span class="hljs-addition">+           ')' =&gt; tokens.push(Token::RPar),</span>
            '*' =&gt; tokens.push(Token::Star),
            ',' =&gt; tokens.push(Token::Comma),
            ';' =&gt; tokens.push(Token::SemiColon),
            c if c.is_whitespace() =&gt; continue,
            c if c.is_alphabetic() =&gt; {
                let mut ident = c.to_string().to_lowercase();
                while let Some(cc) = chars.next_if(|&amp;cc| cc.is_alphanumeric() || cc == '_') {
                    ident.extend(cc.to_lowercase());
                }

                match ident.as_str() {
<span class="hljs-addition">+                   "create" =&gt; tokens.push(Token::Create),</span>
<span class="hljs-addition">+                   "table" =&gt; tokens.push(Token::Table),</span>
                    "select" =&gt; tokens.push(Token::Select),
                    "as" =&gt; tokens.push(Token::As),
                    "from" =&gt; tokens.push(Token::From),
                    _ =&gt; tokens.push(Token::Identifier(ident)),
                }
            }
            _ =&gt; bail!("unexpected character: {}", c),
        }
    }

    Ok(tokens)
}
</code></pre>
<p>Next, we need to extend our AST to represent the new statement type. Our representation will be based on the <a target="_blank" href="https://www.sqlite.org/lang_createtable.html">SQLite documentation</a>.</p>
<pre><code class="lang-diff">// sql/ast.rs

//[...]

#[derive(Debug, Clone, Eq, PartialEq)]
pub enum Statement {
    Select(SelectStatement),
<span class="hljs-addition">+   CreateTable(CreateTableStatement),</span>
}
<span class="hljs-addition">+</span>
<span class="hljs-addition">+#[derive(Debug, Clone, Eq, PartialEq)]</span>
<span class="hljs-addition">+pub struct CreateTableStatement {</span>
<span class="hljs-addition">+    pub name: String,</span>
<span class="hljs-addition">+    pub columns: Vec&lt;ColumnDef&gt;,</span>
<span class="hljs-addition">+}</span>
<span class="hljs-addition">+</span>
<span class="hljs-addition">+#[derive(Debug, Clone, Eq, PartialEq)]</span>
<span class="hljs-addition">+pub struct ColumnDef {</span>
<span class="hljs-addition">+    pub name: String,</span>
<span class="hljs-addition">+    pub col_type: Type,</span>
<span class="hljs-addition">+}</span>
<span class="hljs-addition">+</span>
<span class="hljs-addition">+#[derive(Debug, Clone, Eq, PartialEq)]</span>
<span class="hljs-addition">+pub enum Type {</span>
<span class="hljs-addition">+    Integer,</span>
<span class="hljs-addition">+    Real,</span>
<span class="hljs-addition">+    Text,</span>
<span class="hljs-addition">+    Blob,</span>
<span class="hljs-addition">+}</span>

//[...]
</code></pre>
<p>Parsing types is straightforward: we can simply match the incoming identifier token with a predefined set of types. For now, we'll restrict ourselves to <code>INTEGER</code>, <code>REAL</code>, <code>TEXT</code>, <code>STRING</code>, and <code>BLOB</code>. Once our <code>parse_type</code> method is implemented, constructing <code>ColumnDef</code> nodes is trivial.</p>
<pre><code class="lang-rust"><span class="hljs-comment">// sql/parser.rs</span>

<span class="hljs-comment">//[...]</span>
<span class="hljs-keyword">impl</span> ParserState {
    <span class="hljs-comment">// [...]</span>
    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_column_def</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;ColumnDef&gt; {
        <span class="hljs-literal">Ok</span>(ColumnDef {
            name: <span class="hljs-keyword">self</span>.expect_identifier()?.to_string(),
            col_type: <span class="hljs-keyword">self</span>.parse_type()?,
        })
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_type</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;Type&gt; {
        <span class="hljs-keyword">let</span> type_name = <span class="hljs-keyword">self</span>.expect_identifier()?;
        <span class="hljs-keyword">let</span> t = <span class="hljs-keyword">match</span> type_name.to_lowercase().as_str() {
            <span class="hljs-string">"integer"</span> =&gt; Type::Integer,
            <span class="hljs-string">"real"</span> =&gt; Type::Real,
            <span class="hljs-string">"blob"</span> =&gt; Type::Blob,
            <span class="hljs-string">"text"</span> | <span class="hljs-string">"string"</span> =&gt; Type::Text,
            _ =&gt; bail!(<span class="hljs-string">"unsupported type: {type_name}"</span>),
        };
        <span class="hljs-literal">Ok</span>(t)
    }
    <span class="hljs-comment">// [...]</span>
}

<span class="hljs-comment">//[...]</span>
</code></pre>
<p>In our implementation if the <code>parse_create_table</code> method, we'll parse column definitions using the same pattern as in the <code>parse_result_colums</code> method:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// sql/parser.rs</span>

<span class="hljs-comment">//[...]</span>
<span class="hljs-keyword">impl</span> ParserState {
    <span class="hljs-comment">// [...]</span>
    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_create_table</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;CreateTableStatement&gt; {
        <span class="hljs-keyword">self</span>.expect_eq(Token::Create)?;
        <span class="hljs-keyword">self</span>.expect_eq(Token::Table)?;
        <span class="hljs-keyword">let</span> name = <span class="hljs-keyword">self</span>.expect_identifier()?.to_string();
        <span class="hljs-keyword">self</span>.expect_eq(Token::LPar)?;
        <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> columns = <span class="hljs-built_in">vec!</span>[<span class="hljs-keyword">self</span>.parse_column_def()?];
        <span class="hljs-keyword">while</span> <span class="hljs-keyword">self</span>.next_token_is(Token::Comma) {
            <span class="hljs-keyword">self</span>.advance();
            columns.push(<span class="hljs-keyword">self</span>.parse_column_def()?);
        }
        <span class="hljs-keyword">self</span>.expect_eq(Token::RPar)?;
        <span class="hljs-literal">Ok</span>(CreateTableStatement { name, columns })
    }
    <span class="hljs-comment">// [...]</span>
}
<span class="hljs-comment">//[...]</span>
</code></pre>
<p>Finally, we need to update the <code>parse_statement</code> method to handle the new statement type. We'll also update the <code>parse_statement</code> utility function to make the semicolon terminator optional, as the <code>CREATE TABLE</code> statements stored in the schema table lack a trailing semicolon.</p>
<pre><code class="lang-diff">// sql/parser.rs

//[...]

impl ParserState {
    // [...]

    fn parse_statement(&amp;mut self) -&gt; anyhow::Result&lt;Statement&gt; {
<span class="hljs-deletion">-       Ok(ast::Statement::Select(self.parse_select()?))</span>
<span class="hljs-addition">+       match self.peak_next_token().context("unexpected end of input")? {</span>
<span class="hljs-addition">+           Token::Select =&gt; self.parse_select().map(Statement::Select),</span>
<span class="hljs-addition">+           Token::Create =&gt; self.parse_create_table().map(Statement::CreateTable),</span>
<span class="hljs-addition">+           token =&gt; bail!("unexpected token: {token:?}"),</span>
<span class="hljs-addition">+       }</span>
    }    

    // [...]
}

// [...]

<span class="hljs-deletion">-pub fn parse_statement(input: &amp;str) -&gt; anyhow::Result&lt;Statement&gt; {</span>
<span class="hljs-addition">+pub fn parse_statement(input: &amp;str, trailing_semicolon: bool) -&gt; anyhow::Result&lt;Statement&gt; {</span>
    let tokens = tokenizer::tokenize(input)?;
    let mut state = ParserState::new(tokens);
    let statement = state.parse_statement()?;
<span class="hljs-addition">+   if trailing_semicolon {</span>
        state.expect_eq(Token::SemiColon)?;
<span class="hljs-addition">+   }</span>
    Ok(statement)
}

<span class="hljs-addition">+pub fn parse_create_statement(</span>
<span class="hljs-addition">+    input: &amp;str,</span>
<span class="hljs-addition">+) -&gt; anyhow::Result&lt;CreateTableStatement&gt; {</span>
<span class="hljs-addition">+    match parse_statement(input, false)? {</span>
<span class="hljs-addition">+        Statement::CreateTable(c) =&gt; Ok(c),</span>
<span class="hljs-addition">+        Statement::Select(_) =&gt; bail!("expected a create statement"),</span>
<span class="hljs-addition">+    }</span>
<span class="hljs-addition">+}</span>
</code></pre>
<h2 id="heading-reading-metadata">Reading metadata</h2>
<p>Now that we have the necessary building blocks to read table metadata, we can extend our <code>Database</code> struct to store this information. The <code>TableMetadata::from_cursor</code> method builds a <code>TableMetadata</code> struct from a <code>Cursor</code> object, which represents a record in the schema table. The create statement and first page are extracted from fields 4 and 3, respectively.</p>
<p>As records from the schema table contain informations about other kinds of objects, such as triggers, we check the <code>type</code> field at index 0 to ensure we're dealing with a table.</p>
<p>Finally, in <code>Db::collect_metadata</code>, we iterate over all the records in the schema table, collecting table metadata for each table record we encounter.</p>
<pre><code class="lang-diff">// db.rs

<span class="hljs-addition">+#[derive(Debug, Clone)]</span>
<span class="hljs-addition">+pub struct TableMetadata {</span>
<span class="hljs-addition">+    pub name: String,</span>
<span class="hljs-addition">+    pub columns: Vec&lt;ast::ColumnDef&gt;,</span>
<span class="hljs-addition">+    pub first_page: usize,</span>
<span class="hljs-addition">+}</span>

<span class="hljs-addition">+impl TableMetadata {</span>
<span class="hljs-addition">+   fn from_cursor(cursor: Cursor) -&gt; anyhow::Result&lt;Option&lt;Self&gt;&gt; {</span>
<span class="hljs-addition">+       let type_value = cursor</span>
<span class="hljs-addition">+           .field(0)</span>
<span class="hljs-addition">+           .context("missing type field")</span>
<span class="hljs-addition">+           .context("invalid type field")?;</span>

<span class="hljs-addition">+       if type_value.as_str() != Some("table") {</span>
<span class="hljs-addition">+           return Ok(None);</span>
<span class="hljs-addition">+       }</span>

<span class="hljs-addition">+       let create_stmt = cursor</span>
<span class="hljs-addition">+           .field(4)</span>
<span class="hljs-addition">+           .context("missing create statement")</span>
<span class="hljs-addition">+           .context("invalid create statement")?</span>
<span class="hljs-addition">+           .as_str()</span>
<span class="hljs-addition">+           .context("table create statement should be a string")?</span>
<span class="hljs-addition">+           .to_owned();</span>

<span class="hljs-addition">+       let create = sql::parse_create_statement(&amp;create_stmt)?;</span>

<span class="hljs-addition">+       let first_page = cursor</span>
<span class="hljs-addition">+           .field(3)</span>
<span class="hljs-addition">+           .context("missing table first page")?</span>
<span class="hljs-addition">+           .as_int()</span>
<span class="hljs-addition">+           .context("table first page should be an integer")? as usize;</span>

<span class="hljs-addition">+       Ok(Some(TableMetadata {</span>
<span class="hljs-addition">+           name: create.name,</span>
<span class="hljs-addition">+           columns: create.columns,</span>
<span class="hljs-addition">+           first_page,</span>
<span class="hljs-addition">+       }))</span>
<span class="hljs-addition">+    }</span>
<span class="hljs-addition">+}</span>

pub struct Db {
    pub header: DbHeader,
<span class="hljs-addition">+   pub tables_metadata: Vec&lt;TableMetadata&gt;,</span>
    pager: Pager,
}

impl Db {
    pub fn from_file(filename: impl AsRef&lt;Path&gt;) -&gt; anyhow::Result&lt;Db&gt; {
        let mut file = std::fs::File::open(filename.as_ref()).context("open db file")?;

        let mut header_buffer = [0; pager::HEADER_SIZE];
        file.read_exact(&amp;mut header_buffer)
            .context("read db header")?;

        let header = pager::parse_header(&amp;header_buffer).context("parse db header")?;

<span class="hljs-addition">+       let tables_metadata = Self::collect_tables_metadata(&amp;mut Pager::new(</span>
<span class="hljs-addition">+           file.try_clone()?,</span>
<span class="hljs-addition">+           header.page_size as usize,</span>
<span class="hljs-addition">+       ))?;</span>

        let pager = Pager::new(file, header.page_size as usize);

        Ok(Db {
            header,
            pager,
<span class="hljs-addition">+           tables_metadata,</span>
        })
    }

<span class="hljs-addition">+   fn collect_tables_metadata(pager: &amp;mut Pager) -&gt; anyhow::Result&lt;Vec&lt;TableMetadata&gt;&gt; {</span>
<span class="hljs-addition">+       let mut metadata = Vec::new();</span>
<span class="hljs-addition">+       let mut scanner = Scanner::new(pager, 1);</span>

<span class="hljs-addition">+       while let Some(record) = scanner.next_record()? {</span>
<span class="hljs-addition">+           if let Some(m) = TableMetadata::from_cursor(record)? {</span>
<span class="hljs-addition">+               metadata.push(m);</span>
<span class="hljs-addition">+           }</span>
<span class="hljs-addition">+       }</span>

<span class="hljs-addition">+       Ok(metadata)</span>
<span class="hljs-addition">+   }</span>

    // [...]
}
</code></pre>
<p>Our initial implementation of the <code>.table</code> command can be updated to use the new metadata:</p>
<pre><code class="lang-diff">// main.rs

fn display_tables(db: &amp;mut db::Db) -&gt; anyhow::Result&lt;()&gt; {
<span class="hljs-deletion">-   let mut scanner = db.scanner(1);</span>
<span class="hljs-deletion">-</span>
<span class="hljs-deletion">-   while let Some(mut record) = scanner.next_record()? {</span>
<span class="hljs-deletion">-       let type_value = record</span>
<span class="hljs-deletion">-           .field(0)</span>
<span class="hljs-deletion">-           .context("missing type field")</span>
<span class="hljs-deletion">-           .context("invalid type field")?;</span>

<span class="hljs-deletion">-       if type_value.as_str() == Some("table") {</span>
<span class="hljs-deletion">-           let name_value = record</span>
<span class="hljs-deletion">-               .field(1)</span>
<span class="hljs-deletion">-               .context("missing name field")</span>
<span class="hljs-deletion">-               .context("invalid name field")?;</span>

<span class="hljs-deletion">-           print!("{} ", name_value.as_str().unwrap());</span>
<span class="hljs-deletion">-       }</span>
<span class="hljs-deletion">-   }</span>
<span class="hljs-addition">+   for table in &amp;db.tables_metadata {</span>
<span class="hljs-addition">+       print!("{} ", &amp;table.name)</span>
<span class="hljs-addition">+   }</span>

    Ok(())
}
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>We've extended our SQL parser to support <code>CREATE TABLE</code> statements and used it to extract metadata from the schema table. By parsing the schema, we now have a way to understand the structure of tables in our database.</p>
<p>In the next post, we'll leverage this metadata to build a query evaluator that can execute simple <code>SELECT</code> queries against user-defined tables, bringing us one step closer to a fully functional database engine.</p>
]]></content:encoded></item><item><title><![CDATA[Build your own SQLite, Part 3: SQL parsing 101]]></title><description><![CDATA[After discovering the SQLite file format and implementing the .tables command in part 1 and part 2 of this series, we're ready to tackle the next big challenge: writing our own SQL parser from scratch.
As the SQL dialect supported by SQLite is quite ...]]></description><link>https://blog.sylver.dev/build-your-own-sqlite-part-3-sql-parsing-101</link><guid isPermaLink="true">https://blog.sylver.dev/build-your-own-sqlite-part-3-sql-parsing-101</guid><category><![CDATA[Rust]]></category><category><![CDATA[Databases]]></category><category><![CDATA[from scratch]]></category><category><![CDATA[parsing]]></category><category><![CDATA[SQL]]></category><category><![CDATA[SQLite]]></category><dc:creator><![CDATA[Geoffrey Copin]]></dc:creator><pubDate>Mon, 18 Nov 2024 21:01:13 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1731963390995/1428b4c0-d677-498c-8493-4323b75be327.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>After discovering the SQLite file format and implementing the <code>.tables</code> command in <a target="_blank" href="/build-your-own-sqlite-part-1-listing-tables">part 1</a> and <a target="_blank" href="/build-your-own-sqlite-part-2-scanning-large-tables">part 2</a> of this series, we're ready to tackle the next big challenge: writing our own SQL parser from scratch.</p>
<p>As the SQL dialect supported by SQLite is quite large and complex, we'll initially limit ourselves to a subset that comprises only the <code>select</code> statement, in a striped-down form. Only expressions of the form <code>select &lt;columns&gt; from &lt;table&gt;</code> will be supported, where <code>&lt;columns&gt;</code> is either <code>*</code> or a comma-separated list of columns names (with an optional <code>as</code> alias), and <code>&lt;table&gt;</code> is the name of a table.</p>
<p>The full SQL syntax, as implemented in SQLite is described in great detail in the <a target="_blank" href="https://www.sqlite.org/lang.html">SQL As Understood By SQLite</a> document.</p>
<h2 id="heading-parsing-basics">Parsing Basics</h2>
<p>Our SQL parser will follow a conventional 2 steps process: lexical analysis (or tokenization) and syntax analysis (or parsing).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731963279441/984893c1-9e60-4183-94ca-d74709bd3580.png" alt class="image--center mx-auto" /></p>
<p>The lexical analysis step takes the input SQL string and groups individual characters into tokens, which are meaningful units of the language. For example, the character sequence S-E-L-E-C-T will be grouped into a single token of type <code>select</code>, and the sequence <code>*</code> will be grouped into a token of type <code>star</code>. This stage is also responsible for discarding whitespace and normalizing the case of the input.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731963299161/8724dc74-5b37-462b-8676-068a185f9dd1.png" alt class="image--center mx-auto" /></p>
<p>The syntax analysis step takes the stream of tokens produced by the lexical analysis, and tries to match them against the syntax rules of the language. Its output is an abstract syntax tree (AST), which is a hierarchical representation of the input SQL.</p>
<h2 id="heading-writing-the-tokenizer">Writing the tokenizer</h2>
<p>The first step in writing our tokenizer is to define a <code>Token</code> type that will represent the individual tokens of our SQL dialect. This definition will live in a new module: <code>sql::tokenizer</code>.</p>
<pre><code class="lang-rust"><span class="hljs-comment">// sql/tokenizer.rs</span>
<span class="hljs-meta">#[derive(Debug, Eq, PartialEq)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">enum</span> <span class="hljs-title">Token</span></span> {
    Select,
    As,
    <span class="hljs-built_in">From</span>,
    Star,
    Comma,
    SemiColon,
    Identifier(<span class="hljs-built_in">String</span>),
}

<span class="hljs-keyword">impl</span> Token {
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">as_identifier</span></span>(&amp;<span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">Option</span>&lt;&amp;<span class="hljs-built_in">str</span>&gt; {
        <span class="hljs-keyword">match</span> <span class="hljs-keyword">self</span> {
            Token::Identifier(ident) =&gt; <span class="hljs-literal">Some</span>(ident),
            _ =&gt; <span class="hljs-literal">None</span>,
        }
    }
}
</code></pre>
<p>We also define a utility function <code>as_identifier</code> that will return the string value of a token if it is an <code>Identifier</code>, and <code>None</code> otherwise.</p>
<p>The logic of the tokenize function is quite simple: we iterate over the input string's characters, and based on the current character we decide which token to emit:</p>
<ul>
<li><p>if the character matches a single-character token, we emit it immediately</p>
</li>
<li><p>if the character is a whitespace, it is discarded</p>
</li>
<li><p>finally, if the character is a letter, we start a new identifier token and keep accumulating characters until we reach a character that is not a valid identifier character. At this point, if the accumulated string is a keyword, we emit the corresponding token, otherwise, we emit a raw <code>Identifier</code> token.</p>
</li>
</ul>
<pre><code class="lang-rust"><span class="hljs-comment">// sql/tokenizer.rs</span>
<span class="hljs-keyword">use</span> anyhow::bail;

<span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">tokenize</span></span>(input: &amp;<span class="hljs-built_in">str</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;<span class="hljs-built_in">Vec</span>&lt;Token&gt;&gt; {
    <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> tokens = <span class="hljs-built_in">Vec</span>::new();
    <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> chars = input.chars().peekable();

    <span class="hljs-keyword">while</span> <span class="hljs-keyword">let</span> <span class="hljs-literal">Some</span>(c) = chars.next() {
        <span class="hljs-keyword">match</span> c {
            <span class="hljs-string">'*'</span> =&gt; tokens.push(Token::Star),
            <span class="hljs-string">','</span> =&gt; tokens.push(Token::Comma),
            <span class="hljs-string">';'</span> =&gt; tokens.push(Token::SemiColon),
            c <span class="hljs-keyword">if</span> c.is_whitespace() =&gt; <span class="hljs-keyword">continue</span>,
            c <span class="hljs-keyword">if</span> c.is_alphabetic() =&gt; {
                <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> ident = c.to_string().to_lowercase();
                <span class="hljs-keyword">while</span> <span class="hljs-keyword">let</span> <span class="hljs-literal">Some</span>(cc) = chars.next_if(|&amp;cc| cc.is_alphanumeric() || cc == <span class="hljs-string">'_'</span>) {
                    ident.extend(cc.to_lowercase());
                }

                <span class="hljs-keyword">match</span> ident.as_str() {
                    <span class="hljs-string">"select"</span> =&gt; tokens.push(Token::Select),
                    <span class="hljs-string">"as"</span> =&gt; tokens.push(Token::As),
                    <span class="hljs-string">"from"</span> =&gt; tokens.push(Token::<span class="hljs-built_in">From</span>),
                    _ =&gt; tokens.push(Token::Identifier(ident)),
                }
            }
            _ =&gt; <span class="hljs-keyword">return</span> <span class="hljs-literal">Err</span>(anyhow::anyhow!(<span class="hljs-string">"unexpected character: {}"</span>, c)),
        }
    }

    <span class="hljs-literal">Ok</span>(tokens)
}
</code></pre>
<p>Since SQL is case-insensitive, all identifiers are normalized to lower case.</p>
<h2 id="heading-representing-sql-statements">Representing SQL statements</h2>
<p>Before we dive into the implementation of the parser, we need to decide how to represent SQL statements in our code. We'll settle on a conventional representation, based on the description of the SQL syntax in the SQLite documentation, and write the corresponding Rust types in a new module <code>sql::ast</code>.</p>
<pre><code class="lang-rust"><span class="hljs-comment">// sql/ast.rs</span>

<span class="hljs-meta">#[derive(Debug, Clone, Eq, PartialEq)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">enum</span> <span class="hljs-title">Statement</span></span> {
    Select(SelectStatement),
}

<span class="hljs-meta">#[derive(Debug, Clone, Eq, PartialEq)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">SelectStatement</span></span> {
    <span class="hljs-keyword">pub</span> core: SelectCore,
}

<span class="hljs-meta">#[derive(Debug, Clone, Eq, PartialEq)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">SelectCore</span></span> {
    <span class="hljs-keyword">pub</span> result_columns: <span class="hljs-built_in">Vec</span>&lt;ResultColumn&gt;,
    <span class="hljs-keyword">pub</span> from: SelectFrom,
}

<span class="hljs-meta">#[derive(Debug, Clone, Eq, PartialEq)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">enum</span> <span class="hljs-title">ResultColumn</span></span> {
    Star,
    Expr(ExprResultColumn),
}

<span class="hljs-meta">#[derive(Debug, Clone, Eq, PartialEq)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">ExprResultColumn</span></span> {
    <span class="hljs-keyword">pub</span> expr: Expr,
    <span class="hljs-keyword">pub</span> alias: <span class="hljs-built_in">Option</span>&lt;<span class="hljs-built_in">String</span>&gt;,
}

<span class="hljs-meta">#[derive(Debug, Clone, Eq, PartialEq)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">enum</span> <span class="hljs-title">Expr</span></span> {
    Column(Column),
}

<span class="hljs-meta">#[derive(Debug, Clone, Eq, PartialEq)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">Column</span></span> {
    <span class="hljs-keyword">pub</span> name: <span class="hljs-built_in">String</span>,
}

<span class="hljs-meta">#[derive(Debug, Clone, Eq, PartialEq)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">enum</span> <span class="hljs-title">SelectFrom</span></span> {
    Table(<span class="hljs-built_in">String</span>),
}
</code></pre>
<p>The following query:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">select</span> col1 <span class="hljs-keyword">as</span> <span class="hljs-keyword">first</span>, col2
<span class="hljs-keyword">from</span> <span class="hljs-keyword">table</span>
</code></pre>
<p>Will be parsed into the following rust structure:</p>
<pre><code class="lang-rust">Statement::Select(SelectStatement {
    core: SelectCore {
        result_columns: <span class="hljs-built_in">vec!</span>[
            ResultColumn::Expr(ExprResultColumn {
                expr: Expr::Column(Column {
                     name: <span class="hljs-string">"col1"</span>.to_string()
                }),
                alias: <span class="hljs-literal">Some</span>(<span class="hljs-string">"first"</span>.to_string())
            }),
            ResultColumn::Expr(ExprResultColumn {
                expr: Expr::Column(Column {
                    name: <span class="hljs-string">"col2"</span>.to_string()
                }),
               alias: <span class="hljs-literal">None</span>
            }),
       ],
       from: SelectFrom::Table(<span class="hljs-string">"table"</span>.to_string()),
    },
})
</code></pre>
<p>You may notice a few redundancies in this representation, such as the <code>Expr</code> enum that comprises a single variant. This is intentional, as it will allow us to add new syntactic constructs in future episodes without breaking too much of the existing code.</p>
<h2 id="heading-writing-the-parser">Writing the parser</h2>
<p>Parsing algorithms come in all shapes and sizes, and a full discussion of the topic if beyond the scope of this article. The one we'll use here is called recursive descent and is reasonably simple to understand and implement:</p>
<ul>
<li><p>for every node type, we'll define a function that tries to build the node from the current input tokens, and fails if it is not possible. For example, we'll define a method that builds a <code>Column</code> node by consuming an <code>Identifier</code> token, and fails if the current token is not an <code>Identifier</code> token.</p>
</li>
<li><p>complex "nested" nodes are build by delegating the parsing of their child nodes to other functions. For example, <code>ExprResultColmn</code> is build by parsing an <code>Expr</code> node and an optional <code>as</code> token followed by an <code>Identifier</code> token.</p>
</li>
</ul>
<p>In a fully-fledged parser, these functions can be mutually recursive.</p>
<p>First, let's define a <code>ParserState</code> struct that will hold the state of the parser: the list of tokens, and the current position in the list.</p>
<pre><code class="lang-rust"><span class="hljs-comment">// sql/parser.rs</span>

<span class="hljs-keyword">use</span> anyhow::{bail, Context};

<span class="hljs-keyword">use</span> crate::sql::{
    ast::{
        Column, Expr, ExprResultColumn, ResultColumn, SelectCore, SelectFrom, SelectStatement,
        Statement,
    },
    tokenizer::{<span class="hljs-keyword">self</span>, Token},
};

<span class="hljs-meta">#[derive(Debug)]</span>
<span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">ParserState</span></span> {
    tokens: <span class="hljs-built_in">Vec</span>&lt;Token&gt;,
    pos: <span class="hljs-built_in">usize</span>,
}

<span class="hljs-keyword">impl</span> ParserState {
    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">new</span></span>(tokens: <span class="hljs-built_in">Vec</span>&lt;Token&gt;) -&gt; <span class="hljs-keyword">Self</span> {
        <span class="hljs-keyword">Self</span> { tokens, pos: <span class="hljs-number">0</span> }
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">next_token_is</span></span>(&amp;<span class="hljs-keyword">self</span>, expected: Token) -&gt; <span class="hljs-built_in">bool</span> {
        <span class="hljs-keyword">self</span>.tokens.get(<span class="hljs-keyword">self</span>.pos) == <span class="hljs-literal">Some</span>(&amp;expected)
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">expect_identifier</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;&amp;<span class="hljs-built_in">str</span>&gt; {
        <span class="hljs-keyword">self</span>.expect_matching(|t| matches!(t, Token::Identifier(_)))
            .map(|t| t.as_identifier().unwrap())
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">expect_eq</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>, expected: Token) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;&amp;Token&gt; {
        <span class="hljs-keyword">self</span>.expect_matching(|t| *t == expected)
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">expect_matching</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>, f: <span class="hljs-keyword">impl</span> <span class="hljs-built_in">Fn</span>(&amp;Token) -&gt; <span class="hljs-built_in">bool</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;&amp;Token&gt; {
        <span class="hljs-keyword">match</span> <span class="hljs-keyword">self</span>.next_token() {
            <span class="hljs-literal">Some</span>(token) <span class="hljs-keyword">if</span> f(token) =&gt; <span class="hljs-literal">Ok</span>(token),
            <span class="hljs-literal">Some</span>(token) =&gt; bail!(<span class="hljs-string">"unexpected token: {:?}"</span>, token),
            <span class="hljs-literal">None</span> =&gt; bail!(<span class="hljs-string">"unexpected end of input"</span>),
        }
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">peak_next_token</span></span>(&amp;<span class="hljs-keyword">self</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;&amp;Token&gt; {
        <span class="hljs-keyword">self</span>.tokens.get(<span class="hljs-keyword">self</span>.pos).context(<span class="hljs-string">"unexpected end of input"</span>)
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">next_token</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">Option</span>&lt;&amp;Token&gt; {
        <span class="hljs-keyword">let</span> token = <span class="hljs-keyword">self</span>.tokens.get(<span class="hljs-keyword">self</span>.pos);
        <span class="hljs-keyword">if</span> token.is_some() {
            <span class="hljs-keyword">self</span>.advance();
        }
        token
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">advance</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) {
        <span class="hljs-keyword">self</span>.pos += <span class="hljs-number">1</span>;
    }
}
</code></pre>
<ul>
<li><p><code>current_token_is</code> checks if the current token is equal to the expected token</p>
</li>
<li><p><code>expect_identifier</code> returns the content of the current token if it is an <code>Identifier</code>, and fails otherwise</p>
</li>
<li><p><code>expect_eq</code> checks if the current token is equal to the expected token, and fails otherwise</p>
</li>
<li><p><code>peak_next_token</code> allows us to look at the next token without consuming it, and fails if there are no more tokens</p>
</li>
<li><p><code>next_token</code> returns the current token and advances the parser's position</p>
</li>
<li><p><code>advance</code> increments the parser's position</p>
</li>
</ul>
<p>Armed with these primitives, we can write our simplest parser function: <code>parse_expr</code>! As the only expressions that we support for now are identifiers, the parsing function only has to check that the current token is an <code>Identifier</code> token and build a <code>Expr</code> node from its value.</p>
<pre><code class="lang-rust"><span class="hljs-comment">// sql/parser.rs</span>

<span class="hljs-keyword">impl</span> ParserState {
    <span class="hljs-comment">//...</span>
    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_expr</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;Expr&gt; {
        <span class="hljs-literal">Ok</span>(Expr::Column(Column {
            name: <span class="hljs-keyword">self</span>.expect_identifier()?.to_string(),
        }))
    }
    <span class="hljs-comment">//...</span>
}
</code></pre>
<p>A bit more involved, the <code>parse_expr_result_column</code> function parses terms of the form <code>columnName</code> or <code>columnName as alias</code>. It starts by parsing the initial <code>Expr</code> node (<code>columnName</code>, in our examples), then if the next token is <code>as</code>, it consumes it and parses the <code>Identifier</code> token that follows.</p>
<pre><code class="lang-rust"><span class="hljs-comment">// sql/parser.rs</span>

<span class="hljs-keyword">impl</span> ParserState {
    <span class="hljs-comment">//...</span>
    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_expr_result_column</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;ExprResultColumn&gt; {
        <span class="hljs-keyword">let</span> expr = <span class="hljs-keyword">self</span>.parse_expr()?;
        <span class="hljs-keyword">let</span> alias = <span class="hljs-keyword">if</span> <span class="hljs-keyword">self</span>.next_token_is(Token::As) {
            <span class="hljs-keyword">self</span>.advance();
            <span class="hljs-literal">Some</span>(<span class="hljs-keyword">self</span>.expect_identifier()?.to_string())
        } <span class="hljs-keyword">else</span> {
            <span class="hljs-literal">None</span>
        };
        <span class="hljs-literal">Ok</span>(ExprResultColumn { expr, alias })
    }
    <span class="hljs-comment">//...</span>
}
</code></pre>
<p><code>ResultColumn</code> can represent terms of the form described above, or <code>*</code> to represent all columns of a table. The <code>parse_result_column</code> function checks if the current token is <code>*</code>, and returns a <code>Star</code> node if it is. Otherwise, it delegates the parsing of the <code>ExprResultColumn</code> node to the <code>parse_expr_result_column</code> function.</p>
<pre><code class="lang-rust"><span class="hljs-comment">// sql/parser.rs</span>

<span class="hljs-keyword">impl</span> ParserState {
    <span class="hljs-comment">//...</span>
    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_result_column</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;ResultColumn&gt; {
        <span class="hljs-keyword">if</span> <span class="hljs-keyword">self</span>.peak_next_token()? == &amp;Token::Star {
            <span class="hljs-keyword">self</span>.advance();
            <span class="hljs-keyword">return</span> <span class="hljs-literal">Ok</span>(ResultColumn::Star);
        }

        <span class="hljs-literal">Ok</span>(ResultColumn::Expr(<span class="hljs-keyword">self</span>.parse_expr_result_column()?))
    }
    <span class="hljs-comment">//...</span>
}
</code></pre>
<p>Another interesting example is the <code>parse_result_colums</code> function, which parses a list of columns separated by commas. It starts by parsing the first column, then iterates over the following tokens as long as the token following a result column is a comma, accumulating the parsed columns in a vector.</p>
<pre><code class="lang-rust"><span class="hljs-comment">// sql/parser.rs</span>

<span class="hljs-keyword">impl</span> ParserState {
    <span class="hljs-comment">//...</span>
    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_result_columns</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;<span class="hljs-built_in">Vec</span>&lt;ResultColumn&gt;&gt; {
        <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> result_coluns = <span class="hljs-built_in">vec!</span>[<span class="hljs-keyword">self</span>.parse_result_column()?];
        <span class="hljs-keyword">while</span> <span class="hljs-keyword">self</span>.next_token_is(Token::Comma) {
            <span class="hljs-keyword">self</span>.advance();
            result_coluns.push(<span class="hljs-keyword">self</span>.parse_result_column()?);
        }
        <span class="hljs-literal">Ok</span>(result_coluns)
    }
    <span class="hljs-comment">//...</span>
}
</code></pre>
<p>As you are probably getting the hang of it, implementing the remaining parsing functions can be a fun exercise. In any case, here is my implementation for reference:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// sql/parser.rs</span>

<span class="hljs-keyword">impl</span> ParserState {
    <span class="hljs-comment">//...</span>
    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_statement</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;Statement&gt; {
        <span class="hljs-literal">Ok</span>(Statement::Select(<span class="hljs-keyword">self</span>.parse_select()?))
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_select</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;SelectStatement&gt; {
        <span class="hljs-keyword">self</span>.expect_eq(Token::Select)?;
        <span class="hljs-keyword">let</span> result_columns = <span class="hljs-keyword">self</span>.parse_result_columns()?;
        <span class="hljs-keyword">self</span>.expect_eq(Token::<span class="hljs-built_in">From</span>)?;
        <span class="hljs-keyword">let</span> from = <span class="hljs-keyword">self</span>.parse_select_from()?;
        <span class="hljs-literal">Ok</span>(SelectStatement {
            core: SelectCore {
                result_columns,
                from,
            },
        })
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_select_from</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;SelectFrom&gt; {
        <span class="hljs-keyword">let</span> table = <span class="hljs-keyword">self</span>.expect_identifier()?;
        <span class="hljs-literal">Ok</span>(SelectFrom::Table(table.to_string()))
    }
    <span class="hljs-comment">//...</span>
}
</code></pre>
<p>The final piece of the puzzle is a function that ties everything together, taking an input SQL string, tokenizing it, and parsing it into an AST:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// sql/parser.rs</span>

<span class="hljs-comment">//...</span>

<span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_statement</span></span>(input: &amp;<span class="hljs-built_in">str</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;Statement&gt; {
    <span class="hljs-keyword">let</span> tokens = tokenizer::tokenize(input)?;
    <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> state = ParserState::new(tokens);
    <span class="hljs-keyword">let</span> statement = state.parse_statement()?;
    state.expect_eq(Token::SemiColon)?;
    <span class="hljs-literal">Ok</span>(statement)
}
</code></pre>
<h2 id="heading-putting-it-all-together">Putting it all together</h2>
<p>We've covered a lot of ground! Now is the time to test our parser on some actual SQL queries. To that end, let's alter our REPL loop to parse then input as an SQL statement if it does not match a know command, and print it.</p>
<pre><code class="lang-diff">// src/main.rs

<span class="hljs-addition">+ mod sql;</span>

//...

fn cli(mut db: db::Db) -&gt; anyhow::Result&lt;()&gt; {
    print_flushed("rqlite&gt; ")?;

    let mut line_buffer = String::new();

    while stdin().lock().read_line(&amp;mut line_buffer).is_ok() {
        match line_buffer.trim() {
            ".exit" =&gt; break,
            ".tables" =&gt; display_tables(&amp;mut db)?,
<span class="hljs-addition">+            stmt =&gt; match sql::parse_statement(stmt) {</span>
<span class="hljs-addition">+                Ok(stmt) =&gt; {</span>
<span class="hljs-addition">+                    println!("{:?}", stmt);</span>
<span class="hljs-addition">+                }</span>
<span class="hljs-addition">+                Err(e) =&gt; {</span>
<span class="hljs-addition">+                    println!("Error: {}", e);</span>
<span class="hljs-addition">+                }</span>
<span class="hljs-addition">+            },</span>
<span class="hljs-deletion">-            _ =&gt; {</span>
<span class="hljs-deletion">-               println!("Unrecognized command '{}'", line_buffer.trim());</span>
<span class="hljs-deletion">-           }</span>
        }

        print_flushed("\nrqlite&gt; ")?;

        line_buffer.clear();
    }

    Ok(())
}

//...
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Our database can read data and parse very simple SQL statements. In the next part of this series, we'll bridge the gap between these two functionalities and build a small query engine that compiles SQL queries into execution plans and executes these plans against the persisted data.</p>
]]></content:encoded></item><item><title><![CDATA[Build your own SQLite, Part 2: Scanning large tables]]></title><description><![CDATA[In the previous post, we discovered the SQLite file format and implemented a toy version of the .tables command, allowing us to display the list of tables in a database. But our implementation has a jarring limitation: it assumes that all the data fi...]]></description><link>https://blog.sylver.dev/build-your-own-sqlite-part-2-scanning-large-tables</link><guid isPermaLink="true">https://blog.sylver.dev/build-your-own-sqlite-part-2-scanning-large-tables</guid><category><![CDATA[Rust]]></category><category><![CDATA[SQLite]]></category><category><![CDATA[from scratch]]></category><category><![CDATA[database]]></category><dc:creator><![CDATA[Geoffrey Copin]]></dc:creator><pubDate>Sat, 24 Aug 2024 14:59:36 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1724511449024/377e2714-c3b1-4bef-846c-1a127f6a59c9.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the previous post, we discovered the SQLite file format and implemented a toy version of the <code>.tables</code> command, allowing us to display the list of tables in a database. But our implementation has a jarring limitation: it assumes that all the data fits into the first page of the file. In this post, we'll discover how SQLite represents tables that are too large to fit into a single page, this will make our <code>.tables</code> command more useful, but also lay the groundwork for our query engine.</p>
<h2 id="heading-a-motivating-example">A motivating example</h2>
<p>Let's begin our journey with a much larger test database:</p>
<pre><code class="lang-bash"><span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> {1..1000}; <span class="hljs-keyword">do</span>            
    sqlite3 res/test.db <span class="hljs-string">"create table table<span class="hljs-variable">$i</span>(id integer)"</span>
<span class="hljs-keyword">done</span>

cargo run --release -- res/test.db
rqlite&gt; .tables
</code></pre>
<p>Without much surprise, our small program isn't able to display the list of tables. The reason for that is quite simple: database pages are typically 4096 bytes long, which is far from enough to store 1000 tables. But why did our code fail, instead of displaying the first records that fit into the first page?</p>
<h2 id="heading-b-tree-interior-pages">B-tree interior pages</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1724511435150/0d0e2be9-de89-4f6a-ab67-6524f9691d65.png" alt class="image--center mx-auto" /></p>
<p>When a table is too large to fit into a single page, SQLite splits it into multiple pages, of different types:</p>
<ul>
<li><p>leaf pages, that contains the actual records</p>
</li>
<li><p>interior pages, that store information about which page contains the records for which table.</p>
</li>
</ul>
<p>Interior tables have the same high-level structure as leaf pages, with two key differences:</p>
<ul>
<li><p>instead of storing record, they store a tuple <code>(key, child_page_number)</code> where <code>child_page_number</code> is a 32 bits unsigned integer representing the page number of the "root" page of a subtree that contains records with keys lower or equal to <code>key</code>. Cells in interior pages are logically ordered by <code>key</code> in ascending order.</p>
</li>
<li><p>the header contains an extra field, the "rightmost pointer", which is the page number of the "root" of the subtree that contains records with keys greater than the largest key in the page.</p>
</li>
</ul>
<p>With this new knowledge, we can update our page data structure. We'll start by adding the new optional <code>rightmost_pointer</code> field to the page header. We'll also add a <code>byte_size</code> method that returns the size of the header, depending on wheter the <code>rightmost_pointer</code> field is set or not, and add a new variant to our <code>PageType</code> enum to represent interior pages.</p>
<pre><code class="lang-diff">// src/page.rs

#[derive(Debug, Copy, Clone, Eq, PartialEq)]
pub enum PageType {
    TableLeaf,
<span class="hljs-addition">+   TableInterior,</span>
}

#[derive(Debug, Copy, Clone)]
pub struct PageHeader {
    pub page_type: PageType,
    pub first_freeblock: u16,
    pub cell_count: u16,
    pub cell_content_offset: u32,
    pub fragmented_bytes_count: u8,
<span class="hljs-addition">+   pub rightmost_pointer: Option&lt;u32&gt;,</span>
}

<span class="hljs-addition">+impl PageHeader {</span>
<span class="hljs-addition">+    pub fn byte_size(&amp;self) -&gt; usize {</span>
<span class="hljs-addition">+        if self.rightmost_pointer.is_some() {</span>
<span class="hljs-addition">+            12</span>
<span class="hljs-addition">+        } else {</span>
<span class="hljs-addition">+            8</span>
<span class="hljs-addition">+        }</span>
<span class="hljs-addition">+    }</span>
<span class="hljs-addition">+}</span>
</code></pre>
<p>Let's modify the parsing function to take the new field into account:</p>
<pre><code class="lang-diff">// src/pager.rs

<span class="hljs-addition">+ const PAGE_LEAF_TABLE_ID: u8 = 0x0d;</span>
<span class="hljs-addition">+ const PAGE_INTERIOR_TABLE_ID: u8 = 0x05;</span>

fn parse_page_header(buffer: &amp;[u8]) -&gt; anyhow::Result&lt;page::PageHeader&gt; {
<span class="hljs-deletion">-   let page_type = match buffer[0] {</span>
<span class="hljs-deletion">-      0x0d =&gt; page::PageType::TableLeaf,</span>
<span class="hljs-addition">+   let (page_type, has_rightmost_ptr) = match buffer[0] {</span>
<span class="hljs-addition">+       PAGE_LEAF_TABLE_ID =&gt; (page::PageType::TableLeaf, false),</span>
<span class="hljs-addition">+       PAGE_INTERIOR_TABLE_ID =&gt; (page::PageType::TableInterior, true),</span>
        _ =&gt; anyhow::bail!("unknown page type: {}", buffer[0]),
    };

    let first_freeblock = read_be_word_at(buffer, PAGE_FIRST_FREEBLOCK_OFFSET);
    let cell_count = read_be_word_at(buffer, PAGE_CELL_COUNT_OFFSET);
    let cell_content_offset = match read_be_word_at(buffer, PAGE_CELL_CONTENT_OFFSET) {
        0 =&gt; 65536,
        n =&gt; n as u32,
    };
    let fragmented_bytes_count = buffer[PAGE_FRAGMENTED_BYTES_COUNT_OFFSET];

<span class="hljs-addition">+   let rightmost_pointer = if has_rightmost_ptr {</span>
<span class="hljs-addition">+       Some(read_be_double_at(buffer, PAGE_RIGHTMOST_POINTER_OFFSET))</span>
<span class="hljs-addition">+   } else {</span>
<span class="hljs-addition">+       None</span>
<span class="hljs-addition">+   };</span>

    Ok(page::PageHeader {
        page_type,
        first_freeblock,
        cell_count,
        cell_content_offset,
        fragmented_bytes_count,
<span class="hljs-addition">+       rightmost_pointer,</span>
    })
}
</code></pre>
<p>We decide whether to parse the <code>rightmost_pointer</code> field depending on the value of the <code>page_type</code> byte (<code>0x0d</code> for leaf pages, <code>0x05</code> for interior pages).</p>
<p>Next, we'll update the <code>Page</code> struct to reflect the fact that both leaf and interior pages share the same structure, with the only difference being the content of the cells:</p>
<pre><code class="lang-diff">// src/page.rs

#[derive(Debug, Clone)]
<span class="hljs-deletion">- pub struct TableLeafPage {</span>
<span class="hljs-addition">+ pub struct Page {</span>
    pub header: PageHeader,
    pub cell_pointers: Vec&lt;u16&gt;,
<span class="hljs-deletion">-   pub cells: Vec&lt;TableLeafCell&gt;,</span>
<span class="hljs-addition">+   pub cells: Vec&lt;Cell&gt;,</span>
}

<span class="hljs-deletion">- #[derive(Debug, Clone)]</span>
<span class="hljs-deletion">- pub enum Page {</span>
<span class="hljs-deletion">-   TableLeaf(TableLeafPage),</span>
<span class="hljs-deletion">- }</span>

<span class="hljs-addition">+ #[derive(Debug, Clone)]</span>
<span class="hljs-addition">+ pub enum Cell {</span>
<span class="hljs-addition">+    TableLeaf(TableLeafCell),</span>
<span class="hljs-addition">+    TableInterior(TableInteriorCell),</span>
<span class="hljs-addition">+ }</span>

<span class="hljs-addition">+ impl From&lt;TableLeafCell&gt; for Cell {</span>
<span class="hljs-addition">+    fn from(cell: TableLeafCell) -&gt; Self {</span>
<span class="hljs-addition">+        Cell::TableLeaf(cell)</span>
<span class="hljs-addition">+    }</span>
<span class="hljs-addition">+ }</span>

<span class="hljs-addition">+ impl From&lt;TableInteriorCell&gt; for Cell {</span>
<span class="hljs-addition">+    fn from(cell: TableInteriorCell) -&gt; Self {</span>
<span class="hljs-addition">+        Cell::TableInterior(cell)</span>
<span class="hljs-addition">+    }</span>
<span class="hljs-addition">+ }</span>

<span class="hljs-addition">+ pub struct TableInteriorCell {</span>
<span class="hljs-addition">+    pub left_child_page: u32,</span>
<span class="hljs-addition">+    pub key: i64,</span>
<span class="hljs-addition">+ }</span>
</code></pre>
<p>This change calls for a major update of our parsing functions, reproduced below:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// src/pager.rs</span>

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_page</span></span>(buffer: &amp;[<span class="hljs-built_in">u8</span>], page_num: <span class="hljs-built_in">usize</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;page::Page&gt; {
    <span class="hljs-keyword">let</span> ptr_offset = <span class="hljs-keyword">if</span> page_num == <span class="hljs-number">1</span> { HEADER_SIZE <span class="hljs-keyword">as</span> <span class="hljs-built_in">u16</span> } <span class="hljs-keyword">else</span> { <span class="hljs-number">0</span> };
    <span class="hljs-keyword">let</span> content_buffer = &amp;buffer[ptr_offset <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>..];
    <span class="hljs-keyword">let</span> header = parse_page_header(content_buffer)?;
    <span class="hljs-keyword">let</span> cell_pointers = parse_cell_pointers(
        &amp;content_buffer[header.byte_size()..],
        header.cell_count <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>,
        ptr_offset,
    );

    <span class="hljs-keyword">let</span> cells_parsing_fn = <span class="hljs-keyword">match</span> header.page_type {
        page::PageType::TableLeaf =&gt; parse_table_leaf_cell,
        page::PageType::TableInterior =&gt; parse_table_interior_cell,
    };

    <span class="hljs-keyword">let</span> cells = parse_cells(content_buffer, &amp;cell_pointers, cells_parsing_fn)?;

    <span class="hljs-literal">Ok</span>(page::Page {
        header,
        cell_pointers,
        cells,
    })
}

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_cells</span></span>(
    buffer: &amp;[<span class="hljs-built_in">u8</span>],
    cell_pointers: &amp;[<span class="hljs-built_in">u16</span>],
    parse_fn: <span class="hljs-keyword">impl</span> <span class="hljs-built_in">Fn</span>(&amp;[<span class="hljs-built_in">u8</span>]) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;page::Cell&gt;,
) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;<span class="hljs-built_in">Vec</span>&lt;page::Cell&gt;&gt; {
    cell_pointers
        .iter()
        .map(|&amp;ptr| parse_fn(&amp;buffer[ptr <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>..]))
        .collect()
}

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_table_leaf_cell</span></span>(<span class="hljs-keyword">mut</span> buffer: &amp;[<span class="hljs-built_in">u8</span>]) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;page::Cell&gt; {
    <span class="hljs-keyword">let</span> (n, size) = read_varint_at(buffer, <span class="hljs-number">0</span>);
    buffer = &amp;buffer[n <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>..];

    <span class="hljs-keyword">let</span> (n, row_id) = read_varint_at(buffer, <span class="hljs-number">0</span>);
    buffer = &amp;buffer[n <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>..];

    <span class="hljs-keyword">let</span> payload = buffer[..size <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>].to_vec();

    <span class="hljs-literal">Ok</span>(page::TableLeafCell {
        size,
        row_id,
        payload,
    }
        .into())
}

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_table_interior_cell</span></span>(<span class="hljs-keyword">mut</span> buffer: &amp;[<span class="hljs-built_in">u8</span>]) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;page::Cell&gt; {
    <span class="hljs-keyword">let</span> left_child_page = read_be_double_at(buffer, <span class="hljs-number">0</span>);
    buffer = &amp;buffer[<span class="hljs-number">4</span>..];

    <span class="hljs-keyword">let</span> (_, key) = read_varint_at(buffer, <span class="hljs-number">0</span>);

    <span class="hljs-literal">Ok</span>(page::TableInteriorCell {
        left_child_page,
        key,
    }
        .into())
}
</code></pre>
<h2 id="heading-scanning-logic">Scanning logic</h2>
<p>Our scanning logic will need to be updated to handle interior pages. We can no longer simply iterate over the cells of a page and call it a day. Instead, we'll need to implement a depth-first algorithm that recursively explores the tree, starting from the root page.</p>
<p>To make our task easier, let's introduce a new <code>PositionedPage</code> struct that stores a page, along with the index of the <code>current</code> cell we're looking at:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// src/pager.rs</span>

<span class="hljs-meta">#[derive(Debug)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">PositionedPage</span></span> {
    <span class="hljs-keyword">pub</span> page: Page,
    <span class="hljs-keyword">pub</span> cell: <span class="hljs-built_in">usize</span>,
}

<span class="hljs-keyword">impl</span> PositionedPage {
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">next_cell</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">Option</span>&lt;&amp;Cell&gt; {
        <span class="hljs-keyword">let</span> cell = <span class="hljs-keyword">self</span>.page.get(<span class="hljs-keyword">self</span>.cell);
        <span class="hljs-keyword">self</span>.cell += <span class="hljs-number">1</span>;
        cell
    }

    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">next_page</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">Option</span>&lt;<span class="hljs-built_in">u32</span>&gt; {
        <span class="hljs-keyword">if</span> <span class="hljs-keyword">self</span>.page.header.page_type == PageType::TableInterior
            &amp;&amp; <span class="hljs-keyword">self</span>.cell == <span class="hljs-keyword">self</span>.page.cells.len()
        {
            <span class="hljs-keyword">self</span>.cell += <span class="hljs-number">1</span>;
            <span class="hljs-keyword">self</span>.page.header.rightmost_pointer
        } <span class="hljs-keyword">else</span> {
            <span class="hljs-literal">None</span>
        }
    }
}
</code></pre>
<p>The <code>next_cell</code> method returns the content of the current cell and increments the cell index, so calling it repeatedly will yiels the content of all the cells in the page.</p>
<p>The <code>next_page</code> method is a bit more complex: it returns the <code>rightmost_pointer</code> of the current page if it's an interior page and we just visited the last cell, otherwise it it returns <code>None</code>.</p>
<p>We'll also update our <code>Cursor</code> so that it owns it's payload instead of borrowing it through a <code>Pager</code>:</p>
<pre><code class="lang-plaintext">// src/pager.rs

#[derive(Debug)]
- pub struct Cursor&lt;'p&gt; { 
+ pub struct Cursor {
    header: RecordHeader,
-   pager: &amp;'p mut Pager,
-   page_index: usize,
-   page_cell: usize,
+    payload: Vec&lt;u8&gt;,
}
</code></pre>
<p>This change will allow us to avoid borrowing the <code>Pager</code> mutably from the <code>Cursor</code> and the <code>Scanner</code> at the same time, which would lead to a difficult-to-solve lifetime issue.</p>
<p>With that out of the way, we can focus on the tree scanning logic. We'll maintain a stack of <code>PositionedPage</code> to keep track of the parent pages we've visited. At every step of the walk, there are a few cases to consider:</p>
<ul>
<li><p>if the current page is a leaf page and we haven't visited all the cells yet, we'll just have to build a <code>Cursor</code> with the current cell's payload and return it.</p>
</li>
<li><p>if the current page is an interior page, we'll push the next page (either from the current cell or the rightmost pointer) to the stack and continue the walk.</p>
</li>
<li><p>if we've visited all the cells of the current page, we'll pop the stack and continue the walk from the parent page.</p>
</li>
</ul>
<p>This logic is implemented in the new <code>Scanner</code> struct:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// src/pager.rs</span>

<span class="hljs-meta">#[derive(Debug)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">Scanner</span></span>&lt;<span class="hljs-symbol">'p</span>&gt; {
    pager: &amp;<span class="hljs-symbol">'p</span> <span class="hljs-keyword">mut</span> Pager,
    initial_page: <span class="hljs-built_in">usize</span>,
    page_stack: <span class="hljs-built_in">Vec</span>&lt;PositionedPage&gt;,
}

<span class="hljs-keyword">impl</span>&lt;<span class="hljs-symbol">'p</span>&gt; Scanner&lt;<span class="hljs-symbol">'p</span>&gt; {
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">new</span></span>(pager: &amp;<span class="hljs-symbol">'p</span> <span class="hljs-keyword">mut</span> Pager, page: <span class="hljs-built_in">usize</span>) -&gt; Scanner&lt;<span class="hljs-symbol">'p</span>&gt; {
        Scanner {
            pager,
            initial_page: page,
            page_stack: <span class="hljs-built_in">Vec</span>::new(),
        }
    }

    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">next_record</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;<span class="hljs-built_in">Option</span>&lt;Cursor&gt;&gt; {
        <span class="hljs-keyword">loop</span> {
            <span class="hljs-keyword">match</span> <span class="hljs-keyword">self</span>.next_elem() {
                <span class="hljs-literal">Ok</span>(<span class="hljs-literal">Some</span>(ScannerElem::Cursor(cursor))) =&gt; <span class="hljs-keyword">return</span> <span class="hljs-literal">Ok</span>(<span class="hljs-literal">Some</span>(cursor)),
                <span class="hljs-literal">Ok</span>(<span class="hljs-literal">Some</span>(ScannerElem::Page(page_num))) =&gt; {
                    <span class="hljs-keyword">let</span> new_page = <span class="hljs-keyword">self</span>.pager.read_page(page_num <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>)?.clone();
                    <span class="hljs-keyword">self</span>.page_stack.push(PositionedPage {
                        page: new_page,
                        cell: <span class="hljs-number">0</span>,
                    });
                }
                <span class="hljs-literal">Ok</span>(<span class="hljs-literal">None</span>) <span class="hljs-keyword">if</span> <span class="hljs-keyword">self</span>.page_stack.len() &gt; <span class="hljs-number">1</span> =&gt; {
                    <span class="hljs-keyword">self</span>.page_stack.pop();
                }
                <span class="hljs-literal">Ok</span>(<span class="hljs-literal">None</span>) =&gt; <span class="hljs-keyword">return</span> <span class="hljs-literal">Ok</span>(<span class="hljs-literal">None</span>),
                <span class="hljs-literal">Err</span>(e) =&gt; <span class="hljs-keyword">return</span> <span class="hljs-literal">Err</span>(e),
            }
        }
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">next_elem</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;<span class="hljs-built_in">Option</span>&lt;ScannerElem&gt;&gt; {
        <span class="hljs-keyword">let</span> <span class="hljs-literal">Some</span>(page) = <span class="hljs-keyword">self</span>.current_page()? <span class="hljs-keyword">else</span> {
            <span class="hljs-keyword">return</span> <span class="hljs-literal">Ok</span>(<span class="hljs-literal">None</span>);
        };

        <span class="hljs-keyword">if</span> <span class="hljs-keyword">let</span> <span class="hljs-literal">Some</span>(page) = page.next_page() {
            <span class="hljs-keyword">return</span> <span class="hljs-literal">Ok</span>(<span class="hljs-literal">Some</span>(ScannerElem::Page(page)));
        }

        <span class="hljs-keyword">let</span> <span class="hljs-literal">Some</span>(cell) = page.next_cell() <span class="hljs-keyword">else</span> {
            <span class="hljs-keyword">return</span> <span class="hljs-literal">Ok</span>(<span class="hljs-literal">None</span>);
        };

        <span class="hljs-keyword">match</span> cell {
            Cell::TableLeaf(cell) =&gt; {
                <span class="hljs-keyword">let</span> header = parse_record_header(&amp;cell.payload)?;
                <span class="hljs-literal">Ok</span>(<span class="hljs-literal">Some</span>(ScannerElem::Cursor(Cursor {
                    header,
                    payload: cell.payload.clone(),
                })))
            }
            Cell::TableInterior(cell) =&gt; <span class="hljs-literal">Ok</span>(<span class="hljs-literal">Some</span>(ScannerElem::Page(cell.left_child_page))),
        }
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">current_page</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;<span class="hljs-built_in">Option</span>&lt;&amp;<span class="hljs-keyword">mut</span> PositionedPage&gt;&gt; {
        <span class="hljs-keyword">if</span> <span class="hljs-keyword">self</span>.page_stack.is_empty() {
            <span class="hljs-keyword">let</span> page = <span class="hljs-keyword">match</span> <span class="hljs-keyword">self</span>.pager.read_page(<span class="hljs-keyword">self</span>.initial_page) {
                <span class="hljs-literal">Ok</span>(page) =&gt; page.clone(),
                <span class="hljs-literal">Err</span>(e) =&gt; <span class="hljs-keyword">return</span> <span class="hljs-literal">Err</span>(e),
            };

            <span class="hljs-keyword">self</span>.page_stack.push(PositionedPage { page, cell: <span class="hljs-number">0</span> });
        }

        <span class="hljs-literal">Ok</span>(<span class="hljs-keyword">self</span>.page_stack.last_mut())
    }
}

<span class="hljs-meta">#[derive(Debug)]</span>
<span class="hljs-class"><span class="hljs-keyword">enum</span> <span class="hljs-title">ScannerElem</span></span> {
    Page(<span class="hljs-built_in">u32</span>),
    Cursor(Cursor),
}
</code></pre>
<h2 id="heading-putting-it-all-together">Putting it all together</h2>
<p>The only change that remains to be made is to update the <code>display_tables</code> function to account for the change in <code>next_record</code> signature:</p>
<pre><code class="lang-diff">// src/main.rs

fn display_tables(db: &amp;mut db::Db) -&gt; anyhow::Result&lt;()&gt; {
    let mut scanner = db.scanner(1);

<span class="hljs-deletion">-   while let Some(Ok(mut record)) = scanner.next_record() {</span>
<span class="hljs-addition">+   while let Some(mut record) = scanner.next_record()? {</span>
        let type_value = record
            .field(0)
            .context("missing type field")
            .context("invalid type field")?;

        if type_value.as_str() == Some("table") {
            let name_value = record
                .field(1)
                .context("missing name field")
                .context("invalid name field")?;

            print!("{} ", name_value.as_str().unwrap());
        }
    }

    Ok(())
}
</code></pre>
<p>We can now display our (long!) list of tables:</p>
<pre><code class="lang-bash">cargo run --release -- res/test.db
rqlite&gt; .tables
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Our scanning logic is now able to handle tables that span multiple pages, thanks to the introduction of interior pages. This is a major milestone in our journey to build a fully functional database! In the next post, we'll learn how to parse simple SQL queries and will lay the groundwork for our query engine.</p>
]]></content:encoded></item><item><title><![CDATA[Build your own SQLite, Part 1: Listing tables]]></title><description><![CDATA[As developers, we use databases all the time. But how do they work?
In this series, we'll try to answer that question by building our own
SQLite-compatible database from scratch.
Source code examples will be provided in Rust, but you are encouraged t...]]></description><link>https://blog.sylver.dev/build-your-own-sqlite-part-1-listing-tables</link><guid isPermaLink="true">https://blog.sylver.dev/build-your-own-sqlite-part-1-listing-tables</guid><category><![CDATA[Rust]]></category><category><![CDATA[SQLite]]></category><category><![CDATA[from scratch]]></category><category><![CDATA[project]]></category><category><![CDATA[database]]></category><dc:creator><![CDATA[Geoffrey Copin]]></dc:creator><pubDate>Mon, 22 Jul 2024 21:36:38 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1721684395771/c0c06140-18f6-442d-a6da-f50eb28018de.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>As developers, we use databases all the time. But how do they work?
In this series, we'll try to answer that question by building our own
SQLite-compatible database from scratch.</p>
<p>Source code examples will be provided in Rust, but you are encouraged to
follow along using your language of choice, as we won't be relying
on many language-specific features or libraries.</p>
<p>As an introduction, we'll implement the simplest version of the <code>tables</code> command,
which lists the names of all the tables in a database. While this looks simple, we'll
see that it requires us to make our first deep dive into the SQLite file format.</p>
<h2 id="heading-building-the-test-database">Building the test database</h2>
<p>To keep things as simple as possible, let's build a minimalistic
test database:</p>
<pre><code class="lang-bash">sqlite3 minimal_test.db
sqlite&gt; create table table1(id <span class="hljs-built_in">integer</span>);
sqlite&gt; create table table2(id <span class="hljs-built_in">integer</span>);
sqlite&gt; .<span class="hljs-built_in">exit</span>
</code></pre>
<p>This creates a database with two tables, <code>table1</code> and <code>table2</code>, each with a single
column, <code>id</code>. We can verify this by running the <code>tables</code> command in the SQLite shell:</p>
<pre><code class="lang-bash">sqlite3 minimal_test.db
sqlite&gt; .tables
table1  table2
sqlite&gt; .<span class="hljs-built_in">exit</span>
</code></pre>
<h2 id="heading-bootstrapping-the-project">Bootstrapping the project</h2>
<p>Let's start by creating a new Rust project. We'll use the <code>cargo add</code> to add our only dependency
for now, <code>anyhow</code>:</p>
<pre><code class="lang-bash">cargo new rsqlite
<span class="hljs-built_in">cd</span> rsqlite
cargo add anyhow
</code></pre>
<h2 id="heading-the-sqlite-file-format">The SQLite file format</h2>
<p>SQLite databases are stored in a single file, the format of which is
documented in the <a target="_blank" href="https://www.sqlite.org/fileformat.html">SQLite File Format Specification</a>.
The file is divided into pages, with each page having the same size: a power of 2, between
512 and 65536 bytes.
The first 100 bytes of the first page contain the database header, which includes
information such as the page size and the file format version. In this first part, we'll only
be interested in the page size.
Pages can be of different types, but for this first article, we'll only be interested in
<code>table btree leaf</code> pages, which store the actual table data.</p>
<p>Our first task will be to implement a <code>Pager</code> struct that reads and caches pages from the
database file. But before we do, we'll have to read the page size from the database header.
Let's start by defining our <code>Header</code> struct:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// src/page.rs</span>
<span class="hljs-meta">#[derive(Debug, Copy, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">DbHeader</span></span> {
    <span class="hljs-keyword">pub</span> page_size: <span class="hljs-built_in">u32</span>,
}
</code></pre>
<p>The header starts with the magic string <code>SQLite format 3\0</code>, followed by the page size
encoded as a big-endian 2-byte integer at offset 16. With this information, we can
implement a function that reads the header from a buffer:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// src/pager.rs</span>
<span class="hljs-keyword">pub</span> <span class="hljs-keyword">const</span> HEADER_SIZE: <span class="hljs-built_in">usize</span> = <span class="hljs-number">100</span>;
<span class="hljs-keyword">const</span> HEADER_PREFIX: &amp;[<span class="hljs-built_in">u8</span>] = <span class="hljs-string">b"SQLite format 3\0"</span>;
<span class="hljs-keyword">const</span> HEADER_PAGE_SIZE_OFFSET: <span class="hljs-built_in">usize</span> = <span class="hljs-number">16</span>;

<span class="hljs-keyword">const</span> PAGE_MAX_SIZE: <span class="hljs-built_in">u32</span> = <span class="hljs-number">65536</span>;

<span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_header</span></span>(buffer: &amp;[<span class="hljs-built_in">u8</span>]) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;page::DbHeader&gt; {
    <span class="hljs-keyword">if</span> !buffer.starts_with(HEADER_PREFIX) {
        <span class="hljs-keyword">let</span> prefix = <span class="hljs-built_in">String</span>::from_utf8_lossy(&amp;buffer[..HEADER_PREFIX.len()]);
        anyhow::bail!(<span class="hljs-string">"invalid header prefix: {prefix}"</span>);
    }

    <span class="hljs-keyword">let</span> page_size_raw = read_be_word_at(buffer, HEADER_PAGE_SIZE_OFFSET);
    <span class="hljs-keyword">let</span> page_size = <span class="hljs-keyword">match</span> page_size_raw {
        <span class="hljs-number">1</span> =&gt; PAGE_MAX_SIZE,
        n <span class="hljs-keyword">if</span> ((n &amp; (n - <span class="hljs-number">1</span>)) == <span class="hljs-number">0</span>) &amp;&amp; n != <span class="hljs-number">0</span> =&gt; n <span class="hljs-keyword">as</span> <span class="hljs-built_in">u32</span>,
        _ =&gt; anyhow::bail!(<span class="hljs-string">"page size is not a power of 2: {}"</span>, page_size_raw),
    };

    <span class="hljs-literal">Ok</span>(page::Header { page_size })
}

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">read_be_word_at</span></span>(input: &amp;[<span class="hljs-built_in">u8</span>], offset: <span class="hljs-built_in">usize</span>) -&gt; <span class="hljs-built_in">u16</span> {
    <span class="hljs-built_in">u16</span>::from_be_bytes(input[offset..offset + <span class="hljs-number">2</span>].try_into().unwrap())
}
</code></pre>
<p>Two things to note here:</p>
<ul>
<li>As the maximum page size cannot be represented as a 2-byte integer, a page size of 1 is use to represent the maximum
page size.</li>
<li>We use a somewhat convoluted expression to check if the page size is a power of 2.
The expression <code>n &amp; (n - 1) == 0</code> is true if and only if <code>n</code> is a power of 2, except for <code>n = 0</code>.</li>
</ul>
<div class="hn-embed-widget" id="codecrafters-highend"></div><h2 id="heading-decoding-table-b-tree-leaf-pages">Decoding Table B-tree leaf pages</h2>
<p>Now that we have the minimum information we need to read pages from the disk, let's explore
the content of a <code>table btree-leaf</code> page.
<code>table btree-leaf</code> pages start with an 8-byte header, followed by an sequence of "cell pointers"
containing the offset of every cell in the page. The cells contain the table data, and we
can think of them as key-value pairs, where the key is a 64-bits integer encoded as
a <a target="_blank" href="https://carlmastrangelo.com/blog/lets-make-a-varint">varint</a>
(the <code>rowid</code>) and the value is an arbitrary sequence of bytes representing the row data.
The header contains the following fields:</p>
<ul>
<li><code>page_type</code>: byte representing the page type. For <code>table btree-leaf</code> pages, this is 0x0D.</li>
<li><code>first_freeblock</code>: 2-byte integer representing the offset of the first free block in the page, or zero if there is no
freeblock.</li>
<li><code>cell_count</code>: 2-byte integer representing the number of cells in the page.</li>
<li><code>cell_content_offset</code>: 2-byte integer representing the offset of the first cell.</li>
<li><code>fragmented_bytes_count</code>: 1-byte integer representing the number of fragmented free bytes in the page (we won't make
use of it for now).</li>
</ul>
<p>We'll start by defining a <code>Page</code> enum representing a parsed page, along with
the necessary structs to represent the page header and the cell pointers:</p>
<pre><code class="lang-rust"><span class="hljs-meta">#[derive(Debug, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">enum</span> <span class="hljs-title">Page</span></span> {
    TableLeaf(TableLeafPage),
}

<span class="hljs-meta">#[derive(Debug, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">TableLeafPage</span></span> {
    <span class="hljs-keyword">pub</span> header: PageHeader,
    <span class="hljs-keyword">pub</span> cell_pointers: <span class="hljs-built_in">Vec</span>&lt;<span class="hljs-built_in">u16</span>&gt;,
    <span class="hljs-keyword">pub</span> cells: <span class="hljs-built_in">Vec</span>&lt;TableLeafCell&gt;,
}

<span class="hljs-meta">#[derive(Debug, Copy, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">PageHeader</span></span> {
    <span class="hljs-keyword">pub</span> page_type: PageType,
    <span class="hljs-keyword">pub</span> first_freeblock: <span class="hljs-built_in">u16</span>,
    <span class="hljs-keyword">pub</span> cell_count: <span class="hljs-built_in">u16</span>,
    <span class="hljs-keyword">pub</span> cell_content_offset: <span class="hljs-built_in">u32</span>,
    <span class="hljs-keyword">pub</span> fragmented_bytes_count: <span class="hljs-built_in">u8</span>,
}

<span class="hljs-meta">#[derive(Debug, Copy, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">enum</span> <span class="hljs-title">PageType</span></span> {
    TableLeaf,
}

<span class="hljs-meta">#[derive(Debug, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">TableLeafCell</span></span> {
    <span class="hljs-keyword">pub</span> size: <span class="hljs-built_in">i64</span>,
    <span class="hljs-keyword">pub</span> row_id: <span class="hljs-built_in">i64</span>,
    <span class="hljs-keyword">pub</span> payload: <span class="hljs-built_in">Vec</span>&lt;<span class="hljs-built_in">u8</span>&gt;,
}
</code></pre>
<p>The corresponding parsing functions are quite straightforward. Note the offset handling
in <code>parse_page</code>: since the first page contains the database header, we start parsing
the page at offset 100.</p>
<pre><code class="lang-rust"><span class="hljs-comment">/// pager.rs</span>
<span class="hljs-keyword">const</span> PAGE_LEAF_HEADER_SIZE: <span class="hljs-built_in">usize</span> = <span class="hljs-number">8</span>;
<span class="hljs-keyword">const</span> PAGE_FIRST_FREEBLOCK_OFFSET: <span class="hljs-built_in">usize</span> = <span class="hljs-number">1</span>;
<span class="hljs-keyword">const</span> PAGE_CELL_COUNT_OFFSET: <span class="hljs-built_in">usize</span> = <span class="hljs-number">3</span>;
<span class="hljs-keyword">const</span> PAGE_CELL_CONTENT_OFFSET: <span class="hljs-built_in">usize</span> = <span class="hljs-number">5</span>;
<span class="hljs-keyword">const</span> PAGE_FRAGMENTED_BYTES_COUNT_OFFSET: <span class="hljs-built_in">usize</span> = <span class="hljs-number">7</span>;

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_page</span></span>(buffer: &amp;[<span class="hljs-built_in">u8</span>], page_num: <span class="hljs-built_in">usize</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;page::Page&gt; {
    <span class="hljs-keyword">let</span> ptr_offset = <span class="hljs-keyword">if</span> page_num == <span class="hljs-number">1</span> { HEADER_SIZE <span class="hljs-keyword">as</span> <span class="hljs-built_in">u16</span> } <span class="hljs-keyword">else</span> { <span class="hljs-number">0</span> };

    <span class="hljs-keyword">match</span> buffer[<span class="hljs-number">0</span>] {
        PAGE_LEAF_TABLE_ID =&gt; parse_table_leaf_page(buffer, ptr_offset),
        _ =&gt; <span class="hljs-literal">Err</span>(anyhow::anyhow!(<span class="hljs-string">"unknown page type: {}"</span>, buffer[<span class="hljs-number">0</span>])),
    }
}

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_table_leaf_page</span></span>(buffer: &amp;[<span class="hljs-built_in">u8</span>], ptr_offset: <span class="hljs-built_in">u16</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;page::Page&gt; {
    <span class="hljs-keyword">let</span> header = parse_page_header(buffer)?;

    <span class="hljs-keyword">let</span> content_buffer = &amp;buffer[PAGE_LEAF_HEADER_SIZE..];
    <span class="hljs-keyword">let</span> cell_pointers = parse_cell_pointers(content_buffer, header.cell_count <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>, ptr_offset);

    <span class="hljs-keyword">let</span> cells = cell_pointers
        .iter()
        .map(|&amp;ptr| parse_table_leaf_cell(&amp;buffer[ptr <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>..]))
        .collect::&lt;anyhow::<span class="hljs-built_in">Result</span>&lt;<span class="hljs-built_in">Vec</span>&lt;page::TableLeafCell&gt;&gt;&gt;()?;

    <span class="hljs-literal">Ok</span>(page::Page::TableLeaf(page::TableLeafPage {
        header,
        cell_pointers,
        cells,
    }))
}


<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_page_header</span></span>(buffer: &amp;[<span class="hljs-built_in">u8</span>]) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;page::PageHeader&gt; {
    <span class="hljs-keyword">let</span> page_type = <span class="hljs-keyword">match</span> buffer[<span class="hljs-number">0</span>] {
        <span class="hljs-number">0x0d</span> =&gt; page::PageType::TableLeaf,
        _ =&gt; anyhow::bail!(<span class="hljs-string">"unknown page type: {}"</span>, buffer[<span class="hljs-number">0</span>]),
    };

    <span class="hljs-keyword">let</span> first_freeblock = read_be_word_at(buffer, PAGE_FIRST_FREEBLOCK_OFFSET);
    <span class="hljs-keyword">let</span> cell_count = read_be_word_at(buffer, PAGE_CELL_COUNT_OFFSET);
    <span class="hljs-keyword">let</span> cell_content_offset = <span class="hljs-keyword">match</span> read_be_word_at(buffer, PAGE_CELL_CONTENT_OFFSET) {
        <span class="hljs-number">0</span> =&gt; <span class="hljs-number">65536</span>,
        n =&gt; n <span class="hljs-keyword">as</span> <span class="hljs-built_in">u32</span>,
    };
    <span class="hljs-keyword">let</span> fragmented_bytes_count = buffer[PAGE_FRAGMENTED_BYTES_COUNT_OFFSET];

    <span class="hljs-literal">Ok</span>(page::PageHeader {
        page_type,
        first_freeblock,
        cell_count,
        cell_content_offset,
        fragmented_bytes_count,
    })
}


<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_cell_pointers</span></span>(buffer: &amp;[<span class="hljs-built_in">u8</span>], n: <span class="hljs-built_in">usize</span>, ptr_offset: <span class="hljs-built_in">u16</span>) -&gt; <span class="hljs-built_in">Vec</span>&lt;<span class="hljs-built_in">u16</span>&gt; {
    <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> pointers = <span class="hljs-built_in">Vec</span>::with_capacity(n);
    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-number">0</span>..n {
        pointers.push(read_be_word_at(buffer, <span class="hljs-number">2</span> * i) - ptr_offset);
    }
    pointers
}

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_table_leaf_cell</span></span>(<span class="hljs-keyword">mut</span> buffer: &amp;[<span class="hljs-built_in">u8</span>]) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;page::TableLeafCell&gt; {
    <span class="hljs-keyword">let</span> (n, size) = read_varint_at(buffer, <span class="hljs-number">0</span>);
    buffer = &amp;buffer[n <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>..];

    <span class="hljs-keyword">let</span> (n, row_id) = read_varint_at(buffer, <span class="hljs-number">0</span>);
    buffer = &amp;buffer[n <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>..];

    <span class="hljs-keyword">let</span> payload = buffer[..size <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>].to_vec();

    <span class="hljs-literal">Ok</span>(page::TableLeafCell {
        size,
        row_id,
        payload,
    })
}

<span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">read_varint_at</span></span>(buffer: &amp;[<span class="hljs-built_in">u8</span>], <span class="hljs-keyword">mut</span> offset: <span class="hljs-built_in">usize</span>) -&gt; (<span class="hljs-built_in">u8</span>, <span class="hljs-built_in">i64</span>) {
    <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> size = <span class="hljs-number">0</span>;
    <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> result = <span class="hljs-number">0</span>;

    <span class="hljs-keyword">while</span> size &lt; <span class="hljs-number">9</span> {
        <span class="hljs-keyword">let</span> current_byte = buffer[offset] <span class="hljs-keyword">as</span> <span class="hljs-built_in">i64</span>;
        <span class="hljs-keyword">if</span> size == <span class="hljs-number">8</span> {
            result = (result &lt;&lt; <span class="hljs-number">8</span>) | current_byte;
        } <span class="hljs-keyword">else</span> {
            result = (result &lt;&lt; <span class="hljs-number">7</span>) | (current_byte &amp; <span class="hljs-number">0b0111_1111</span>);
        }

        offset += <span class="hljs-number">1</span>;
        size += <span class="hljs-number">1</span>;

        <span class="hljs-keyword">if</span> current_byte &amp; <span class="hljs-number">0b1000_0000</span> == <span class="hljs-number">0</span> {
            <span class="hljs-keyword">break</span>;
        }
    }

    (size, result)
}
</code></pre>
<p>To read a varint, we copy the 7 least significant bits of each byte to the result, as long as the most significant bit is set. As the maximum length of a varint is 9 bytes, keep track of 
the number of bytes visited and stop after a maximum of 9 bytes. Note that to
complete a 64 bits value, we need the first 7 bits of the first 8 bytes
and all the bits of the last byte. That's why we test the current size 
of the varint at each iteration and add a special case for the last byte (when <code>size == 8</code>).</p>
<p>We can finally implement the pager itself. For now, it only loads and caches pages without
any eviction policy:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// pager.rs</span>
<span class="hljs-meta">#[derive(Debug, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">Pager</span></span>&lt;I: Read + Seek = std::fs::File&gt; {
    input: I,
    page_size: <span class="hljs-built_in">usize</span>,
    pages: HashMap&lt;<span class="hljs-built_in">usize</span>, page::Page&gt;,
}

<span class="hljs-keyword">impl</span>&lt;I: Read + Seek&gt; Pager&lt;I&gt; {
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">new</span></span>(input: I, page_size: <span class="hljs-built_in">usize</span>) -&gt; <span class="hljs-keyword">Self</span> {
        <span class="hljs-keyword">Self</span> {
            input,
            page_size,
            pages: HashMap::new(),
        }
    }

    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">read_page</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>, n: <span class="hljs-built_in">usize</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;&amp;page::Page&gt; {
        <span class="hljs-keyword">if</span> <span class="hljs-keyword">self</span>.pages.contains_key(&amp;n) {
            <span class="hljs-keyword">return</span> <span class="hljs-literal">Ok</span>(<span class="hljs-keyword">self</span>.pages.get(&amp;n).unwrap());
        }

        <span class="hljs-keyword">let</span> page = <span class="hljs-keyword">self</span>.load_page(n)?;
        <span class="hljs-keyword">self</span>.pages.insert(n, page);
        <span class="hljs-literal">Ok</span>(<span class="hljs-keyword">self</span>.pages.get(&amp;n).unwrap())
    }

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">load_page</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>, n: <span class="hljs-built_in">usize</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;page::Page&gt; {
        <span class="hljs-keyword">let</span> offset = n.saturating_sub(<span class="hljs-number">1</span>) * <span class="hljs-keyword">self</span>.page_size;

        <span class="hljs-keyword">self</span>.input
            .seek(SeekFrom::Start(offset <span class="hljs-keyword">as</span> <span class="hljs-built_in">u64</span>))
            .context(<span class="hljs-string">"seek to page start"</span>)?;

        <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> buffer = <span class="hljs-built_in">vec!</span>[<span class="hljs-number">0</span>; <span class="hljs-keyword">self</span>.page_size];
        <span class="hljs-keyword">self</span>.input.read_exact(&amp;<span class="hljs-keyword">mut</span> buffer).context(<span class="hljs-string">"read page"</span>)?;

        parse_page(&amp;buffer, n)
    }
}
</code></pre>
<h2 id="heading-records">Records</h2>
<p>We now have a way to read pages, and to access the pages cells. But how to decode the values of the cells?
Each cell contains the value of a row in the table, encoded using
the <a target="_blank" href="https://www.sqlite.org/fileformat2.html#record_format">SQLite record format</a>.
The record format is quite simple: a record consists of a header, followed by a sequence of field values.
The header starts with a varint representing the byte size of the headerm followed by a sequence
of varints -one per column- determining the type of each column according to the following table:</p>
<ul>
<li>0: NULL</li>
<li>1: 8-bits signed integer</li>
<li>2: 16-bits signed integer</li>
<li>3: 24-bits signed integer</li>
<li>4: 32-bits signed integer</li>
<li>5: 48-bits signed integer</li>
<li>6: 64-bits signed integer</li>
<li>7: 64-bits IEEE floating point number</li>
<li>8: value is the integer 0</li>
<li>9: value is the integer 1</li>
<li>10 &amp; 11: reserved for internal use</li>
<li>n with n even and n &gt; 12: BLOB of size (n - 12) / 2</li>
<li>n with n odd and n &gt; 13: text of size (n - 13) / 2</li>
</ul>
<p>We now have all the informations we need to parse and represent record's headers:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// src/cursor.rs</span>
<span class="hljs-meta">#[derive(Debug, Copy, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">enum</span> <span class="hljs-title">RecordFieldType</span></span> {
    Null,
    I8,
    I16,
    I24,
    I32,
    I48,
    I64,
    Float,
    Zero,
    One,
    <span class="hljs-built_in">String</span>,
    Blob,
}

<span class="hljs-meta">#[derive(Debug, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">RecordField</span></span> {
    <span class="hljs-keyword">pub</span> offset: <span class="hljs-built_in">usize</span>,
    <span class="hljs-keyword">pub</span> field_type: RecordFieldType,
}

<span class="hljs-meta">#[derive(Debug, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">RecordHeader</span></span> {
    <span class="hljs-keyword">pub</span> fields: <span class="hljs-built_in">Vec</span>&lt;RecordField&gt;,
}

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_record_header</span></span>(<span class="hljs-keyword">mut</span> buffer: &amp;[<span class="hljs-built_in">u8</span>]) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;RecordHeader&gt; {
    <span class="hljs-keyword">let</span> (varint_size, header_length) = crate::pager::read_varint_at(buffer, <span class="hljs-number">0</span>);
    buffer = &amp;buffer[varint_size <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>..header_length <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>];

    <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> fields = <span class="hljs-built_in">Vec</span>::new();
    <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> current_offset = header_length <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>;

    <span class="hljs-keyword">while</span> !buffer.is_empty() {
        <span class="hljs-keyword">let</span> (discriminant_size, discriminant) = crate::pager::read_varint_at(buffer, <span class="hljs-number">0</span>);
        buffer = &amp;buffer[discriminant_size <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>..];

        <span class="hljs-keyword">let</span> (field_type, field_size) = <span class="hljs-keyword">match</span> discriminant {
            <span class="hljs-number">0</span> =&gt; (RecordFieldType::Null, <span class="hljs-number">0</span>),
            <span class="hljs-number">1</span> =&gt; (RecordFieldType::I8, <span class="hljs-number">1</span>),
            <span class="hljs-number">2</span> =&gt; (RecordFieldType::I16, <span class="hljs-number">2</span>),
            <span class="hljs-number">3</span> =&gt; (RecordFieldType::I24, <span class="hljs-number">3</span>),
            <span class="hljs-number">4</span> =&gt; (RecordFieldType::I32, <span class="hljs-number">4</span>),
            <span class="hljs-number">5</span> =&gt; (RecordFieldType::I48, <span class="hljs-number">6</span>),
            <span class="hljs-number">6</span> =&gt; (RecordFieldType::I64, <span class="hljs-number">8</span>),
            <span class="hljs-number">7</span> =&gt; (RecordFieldType::Float, <span class="hljs-number">8</span>),
            <span class="hljs-number">8</span> =&gt; (RecordFieldType::Zero, <span class="hljs-number">0</span>),
            <span class="hljs-number">9</span> =&gt; (RecordFieldType::One, <span class="hljs-number">0</span>),
            n <span class="hljs-keyword">if</span> n &gt;= <span class="hljs-number">12</span> &amp;&amp; n % <span class="hljs-number">2</span> == <span class="hljs-number">0</span> =&gt; {
                <span class="hljs-keyword">let</span> size = ((n - <span class="hljs-number">12</span>) / <span class="hljs-number">2</span>) <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>;
                (RecordFieldType::Blob(size), size)
            }
            n <span class="hljs-keyword">if</span> n &gt;= <span class="hljs-number">13</span> &amp;&amp; n % <span class="hljs-number">2</span> == <span class="hljs-number">1</span> =&gt; {
                <span class="hljs-keyword">let</span> size = ((n - <span class="hljs-number">13</span>) / <span class="hljs-number">2</span>) <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>;
                (RecordFieldType::<span class="hljs-built_in">String</span>(size), size)
            }
            n =&gt; anyhow::bail!(<span class="hljs-string">"unsupported field type: {}"</span>, n),
        };

        fields.push(RecordField {
            offset: current_offset,
            field_type,
        });

        current_offset += field_size;
    }

    <span class="hljs-literal">Ok</span>(RecordHeader { fields })
}
</code></pre>
<p>To make it easier to work with records, we'll define a <code>Value</code> type, representing field values
and a <code>Cursor</code> struct that uniquely identifies a record within a database file. The <code>Cursor</code>
will expose a <code>field</code> method, returning the value of the record's n-th field:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// src/value.rs</span>
<span class="hljs-keyword">use</span> std::borrow::Cow;

<span class="hljs-meta">#[derive(Debug, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">enum</span> <span class="hljs-title">Value</span></span>&lt;<span class="hljs-symbol">'p</span>&gt; {
    Null,
    <span class="hljs-built_in">String</span>(Cow&lt;<span class="hljs-symbol">'p</span>, <span class="hljs-built_in">str</span>&gt;),
    Blob(Cow&lt;<span class="hljs-symbol">'p</span>, [<span class="hljs-built_in">u8</span>]&gt;),
    Int(<span class="hljs-built_in">i64</span>),
    Float(<span class="hljs-built_in">f64</span>),
}

<span class="hljs-keyword">impl</span>&lt;<span class="hljs-symbol">'p</span>&gt; Value&lt;<span class="hljs-symbol">'p</span>&gt; {
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">as_str</span></span>(&amp;<span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">Option</span>&lt;&amp;<span class="hljs-built_in">str</span>&gt; {
        <span class="hljs-keyword">if</span> <span class="hljs-keyword">let</span> Value::<span class="hljs-built_in">String</span>(s) = <span class="hljs-keyword">self</span> {
            <span class="hljs-literal">Some</span>(s.as_ref())
        } <span class="hljs-keyword">else</span> {
            <span class="hljs-literal">None</span>
        }
    }
}
</code></pre>
<pre><code class="lang-rust"><span class="hljs-comment">// src/cursor.rs</span>
<span class="hljs-meta">#[derive(Debug)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">Cursor</span></span>&lt;<span class="hljs-symbol">'p</span>&gt; {
    header: RecordHeader,
    pager: &amp;<span class="hljs-symbol">'p</span> <span class="hljs-keyword">mut</span> Pager,
    page_index: <span class="hljs-built_in">usize</span>,
    page_cell: <span class="hljs-built_in">usize</span>,
}

<span class="hljs-keyword">impl</span>&lt;<span class="hljs-symbol">'p</span>&gt; Cursor&lt;<span class="hljs-symbol">'p</span>&gt; {
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">field</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>, n: <span class="hljs-built_in">usize</span>) -&gt; <span class="hljs-built_in">Option</span>&lt;Value&gt; {
        <span class="hljs-keyword">let</span> record_field = <span class="hljs-keyword">self</span>.header.fields.get(n)?;

        <span class="hljs-keyword">let</span> payload = <span class="hljs-keyword">match</span> <span class="hljs-keyword">self</span>.pager.read_page(<span class="hljs-keyword">self</span>.page_index) {
            <span class="hljs-literal">Ok</span>(Page::TableLeaf(leaf)) =&gt; &amp;leaf.cells[<span class="hljs-keyword">self</span>.page_cell].payload,
            _ =&gt; <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>,
        };

        <span class="hljs-keyword">match</span> record_field.field_type {
            RecordFieldType::Null =&gt; <span class="hljs-literal">Some</span>(Value::Null),
            RecordFieldType::I8 =&gt; <span class="hljs-literal">Some</span>(Value::Int(read_i8_at(payload, record_field.offset))),
            RecordFieldType::I16 =&gt; <span class="hljs-literal">Some</span>(Value::Int(read_i16_at(payload, record_field.offset))),
            RecordFieldType::I24 =&gt; <span class="hljs-literal">Some</span>(Value::Int(read_i24_at(payload, record_field.offset))),
            RecordFieldType::I32 =&gt; <span class="hljs-literal">Some</span>(Value::Int(read_i32_at(payload, record_field.offset))),
            RecordFieldType::I48 =&gt; <span class="hljs-literal">Some</span>(Value::Int(read_i48_at(payload, record_field.offset))),
            RecordFieldType::I64 =&gt; <span class="hljs-literal">Some</span>(Value::Int(read_i64_at(payload, record_field.offset))),
            RecordFieldType::Float =&gt; <span class="hljs-literal">Some</span>(Value::Float(read_f64_at(payload, record_field.offset))),
            RecordFieldType::<span class="hljs-built_in">String</span>(length) =&gt; {
                <span class="hljs-keyword">let</span> value = std::<span class="hljs-built_in">str</span>::from_utf8(
                    &amp;payload[record_field.offset..record_field.offset + length],
                ).expect(<span class="hljs-string">"invalid utf8"</span>);
                <span class="hljs-literal">Some</span>(Value::<span class="hljs-built_in">String</span>(Cow::Borrowed(value)))
            }
            RecordFieldType::Blob(length) =&gt; {
                <span class="hljs-keyword">let</span> value = &amp;payload[record_field.offset..record_field.offset + length];
                <span class="hljs-literal">Some</span>(Value::Blob(Cow::Borrowed(value)))
            }
            _ =&gt; <span class="hljs-built_in">panic!</span>(<span class="hljs-string">"unimplemented"</span>),
        }
    }
}

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">read_i8_at</span></span>(input: &amp;[<span class="hljs-built_in">u8</span>], offset: <span class="hljs-built_in">usize</span>) -&gt; <span class="hljs-built_in">i64</span> {
    input[offset] <span class="hljs-keyword">as</span> <span class="hljs-built_in">i64</span>
}

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">read_i16_at</span></span>(input: &amp;[<span class="hljs-built_in">u8</span>], offset: <span class="hljs-built_in">usize</span>) -&gt; <span class="hljs-built_in">i64</span> {
    <span class="hljs-built_in">i16</span>::from_be_bytes(input[offset..offset + <span class="hljs-number">2</span>].try_into().unwrap()) <span class="hljs-keyword">as</span> <span class="hljs-built_in">i64</span>
}

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">read_i24_at</span></span>(input: &amp;[<span class="hljs-built_in">u8</span>], offset: <span class="hljs-built_in">usize</span>) -&gt; <span class="hljs-built_in">i64</span> {
    (<span class="hljs-built_in">i32</span>::from_be_bytes(input[offset..offset + <span class="hljs-number">3</span>].try_into().unwrap()) &amp; <span class="hljs-number">0x00FFFFFF</span>) <span class="hljs-keyword">as</span> <span class="hljs-built_in">i64</span>
}

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">read_i32_at</span></span>(input: &amp;[<span class="hljs-built_in">u8</span>], offset: <span class="hljs-built_in">usize</span>) -&gt; <span class="hljs-built_in">i64</span> {
    <span class="hljs-built_in">i32</span>::from_be_bytes(input[offset..offset + <span class="hljs-number">4</span>].try_into().unwrap()) <span class="hljs-keyword">as</span> <span class="hljs-built_in">i64</span>
}

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">read_i48_at</span></span>(input: &amp;[<span class="hljs-built_in">u8</span>], offset: <span class="hljs-built_in">usize</span>) -&gt; <span class="hljs-built_in">i64</span> {
    <span class="hljs-built_in">i64</span>::from_be_bytes(input[offset..offset + <span class="hljs-number">6</span>].try_into().unwrap()) &amp; <span class="hljs-number">0x0000FFFFFFFFFFFF</span>
}

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">read_i64_at</span></span>(input: &amp;[<span class="hljs-built_in">u8</span>], offset: <span class="hljs-built_in">usize</span>) -&gt; <span class="hljs-built_in">i64</span> {
    <span class="hljs-built_in">i64</span>::from_be_bytes(input[offset..offset + <span class="hljs-number">8</span>].try_into().unwrap())
}

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">read_f64_at</span></span>(input: &amp;[<span class="hljs-built_in">u8</span>], offset: <span class="hljs-built_in">usize</span>) -&gt; <span class="hljs-built_in">f64</span> {
    <span class="hljs-built_in">f64</span>::from_be_bytes(input[offset..offset + <span class="hljs-number">8</span>].try_into().unwrap())
}
</code></pre>
<p>To simplify iteration over a page's records, we'll also implement a <code>Scanner</code> struct that
wraps a page and allows us to get a <code>Cursor</code> for each record:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// src/cursor.rs</span>
<span class="hljs-meta">#[derive(Debug)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">Scanner</span></span>&lt;<span class="hljs-symbol">'p</span>&gt; {
    pager: &amp;<span class="hljs-symbol">'p</span> <span class="hljs-keyword">mut</span> Pager,
    page: <span class="hljs-built_in">usize</span>,
    cell: <span class="hljs-built_in">usize</span>,
}

<span class="hljs-keyword">impl</span>&lt;<span class="hljs-symbol">'p</span>&gt; Scanner&lt;<span class="hljs-symbol">'p</span>&gt; {
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">new</span></span>(pager: &amp;<span class="hljs-symbol">'p</span> <span class="hljs-keyword">mut</span> Pager, page: <span class="hljs-built_in">usize</span>) -&gt; Scanner&lt;<span class="hljs-symbol">'p</span>&gt; {
        Scanner {
            pager,
            page,
            cell: <span class="hljs-number">0</span>,
        }
    }
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">next_record</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">Option</span>&lt;anyhow::<span class="hljs-built_in">Result</span>&lt;Cursor&gt;&gt; {
        <span class="hljs-keyword">let</span> page = <span class="hljs-keyword">match</span> <span class="hljs-keyword">self</span>.pager.read_page(<span class="hljs-keyword">self</span>.page) {
            <span class="hljs-literal">Ok</span>(page) =&gt; page,
            <span class="hljs-literal">Err</span>(e) =&gt; <span class="hljs-keyword">return</span> <span class="hljs-literal">Some</span>(<span class="hljs-literal">Err</span>(e)),
        };

        <span class="hljs-keyword">match</span> page {
            Page::TableLeaf(leaf) =&gt; {
                <span class="hljs-keyword">let</span> cell = leaf.cells.get(<span class="hljs-keyword">self</span>.cell)?;

                <span class="hljs-keyword">let</span> header = <span class="hljs-keyword">match</span> parse_record_header(&amp;cell.payload) {
                    <span class="hljs-literal">Ok</span>(header) =&gt; header,
                    <span class="hljs-literal">Err</span>(e) =&gt; <span class="hljs-keyword">return</span> <span class="hljs-literal">Some</span>(<span class="hljs-literal">Err</span>(e)),
                };

                <span class="hljs-keyword">let</span> record = Cursor {
                    header,
                    pager: <span class="hljs-keyword">self</span>.pager,
                    page_index: <span class="hljs-keyword">self</span>.page,
                    page_cell: <span class="hljs-keyword">self</span>.cell,
                };

                <span class="hljs-keyword">self</span>.cell += <span class="hljs-number">1</span>;

                <span class="hljs-literal">Some</span>(<span class="hljs-literal">Ok</span>(record))
            }
        }
    }
}
</code></pre>
<h2 id="heading-table-descriptions">Table descriptions</h2>
<p>With most of the leg work out of the way, we can get back to our original goal: listing tables.
SQLite stores the schema of a database in a special table called <code>sqlite_master</code>.
The schema for the <code>sqlite_master</code> table is as follows:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> sqlite_schema(
  <span class="hljs-keyword">type</span> <span class="hljs-built_in">text</span>,
  <span class="hljs-keyword">name</span> <span class="hljs-built_in">text</span>,
  tbl_name <span class="hljs-built_in">text</span>,
  rootpage <span class="hljs-built_in">integer</span>,
  <span class="hljs-keyword">sql</span> <span class="hljs-built_in">text</span>
);
</code></pre>
<p>Theses columns are used as follows:</p>
<ul>
<li><code>type</code>: the type of the schema object. For tables, this will always be <code>table</code>.</li>
<li><code>name</code>: the name of the schema object.</li>
<li><code>tbl_name</code>: the name of the table the schema object is associated with. In the case of tables, this will be the same
as <code>name</code>.</li>
<li><code>rootpage</code>: root page of the table, we'll use it later to read the table's content.</li>
<li><code>sql</code>: the SQL statement used to create the table.</li>
</ul>
<p>Since our simple database only handles basic schemas for now, we can assume that the entire
schema fits in the first page of our database file.
In order to list the tables in the database, we'll need to:</p>
<ul>
<li>initialize the pager with the database file</li>
<li>create a <code>Scanner</code> for the first page</li>
<li>iterate over the records, and print the value of the <code>name</code> field (at index 1) for each record.</li>
</ul>
<p>First, we'll define a <code>Db</code> struct to hold our global state:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// src/db.rs</span>
<span class="hljs-keyword">use</span> std::{io::Read, path::Path};

<span class="hljs-keyword">use</span> anyhow::Context;

<span class="hljs-keyword">use</span> crate::{cursor::Scanner, page::DbHeader, pager, pager::Pager};

<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">Db</span></span> {
    <span class="hljs-keyword">pub</span> header: DbHeader,
    pager: Pager,
}

<span class="hljs-keyword">impl</span> Db {
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">from_file</span></span>(filename: <span class="hljs-keyword">impl</span> <span class="hljs-built_in">AsRef</span>&lt;Path&gt;) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;Db&gt; {
        <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> file = std::fs::File::open(filename.as_ref()).context(<span class="hljs-string">"open db file"</span>)?;

        <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> header_buffer = [<span class="hljs-number">0</span>; pager::HEADER_SIZE];
        file.read_exact(&amp;<span class="hljs-keyword">mut</span> header_buffer)
            .context(<span class="hljs-string">"read db header"</span>)?;

        <span class="hljs-keyword">let</span> header = pager::parse_header(&amp;header_buffer).context(<span class="hljs-string">"parse db header"</span>)?;

        <span class="hljs-keyword">let</span> pager = Pager::new(file, header.page_size <span class="hljs-keyword">as</span> <span class="hljs-built_in">usize</span>);

        <span class="hljs-literal">Ok</span>(Db { header, pager })
    }

    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">scanner</span></span>(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>, page: <span class="hljs-built_in">usize</span>) -&gt; Scanner {
        Scanner::new(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>.pager, page)
    }
}
</code></pre>
<p>The implementation of a basic REPL supporting the <code>tables</code> and <code>tables</code> commands is straightforward:</p>
<pre><code class="lang-rust"><span class="hljs-keyword">use</span> std::io::{stdin, BufRead, Write};

<span class="hljs-keyword">use</span> anyhow::Context;

<span class="hljs-keyword">mod</span> cursor;
<span class="hljs-keyword">mod</span> db;
<span class="hljs-keyword">mod</span> page;
<span class="hljs-keyword">mod</span> pager;
<span class="hljs-keyword">mod</span> value;

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">main</span></span>() -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;()&gt; {
    <span class="hljs-keyword">let</span> database = db::Db::from_file(std::env::args().nth(<span class="hljs-number">1</span>).context(<span class="hljs-string">"missing db file"</span>)?)?;
    cli(database)
}

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">cli</span></span>(<span class="hljs-keyword">mut</span> db: db::Db) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;()&gt; {
    print_flushed(<span class="hljs-string">"rqlite&gt; "</span>)?;

    <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> line_buffer = <span class="hljs-built_in">String</span>::new();

    <span class="hljs-keyword">while</span> stdin().lock().read_line(&amp;<span class="hljs-keyword">mut</span> line_buffer).is_ok() {
        <span class="hljs-keyword">match</span> line_buffer.trim() {
            <span class="hljs-string">".exit"</span> =&gt; <span class="hljs-keyword">break</span>,
            <span class="hljs-string">".tables"</span> =&gt; display_tables(&amp;<span class="hljs-keyword">mut</span> db)?,
            _ =&gt; {
                <span class="hljs-built_in">println!</span>(<span class="hljs-string">"Unrecognized command '{}'"</span>, line_buffer.trim());
            }
        }

        print_flushed(<span class="hljs-string">"\nrqlite&gt; "</span>)?;

        line_buffer.clear();
    }

    <span class="hljs-literal">Ok</span>(())
}

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">display_tables</span></span>(db: &amp;<span class="hljs-keyword">mut</span> db::Db) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;()&gt; {
    <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> scanner = db.scanner(<span class="hljs-number">1</span>);

    <span class="hljs-keyword">while</span> <span class="hljs-keyword">let</span> <span class="hljs-literal">Some</span>(<span class="hljs-literal">Ok</span>(<span class="hljs-keyword">mut</span> record)) = scanner.next_record() {
        <span class="hljs-keyword">let</span> type_value = record
            .field(<span class="hljs-number">0</span>)
            .context(<span class="hljs-string">"missing type field"</span>)
            .context(<span class="hljs-string">"invalid type field"</span>)?;

        <span class="hljs-keyword">if</span> type_value.as_str() == <span class="hljs-literal">Some</span>(<span class="hljs-string">"table"</span>) {
            <span class="hljs-keyword">let</span> name_value = record
                .field(<span class="hljs-number">1</span>)
                .context(<span class="hljs-string">"missing name field"</span>)
                .context(<span class="hljs-string">"invalid name field"</span>)?;

            <span class="hljs-built_in">print!</span>(<span class="hljs-string">"{} "</span>, name_value.as_str().unwrap());
        }
    }

    <span class="hljs-literal">Ok</span>(())
}

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">print_flushed</span></span>(s: &amp;<span class="hljs-built_in">str</span>) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;()&gt; {
    <span class="hljs-built_in">print!</span>(<span class="hljs-string">"{}"</span>, s);
    std::io::stdout().flush().context(<span class="hljs-string">"flush stdout"</span>)
}
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>The first part of our SQLite-compatible database is now complete. We can read the database header,
parse table btree-leaf pages and decode records, but we still have a long way to go before we can
support rich queries. In the next part, we'll learn how to parse the SQL language and make
our first stides towards implementing the <code>SELECT</code> statement!</p>
]]></content:encoded></item><item><title><![CDATA[Build an HTTP server with Rust and tokio - Part 1: serving static files]]></title><description><![CDATA[In this episode, we'll extend our server to serve static files. We'll also refactor our code to support connection reuse, and implement a graceful shutdown mechanism.
If your didn't follow the previous episode, you can find the code on GitHub.
As we ...]]></description><link>https://blog.sylver.dev/build-a-http-server-with-rust-and-tokio-part-1-serving-static-files</link><guid isPermaLink="true">https://blog.sylver.dev/build-a-http-server-with-rust-and-tokio-part-1-serving-static-files</guid><category><![CDATA[http]]></category><category><![CDATA[Rust]]></category><category><![CDATA[tokio]]></category><category><![CDATA[Web Development]]></category><dc:creator><![CDATA[Geoffrey Copin]]></dc:creator><pubDate>Sat, 20 May 2023 22:21:57 GMT</pubDate><content:encoded><![CDATA[<p>In this episode, we'll extend our server to serve static files. We'll also refactor our code to support connection reuse, and implement a graceful shutdown mechanism.</p>
<p>If your didn't follow the previous episode, you can find the code on <a target="_blank" href="https://github.com/geoffreycopin/http_server/tree/part_0">GitHub</a>.</p>
<p>As we will use new dependencies, we'll need to update our <code>Cargo.toml</code> file:</p>
<pre><code class="lang-bash">cargo add clap tokio-util futures
</code></pre>
<h2 id="heading-connection-reuse">Connection reuse</h2>
<p>Under HTTP/1.0, a separate TCP connection is established for each request/response pair. This is inefficient, as it requires a new TCP handshake for each request. HTTP/1.1 introduced connection reuse, which allows multiple requests to be sent over the same TCP connection. This mechanism is also necessary to support request pipelining, which we'll see in a future episode.</p>
<p>We'll keep waiting for new requests on the same connection until the client closes it, unless the client sets the <code>Connection: close</code> header. In that case, we'll close the connection after sending the response. All we have to do to implement this change is to wrap our client handling code in a loop:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// main.rs</span>

<span class="hljs-comment">// [...]</span>
info!(?addr, <span class="hljs-string">"new connection"</span>);

<span class="hljs-keyword">loop</span> {
    <span class="hljs-keyword">let</span> req = <span class="hljs-keyword">match</span> req::parse_request(&amp;<span class="hljs-keyword">mut</span> stream).<span class="hljs-keyword">await</span> {
        <span class="hljs-literal">Ok</span>(req) =&gt; {
            info!(?req, <span class="hljs-string">"incoming request"</span>);
            req
        }
        <span class="hljs-literal">Err</span>(e) =&gt; {
            error!(?e, <span class="hljs-string">"failed to parse request"</span>);
            <span class="hljs-keyword">break</span>;
        }
    };

    <span class="hljs-keyword">let</span> close_connection =
        req.headers.get(<span class="hljs-string">"Connection"</span>) == <span class="hljs-literal">Some</span>(&amp;<span class="hljs-string">"close"</span>.to_string());

    <span class="hljs-keyword">let</span> resp = resp::Response::from_html(
        resp::Status::NotFound,
        <span class="hljs-built_in">include_str!</span>(<span class="hljs-string">"../static/404.html"</span>),
    );

    resp.write(&amp;<span class="hljs-keyword">mut</span> stream).<span class="hljs-keyword">await</span>.unwrap();

    <span class="hljs-keyword">if</span> close_connection {
        <span class="hljs-keyword">break</span>;
    }
}
<span class="hljs-comment">// [...]</span>
</code></pre>
<h2 id="heading-serving-static-files">Serving static files</h2>
<p>Answering every request with the same 'Not found' page was ok for a start, but the time has come to serve some real content. In this section, we'll implement a handler that serves static files from a directory. This directory will be either the current working directory, or a directory specified by the user.</p>
<p>As our CLI is getting more complex, we'll use the <a target="_blank" href="https://crates.io/crates/clap">clap</a> crate to parse the command line arguments.</p>
<pre><code class="lang-rust"><span class="hljs-comment">// args.rs</span>
<span class="hljs-keyword">use</span> std::path::PathBuf;

<span class="hljs-keyword">use</span> clap::Parser;

<span class="hljs-meta">#[derive(Parser, Debug)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">Args</span></span> {
    <span class="hljs-meta">#[arg(short, long, default_value_t = 8080)]</span>
    <span class="hljs-keyword">pub</span> port: <span class="hljs-built_in">u16</span>,
    <span class="hljs-meta">#[arg(short, long)]</span>
    <span class="hljs-keyword">pub</span> root: <span class="hljs-built_in">Option</span>&lt;PathBuf&gt;,
}

<span class="hljs-comment">// main.rs</span>
<span class="hljs-keyword">use</span> clap::Parser;

<span class="hljs-comment">// [...]</span>

<span class="hljs-meta">#[tokio::main]</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">main</span></span>() -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;()&gt; {
    <span class="hljs-comment">// [...]</span>
    <span class="hljs-keyword">let</span> args = args::Args::parse();
    <span class="hljs-keyword">let</span> port = args.port;
    <span class="hljs-keyword">let</span> listener = TcpListener::bind(<span class="hljs-built_in">format!</span>(<span class="hljs-string">"0.0.0.0:{port}"</span>)).<span class="hljs-keyword">await</span>.unwrap();
    <span class="hljs-comment">// [...]</span>
}
</code></pre>
<p>In order to simplify our handling code, we'll also refactor our <code>Responsestruct</code> by using a trait object instead of a generic type parameter. This will allow us to use the same <code>Response</code> type for all our handlers, even if they don't return the same type of response.</p>
<p>We'll also add a new <code>from_file</code> constructor to our <code>Response</code> type, which will allow us to create a response <code>struct</code> from a file on disk. At this stage, we will not implement any kind of sophisticated content negotiation, so we'll just set the <code>Content-Type</code> header to a mime type based on the file extension.</p>
<pre><code class="lang-rust"><span class="hljs-comment">// resp.rs</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">Response</span></span> {
    <span class="hljs-keyword">pub</span> status: Status,
    <span class="hljs-keyword">pub</span> headers: HashMap&lt;<span class="hljs-built_in">String</span>, <span class="hljs-built_in">String</span>&gt;,
    <span class="hljs-keyword">pub</span> data: <span class="hljs-built_in">Box</span>&lt;<span class="hljs-keyword">dyn</span> AsyncRead + Unpin + <span class="hljs-built_in">Send</span>&gt;,
}

<span class="hljs-keyword">impl</span> Response {
    <span class="hljs-comment">// [...]</span>
    <span class="hljs-keyword">pub</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">from_file</span></span>(path: &amp;Path, file: File) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;Response&gt; {
        <span class="hljs-keyword">let</span> headers = hashmap! {
            <span class="hljs-string">"Content-Length"</span>.to_string() =&gt; file.metadata().<span class="hljs-keyword">await</span>?.len().to_string(),
            <span class="hljs-string">"Content-Type"</span>.to_string() =&gt; mime_type(path).to_string(),
        };

        <span class="hljs-literal">Ok</span>(Response {
            headers,
            status: Status::<span class="hljs-literal">Ok</span>,
            data: <span class="hljs-built_in">Box</span>::new(file),
        })
    } 
    <span class="hljs-comment">// [...]</span>
}

<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">mime_type</span></span>(path: &amp;Path) -&gt; &amp;<span class="hljs-built_in">str</span> {
    <span class="hljs-keyword">match</span> path.extension().and_then(|ext| ext.to_str()) {
        <span class="hljs-literal">Some</span>(<span class="hljs-string">"html"</span>) =&gt; <span class="hljs-string">"text/html"</span>,
        <span class="hljs-literal">Some</span>(<span class="hljs-string">"css"</span>) =&gt; <span class="hljs-string">"text/css"</span>,
        <span class="hljs-literal">Some</span>(<span class="hljs-string">"js"</span>) =&gt; <span class="hljs-string">"text/javascript"</span>,
        <span class="hljs-literal">Some</span>(<span class="hljs-string">"png"</span>) =&gt; <span class="hljs-string">"image/png"</span>,
        <span class="hljs-literal">Some</span>(<span class="hljs-string">"jpg"</span>) =&gt; <span class="hljs-string">"image/jpeg"</span>,
        <span class="hljs-literal">Some</span>(<span class="hljs-string">"gif"</span>) =&gt; <span class="hljs-string">"image/gif"</span>,
        _ =&gt; <span class="hljs-string">"application/octet-stream"</span>,
    }
}
</code></pre>
<p>These preparatory steps allow us to implement our static file handler in a few lines of code:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// handler.rs</span>
<span class="hljs-keyword">use</span> std::{env::current_dir, io, path::PathBuf};

<span class="hljs-keyword">use</span> crate::{
    req::Request,
    resp::{Response, Status},
};

<span class="hljs-meta">#[derive(Debug, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">StaticFileHandler</span></span> {
    root: PathBuf,
}

<span class="hljs-keyword">impl</span> StaticFileHandler {
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">in_current_dir</span></span>() -&gt; io::<span class="hljs-built_in">Result</span>&lt;StaticFileHandler&gt; {
        current_dir().map(StaticFileHandler::with_root)
    }

    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">with_root</span></span>(root: PathBuf) -&gt; StaticFileHandler {
        StaticFileHandler { root }
    }

    <span class="hljs-keyword">pub</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">handle</span></span>(&amp;<span class="hljs-keyword">self</span>, request: Request) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;Response&gt; {
        <span class="hljs-keyword">let</span> path = <span class="hljs-keyword">self</span>.root.join(request.path.strip_prefix(<span class="hljs-string">'/'</span>).unwrap());

        <span class="hljs-keyword">if</span> !path.is_file() {
            <span class="hljs-keyword">return</span> <span class="hljs-literal">Ok</span>(Response::from_html(
                Status::NotFound,
                <span class="hljs-built_in">include_str!</span>(<span class="hljs-string">"../static/404.html"</span>),
            ));
        }

        <span class="hljs-keyword">let</span> file = tokio::fs::File::open(&amp;path).<span class="hljs-keyword">await</span>?;
        Response::from_file(&amp;path, file).<span class="hljs-keyword">await</span>
    }
}


<span class="hljs-comment">// main.rs</span>

<span class="hljs-meta">#[tokio::main]</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">main</span></span>() -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;()&gt; {
    <span class="hljs-comment">// [...]</span>
    <span class="hljs-keyword">let</span> handler = args
        .root
        .map(handler::StaticFileHandler::with_root)
        .unwrap_or_else(|| {
            handler::StaticFileHandler::in_current_dir().expect(<span class="hljs-string">"failed to get current dir"</span>)
        });
    <span class="hljs-comment">// [...]</span>
    <span class="hljs-keyword">match</span> handler.handle(req).<span class="hljs-keyword">await</span> {
        <span class="hljs-literal">Ok</span>(resp) =&gt; {
            resp.write(stream).<span class="hljs-keyword">await</span>.unwrap();
        }
        <span class="hljs-literal">Err</span>(e) =&gt; {
            error!(?e, <span class="hljs-string">"failed to handle request"</span>);
            <span class="hljs-keyword">return</span> <span class="hljs-literal">Ok</span>(<span class="hljs-literal">false</span>);
        }
    };
    <span class="hljs-comment">// [...]</span>
}
</code></pre>
<p>After copying these files to the <code>static</code> directory, we can now serve them with our server:</p>
<pre><code class="lang-bash">cargo run -- --root static
</code></pre>
<p>Pressing <code>Ctrl+C</code> will stop the server, but in a rather abrupt way: the server will not log any message regarding the shutdown, and will not wait for the pending requests to be processed before closing the connection. This is not a big deal for a toy server, but in a real world application, we would want to handle this more gracefully.</p>
<h2 id="heading-graceful-shutdown">Graceful shutdown</h2>
<p>The most straightforward way to stop our server would be to collect the client handling tasks into a <code>JoinSet</code> and abort them when we receive a <code>SIGINT</code> signal. However, by doing so, we would have no way to wait for the pending requests to be processed before exiting the program.</p>
<p>Instead, we'll use a <code>CancellationToken</code> to signal that the server should:</p>
<ul>
<li><p>stop accepting new connection,</p>
</li>
<li><p>stop processing requests from the current connections</p>
</li>
</ul>
<p>We'll use the two methods provided by the <code>CancellationToken</code>:</p>
<ul>
<li><p><code>cancel</code>: request cancellation from the main thread when the user presses <code>Ctrl+C</code></p>
</li>
<li><p><code>cancelled</code>: return a future that resolves when the cancellation is requested. Used in conjunction with <code>select!</code>, this will allow us to stop the client handling tasks when the cancellation is requested.</p>
</li>
</ul>
<p>Our <code>main</code> function now looks like this:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// [...]</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">main</span></span>() -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;()&gt; {
    <span class="hljs-comment">// [...]</span>
    <span class="hljs-keyword">let</span> cancel_token = CancellationToken::new();

    tokio::spawn({
        <span class="hljs-keyword">let</span> cancel_token = cancel_token.clone();
        <span class="hljs-keyword">async</span> <span class="hljs-keyword">move</span> {
            <span class="hljs-keyword">if</span> <span class="hljs-keyword">let</span> <span class="hljs-literal">Ok</span>(()) = signal::ctrl_c().<span class="hljs-keyword">await</span> {
                info!(<span class="hljs-string">"received Ctrl-C, shutting down"</span>);
                cancel_token.cancel();
            }
        }
    });
    <span class="hljs-comment">// [...]</span>

    <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> tasks = <span class="hljs-built_in">Vec</span>::new();

    <span class="hljs-keyword">loop</span> {
        <span class="hljs-keyword">let</span> cancel_token = cancel_token.clone();

        tokio::<span class="hljs-built_in">select!</span> {
            <span class="hljs-literal">Ok</span>((stream, addr)) = listener.accept() =&gt; {
                <span class="hljs-keyword">let</span> handler = handler.clone();
                <span class="hljs-keyword">let</span> client_task = tokio::spawn(<span class="hljs-keyword">async</span> <span class="hljs-keyword">move</span> {
                    <span class="hljs-keyword">if</span> <span class="hljs-keyword">let</span> <span class="hljs-literal">Err</span>(e) = handle_client(cancel_token, stream, addr, &amp;handler).<span class="hljs-keyword">await</span> {
                        error!(?e, <span class="hljs-string">"failed to handle client"</span>);
                    }
                });
                tasks.push(client_task);
            },
            _ = cancel_token.cancelled() =&gt; {
                info!(<span class="hljs-string">"stop listening"</span>);
                <span class="hljs-keyword">break</span>;
            }
        }
    }

    futures::future::join_all(tasks).<span class="hljs-keyword">await</span>;

    <span class="hljs-literal">Ok</span>(())
}

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">handle_client</span></span>(
    cancel_token: CancellationToken,
    stream: TcpStream,
    addr: SocketAddr,
    handler: &amp;handler::StaticFileHandler,
) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;()&gt; {
    <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> stream = BufStream::new(stream);

    info!(?addr, <span class="hljs-string">"new connection"</span>);

    <span class="hljs-keyword">loop</span> {
        tokio::<span class="hljs-built_in">select!</span> {
            req = req::parse_request(&amp;<span class="hljs-keyword">mut</span> stream) =&gt; {
                <span class="hljs-keyword">match</span> req {
                    <span class="hljs-literal">Ok</span>(req) =&gt; {
                        info!(?req, <span class="hljs-string">"incoming request"</span>);
                        <span class="hljs-keyword">let</span> close_conn = handle_req(req, &amp;handler, &amp;<span class="hljs-keyword">mut</span> stream).<span class="hljs-keyword">await</span>?;
                        <span class="hljs-keyword">if</span> close_conn {
                            <span class="hljs-keyword">break</span>;
                        }
                    }
                    <span class="hljs-literal">Err</span>(e) =&gt; {
                        error!(?e, <span class="hljs-string">"failed to parse request"</span>);
                        <span class="hljs-keyword">break</span>;
                    }
                }
            }
            _ = cancel_token.cancelled() =&gt; {
                info!(?addr, <span class="hljs-string">"closing connection"</span>);
                <span class="hljs-keyword">break</span>;
            }
        }
    }

    <span class="hljs-literal">Ok</span>(())
}

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">handle_req</span></span>&lt;S: AsyncWrite + Unpin&gt;(
    req: req::Request,
    handler: &amp;handler::StaticFileHandler,
    stream: &amp;<span class="hljs-keyword">mut</span> S,
) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;<span class="hljs-built_in">bool</span>&gt; {
    <span class="hljs-keyword">let</span> close_connection = req.headers.get(<span class="hljs-string">"Connection"</span>) == <span class="hljs-literal">Some</span>(&amp;<span class="hljs-string">"close"</span>.to_string());

    <span class="hljs-keyword">match</span> handler.handle(req).<span class="hljs-keyword">await</span> {
        <span class="hljs-literal">Ok</span>(resp) =&gt; {
            resp.write(stream).<span class="hljs-keyword">await</span>.unwrap();
        }
        <span class="hljs-literal">Err</span>(e) =&gt; {
            error!(?e, <span class="hljs-string">"failed to handle request"</span>);
            <span class="hljs-keyword">return</span> <span class="hljs-literal">Ok</span>(<span class="hljs-literal">false</span>);
        }
    };

    <span class="hljs-literal">Ok</span>(close_connection)
}
</code></pre>
<p>We can now stop the server by pressing <code>Ctrl+C</code>, and the server will wait for the pending requests to be processed before exiting.</p>
<p>The full source code for this part is available <a target="_blank" href="https://github.com/geoffreycopin/http_server">here</a>.</p>
<p>Looking for a Rust dev? <a target="_blank" href="mailto:copin.geoffrey@gmail.com">Let's get in touch!</a></p>
]]></content:encoded></item><item><title><![CDATA[Build a web server with Rust and tokio - Part 0: a simple GET handler]]></title><description><![CDATA[Build a web server with Rust and tokio - Part 0: the simplest possible GET handler
Welcome to this series of blog posts where we will be exploring how to build a web server from scratch using the Rust programming language. We will be taking a hands-o...]]></description><link>https://blog.sylver.dev/build-a-web-server-with-rust-and-tokio-part-0-a-simple-get-handler</link><guid isPermaLink="true">https://blog.sylver.dev/build-a-web-server-with-rust-and-tokio-part-0-a-simple-get-handler</guid><category><![CDATA[Rust]]></category><category><![CDATA[tokio]]></category><category><![CDATA[http]]></category><category><![CDATA[async]]></category><dc:creator><![CDATA[Geoffrey Copin]]></dc:creator><pubDate>Thu, 11 May 2023 19:00:35 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1683831573498/3039c91a-ce61-446c-9c90-ab799858acba.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-build-a-web-server-with-rust-and-tokio-part-0-the-simplest-possible-get-handler">Build a web server with Rust and tokio - Part 0: the simplest possible GET handler</h1>
<p>Welcome to this series of blog posts where we will be exploring how to build a web server from scratch using the Rust programming language. We will be taking a hands-on approach, maximizing our learning experience by using as few dependencies as possible and implementing as much logic as we can. This will enable us to understand the inner workings of a web server and the underlying protocols that it uses.</p>
<p>By the end of this tutorial, you will have a solid understanding of how to build a web server from scratch using Rust and the tokio library. So, let's dive in and get started on our journey!</p>
<p>In this first part, we'll be building a barebones web server that can only anwser GET requests with a static Not Found response. This will give us a good starting point to build upon in the following tutorial.</p>
<h2 id="heading-setting-up-our-project">Setting up our project</h2>
<p>First, we need to create a new Rust project. We'll use the following crates:</p>
<ul>
<li><p><a target="_blank" href="https://docs.rs/tokio/1.28.0/tokio/">tokio</a>: async runtime</p>
</li>
<li><p><a target="_blank" href="https://docs.rs/anyhow/1.0.44/anyhow/">anyhow</a>: easy error handling</p>
</li>
<li><p><a target="_blank" href="https://docs.rs/maplit/1.0.2/maplit/">maplit</a>: macro for creating HashMaps</p>
</li>
<li><p><a target="_blank" href="https://docs.rs/tracing/0.1.27/tracing/">tracing</a>: structured logging</p>
</li>
<li><p><a target="_blank" href="https://docs.rs/tracing-subscriber/0.2.19/tracing_subscriber/">tracing-subscriber</a>: instrumentation</p>
</li>
</ul>
<pre><code class="lang-bash">cargo new webserver
cargo add tokio --features full
cargo add anyhow maplit tracing tracing-subscriber
</code></pre>
<h2 id="heading-anatomy-of-a-simple-get-request">Anatomy of a simple GET request</h2>
<p>In order to actually see what a GET request looks like, we'll set up a simple server listening on port 8080 that will print the incoming requests to the console. This can be done with <code>netcat</code>:</p>
<pre><code class="lang-bash">nc -l 8080
</code></pre>
<p>Now, if we open a new terminal and use <code>curl</code> send a simple GET request to our server, we should see the following output:</p>
<p><img src="https://raw.githubusercontent.com/geoffreycopin/http_server/gh-pages/blog/img/coloured-get.png" alt /></p>
<p>Let's break down the request parts:</p>
<ul>
<li><p>the method: indicates the action to be performed on the resource. In this case, we are performing a GET request, which means we want to retrieve the resource</p>
</li>
<li><p>the path: uniquely identifies the resource. In this case, we are requesting the root path <code>/</code></p>
</li>
<li><p>the protocol: the protocol version. At this stage, we will always asume HTTP/1.1</p>
</li>
<li><p>the headers: a set of key-value pairs that provide additional information about the request. Our request contains the <code>Host</code> header, which indicates the host name of the server, the <code>User-Agent</code> header, which describes the client software that is making the request and the <code>Accept</code> header, which indicates the media types that are acceptable for the response. We'll go into more details about headers in a later tutorial</p>
</li>
</ul>
<p>We'll use the following <code>struct</code> to represent requests in our code:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// req.rs</span>

<span class="hljs-meta">#[derive(Debug, Clone, Eq, PartialEq)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">Request</span></span> {
    <span class="hljs-keyword">pub</span> method: Method,
    <span class="hljs-keyword">pub</span> path: <span class="hljs-built_in">String</span>,
    <span class="hljs-keyword">pub</span> headers: HashMap&lt;<span class="hljs-built_in">String</span>, <span class="hljs-built_in">String</span>&gt;,
}

<span class="hljs-meta">#[derive(Debug, Copy, Clone, Eq, PartialEq, Hash)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">enum</span> <span class="hljs-title">Method</span></span> {
    Get,
}
</code></pre>
<p>Parsing the request is just a matter of splitting the request string into lines. The first line contains the method, path and protocol separated by spaces. The following lines contain the headers, followed by an empty line.</p>
<pre><code class="lang-rust"><span class="hljs-comment">// req.rs</span>
<span class="hljs-keyword">use</span> std::{collections::HashMap, hash::Hash};

<span class="hljs-keyword">use</span> tokio::io::{AsyncBufRead, AsyncBufReadExt};

<span class="hljs-comment">// [...]</span>

<span class="hljs-keyword">impl</span> TryFrom&lt;&amp;<span class="hljs-built_in">str</span>&gt; <span class="hljs-keyword">for</span> Method {
    <span class="hljs-class"><span class="hljs-keyword">type</span> <span class="hljs-title">Error</span></span> = anyhow::Error;

    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">try_from</span></span>(value: &amp;<span class="hljs-built_in">str</span>) -&gt; <span class="hljs-built_in">Result</span>&lt;<span class="hljs-keyword">Self</span>, Self::Error&gt; {
        <span class="hljs-keyword">match</span> value {
            <span class="hljs-string">"GET"</span> =&gt; <span class="hljs-literal">Ok</span>(Method::Get),
            m =&gt; <span class="hljs-literal">Err</span>(anyhow::anyhow!(<span class="hljs-string">"unsupported method: {m}"</span>)),
        }
    }
}

<span class="hljs-keyword">pub</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">parse_request</span></span>(<span class="hljs-keyword">mut</span> stream: <span class="hljs-keyword">impl</span> AsyncBufRead + Unpin) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;Request&gt; {
    <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> line_buffer = <span class="hljs-built_in">String</span>::new();
    stream.read_line(&amp;<span class="hljs-keyword">mut</span> line_buffer).<span class="hljs-keyword">await</span>?;

    <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> parts = line_buffer.split_whitespace();

    <span class="hljs-keyword">let</span> method: Method = parts
        .next()
        .ok_or(anyhow::anyhow!(<span class="hljs-string">"missing method"</span>))
        .and_then(TryInto::try_into)?;

    <span class="hljs-keyword">let</span> path: <span class="hljs-built_in">String</span> = parts
        .next()
        .ok_or(anyhow::anyhow!(<span class="hljs-string">"missing path"</span>))
        .map(<span class="hljs-built_in">Into</span>::into)?;

    <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> headers = HashMap::new();

    <span class="hljs-keyword">loop</span> {
        line_buffer.clear();
        stream.read_line(&amp;<span class="hljs-keyword">mut</span> line_buffer).<span class="hljs-keyword">await</span>?;

        <span class="hljs-keyword">if</span> line_buffer.is_empty() || line_buffer == <span class="hljs-string">"\n"</span> || line_buffer == <span class="hljs-string">"\r\n"</span> {
            <span class="hljs-keyword">break</span>;
        }

        <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> comps = line_buffer.split(<span class="hljs-string">":"</span>);
        <span class="hljs-keyword">let</span> key = comps.next().ok_or(anyhow::anyhow!(<span class="hljs-string">"missing header name"</span>))?;
        <span class="hljs-keyword">let</span> value = comps
            .next()
            .ok_or(anyhow::anyhow!(<span class="hljs-string">"missing header value"</span>))?
            .trim();

        headers.insert(key.to_string(), value.to_string());
    }

    <span class="hljs-literal">Ok</span>(Request {
        method,
        path,
        headers,
    })
}
</code></pre>
<h2 id="heading-accepting-connections">Accepting connections</h2>
<p>Now that we know how to parse a request, we can start accepting connections. Each time a new connection is established, we'll spawn a new task to handle it in order to keep the main thread free to accept new connections.</p>
<pre><code class="lang-rust"><span class="hljs-comment">// main.rs</span>
<span class="hljs-keyword">use</span> tokio::{io::BufStream, net::TcpListener};
<span class="hljs-keyword">use</span> tracing::info;

<span class="hljs-keyword">mod</span> req;

<span class="hljs-keyword">static</span> DEFAULT_PORT: &amp;<span class="hljs-built_in">str</span> = <span class="hljs-string">"8080"</span>;

<span class="hljs-meta">#[tokio::main]</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">main</span></span>() -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;()&gt; {
    <span class="hljs-comment">// Initialize the default tracing subscriber.</span>
    tracing_subscriber::fmt::init();

    <span class="hljs-keyword">let</span> port: <span class="hljs-built_in">u16</span> = std::env::args()
        .nth(<span class="hljs-number">1</span>)
        .unwrap_or_else(|| DEFAULT_PORT.to_string())
        .parse()?;

    <span class="hljs-keyword">let</span> listener = TcpListener::bind(<span class="hljs-built_in">format!</span>(<span class="hljs-string">"0.0.0.0:{port}"</span>)).<span class="hljs-keyword">await</span>.unwrap();

    info!(<span class="hljs-string">"listening on: {}"</span>, listener.local_addr()?);

    <span class="hljs-keyword">loop</span> {
        <span class="hljs-keyword">let</span> (stream, addr) = listener.accept().<span class="hljs-keyword">await</span>?;
        <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> stream = BufStream::new(stream);

        <span class="hljs-comment">// do not block the main thread, spawn a new task</span>
        tokio::spawn(<span class="hljs-keyword">async</span> <span class="hljs-keyword">move</span> {
            info!(?addr, <span class="hljs-string">"new connection"</span>);

            <span class="hljs-keyword">match</span> req::parse_request(&amp;<span class="hljs-keyword">mut</span> stream).<span class="hljs-keyword">await</span> {
                <span class="hljs-literal">Ok</span>(req) =&gt; info!(?req, <span class="hljs-string">"incoming request"</span>),
                <span class="hljs-literal">Err</span>(e) =&gt; {
                    info!(?e, <span class="hljs-string">"failed to parse request"</span>);
                }
            }
        });
    }
}
</code></pre>
<p>We can now run our server on port <code>8081</code>with the following command: <code>cargo run -- 8081</code>. Sending a GET request to <code>localhost:8081</code> should print the following output:</p>
<pre><code class="lang-plaintext">INFO http_server: listening on: 0.0.0.0:8081
INFO http_server: new connection addr=127.0.0.1:49351
INFO http_server: incoming request req=Request { method: Get, path: "/", headers: {"Host": "localhost", "User-Agent": "curl/7.87.0", "Accept": "*/*"} }
</code></pre>
<h2 id="heading-sending-a-response">Sending a response</h2>
<p>At this stage, we'll answer every request with a static <code>Not found</code> page. Our response will have the following format:</p>
<p><img src="https://raw.githubusercontent.com/geoffreycopin/http_server/gh-pages/blog/img/coloured-response.png" alt /></p>
<p>Let's explore the different parts of the response:</p>
<ul>
<li><p>the status line: contains the protocol version, the status code and a human-readable status message</p>
</li>
<li><p>the response headers: encoded in the same way as for the request. Our response contains the <code>Content-Length</code> header, which specified the length of the response body, and the <code>Content-Type</code> header, which indicates that the response body is encoded in HTML. The headers are followed by an empty line.</p>
</li>
<li><p>the response body: contains the actual data that will be displayed in the browser. We used an empty HTML document for brevity</p>
</li>
</ul>
<p>We'll use the following <code>struct</code> to represent responses in our code:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// resp.rs</span>
<span class="hljs-keyword">use</span> tokio::io::{AsyncRead, AsyncWrite, AsyncWriteExt};

<span class="hljs-meta">#[derive(Debug, Clone)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">Response</span></span>&lt;S: AsyncRead + Unpin&gt; {
    <span class="hljs-keyword">pub</span> status: Status,
    <span class="hljs-keyword">pub</span> headers: HashMap&lt;<span class="hljs-built_in">String</span>, <span class="hljs-built_in">String</span>&gt;,
    <span class="hljs-keyword">pub</span> data: S,
}

<span class="hljs-meta">#[derive(Debug, Copy, Clone, Eq, PartialEq, Hash)]</span>
<span class="hljs-keyword">pub</span> <span class="hljs-class"><span class="hljs-keyword">enum</span> <span class="hljs-title">Status</span></span> {
    NotFound,
}

<span class="hljs-keyword">impl</span> Display <span class="hljs-keyword">for</span> Status {
    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">fmt</span></span>(&amp;<span class="hljs-keyword">self</span>, f: &amp;<span class="hljs-keyword">mut</span> Formatter&lt;<span class="hljs-symbol">'_</span>&gt;) -&gt; std::fmt::<span class="hljs-built_in">Result</span> {
        <span class="hljs-keyword">match</span> <span class="hljs-keyword">self</span> {
            Status::NotFound =&gt; <span class="hljs-built_in">write!</span>(f, <span class="hljs-string">"404 Not Found"</span>),
        }
    }
}
</code></pre>
<p>The <code>data</code> field is generic over the type of the response body to account for future use cases where we might want to send a stream of data.</p>
<p>Creating a response from an HTML string is straight forward:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// resp.rs</span>
<span class="hljs-keyword">use</span> std::io::Cursor;
<span class="hljs-keyword">use</span> maplit::hashmap;

<span class="hljs-comment">// [..]</span>

<span class="hljs-keyword">impl</span> Response&lt;Cursor&lt;<span class="hljs-built_in">Vec</span>&lt;<span class="hljs-built_in">u8</span>&gt;&gt;&gt; {
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">from_html</span></span>(status: Status, data: <span class="hljs-keyword">impl</span> <span class="hljs-built_in">ToString</span>) -&gt; <span class="hljs-keyword">Self</span> {
        <span class="hljs-keyword">let</span> bytes = data.to_string().into_bytes();

        <span class="hljs-keyword">let</span> headers = hashmap! {
            <span class="hljs-string">"Content-Type"</span>.to_string() =&gt; <span class="hljs-string">"text/html"</span>.to_string(),
            <span class="hljs-string">"Content-Length"</span>.to_string() =&gt; bytes.len().to_string(),
        };

        <span class="hljs-keyword">Self</span> {
            status,
            headers,
            data: Cursor::new(bytes),
        }
    }
}
</code></pre>
<p>Sending a response is a bit more involved. We'll use the <code>AsyncWrite</code> trait to write the response to a generic output stream.</p>
<pre><code class="lang-rust"><span class="hljs-comment">// resp.rs</span>
<span class="hljs-keyword">use</span> std::{
    collections::HashMap,
    fmt::{Display, Formatter},
    io::Cursor,
};

<span class="hljs-keyword">use</span> maplit::hashmap;
<span class="hljs-keyword">use</span> tokio::io::{AsyncRead, AsyncWrite, AsyncWriteExt};

<span class="hljs-comment">// [...]</span>

<span class="hljs-keyword">impl</span>&lt;S: AsyncRead + Unpin&gt; Response&lt;S&gt; {
    <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">status_and_headers</span></span>(&amp;<span class="hljs-keyword">self</span>) -&gt; <span class="hljs-built_in">String</span> {
        <span class="hljs-keyword">let</span> headers = <span class="hljs-keyword">self</span>
            .headers
            .iter()
            .map(|(k, v)| <span class="hljs-built_in">format!</span>(<span class="hljs-string">"{}: {}"</span>, k, v))
            .collect::&lt;<span class="hljs-built_in">Vec</span>&lt;_&gt;&gt;()
            .join(<span class="hljs-string">"\r\n"</span>);

        <span class="hljs-built_in">format!</span>(<span class="hljs-string">"HTTP/1.1 {}\r\n{headers}\r\n\r\n"</span>, <span class="hljs-keyword">self</span>.status)
    }

    <span class="hljs-keyword">pub</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">write</span></span>&lt;O: AsyncWrite + Unpin&gt;(<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>, stream: &amp;<span class="hljs-keyword">mut</span> O) -&gt; anyhow::<span class="hljs-built_in">Result</span>&lt;()&gt; {
        stream
            .write_all(<span class="hljs-keyword">self</span>.status_and_headers().as_bytes())
            .<span class="hljs-keyword">await</span>?;

        tokio::io::copy(&amp;<span class="hljs-keyword">mut</span> <span class="hljs-keyword">self</span>.data, stream).<span class="hljs-keyword">await</span>?;

        <span class="hljs-literal">Ok</span>(())
    }
}

<span class="hljs-keyword">impl</span> Display <span class="hljs-keyword">for</span> Status {
    <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">fmt</span></span>(&amp;<span class="hljs-keyword">self</span>, f: &amp;<span class="hljs-keyword">mut</span> Formatter&lt;<span class="hljs-symbol">'_</span>&gt;) -&gt; std::fmt::<span class="hljs-built_in">Result</span> {
        <span class="hljs-keyword">match</span> <span class="hljs-keyword">self</span> {
            Status::NotFound =&gt; <span class="hljs-built_in">write!</span>(f, <span class="hljs-string">"404 Not Found"</span>),
        }
    }
}
</code></pre>
<h2 id="heading-puting-it-all-together">Puting it all together</h2>
<p>We'll use the following document as our <code>404</code> page:</p>
<pre><code class="lang-html"><span class="hljs-comment">&lt;!-- static/404.html --&gt;</span>
<span class="hljs-meta">&lt;!DOCTYPE <span class="hljs-meta-keyword">html</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">html</span> <span class="hljs-attr">lang</span>=<span class="hljs-string">"en"</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">head</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">title</span>&gt;</span>Page Not Found<span class="hljs-tag">&lt;/<span class="hljs-name">title</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">style</span>&gt;</span><span class="css">
        <span class="hljs-selector-tag">body</span> {
            <span class="hljs-attribute">background-color</span>: <span class="hljs-number">#f8f8f8</span>;
            <span class="hljs-attribute">font-family</span>: Arial, sans-serif;
            <span class="hljs-attribute">font-size</span>: <span class="hljs-number">16px</span>;
            <span class="hljs-attribute">color</span>: <span class="hljs-number">#333</span>;
        }
        <span class="hljs-selector-class">.container</span> {
            <span class="hljs-attribute">max-width</span>: <span class="hljs-number">600px</span>;
            <span class="hljs-attribute">margin</span>: <span class="hljs-number">0</span> auto;
            <span class="hljs-attribute">padding</span>: <span class="hljs-number">40px</span> <span class="hljs-number">20px</span>;
            <span class="hljs-attribute">text-align</span>: center;
            <span class="hljs-attribute">border</span>: <span class="hljs-number">1px</span> solid <span class="hljs-number">#ddd</span>;
            <span class="hljs-attribute">border-radius</span>: <span class="hljs-number">5px</span>;
            <span class="hljs-attribute">background-color</span>: <span class="hljs-number">#fff</span>;
            <span class="hljs-attribute">box-shadow</span>: <span class="hljs-number">0</span> <span class="hljs-number">2px</span> <span class="hljs-number">4px</span> <span class="hljs-built_in">rgba</span>(<span class="hljs-number">0</span>,<span class="hljs-number">0</span>,<span class="hljs-number">0</span>,<span class="hljs-number">0.1</span>);
        }
        <span class="hljs-selector-tag">h1</span> {
            <span class="hljs-attribute">font-size</span>: <span class="hljs-number">48px</span>;
            <span class="hljs-attribute">margin-bottom</span>: <span class="hljs-number">20px</span>;
            <span class="hljs-attribute">color</span>: <span class="hljs-number">#333</span>;
        }
        <span class="hljs-selector-tag">p</span> {
            <span class="hljs-attribute">font-size</span>: <span class="hljs-number">24px</span>;
            <span class="hljs-attribute">margin-bottom</span>: <span class="hljs-number">40px</span>;
        }
    </span><span class="hljs-tag">&lt;/<span class="hljs-name">style</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">head</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">body</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"container"</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">h1</span>&gt;</span>404<span class="hljs-tag">&lt;/<span class="hljs-name">h1</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">p</span>&gt;</span>The page you are looking for could not be found.<span class="hljs-tag">&lt;/<span class="hljs-name">p</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">body</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">html</span>&gt;</span>
</code></pre>
<p>We can now use our <code>Response</code> struct to send a <code>Not found</code> page to the client when we receive a request:</p>
<pre><code class="lang-rust"><span class="hljs-comment">// main.rs</span>
<span class="hljs-comment">// [...]</span>
<span class="hljs-keyword">let</span> resp = resp::Response::from_html(
    resp::Status::NotFound,
    <span class="hljs-built_in">include_str!</span>(<span class="hljs-string">"../static/404.html"</span>),
);

resp.write(&amp;<span class="hljs-keyword">mut</span> stream).<span class="hljs-keyword">await</span>.unwrap();
<span class="hljs-comment">// [...]</span>
</code></pre>
<p>Navigating to <code>localhost:8081</code> should now display our <code>Not found</code> page.</p>
<p>That's a good start, but we're still far from a fully functional web server. In the next part, we'll add support for serving static files. You can find the code for this part <a target="_blank" href="https://github.com/geoffreycopin/http_server">here</a>.</p>
<p>Looking for a Rust dev? <a target="_blank" href="mailto:copin.geoffrey@gmail.com">Let's get in touch!</a></p>
]]></content:encoded></item><item><title><![CDATA[Build a custom Python linter in 5 minutes]]></title><description><![CDATA[Creating a custom linter can be a great way to enforce coding standards and detect code smells. In this tutorial, we'll use Sylver, a source code query engine to build a custom Python linter in just a few lines of code.
Sylver's main interface is a R...]]></description><link>https://blog.sylver.dev/build-a-custom-python-linter-in-5-minutes</link><guid isPermaLink="true">https://blog.sylver.dev/build-a-custom-python-linter-in-5-minutes</guid><category><![CDATA[Python]]></category><category><![CDATA[Linter]]></category><category><![CDATA[coding]]></category><category><![CDATA[programming]]></category><category><![CDATA[static code analysis]]></category><dc:creator><![CDATA[Geoffrey Copin]]></dc:creator><pubDate>Fri, 20 Jan 2023 15:52:59 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1674229908747/17b74486-4d61-4682-bf0f-8fc322262305.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Creating a custom linter can be a great way to enforce coding standards and detect code smells. In this tutorial, we'll use Sylver, a source code query engine to build a custom Python linter in just a few lines of code.</p>
<p>Sylver's main interface is a REPL console, in which we can load the source code of our project to query it using a SQL-like query language called <a target="_blank" href="https://docs.sylver.dev/docs/dsl/sylq">SYLQ</a>. Once we'll have authored <code>SYLQ</code> queries expressing our linting rules, we'll be able to save them into a ruleset that can be run like a traditional linter.</p>
<h1 id="heading-installation">Installation</h1>
<p>If <code>sylver --version</code> doesn't output a version number &gt;= <code>0.2.2</code>, go to <a target="_blank" href="https://sylver.dev">https://sylver.dev</a> to download a fresh copy of the software.</p>
<h1 id="heading-project-setup">Project setup</h1>
<p>We'll use the following Python file to test our linting rules:</p>
<pre><code class="lang-python"><span class="hljs-comment">#main.py</span>
<span class="hljs-keyword">from</span> users.models <span class="hljs-keyword">import</span> *
<span class="hljs-keyword">from</span> auth.models <span class="hljs-keyword">import</span> check_password

foo = <span class="hljs-number">100</span>
O = <span class="hljs-number">100.0</span>

my_dict = {<span class="hljs-string">'hello'</span>: <span class="hljs-string">'world'</span>}

<span class="hljs-keyword">if</span> my_dict.has_key(<span class="hljs-string">'hello'</span>):
    print(<span class="hljs-string">'It works!'</span>)

<span class="hljs-keyword">if</span> <span class="hljs-string">'hello'</span> <span class="hljs-keyword">in</span> my_dict:
    print(<span class="hljs-string">'It works!'</span>)
</code></pre>
<h1 id="heading-starting-the-repl">Starting the REPL</h1>
<p>Starting the REPL is as simple as invoking the following command at the root of your project:</p>
<pre><code>sylver query --files=<span class="hljs-string">"src/**/*.py"</span> --language=python
</code></pre><p>The REPL can be exited by pressing <code>Ctrl+C</code> or typing <code>:quit</code> at the prompt.</p>
<p>We can now execute <code>SYLQ</code> queries by typing the code of the query, followed by a <code>;</code>.
For instance: to retrieve all the if statements (denoted by the node type IfStatement):</p>
<pre><code>match IfStatement;
</code></pre><p>The results of the query will be formatted as follow:</p>
<pre><code>$<span class="hljs-number">0</span> [IfStatement main.py:<span class="hljs-number">1</span>:<span class="hljs-number">9</span><span class="hljs-number">-23</span>:<span class="hljs-number">10</span>]
$<span class="hljs-number">1</span> [IfStatement main.py:<span class="hljs-number">1</span>:<span class="hljs-number">12</span><span class="hljs-number">-23</span>:<span class="hljs-number">13</span>]
</code></pre><p>The code of a given if statement can be displayed by typing <code>:print</code> followed by the node alias (for instance: <code>:print $1</code>). The parse tree can be displayed using the <code>:print_ast</code> command (for instance: <code>:print_ast $1</code>).</p>
<h2 id="heading-rule1-wildcard-imports-inspired-by-f403httpswwwflake8rulescomrulesf403html">Rule1: wildcard imports (inspired by <a target="_blank" href="https://www.flake8rules.com/rules/F403.html">F403</a>)</h2>
<p>This rule will flag all the imports of the form <code>from x import *</code>.</p>
<p>The first step is to get familiar with the tree structure of Python's import statements, so let's print a <code>ImportFromStatement</code> node along with its AST:</p>
<pre><code>λ&gt; match ImportFromStatement;

$<span class="hljs-number">2</span> [ImportFromStatement main.py:<span class="hljs-number">1</span>:<span class="hljs-number">1</span><span class="hljs-number">-27</span>:<span class="hljs-number">1</span>]
$<span class="hljs-number">3</span> [ImportFromStatement main.py:<span class="hljs-number">1</span>:<span class="hljs-number">2</span><span class="hljs-number">-39</span>:<span class="hljs-number">2</span>]

λ&gt; :print $<span class="hljs-number">2</span>

<span class="hljs-keyword">from</span> users.models <span class="hljs-keyword">import</span> *

λ&gt; :print_ast $<span class="hljs-number">2</span>

ImportFromStatement {
. ● module_name: DottedName {
. . Identifier { users }
. . Identifier { models }
. }
. WildcardImport { * }
}
</code></pre><p>It appears that the faulty part of the import statement (the wildcard: <code>*</code>) is represented by a <code>WildcardImport</code> node.
So this first rule can easily be expressed in <code>SYLQ</code>:</p>
<pre><code>match WildcardImport;
</code></pre><h2 id="heading-rule2-ambiguous-variable-name-inspired-by-e741httpswwwflake8rulescomrulese741html">Rule2: Ambiguous variable name (inspired by <a target="_blank" href="https://www.flake8rules.com/rules/E741.html">E741</a>)</h2>
<p>This style-oriented rule will detect variables named 'l', 'I' or 'O',  as these names can be confusing.</p>
<p>Same as before, let's analyze the tree structure of an assignment:</p>
<pre><code>λ&gt; match Assignment;

$<span class="hljs-number">4</span> [Assignment main.py:<span class="hljs-number">1</span>:<span class="hljs-number">4</span><span class="hljs-number">-10</span>:<span class="hljs-number">4</span>]
$<span class="hljs-number">5</span> [Assignment main.py:<span class="hljs-number">1</span>:<span class="hljs-number">5</span><span class="hljs-number">-10</span>:<span class="hljs-number">5</span>]
$<span class="hljs-number">6</span> [Assignment main.py:<span class="hljs-number">1</span>:<span class="hljs-number">7</span><span class="hljs-number">-29</span>:<span class="hljs-number">7</span>]

λ&gt; :print_ast $<span class="hljs-number">5</span>

Assignment {
. ● left: Identifier { O }
. ● right: Float { <span class="hljs-number">100.0</span> }
}
</code></pre><p>The variable's <code>Identifier</code> can be accessed through the <code>left</code> field of the <code>Assignment</code> node. We can match the <code>Identifier</code>'s text against a regex
by using the builtin <code>matches</code> method:</p>
<pre><code>match a@Assignment when a.left.text.matches(<span class="hljs-string">`^(I|O|l)$`</span>);
</code></pre><p>Here the <code>Assignment</code> node is bound to <code>a</code> using the binding operator: <code>@</code>.</p>
<h2 id="heading-rule3-haskey-is-deprecated-inspired-by-w601httpswwwflake8rulescomrulesw601html">Rule3: <code>has_key()</code> is deprecated (inspired by <a target="_blank" href="https://www.flake8rules.com/rules/W601.html">W601</a>)</h2>
<p>This rule signals uses of the deprecated dictionnary <code>has_key</code> method.</p>
<p>Here is the tree representation of a call to <code>has_key</code>:</p>
<pre><code>Call {
. ● <span class="hljs-function"><span class="hljs-keyword">function</span>: <span class="hljs-title">Attribute</span> </span>{
. . ● object: Identifier { my_dict }
. . ● attribute: Identifier { has_key }
. }
. ● <span class="hljs-built_in">arguments</span>: ArgumentList {
. . String { <span class="hljs-string">'hello'</span> }
. }
}
</code></pre><p>This query can be expressed using nested patterns, as follow:</p>
<pre><code>match Call(<span class="hljs-function"><span class="hljs-keyword">function</span>: <span class="hljs-title">Attribute</span>(<span class="hljs-params">attribute: <span class="hljs-string">'has_key'</span></span>));</span>
</code></pre><h2 id="heading-creating-the-ruleset">Creating the ruleset</h2>
<p>The following ruleset uses our linting rules:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">id:</span> <span class="hljs-string">customRules</span>

<span class="hljs-attr">language:</span> <span class="hljs-string">python</span>

<span class="hljs-attr">rules:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">id:</span> <span class="hljs-string">F403</span>
      <span class="hljs-attr">category:</span> <span class="hljs-string">style</span>
      <span class="hljs-attr">message:</span> <span class="hljs-string">"wildcard import"</span>
      <span class="hljs-attr">note:</span> <span class="hljs-string">"wildcard imports are discouraged because the programmer often won’t know where an imported object is defined"</span>

      <span class="hljs-attr">query:</span> <span class="hljs-string">&gt;
        match WildcardImport
</span>


    <span class="hljs-bullet">-</span> <span class="hljs-attr">id:</span> <span class="hljs-string">E741</span>
      <span class="hljs-attr">category:</span> <span class="hljs-string">style</span>
      <span class="hljs-attr">message:</span> <span class="hljs-string">"ambiguous variable name"</span>
      <span class="hljs-attr">note:</span> <span class="hljs-string">"variables named I, O and l can be very hard to read"</span>

      <span class="hljs-attr">query:</span> <span class="hljs-string">&gt;
        match a@Assignment when a.left.text.matches(`^(I|O|l)$`)
</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">id:</span> <span class="hljs-string">W601</span>
      <span class="hljs-attr">category:</span> <span class="hljs-string">style</span>
      <span class="hljs-attr">message:</span> <span class="hljs-string">".has_key() is deprecated"</span>
      <span class="hljs-attr">note:</span> <span class="hljs-string">"'.has_key()' was deprecated in Python 2. It is recommended to use the 'in' operator instead"</span>

      <span class="hljs-attr">query:</span> <span class="hljs-string">&gt;</span>
        <span class="hljs-string">match</span> <span class="hljs-string">Call(function:</span> <span class="hljs-string">Attribute(attribute:</span> <span class="hljs-string">'has_key'</span><span class="hljs-string">))</span>
</code></pre>
<p>Assuming that it is stored in a file called <code>ruleset.yaml</code> at the root of our project, we can run it with the following command:</p>
<pre><code>sylver ruleset run --files <span class="hljs-string">"**/*.py"</span> --rulesets ruletset.yaml
</code></pre><h1 id="heading-getting-updates">Getting updates</h1>
<p>For more informations about new features and/or cool <code>SYLQ</code> one-liners, connect with Sylver on <a target="_blank" href="https://twitter.com/Geoffrey198">Twitter</a> or <a target="_blank" href="https://discord.gg/PaVTgTSSxu">Discord</a>!</p>
]]></content:encoded></item><item><title><![CDATA[Build a custom Javascript linter in 5 minutes]]></title><description><![CDATA[Creating a custom linter can be a great way to enforce coding standards and detect code smells. In this tutorial, we'll use Sylver, a source code query engine to build a custom Javascript linter in just a few lines of code.
Sylver's main interface is...]]></description><link>https://blog.sylver.dev/build-a-custom-javascript-linter-in-5-minutes</link><guid isPermaLink="true">https://blog.sylver.dev/build-a-custom-javascript-linter-in-5-minutes</guid><category><![CDATA[JavaScript]]></category><category><![CDATA[JSX]]></category><category><![CDATA[Linter]]></category><category><![CDATA[linters]]></category><category><![CDATA[static code analysis]]></category><dc:creator><![CDATA[Geoffrey Copin]]></dc:creator><pubDate>Thu, 24 Nov 2022 17:39:48 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1669311511035/nTGtzgT-N.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Creating a custom linter can be a great way to enforce coding standards and detect code smells. In this tutorial, we'll use Sylver, a source code query engine to build a custom Javascript linter in just a few lines of code.</p>
<p>Sylver's main interface is a REPL console, in which we can load the source code of our project to query it using a SQL-like query language called <code>SYLQ</code>. Once we'll have authored <code>SYLQ</code> queries expressing our linting rules, we'll be able to save them into a ruleset that can be run like a traditional linter.</p>
<h1 id="heading-installation">Installation</h1>
<p>If <code>sylver --version</code> doesn't output a version number &gt;= <code>0.1.9</code>, go to <a target="_blank" href="https://sylver.dev">https://sylver.dev</a> to download a fresh copy of the software.</p>
<h1 id="heading-starting-the-repl">Starting the REPL</h1>
<p>Starting the REPL is as simple as invoking the following command at the root of your project:</p>
<pre><code>sylver query --files=<span class="hljs-string">"src/**/*.js"</span> --spec=https:<span class="hljs-comment">//github.com/sylver-dev/javascript.git#javascript.yaml</span>
</code></pre><p>The REPL can be exited by pressing <code>Ctrl+C</code> or typing <code>:quit</code> at the prompt.</p>
<p>We can now execute <code>SYLQ</code> queries by typing the code of the query, followed by a <code>;</code>.
For instance: to retrieve all the method definitions (denoted by the node type MethodDefinition):</p>
<pre><code>match MethodDefinition;
</code></pre><p>The results of the query will be formatted as follow:</p>
<pre><code>[...]
$<span class="hljs-number">0</span> [MethodDefinition src/store/createArticles.js:<span class="hljs-number">36</span>:<span class="hljs-number">5</span><span class="hljs-number">-38</span>:<span class="hljs-number">5</span>]
$<span class="hljs-number">1</span> [MethodDefinition src/store/createArticles.js:<span class="hljs-number">39</span>:<span class="hljs-number">5</span><span class="hljs-number">-41</span>:<span class="hljs-number">5</span>]
$<span class="hljs-number">2</span> [MethodDefinition src/store/createArticles.js:<span class="hljs-number">42</span>:<span class="hljs-number">5</span><span class="hljs-number">-59</span>:<span class="hljs-number">5</span>]
$<span class="hljs-number">3</span> [MethodDefinition src/store/createArticles.js:<span class="hljs-number">60</span>:<span class="hljs-number">5</span><span class="hljs-number">-77</span>:<span class="hljs-number">5</span>]
$<span class="hljs-number">4</span> [MethodDefinition src/store/createArticles.js:<span class="hljs-number">78</span>:<span class="hljs-number">5</span><span class="hljs-number">-83</span>:<span class="hljs-number">5</span>]
$<span class="hljs-number">5</span> [MethodDefinition src/store/createArticles.js:<span class="hljs-number">84</span>:<span class="hljs-number">5</span><span class="hljs-number">-89</span>:<span class="hljs-number">5</span>]
[...]
</code></pre><p>The code of a given method definition can be displayed by typing <code>:print</code> followed by the node alias (for instance: <code>:print $3</code>). The parse tree can be displayed using the <code>:print_ast</code> command (for instance: <code>:print_ast $3</code>).</p>
<h2 id="heading-rule1-use-of-the-operator">Rule1: use of the <code>==</code> operator</h2>
<p>For our first rule, we'd like to detect uses of the unsafe <code>==</code> operator for checking equality.
The first step is to get familiar with the tree structure of Javascript's binary expressions, so let's print a <code>BinaryExpression</code> node along with its AST:</p>
<pre><code>λ&gt; match BinaryExpression;

[...]
$<span class="hljs-number">43</span> [BinaryExpression src/pages/Article/Comments.js:<span class="hljs-number">7</span>:<span class="hljs-number">31</span><span class="hljs-number">-7</span>:<span class="hljs-number">77</span>]
[...]

λ&gt; :print $<span class="hljs-number">43</span>

currentUser.username == comment.author.username

λ&gt; :print_ast $<span class="hljs-number">43</span>

BinaryExpression {
. ● left: MemberExpression {
. . ● object: Identifier { currentUser }
. . ● property: Identifier { username }
. }
. ● operator: EqEq { == }
. ● right: MemberExpression {
. . ● object: MemberExpression {
. . . ● object: Identifier { comment }
. . . ● property: Identifier { author }
. . }
. . ● property: Identifier { username }
. }
}
</code></pre><p>It appears that the nodes violating our rule are the <code>BinaryExpression</code> nodes
for which the <code>operator</code> field contains an <code>EqEq</code> node.
This can be easily expressed in <code>SYLQ</code>:</p>
<pre><code>match BinaryExpression(operator: EqEq);
</code></pre><h2 id="heading-rule2-functions-with-too-many-parameters">Rule2: functions with too many parameters</h2>
<p>For our second linting rule, we'd like to identify functions that have more than
6 parameters.</p>
<p>Here is the relevant part of the parse tree of a <code>Function</code> node:</p>
<pre><code><span class="hljs-built_in">Function</span> {
. ● <span class="hljs-keyword">async</span>: AsyncModifier { <span class="hljs-keyword">async</span> }
. ● name: Identifier { send }
. ● parameters: FormalParameters {
. . ● params: List {
. . . FormalParameter {
. . . . ● value: Identifier { method }
. . . }
. . . FormalParameter {
. . . . ● value: Identifier { url }
. . . }
. . . FormalParameter {
. . . . ● value: Identifier { data }
. . . }
. . . FormalParameter {
. . . . ● value: Identifier { resKey }
. . . }
. . }
. }
. ● body: StatementBlock {
[...]
</code></pre><p>Function parameters are represented by <code>FormalParameters</code> nodes with a <code>params</code> field containing the actual function parameters. In our query, the condition
regarding the length of the <code>params</code> list can be specified in a <code>when</code> clause, as follows:</p>
<pre><code>match f@FormalParameters when f.params.length &gt; <span class="hljs-number">6</span>;
</code></pre><h2 id="heading-rule3-jsx-img-elements-without-an-alt-attribute">Rule3: JSX 'img' elements without an 'alt' attribute</h2>
<p>For our last rule, we'd like to identify <code>&lt;img&gt;</code> elements that miss the <code>alt</code> attribute. <code>img</code> elements are self-closing, so we'll start by looking at the parse tree of a <code>JsxSelfClosingElement</code> node:</p>
<pre><code>λ&gt; match JsxSelfClosingElement;
[...]
$<span class="hljs-number">73</span> [JsxSelfClosingElement src/pages/Article/Comments.js:<span class="hljs-number">21</span>:<span class="hljs-number">11</span><span class="hljs-number">-21</span>:<span class="hljs-number">55</span>]
[...]

λ&gt; :print $<span class="hljs-number">73</span>

&lt;img src={image} <span class="hljs-class"><span class="hljs-keyword">class</span></span>=<span class="hljs-string">"comment-author-img"</span>/&gt;

λ&gt; :print_ast $<span class="hljs-number">73</span>

JsxSelfClosingElement {
. ● name: Identifier { img }
. ● attribute: List {
. . JsxAttribute {
. . . ● name: Identifier { src }
. . . ● value: JsxExpression {
. . . . Identifier { image }
. . . }
. . }
. . JsxAttribute {
. . . ● name: Identifier { <span class="hljs-class"><span class="hljs-keyword">class</span> }
. . . ● <span class="hljs-title">value</span>: <span class="hljs-title">String</span> </span>{ <span class="hljs-string">"comment-author-img"</span> }
. . }
. }
}
</code></pre><p>In order to find the <code>img</code> elements that have no JsxAttribute with named <code>alt</code> in their <code>attribute</code> list, we can use a <a target="_blank" href="https://sylver.dev/docs/dsl/sylq#list-quantifying-expressions">list quantifying expression</a>, as illustrated in the following query:</p>
<pre><code>match j@JsxSelfClosingElement(name: <span class="hljs-string">"img"</span>) 
      when no j.attribute match JsxAttribute(name: <span class="hljs-string">"alt"</span>);
</code></pre><h2 id="heading-creating-the-ruleset">Creating the ruleset</h2>
<p>The following ruleset uses our linting rules:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">id:</span> <span class="hljs-string">customLinter</span>

<span class="hljs-attr">language:</span> <span class="hljs-string">"https://github.com/sylver-dev/javascript.git#javascript.yaml"</span>

<span class="hljs-attr">rules:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">id:</span> <span class="hljs-string">unsafeEq</span>
      <span class="hljs-attr">message:</span> <span class="hljs-string">equality</span> <span class="hljs-string">comparison</span> <span class="hljs-string">with</span> <span class="hljs-string">`==`</span> <span class="hljs-string">operator</span>
      <span class="hljs-attr">category:</span> <span class="hljs-string">style</span>

      <span class="hljs-attr">query:</span> <span class="hljs-string">"match BinaryExpression(operator: EqEq)"</span>


    <span class="hljs-bullet">-</span> <span class="hljs-attr">id:</span> <span class="hljs-string">tooManyParams</span>
      <span class="hljs-attr">message:</span> <span class="hljs-string">function</span> <span class="hljs-string">has</span> <span class="hljs-string">too</span> <span class="hljs-string">many</span> <span class="hljs-string">parameters</span>
      <span class="hljs-attr">category:</span> <span class="hljs-string">style</span>
      <span class="hljs-attr">note:</span> <span class="hljs-string">According</span> <span class="hljs-string">to</span> <span class="hljs-string">our</span> <span class="hljs-string">style</span> <span class="hljs-string">guide,</span> <span class="hljs-string">functions</span> <span class="hljs-string">should</span> <span class="hljs-string">have</span> <span class="hljs-string">less</span> <span class="hljs-string">than</span> <span class="hljs-number">6</span> <span class="hljs-string">parameters.</span>

      <span class="hljs-attr">query:</span> <span class="hljs-string">match</span> <span class="hljs-string">f@FormalParameters</span> <span class="hljs-string">when</span> <span class="hljs-string">f.params.length</span> <span class="hljs-string">&gt;</span> <span class="hljs-number">6</span>


    <span class="hljs-bullet">-</span> <span class="hljs-attr">id:</span> <span class="hljs-string">missingAlt</span>
      <span class="hljs-attr">message:</span> <span class="hljs-string">&lt;img&gt;</span> <span class="hljs-string">tags</span> <span class="hljs-string">should</span> <span class="hljs-string">have</span> <span class="hljs-string">an</span> <span class="hljs-string">"alt"</span> <span class="hljs-string">attribute</span>
      <span class="hljs-attr">category:</span> <span class="hljs-string">style</span>

      <span class="hljs-attr">query:</span>
        <span class="hljs-string">"match j@JsxSelfClosingElement(name: 'img')
              when no j.attribute match JsxAttribute(name: 'alt')"</span>
</code></pre>
<p>Assuming that it is stored in a file called <code>custom_linter.yaml</code> at the root of our project, we can run it with the following command:</p>
<pre><code>sylver ruleset run --files=<span class="hljs-string">"src/**/*.js"</span> --rulesets=custom_linter.yaml
</code></pre><h1 id="heading-getting-updates">Getting updates</h1>
<p>For more informations about new features and/or cool <code>SYLQ</code> one-liners, connect with Sylver on <a target="_blank" href="https://twitter.com/Geoffrey198">Twitter</a> or <a target="_blank" href="https://discord.gg/PaVTgTSSxu">Discord</a>!</p>
]]></content:encoded></item><item><title><![CDATA[Build a custom Go linter in 5 minutes]]></title><description><![CDATA[Creating a custom linter can be a great way to enforce coding standards and detect code smells. In this tutorial, we'll use Sylver's, a source code query engine to build a custom Golang linter in just a few lines of code.
Sylver's main interface is a...]]></description><link>https://blog.sylver.dev/build-a-custom-go-linter-in-5-minutes</link><guid isPermaLink="true">https://blog.sylver.dev/build-a-custom-go-linter-in-5-minutes</guid><category><![CDATA[Go Language]]></category><category><![CDATA[Linter]]></category><category><![CDATA[static analysis]]></category><category><![CDATA[Query]]></category><dc:creator><![CDATA[Geoffrey Copin]]></dc:creator><pubDate>Fri, 23 Sep 2022 14:18:39 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1663942405819/tGDzRT-ex.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Creating a custom linter can be a great way to enforce coding standards and detect code smells. In this tutorial, we'll use Sylver's, a source code query engine to build a custom Golang linter in just a few lines of code.</p>
<p>Sylver's main interface is a REPL console, in which we can load the source code of our project to query it using a SQL-like query language called <code>SYLQ</code>. Once we'll have authored <code>SYLQ</code> queries expressing our linting rules, we'll be able to save them into a ruleset that can be run like a traditional linter.</p>
<h1 id="heading-installation">Installation</h1>
<p>If <code>sylver --version</code> doesn't output a version number &gt;= <code>0.1.8</code>, go to <a target="_blank" href="https://sylver.dev">https://sylver.dev</a> to download a fresh copy of the software.</p>
<h1 id="heading-starting-the-repl">Starting the REPL</h1>
<p>Starting the REPL is as simple as invoking the following command at the root of your project:</p>
<pre><code>sylver query --files=<span class="hljs-string">"**/*.go"</span> --spec=https:<span class="hljs-comment">//github.com/sylver-dev/golang.git#golang.yaml</span>
</code></pre><p>The REPL can be exited by pressing <code>Ctrl+C</code> or typing <code>:quit</code> at the prompt.</p>
<p>We can now execute <code>SYLQ</code> queries by typing the code of the query, followed by a <code>;</code>.
For instance: to retrieve all the struct declarations:</p>
<pre><code>match StructType;
</code></pre><p>The results of the query will be formatted as follow:</p>
<pre><code>[...]
$<span class="hljs-number">359</span> [StructType association.go:<span class="hljs-number">323</span>:<span class="hljs-number">17</span><span class="hljs-number">-327</span>:<span class="hljs-number">1</span>]
$<span class="hljs-number">360</span> [StructType schema/index.go:<span class="hljs-number">10</span>:<span class="hljs-number">12</span><span class="hljs-number">-18</span>:<span class="hljs-number">1</span>]
$<span class="hljs-number">361</span> [StructType schema/index.go:<span class="hljs-number">20</span>:<span class="hljs-number">18</span><span class="hljs-number">-27</span>:<span class="hljs-number">1</span>]
$<span class="hljs-number">362</span> [StructType tests/group_by_test.go:<span class="hljs-number">70</span>:<span class="hljs-number">12</span><span class="hljs-number">-73</span>:<span class="hljs-number">2</span>]
$<span class="hljs-number">363</span> [StructType schema/check.go:<span class="hljs-number">11</span>:<span class="hljs-number">12</span><span class="hljs-number">-15</span>:<span class="hljs-number">1</span>]
</code></pre><p>The code of a given struct declaration can be displayed by typing <code>:print</code> followed by the node alias (for instance: <code>:print $362</code>). The parse tree can be displayed using the <code>:print_ast</code> command (for instance: <code>:print_ast $362</code>).</p>
<h2 id="heading-rule1-detect-struct-declarations-with-too-many-fields">Rule1: detect struct declarations with too many fields</h2>
<p>For our first rule, we'd like to flag struct declarations that have more than 10 fields.
The first step is to get familiar with the tree structure of struct declarations, so let's print a <code>StructType</code> along with its ast:</p>
<pre><code>λ&gt; :print $<span class="hljs-number">362</span>

struct {
        Name  string
        Total int64
    }

λ&gt; :print_ast $<span class="hljs-number">362</span>

StructType {
. ● fields: List&lt;FieldSpec&gt; {
. . FieldSpec {
. . . ● names: List&lt;Identifier&gt; {
. . . . Identifier { Name }
. . . }
. . . ● type: TypeIdent {
. . . . ● name: Identifier { string }
. . . }
. . }
. . FieldSpec {
. . . ● names: List&lt;Identifier&gt; {
. . . . Identifier { Total }
. . . }
. . . ● type: TypeIdent {
. . . . ● name: Identifier { int64 }
. . . }
. . }
. }
}
</code></pre><p>The fields of the struct are stored in a field aptly named <code>fields</code> that holds a list of <code>FieldSpec</code> nodes. This means that the nodes violating our rule are all the <code>StructType</code> nodes for which the <code>fields</code> list has a length higher than 10.
This can be easily expressed in <code>SYLQ</code>:</p>
<pre><code> match StructType s when s.fields.length &gt; <span class="hljs-number">10</span>;
</code></pre><h2 id="heading-rule2-suggest-the-usage-of-assignment-operators">Rule2: suggest the usage of assignment operators</h2>
<p>For our second linting rule, we'd like to identify assignments that could be simplified by using an assignment operator (like <code>+=</code>) such as: </p>
<pre><code class="lang-go">x = x + <span class="hljs-number">1</span>
</code></pre>
<p>Let's explore the parse tree of a simple assignment:</p>
<pre><code>λ&gt; :print $<span class="hljs-number">5750</span>

err = nil

λ&gt; :print_ast $<span class="hljs-number">5750</span>

AssignStmt {
. ● lhs: List&lt;Expr&gt; {
. . Identifier { err }
. }
. ● rhs: List&lt;Expr&gt; {
. . NilLit { nil }
. }
}
</code></pre><p>So we want to retrieve the <code>AssignStmt</code> nodes for which the <code>rhs</code> field contains a <code>Binop</code> that has <code>lhs</code> as its left operand. Also, the left-hand side of the assignment must contain a single expression. This can be written as:</p>
<pre><code>match AssignStmt a when
      a.lhs.length == <span class="hljs-number">1</span>
   &amp;&amp; a.rhs[<span class="hljs-number">0</span>] is { BinOp b when b.left.text == a.lhs[<span class="hljs-number">0</span>].text };
</code></pre><h2 id="heading-rule3-incorrect-usage-of-the-make-builtin-function">Rule3: incorrect usage of the <code>make</code> builtin function</h2>
<p>For our last linting rule, we want to identify incorrect usage of the <code>make</code> function, where the length is higher than the capacity, as this probably indicates a programming error.</p>
<p>Here is the parse tree of a call to make:</p>
<pre><code>λ&gt; :print $<span class="hljs-number">16991</span>

make([]string, <span class="hljs-number">0</span>, len(value))

λ&gt; :print_ast $<span class="hljs-number">16991</span>

CallExpr {
. ● fun: Identifier { make }
. ● args: List&lt;GoNode&gt; {
. . SliceType {
. . . ● elemsType: TypeIdent {
. . . . ● name: Identifier { string }
. . . }
. . }
. . IntLit { <span class="hljs-number">0</span> }
. . CallExpr {
. . . ● fun: Identifier { len }
. . . ● args: List&lt;GoNode&gt; {
. . . . Identifier { value }
. . . }
. . }
. }
}
</code></pre><p>Here are the conditions that violating nodes will meet:</p>
<ul>
<li>The test of <code>fun</code> is <code>make</code></li>
<li>The args list contains 3 elements</li>
<li>The last two arguments are int literals</li>
<li>The third argument (capacity) is smaller than the second (length)</li>
</ul>
<p>Let's encode this in <code>SYLQ</code>:</p>
<pre><code>match CallExpr c when
      c.fun.text == <span class="hljs-string">'make'</span>
   &amp;&amp; c.args.length == <span class="hljs-number">3</span>
   &amp;&amp; c.args[<span class="hljs-number">1</span>] is IntLit
   &amp;&amp; c.args[<span class="hljs-number">2</span>] is IntLit
   &amp;&amp; c.args[<span class="hljs-number">2</span>].text.to_int() &lt; c.args[<span class="hljs-number">1</span>].text.to_int();
</code></pre><h2 id="heading-creating-the-ruleset">Creating the ruleset</h2>
<p>The following ruleset uses our linting rules:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">id:</span> <span class="hljs-string">customLinter</span>

<span class="hljs-attr">language:</span> <span class="hljs-string">"https://github.com/sylver-dev/golang.git#golang.yaml"</span>

<span class="hljs-attr">rules:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">id:</span> <span class="hljs-string">largeStruct</span>
      <span class="hljs-attr">message:</span> <span class="hljs-string">struct</span> <span class="hljs-string">has</span> <span class="hljs-string">many</span> <span class="hljs-string">fields</span>
      <span class="hljs-attr">category:</span> <span class="hljs-string">style</span>

      <span class="hljs-attr">query:</span>  <span class="hljs-string">match</span> <span class="hljs-string">StructType</span> <span class="hljs-string">s</span> <span class="hljs-string">when</span> <span class="hljs-string">s.fields.length</span> <span class="hljs-string">&gt;</span> <span class="hljs-number">10</span>


    <span class="hljs-bullet">-</span> <span class="hljs-attr">id:</span> <span class="hljs-string">assignOp</span>
      <span class="hljs-attr">message:</span> <span class="hljs-string">assignment</span> <span class="hljs-string">should</span> <span class="hljs-string">use</span> <span class="hljs-string">an</span> <span class="hljs-string">assignment</span> <span class="hljs-string">operator</span>
      <span class="hljs-attr">category:</span> <span class="hljs-string">style</span>
      <span class="hljs-attr">note:</span> <span class="hljs-string">According</span> <span class="hljs-string">to</span> <span class="hljs-string">our</span> <span class="hljs-string">style</span> <span class="hljs-string">guide,</span> <span class="hljs-string">assignment</span> <span class="hljs-string">operators</span> <span class="hljs-string">should</span> <span class="hljs-string">be</span> <span class="hljs-string">preferred.</span>

      <span class="hljs-attr">query:</span> <span class="hljs-string">&gt;
        match AssignStmt a when
             a.lhs.length == 1
          &amp;&amp; a.rhs[0] is { BinOp b when b.left.text == a.lhs[0].text }
</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">id:</span> <span class="hljs-string">makeCapacityErr</span> 
      <span class="hljs-attr">message:</span> <span class="hljs-string">capacity</span> <span class="hljs-string">should</span> <span class="hljs-string">be</span> <span class="hljs-string">higher</span> <span class="hljs-string">than</span> <span class="hljs-string">length</span>     
      <span class="hljs-attr">category:</span> <span class="hljs-string">bug</span>

      <span class="hljs-attr">query:</span> <span class="hljs-string">&gt;
        match CallExpr c when
              c.fun.text == 'make'
          &amp;&amp; c.args.length == 3
          &amp;&amp; c.args[1] is IntLit
          &amp;&amp; c.args[2] is IntLit
          &amp;&amp; c.args[2].text.to_int() &lt; c.args[1].text.to_int()</span>
</code></pre>
<p>Assuming that it is stored in a file called <code>custom_linter.yaml</code> at the root of our project, we can run it with the following command:</p>
<pre><code>sylver ruleset run --files=<span class="hljs-string">"**/*.go"</span> --rulesets=custom_linter.yaml
</code></pre>]]></content:encoded></item><item><title><![CDATA[Building a JSON validator with Sylver - Part3/3 : From queries to analyzer]]></title><description><![CDATA[In 
Part1 and
Part2 of the
series, we learned how to build a language spec and how to use Sylver's query
language to explore the parse tree of our JSON documents.
While it can be insightful to explore a codebase interactively through source-code
quer...]]></description><link>https://blog.sylver.dev/building-a-json-validator-with-sylver-part33-from-queries-to-analyzer</link><guid isPermaLink="true">https://blog.sylver.dev/building-a-json-validator-with-sylver-part33-from-queries-to-analyzer</guid><category><![CDATA[Linter]]></category><category><![CDATA[json]]></category><category><![CDATA[static analysis]]></category><category><![CDATA[static code analysis]]></category><dc:creator><![CDATA[Geoffrey Copin]]></dc:creator><pubDate>Tue, 30 Aug 2022 14:03:33 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1661936929288/Y-uxxSje1.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In 
<a target="_blank" href="https://blog.sylver.dev/getting-started-with-sylver-part13-building-a-json-parser-in-49-lines-of-code">Part1</a> and
<a target="_blank" href="https://blog.sylver.dev/getting-started-with-sylver-part13-building-a-json-parser-in-49-lines-of-code">Part2</a> of the
series, we learned how to build a language spec and how to use Sylver's query
language to explore the parse tree of our JSON documents.</p>
<p>While it can be insightful to explore a codebase interactively through source-code
queries, it's not the most practical way to perform source code verification. In this
tutorial, we'll learn how to package the queries we built in the last part
into a <em>ruleset</em> to use them in a linter-like fashion.</p>
<p>If you have already installed <code>sylver</code> and <code>sylver --version</code> doesn't output
a version number &gt;= <code>0.1.4</code>, please go to <a target="_blank" href="https://sylver.dev">https://sylver.dev</a> to download a fresh
copy of the software.</p>
<h2 id="heading-prelude">Prelude</h2>
<p>We'll reuse two files from the last tutorial:</p>
<ul>
<li>json.syl</li>
</ul>
<pre><code>node JsonNode { }

node Null: JsonNode { }

node Bool: JsonNode { }

node <span class="hljs-built_in">Number</span>: JsonNode { }

node <span class="hljs-built_in">String</span>: JsonNode { }

node <span class="hljs-built_in">Array</span>: JsonNode { 
    <span class="hljs-attr">elems</span>: List&lt;JsonNode&gt; 
}

node <span class="hljs-built_in">Object</span>: JsonNode {
    <span class="hljs-attr">members</span>: List&lt;Member&gt;
}

node Member: JsonNode {
    <span class="hljs-attr">key</span>: <span class="hljs-built_in">String</span>,
    <span class="hljs-attr">value</span>: JsonNode
}

term COMMA = <span class="hljs-string">','</span>
term COLON = <span class="hljs-string">':'</span>
term L_BRACE = <span class="hljs-string">'{'</span>
term R_BRACE = <span class="hljs-string">'}'</span>
term L_BRACKET = <span class="hljs-string">'['</span>
term R_BRACKET = <span class="hljs-string">']'</span>
term NULL = <span class="hljs-string">'null'</span>

term BOOL_LIT = <span class="hljs-string">`true|false`</span>
term NUMBER_LIT = <span class="hljs-string">`\-?(0|([1-9][0-9]*))(.[0-9]+)?((e|E)(\+|-)?[0-9]+)?`</span>
term STRING_LIT = <span class="hljs-string">`"([^"\\]|(\\[\\/bnfrt"])|(\\u[a-fA-F0-9]{4}))*"`</span>


ignore term WHITESPACE = <span class="hljs-string">`\s`</span>

rule string = <span class="hljs-built_in">String</span> { STRING_LIT }

rule member = Member { key@string COLON value@main }

rule main =
    Null { NULL }
  | <span class="hljs-built_in">Number</span> { NUMBER_LIT }
  | Bool { BOOL_LIT }
  | string
  | <span class="hljs-built_in">Array</span> { L_BRACKET elems@sepBy(COMMA, main) R_BRACKET }
  | <span class="hljs-built_in">Object</span> { L_BRACE members@sepBy(COMMA, member) R_BRACE }@
</code></pre><ul>
<li>invalid_config.json</li>
</ul>
<pre><code class="lang-json">{
    <span class="hljs-attr">"variables"</span>: [
        {
            <span class="hljs-attr">"name"</span>: <span class="hljs-string">"date of birt`"</span>,
            <span class="hljs-attr">"description"</span>: <span class="hljs-string">"Customer's date of birth"</span>,
            <span class="hljs-attr">"type"</span>: <span class="hljs-string">"datetime"</span>
        },
        {
            <span class="hljs-attr">"name"</span>: <span class="hljs-string">"activity"</span>,
            <span class="hljs-attr">"description"</span>: <span class="hljs-string">"A short text describing the customer's profession"</span>,
            <span class="hljs-attr">"type"</span>: <span class="hljs-string">"string"</span>
        },
        {
            <span class="hljs-attr">"name"</span>: <span class="hljs-string">"country"</span>,
            <span class="hljs-attr">"description"</span>: <span class="hljs-string">"Customer's country of residence"</span>,
            <span class="hljs-attr">"type"</span>: <span class="hljs-string">"string"</span>,
            <span class="hljs-attr">"values"</span>: [<span class="hljs-string">"us"</span>, <span class="hljs-string">"fr"</span>, <span class="hljs-string">"it"</span> ]
        }
    ]
}
</code></pre>
<h2 id="heading-stepping-out-of-the-repl">Stepping out of the REPL</h2>
<h3 id="heading-creating-a-ruleset">Creating a ruleset</h3>
<p>Packaging the rules from the previous tutorial into a reusable ruleset is as simple as
creating the following <a target="_blank" href="https://yaml.org">YAML</a> file:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">id:</span> <span class="hljs-string">'JSON ruleset'</span>
<span class="hljs-attr">language:</span> <span class="hljs-string">json.syl</span>

<span class="hljs-attr">rules:</span> 
  <span class="hljs-bullet">-</span> <span class="hljs-attr">id:</span> <span class="hljs-string">variable_length</span>
    <span class="hljs-attr">message:</span> <span class="hljs-string">Variable</span> <span class="hljs-string">name</span> <span class="hljs-string">is</span> <span class="hljs-string">too</span> <span class="hljs-string">long</span>
    <span class="hljs-attr">category:</span> <span class="hljs-string">style</span>
    <span class="hljs-attr">query:</span> <span class="hljs-string">&gt;
      match String desc when desc.text.length &gt; 37 &amp;&amp; desc.parent is {
        Member m when m.key.text == '"description"'
      }  
</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">id:</span> <span class="hljs-string">variable_format</span> 
    <span class="hljs-attr">message:</span> <span class="hljs-string">Variable</span> <span class="hljs-string">name</span> <span class="hljs-string">isn't</span> <span class="hljs-string">a</span> <span class="hljs-string">lowercase</span> <span class="hljs-string">word</span>
    <span class="hljs-attr">category:</span> <span class="hljs-string">style</span>
    <span class="hljs-attr">query:</span> <span class="hljs-string">&gt;
      match String s when !s.text.matches(`"[a-z]+"`) &amp;&amp; s.parent is {
        Member m when m.key.text == '"name"'
      }    
</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">id:</span> <span class="hljs-string">types_or_values</span>
    <span class="hljs-attr">message:</span> <span class="hljs-string">Fields</span> <span class="hljs-string">'type'</span> <span class="hljs-string">and</span> <span class="hljs-string">'values'</span> <span class="hljs-string">are</span> <span class="hljs-string">mutually</span> <span class="hljs-string">exclusive</span>    
    <span class="hljs-attr">category:</span> <span class="hljs-string">error</span>
    <span class="hljs-attr">note:</span> <span class="hljs-string">The</span> <span class="hljs-string">type</span> <span class="hljs-string">can</span> <span class="hljs-string">be</span> <span class="hljs-string">deduced</span> <span class="hljs-string">from</span> <span class="hljs-string">the</span> <span class="hljs-string">values</span> <span class="hljs-string">list.</span>
    <span class="hljs-attr">query:</span> <span class="hljs-string">&gt;
      match Object n when
        any n.members.children match {  
            Member m when m.key.text == '"type"' 
        }
        &amp;&amp; any n.members.children match { 
            Member m when m.key.text == '"values"' 
        }</span>
</code></pre>
<p>Where <code>id</code> is a human-readable description of the ruleset, and <code>language</code> refers to
a language spec file.</p>
<p>The following properties describe the individual rules composing the ruleset:</p>
<ul>
<li>id: unique and short name of the rule</li>
<li>message: a concise description of the issue</li>
<li>category: error, bug, smell, style</li>
<li>query: inline query</li>
<li>note: optional additional informations</li>
</ul>
<p>Assuming that our ruleset file is called <code>ruleset.yaml</code>, we can run this ruleset on every <code>.json</code> file in the current directory by invoking the following command:</p>
<pre><code class="lang-bash">sylver ruleset run --files <span class="hljs-string">"*.json"</span> --rulesets ruleset.yaml
</code></pre>
<h3 id="heading-storing-our-project-configuration">Storing our project configuration</h3>
<p>If we wish to validate our codebase against multiple rulesets, repeating the above
command for every ruleset can be tedious. Instead, we can write a project configuration
in a <code>sylver.yaml</code> file at the root of our project:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">subprojects:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">language:</span> <span class="hljs-string">json.syl</span>
    <span class="hljs-attr">rulesets:</span> [<span class="hljs-string">'ruleset.yaml'</span>]
    <span class="hljs-attr">include:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">'./**/*.json'</span>
</code></pre>
<p>The configuration contains a list of subprojects, each having a language, an optional list of rulesets, and a list of files to include.</p>
<p>Invoking <code>sylver check</code> will read the config from <code>sylver.yaml</code> and run the
specified rulesets.</p>
<h2 id="heading-git-integration">Git integration</h2>
<p>Should you want to reuse your language specs or rulesets in several projects, copying your <code>.syl</code> and <code>.yaml</code> files in every project would be inconvenient. Luckily
rulesets and project configurations can refer to artifacts stored in a git repository.</p>
<p>The language spec and ruleset for this tutorial have been uploaded to <a target="_blank" href="https://github.com/geoffreycopin/getting_started_json_tutorial">this repo</a>, so if we rewrite our <code>sylver.yaml</code> config file as:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">subprojects:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">language:</span> 
      <span class="hljs-attr">repo:</span> <span class="hljs-string">https://github.com/geoffreycopin/getting_started_json_tutorial</span>
      <span class="hljs-attr">file:</span> <span class="hljs-string">json.syl</span>
    <span class="hljs-attr">rulesets:</span> 
      <span class="hljs-bullet">-</span> <span class="hljs-attr">repo:</span> <span class="hljs-string">https://github.com/geoffreycopin/getting_started_json_tutorial</span>
        <span class="hljs-attr">file:</span> <span class="hljs-string">'ruleset.yaml'</span>
    <span class="hljs-attr">include:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">'./**/*.json'</span>
</code></pre>
<p>the language spec and ruleset will be cloned automatically in the <code>.sylver</code> directory
when running <code>sylver check</code>.</p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>We now have a reusable linter for our JSON configuration files built from
scratch using Sylver's DSL.</p>
<p>The following tutorial will use a pre-built <a target="_blank" href="https://go.dev/">Golang</a> to write a general-purpose Go linter.</p>
]]></content:encoded></item><item><title><![CDATA[Building a JSON validator with Sylver - Part2/3 : Intuitive JSON AST queries]]></title><description><![CDATA[In Part 1,
we used Sylver's meta language to build a specification for the
JSON format. But an AST, by itself, is not of much use.
In this next tutorial, we'll continue building our JSON configuration validator.
To this end, we'll learn how to use Sy...]]></description><link>https://blog.sylver.dev/building-a-json-validator-with-sylver-part23-intuitive-json-ast-queries</link><guid isPermaLink="true">https://blog.sylver.dev/building-a-json-validator-with-sylver-part23-intuitive-json-ast-queries</guid><category><![CDATA[Linter]]></category><category><![CDATA[static code analysis]]></category><category><![CDATA[static analysis]]></category><category><![CDATA[json]]></category><category><![CDATA[json parser]]></category><dc:creator><![CDATA[Geoffrey Copin]]></dc:creator><pubDate>Sat, 20 Aug 2022 10:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1661024496879/-2KtfU856.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In <a target="_blank" href="https://blog.sylver.dev/getting-started-with-sylver-part1-building-a-json-parser-in-49-lines-of-code">Part 1</a>,
we used <a target="_blank" href="https://sylver.dev">Sylver</a>'s meta language to build a specification for the
JSON format. But an AST, by itself, is not of much use.
In this next tutorial, we'll continue building our JSON configuration validator.
To this end, we'll learn how to use Sylver's query REPL (Read Eval Print Loop) to
identify the parts of our JSON code that do not comply with a set of increasingly complex
rules. In the next and last part, we'll learn how to package queries into a rule set
to share and reuse them easily.</p>
<p>If you have already installed <code>sylver</code> and <code>sylver --version</code> doesn't output
a version number &gt;= <code>0.1.3</code>, please go to <a target="_blank" href="https://sylver.dev">https://sylver.dev</a> to download a fresh copy of the software.</p>
<h2 id="heading-prelude">Prelude</h2>
<p>We will reuse the <code>json.syl</code> spec file built in the previous tutorial.
Since the <code>config.json</code> file that we used to test our parser contains a valid
configuration, we'll need to create a new JSON document (<code>invalid_config.json</code>) containing 
an incorrect configuration:</p>
<pre><code class="lang-json">{
    <span class="hljs-attr">"variables"</span>: [
        {
            <span class="hljs-attr">"name"</span>: <span class="hljs-string">"date of birt`"</span>,
            <span class="hljs-attr">"description"</span>: <span class="hljs-string">"Customer's date of birth"</span>,
            <span class="hljs-attr">"type"</span>: <span class="hljs-string">"datetime"</span>
        },
        {
            <span class="hljs-attr">"name"</span>: <span class="hljs-string">"activity"</span>,
            <span class="hljs-attr">"description"</span>: <span class="hljs-string">"A short text describing the customer's profession"</span>,
            <span class="hljs-attr">"type"</span>: <span class="hljs-string">"string"</span>
        },
        {
            <span class="hljs-attr">"name"</span>: <span class="hljs-string">"country"</span>,
            <span class="hljs-attr">"description"</span>: <span class="hljs-string">"Customer's country of residence"</span>,
            <span class="hljs-attr">"type"</span>: <span class="hljs-string">"string"</span>,
            <span class="hljs-attr">"values"</span>: [<span class="hljs-string">"us"</span>, <span class="hljs-string">"fr"</span>, <span class="hljs-string">"it"</span> ]
        }
    ]
}
</code></pre>
<p>This file represents the configuration of an imaginary software handling a database of
customers. Each customer profile is described using a configurable set of variables.</p>
<p>We want to validate these variable declarations using the following rules:</p>
<ol>
<li>Variable descriptions must contain at most 35 characters</li>
<li>The name of a variable must be a single lowercase word</li>
<li>If the <code>values</code> field is set, the <code>type</code> field should be absent (as it
matches the type of the values)</li>
</ol>
<p>Let's parse this file and start a REPL to query our AST!</p>
<h2 id="heading-basic-repl-usage">Basic REPL usage</h2>
<p>Loading the AST of <code>invalid_config.json</code> is as simple as invoking:</p>
<pre><code>sylver query <span class="hljs-operator">-</span><span class="hljs-operator">-</span>spec<span class="hljs-operator">=</span>json.syl <span class="hljs-operator">-</span><span class="hljs-operator">-</span>files<span class="hljs-operator">=</span>invalid_config.json
</code></pre><p>The <code>files</code> argument accepts one or several file names or quoted glob patterns.</p>
<p>You can exit the REPL by typing <code>:quit</code> at the prompt.</p>
<p><code>:print invalid_config.json</code> and <code>:print_ast invalid_config.json</code> can be used to visualize
one of the loaded files, or the corresponding AST.</p>
<h2 id="heading-query-language-basics">Query language basics</h2>
<p>Syntax queries are of the form <code>match &lt;NodeType&gt; &lt;binding&gt;? (when &lt;boolean expression&gt;)?</code>.
The node binding and the <code>when [...]</code> clause are optional.
<code>NodeType</code> represents either a node type as it appears when printing the AST or a
placeholder (<code>_</code>) that matches every node. The whole part following the <code>match</code> keyword
is called a query pattern.</p>
<p>In the REPL, queries must be followed by a <code>;</code></p>
<p>The most straightforward query, returning every node in the AST, is written as follows:</p>
<pre><code><span class="hljs-keyword">match</span> _;
</code></pre><p>A slightly more advanced query to return every <code>String</code> node in our document:</p>
<pre><code><span class="hljs-keyword">match</span> <span class="hljs-built_in">String</span>;
</code></pre><p>If we only wish to retrieve the string literals above a certain length, we can
add a <code>when</code> clause: </p>
<pre><code><span class="hljs-keyword">match</span> <span class="hljs-built_in">String</span> <span class="hljs-built_in">str</span> when <span class="hljs-built_in">str</span>.text.length &gt; <span class="hljs-number">35</span>;
</code></pre><p>This query matches only the string literals whose text representation (quotes included)
contains more than 35 characters. In our document, there is only one match on line 10.</p>
<p>The node type can be tested using the <code>is</code> keyword. For instance,
to retrieve any node whose direct parent is an <code>Object</code>:</p>
<pre><code>match <span class="hljs-keyword">_</span> node when node.parent <span class="hljs-keyword">is</span> Object;
</code></pre><p>Returns the members list of all the objects in our document.</p>
<p>The <code>is</code> keyword can also test a node against a full query pattern surrounded by curly braces. So, for example, we can retrieve every node whose parent
is a member with key <code>name</code>. </p>
<pre><code>match <span class="hljs-keyword">_</span> node when node.parent <span class="hljs-keyword">is</span> {
    Member m when m.key.text <span class="hljs-operator">=</span><span class="hljs-operator">=</span> <span class="hljs-string">'"name"'</span>
};
</code></pre><p>String literals can be single or double quoted.</p>
<p>Now that we know how to write basic queries, let's try to find the nodes that violate our rules.</p>
<h2 id="heading-rule-1-variable-descriptions-should-be-shorter-than-35-characters">Rule 1: variable descriptions should be shorter than 35 characters</h2>
<p>Except for the <code>&amp;&amp;</code> operator in boolean expressions, this rule only
uses features that appeared in the previous section so you can test
yourself by trying to write it without looking at the following block of code!</p>
<pre><code>match String desc when desc.text.<span class="hljs-built_in">length</span> <span class="hljs-operator">&gt;</span> <span class="hljs-number">37</span> <span class="hljs-operator">&amp;</span><span class="hljs-operator">&amp;</span> desc.parent <span class="hljs-keyword">is</span> {
   Member m when m.key.text <span class="hljs-operator">=</span><span class="hljs-operator">=</span> <span class="hljs-string">'"description"'</span>
};
</code></pre><p>This query should return a single node on line 10 that is indeed longer than the 
specified length.
Note that we check that the text length is above 37 instead of 35 because of the surrounding quotes.</p>
<h2 id="heading-rule-2-variable-names-should-be-a-single-lowercase-word">Rule 2: variable names should be a single lowercase word</h2>
<pre><code>match String s when <span class="hljs-operator">!</span>s.text.matches(`<span class="hljs-string">"[a-z]+"</span>`) <span class="hljs-operator">&amp;</span><span class="hljs-operator">&amp;</span> s.parent <span class="hljs-keyword">is</span> {
   Member m when m.key.text <span class="hljs-operator">=</span><span class="hljs-operator">=</span> <span class="hljs-string">'"name"'</span>
};
</code></pre><p>Returns a single node corresponding to the invalid <code>date of birth</code> name.</p>
<p>Apart from the boolean <code>!</code> operator, this rule demonstrates the use of the <code>matches</code>
method on text values. Unsurprisingly, it returns <code>true</code> when the text matches the 
regex literal given as an argument. As in spec files, regex literals are delimited by
backticks.</p>
<h2 id="heading-rule-3-fields-type-and-values-should-be-mutually-exclusive">Rule 3: fields <code>type</code> and <code>values</code> should be mutually exclusive</h2>
<p>For this rule, we'll use array quantifying expressions of the form:</p>
<pre><code><span class="hljs-operator">&lt;</span>quantifier<span class="hljs-operator">&gt;</span> <span class="hljs-operator">&lt;</span>array value<span class="hljs-operator">&gt;</span> match <span class="hljs-operator">&lt;</span>query pattern<span class="hljs-operator">&gt;</span>
</code></pre><p>Where the quantifier is any of the following keywords: <code>no</code>, <code>any</code>, <code>all</code>.
Array quantifying expressions return true when any, all, or none of the values
in the given array match the query pattern.
Using this new tool, we can find the <code>Object</code> nodes for witch
there is at least one child member with key <code>type</code> and one child member with key <code>values</code>:</p>
<pre><code>match Object n when
    any n.members.children match {  
        Member m when m.key.text <span class="hljs-operator">=</span><span class="hljs-operator">=</span> <span class="hljs-string">'"type"'</span> 
    }
    <span class="hljs-operator">&amp;</span><span class="hljs-operator">&amp;</span> any n.members.children match { 
        Member m when m.key.text <span class="hljs-operator">=</span><span class="hljs-operator">=</span> <span class="hljs-string">'"values"'</span> 
    };
</code></pre><h2 id="heading-conclusion">Conclusion</h2>
<p>We now have queries to identify violations of our business rules, but opening the REPL
and pasting the queries whenever we want to validate a document isn't very practical.
So, in the <a target="_blank" href="https://blog.sylver.dev/getting-started-with-sylver-part33-from-queries-to-analyzer">final part</a> of this series, we'll learn how to package queries into a Sylver rule
set to consume, distribute and share them more conveniently!</p>
]]></content:encoded></item><item><title><![CDATA[Building a JSON validator with Sylver - Part1/3 : Writing a JSON parser in 49 lines of code]]></title><description><![CDATA[Sylver is a language agnostic platform for building custom source code
analyzers (think eslint for every language).
This might be a lot to unpack, so let us explore this tool
by solving a real-world problem: our application's configuration is stored ...]]></description><link>https://blog.sylver.dev/building-a-json-validator-with-sylver-part13-writing-a-json-parser-in-49-lines-of-code</link><guid isPermaLink="true">https://blog.sylver.dev/building-a-json-validator-with-sylver-part13-writing-a-json-parser-in-49-lines-of-code</guid><category><![CDATA[static code analysis]]></category><category><![CDATA[static analysis]]></category><category><![CDATA[code analysis]]></category><category><![CDATA[Linter]]></category><category><![CDATA[json parser]]></category><dc:creator><![CDATA[Geoffrey Copin]]></dc:creator><pubDate>Wed, 10 Aug 2022 16:30:59 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1660149889094/AYbtAeRQz.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Sylver is a language agnostic platform for building custom source code
analyzers (think eslint for every language).
This might be a lot to unpack, so let us explore this tool
by solving a real-world problem: our application's configuration is stored in
complex JSON documents, and we'd like to build a tool to automatically validate
these documents against our business rules.</p>
<p>In this series of tutorials, we'll go from having zero knowledge of Sylver or
static analysis to building a fully-fledged linter for our configuration files.
We will use JSON as an example, but the tools and techniques presented
apply to many data formats and even complete programming languages!
Also, note that while we will be building everything from scratch using
Sylver's domain-specific languages (DSL), a catalog of built-in specifications for the most common languages will be included in future releases of the tool.</p>
<ul>
<li><p>In part 1, we will discover Sylver's meta language, a DSL used to describe the shape
of programming languages and data formats. After completing this tutorial, we'll have
a complete spec for the JSON language, allowing us to turn JSON documents into
Sylver parse trees.</p>
</li>
<li><p>In <a target="_blank" href="https://blog.sylver.dev/getting-started-with-sylver-part23-intuitive-json-ast-queries">part 2</a>, we will load the parse trees into Sylver's query engine,
and we will find nodes that violate our business rules using an SQL-like query
language.</p>
</li>
<li><p>In <a target="_blank" href="https://blog.sylver.dev/getting-started-with-sylver-part33-from-queries-to-analyzer">part 3</a>, we will learn how to turn our language spec and queries into a set of
linting rules so that we can run them conveniently and share them.</p>
</li>
</ul>
<h2 id="heading-installation">Installation</h2>
<p>Sylver is distributed as a single static binary. Installing it is as simple as:</p>
<ol>
<li>Go to <a target="_blank" href="https://sylver.dev">https://sylver.dev</a> to download the binary for your platform</li>
<li>Unpack the downloaded archive</li>
<li>Move the <code>sylver</code> binary to a location in your <code>$PATH</code></li>
</ol>
<h2 id="heading-prelude">Prelude</h2>
<p>Let us create a blank workspace for this tutorial!<br />We'll start by creating a new folder and a <code>json.syl</code> file to write
the Sylver specification for the JSON language.</p>
<pre><code class="lang-bash">mkdir sylver_getting_started
<span class="hljs-built_in">cd</span> sylver_getting_started
touch json.syl
</code></pre>
<p>We will also store our test JSON file in <code>config.json</code>.</p>
<pre><code class="lang-json">{
    <span class="hljs-attr">"variables"</span>: [
        {
            <span class="hljs-attr">"name"</span>: <span class="hljs-string">"country"</span>,
            <span class="hljs-attr">"description"</span>: <span class="hljs-string">"Customer's country of residence"</span>,
            <span class="hljs-attr">"values"</span>: [<span class="hljs-string">"us"</span>, <span class="hljs-string">"fr"</span>, <span class="hljs-string">"it"</span>]
        },
        {
            <span class="hljs-attr">"name"</span>: <span class="hljs-string">"age"</span>,
            <span class="hljs-attr">"description"</span>: <span class="hljs-string">"Cusomer's age"</span>,
            <span class="hljs-attr">"type"</span>: <span class="hljs-string">"number"</span>
        }
    ]
}
</code></pre>
<p>This file specifies the variables used to describe customer profiles in a fictional
customer database.
Variables are assigned a type or a list of potential values.</p>
<h2 id="heading-types-definition">Types definition</h2>
<p>Sylver parses raw text into typed structures called parse trees. The first step in defining a language spec is to define a set of types for our tree nodes.
We'll add the following node types declarations to our <code>json.syl</code> spec:</p>
<pre><code><span class="hljs-selector-tag">node</span> <span class="hljs-selector-tag">JsonNode</span> { }

<span class="hljs-selector-tag">node</span> <span class="hljs-selector-tag">Null</span>: <span class="hljs-selector-tag">JsonNode</span> { }

<span class="hljs-selector-tag">node</span> <span class="hljs-selector-tag">Bool</span>: <span class="hljs-selector-tag">JsonNode</span> { }

<span class="hljs-selector-tag">node</span> <span class="hljs-selector-tag">Number</span>: <span class="hljs-selector-tag">JsonNode</span> { }

<span class="hljs-selector-tag">node</span> <span class="hljs-selector-tag">String</span>: <span class="hljs-selector-tag">JsonNode</span> { }

<span class="hljs-selector-tag">node</span> <span class="hljs-selector-tag">Array</span>: <span class="hljs-selector-tag">JsonNode</span> { 
    <span class="hljs-attribute">elems</span>: List&lt;JsonNode&gt; 
}

<span class="hljs-selector-tag">node</span> <span class="hljs-selector-tag">Object</span>: <span class="hljs-selector-tag">JsonNode</span> {
    <span class="hljs-attribute">members</span>: List&lt;Member&gt;
}

<span class="hljs-selector-tag">node</span> <span class="hljs-selector-tag">Member</span>: <span class="hljs-selector-tag">JsonNode</span> {
    <span class="hljs-attribute">key</span>: String,
    value: JsonNode
}
</code></pre><p>These declarations resemble object type declarations in many mainstream languages.
The <code>:</code> syntax denotes inheritance.</p>
<p>Now that we have a set of types to describe JSON documents, we need to specify how to build a
parse tree from a sequence of characters. This process is done in two steps:</p>
<ol>
<li>Lexical analysis: in this step, individual characters that form an indivisible entity
(such as the digits of a number or the characters of a string) are grouped together into tokens.
Some tokens are only one character-wide (for example, the brackets and semicolons in JSON).</li>
<li>Syntactic analysis, in which tree nodes are built for the stream of tokens.</li>
</ol>
<h2 id="heading-lexical-analysis">Lexical analysis</h2>
<p>Tokens are described using declarations of the form <code>term NAME = &lt;term_content&gt;</code> where
<code>&lt;term_content&gt;</code> is either a literal surrounded by single-quotes (<code>'</code>) or a regex
between backticks (<code>` </code>). The regexes use a syntax similar to Perl-style regular expressions.<br />Characters in the input string that match one of the terminal literals
or regexes will be grouped into a token of the given name.</p>
<pre><code>term COMMA <span class="hljs-operator">=</span> <span class="hljs-string">','</span>
term COLON <span class="hljs-operator">=</span> <span class="hljs-string">':'</span>
term L_BRACE <span class="hljs-operator">=</span> <span class="hljs-string">'{'</span>
term R_BRACE <span class="hljs-operator">=</span> <span class="hljs-string">'}'</span>
term L_BRACKET <span class="hljs-operator">=</span> <span class="hljs-string">'['</span>
term R_BRACKET <span class="hljs-operator">=</span> <span class="hljs-string">']'</span>
term NULL <span class="hljs-operator">=</span> <span class="hljs-string">'null'</span>

term BOOL_LIT <span class="hljs-operator">=</span> `<span class="hljs-literal">true</span><span class="hljs-operator">|</span><span class="hljs-literal">false</span>`
term NUMBER_LIT <span class="hljs-operator">=</span> `\<span class="hljs-operator">-</span>?(<span class="hljs-number">0</span><span class="hljs-operator">|</span>([<span class="hljs-number">1</span><span class="hljs-number">-9</span>][<span class="hljs-number">0</span><span class="hljs-number">-9</span>]<span class="hljs-operator">*</span>))(.[<span class="hljs-number">0</span><span class="hljs-number">-9</span>]<span class="hljs-operator">+</span>)?((e<span class="hljs-operator">|</span>E)(\<span class="hljs-operator">+</span><span class="hljs-operator">|</span><span class="hljs-operator">-</span>)?[<span class="hljs-number">0</span><span class="hljs-number">-9</span>]<span class="hljs-operator">+</span>)?`
term STRING_LIT <span class="hljs-operator">=</span> `<span class="hljs-string">"([^"</span>\\]<span class="hljs-operator">|</span>(\\[\\<span class="hljs-operator">/</span>bnfrt<span class="hljs-string">"])|(\\u[a-fA-F0-9]{4}))*"</span>`


ignore term WHITESPACE <span class="hljs-operator">=</span> `\s`
</code></pre><p>Term rules for numbers and strings are slightly involved in accounting for some
of JSON's peculiarities.</p>
<p>Note that the <code>WHITESPACE</code> term declaration (matching a single whitespace character) is prefixed with the <code>ignore</code> keyword. This means that <code>WHITESPACE</code> tokens do not affect
the structure of the document and can be ignored during syntactic analysis.</p>
<h2 id="heading-syntactic-analysis">Syntactic analysis</h2>
<p>In this last part of the language spec, we write rules describing how tree nodes are built by matching tokens from the input stream.</p>
<p>For example, a rule specifying: "if the current token is a STRING_LIT, build a
String node" can be written as follows:</p>
<pre><code>rule <span class="hljs-keyword">string</span> = <span class="hljs-keyword">String</span> { STRING_LIT }
</code></pre><p>Rules can refer to other rules to construct nested nodes.
For example, here is a rule specifying that a <code>Member</code> node (corresponding to an object member
in JSON) can be built by building a node using the <code>string</code> rule and then matching a <code>COLON</code> token followed by any valid JSON value:</p>
<pre><code><span class="hljs-keyword">rule</span> member = Member { key@string COLON <span class="hljs-keyword">value</span>@main }
</code></pre><p>Nested nodes are associated with a field using the <code>@</code> syntax.
The <code>main</code> rule is the entry point for the parser, so in our case, it designates any valid JSON value.  </p>
<p>A valid JSON document can be made of a 'null' literal, a number, a boolean value,
a string, an array of JSON values, or a JSON object, which is reflected in the main rule:</p>
<pre><code><span class="hljs-attribute">rule</span> main =
    Null { <span class="hljs-attribute">NULL</span> }
  | Number { <span class="hljs-attribute">NUMBER_LIT</span> }
  | Bool { <span class="hljs-attribute">BOOL_LIT</span> }
  | string
  | Array { <span class="hljs-attribute">L_BRACKET</span> elems<span class="hljs-variable">@sepBy</span>(COMMA, main) R_BRACKET }
  | Object { <span class="hljs-attribute">L_BRACE</span> members<span class="hljs-variable">@sepBy</span>(COMMA, member) R_BRACE }
</code></pre><p>The <code>sepBy(TOKEN, rule_name)</code> syntax is used to parse nodes using the <code>main</code> rule,
while matching a <code>TOKEN</code> token between every parsed node. </p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>We now have a complete language spec for the JSON language:</p>
<pre><code>node JsonNode { }

node Null: JsonNode { }

node Bool: JsonNode { }

node Number: JsonNode { }

node String: JsonNode { }

node Array: JsonNode { 
    elems: List<span class="hljs-operator">&lt;</span>JsonNode<span class="hljs-operator">&gt;</span> 
}

node Object: JsonNode {
    members: List<span class="hljs-operator">&lt;</span>Member<span class="hljs-operator">&gt;</span>
}

node Member: JsonNode {
    key: String,
    <span class="hljs-built_in">value</span>: JsonNode
}

term COMMA <span class="hljs-operator">=</span> <span class="hljs-string">','</span>
term COLON <span class="hljs-operator">=</span> <span class="hljs-string">':'</span>
term L_BRACE <span class="hljs-operator">=</span> <span class="hljs-string">'{'</span>
term R_BRACE <span class="hljs-operator">=</span> <span class="hljs-string">'}'</span>
term L_BRACKET <span class="hljs-operator">=</span> <span class="hljs-string">'['</span>
term R_BRACKET <span class="hljs-operator">=</span> <span class="hljs-string">']'</span>
term NULL <span class="hljs-operator">=</span> <span class="hljs-string">'null'</span>

term BOOL_LIT <span class="hljs-operator">=</span> `<span class="hljs-literal">true</span><span class="hljs-operator">|</span><span class="hljs-literal">false</span>`
term NUMBER_LIT <span class="hljs-operator">=</span> `\<span class="hljs-operator">-</span>?(<span class="hljs-number">0</span><span class="hljs-operator">|</span>([<span class="hljs-number">1</span><span class="hljs-number">-9</span>][<span class="hljs-number">0</span><span class="hljs-number">-9</span>]<span class="hljs-operator">*</span>))(.[<span class="hljs-number">0</span><span class="hljs-number">-9</span>]<span class="hljs-operator">+</span>)?((e<span class="hljs-operator">|</span>E)(\<span class="hljs-operator">+</span><span class="hljs-operator">|</span><span class="hljs-operator">-</span>)?[<span class="hljs-number">0</span><span class="hljs-number">-9</span>]<span class="hljs-operator">+</span>)?`
term STRING_LIT <span class="hljs-operator">=</span> `<span class="hljs-string">"([^"</span>\\]<span class="hljs-operator">|</span>(\\[\\<span class="hljs-operator">/</span>bnfrt<span class="hljs-string">"])|(\\u[a-fA-F0-9]{4}))*"</span>`


ignore term WHITESPACE <span class="hljs-operator">=</span> `\s`

rule <span class="hljs-keyword">string</span> <span class="hljs-operator">=</span> String { STRING_LIT }

rule member <span class="hljs-operator">=</span> Member { key@<span class="hljs-keyword">string</span> COLON value@main }

rule main <span class="hljs-operator">=</span>
    Null { NULL }
  <span class="hljs-operator">|</span> Number { NUMBER_LIT }
  <span class="hljs-operator">|</span> Bool { BOOL_LIT }
  <span class="hljs-operator">|</span> <span class="hljs-keyword">string</span>
  <span class="hljs-operator">|</span> Array { L_BRACKET elems@sepBy(COMMA, main) R_BRACKET }
  <span class="hljs-operator">|</span> Object { L_BRACE members@sepBy(COMMA, member) R_BRACE }
</code></pre><p>The last step is to test it on our test file!
This is done by invoking the following command:
<code>sylver parse --spec=json.syl --file=config.json</code></p>
<p>Which yields the following parse tree:</p>
<pre><code>Object {
. ● members: List<span class="hljs-operator">&lt;</span>Member<span class="hljs-operator">&gt;</span> {
. . Member {
. . . ● key: String { <span class="hljs-string">"variables"</span> }
. . . ● <span class="hljs-built_in">value</span>: Array {
. . . . ● elems: List<span class="hljs-operator">&lt;</span>JsonNode<span class="hljs-operator">&gt;</span> {
. . . . . Object {
. . . . . . ● members: List<span class="hljs-operator">&lt;</span>Member<span class="hljs-operator">&gt;</span> {
. . . . . . . Member {
. . . . . . . . ● key: String { <span class="hljs-string">"name"</span> }
. . . . . . . . ● <span class="hljs-built_in">value</span>: String { <span class="hljs-string">"country"</span> }
. . . . . . . }
. . . . . . . Member {
. . . . . . . . ● key: String { <span class="hljs-string">"description"</span> }
. . . . . . . . ● <span class="hljs-built_in">value</span>: String { <span class="hljs-string">"Customer's country of residence"</span> }
. . . . . . . }
. . . . . . . Member {
. . . . . . . . ● key: String { <span class="hljs-string">"values"</span> }
. . . . . . . . ● <span class="hljs-built_in">value</span>: Array {
. . . . . . . . . ● elems: List<span class="hljs-operator">&lt;</span>JsonNode<span class="hljs-operator">&gt;</span> {
. . . . . . . . . . String { <span class="hljs-string">"us"</span> }
. . . . . . . . . . String { <span class="hljs-string">"fr"</span> }
. . . . . . . . . . String { <span class="hljs-string">"it"</span> }
. . . . . . . . . }
. . . . . . . . }
. . . . . . . }
. . . . . . }
. . . . . }
. . . . . Object {
. . . . . . ● members: List<span class="hljs-operator">&lt;</span>Member<span class="hljs-operator">&gt;</span> {
. . . . . . . Member {
. . . . . . . . ● key: String { <span class="hljs-string">"name"</span> }
. . . . . . . . ● <span class="hljs-built_in">value</span>: String { <span class="hljs-string">"age"</span> }
. . . . . . . }
. . . . . . . Member {
. . . . . . . . ● key: String { <span class="hljs-string">"description"</span> }
. . . . . . . . ● <span class="hljs-built_in">value</span>: String { <span class="hljs-string">"Customer's age"</span> }
. . . . . . . }
. . . . . . . Member {
. . . . . . . . ● key: String { <span class="hljs-string">"type"</span> }
. . . . . . . . ● <span class="hljs-built_in">value</span>: String { <span class="hljs-string">"number"</span> }
. . . . . . . }
. . . . . . }
. . . . . }
. . . . }
. . . }
. . }
. }
}
</code></pre><p>In the next part, we'll define business rules to validate our JSON 
configuration (for example, the possible values for each variable must be
of the same type), and we will use a query DSL to identify the tree nodes that violate these rules.</p>
]]></content:encoded></item></channel></rss>