Natural Language Databases

Table of contents

Requirements

I’m writing a Bible study app. I’d like for users to be able to query word relations (like “what’s every direct object that the verb ‘love’ targets”)? I have a few requirements:

Data model

What is a “semantic word tree” model? There’s two popular ones:

I’d like to be able to support any graph model, but I’ll start with Universal Dependencies because there are datasets readily available.

Data sets

There’s a Penn Treebank and Penn-ish treebank for the Old Testament that someone is porting to Universal Dependencies.

There’s another Penn-ish treebank for the New Testament that same someone has ported to Universal Dependencies.

I’ll likely use the ported treebanks to start.

Candidate databases

There aren’t a lot of embedded on-disk databases. There’s even less when you consider those that can be embedded in a browser.

SQLite

With a vector extension and WASM build, SQLite looks pretty convincing at a light 500Kb. Until you consider modelling trees as tables. With some JSON type (ab)use and a defining a custom function for containment it could work. But even if you perfectly understand the data model, SQL joins are unintuitive. Nested subqueries are also yucky!

Maybe a different query layer on top could work. But at that point, SQLite has had the SQL part mostly removed from it. It’d just be a persistence layer with B-Tree indices.

Kuzu

Kuzu supports graph models! Its query language, Cypher, is a more intuitive relational query language than SQL. It has a WASM build for browsers. It has vector and BM25 search builtin. It can relate metadata well since any two nodes can be related regardless of their type. This makes it possible for a book’s author to be an “Organization” or “Person” which makes modeling Schema.org relationships much easier than with SQL.

However, Kuzu’s not quite cut out for text or the web. Sentences “containing” a range of words is unsupported, so each word needs a “contained by” relation. The base WASM build is too heavy at 15MB because of a bunch of C++ dependencies. With that much space I could ship over 30 copies of a fully tagged Bible!

Text-fabric

Text-fabric is perfect for what I want! It’s query language is intuitive.

However, it’s Python-only, slow, and cannot be embedded in a browser. Furthermore, I’d like a single binary database file instead of a multiple folders with plaintext files so that it can more easily be shared.

Conclusion

I’ll embark on writing my own Cypher/Text-fabric query language database in Zig. If I can’t get far I’ll try to port text fabric to Typescript and accept the multi-file format.