Natural Language Databases
Table of contents
Requirements §
I’m writing a Bible study app. I’d like for users to be able to query word relations (like “what’s every direct object that the verb ‘love’ targets”)? I have a few requirements:
- Semantic word tree model
- Model is language-agnostic for multilingual support
- Model supports semantic and spelling variants (see Qere vs Ketiv)
- Simple relational query language is intuitive to non-technical users
- Embeddable on mobile devices, preferably in a browser
- Paragraph, sentence, and word segmentation
- Logical segmentation (book, chapter, verse)
- Vector and BM25 search for normal text queries
- Can relate books, authors, named entities, audiobooks, and other metadata
Data model §
What is a “semantic word tree” model? There’s two popular ones:
- Penn Treebank (1989) which models English as lexical groups of words with English Part of Speech (POS) tags.
- Universal Dependencies (2013) which models sentences as a Directed Acyclic Graph (DAG) of “dependent” words with Universal Part of Speech (UPOS) tags.
I’d like to be able to support any graph model, but I’ll start with Universal Dependencies because there are datasets readily available.
Data sets §
There’s a Penn Treebank and Penn-ish treebank for the Old Testament that someone is porting to Universal Dependencies.
There’s another Penn-ish treebank for the New Testament that same someone has ported to Universal Dependencies.
I’ll likely use the ported treebanks to start.
Candidate databases §
There aren’t a lot of embedded on-disk databases. There’s even less when you consider those that can be embedded in a browser.
SQLite §
With a vector extension and WASM build, SQLite looks pretty convincing at a light 500Kb. Until you consider modelling trees as tables. With some JSON type (ab)use and a defining a custom function for containment it could work. But even if you perfectly understand the data model, SQL joins are unintuitive. Nested subqueries are also yucky!
Maybe a different query layer on top could work. But at that point, SQLite has had the SQL part mostly removed from it. It’d just be a persistence layer with B-Tree indices.
Kuzu §
Kuzu supports graph models! Its query language, Cypher, is a more intuitive relational query language than SQL. It has a WASM build for browsers. It has vector and BM25 search builtin. It can relate metadata well since any two nodes can be related regardless of their type. This makes it possible for a book’s author to be an “Organization” or “Person” which makes modeling Schema.org relationships much easier than with SQL.
However, Kuzu’s not quite cut out for text or the web. Sentences “containing” a range of words is unsupported, so each word needs a “contained by” relation. The base WASM build is too heavy at 15MB because of a bunch of C++ dependencies. With that much space I could ship over 30 copies of a fully tagged Bible!
Text-fabric §
Text-fabric is perfect for what I want! It’s query language is intuitive.
However, it’s Python-only, slow, and cannot be embedded in a browser. Furthermore, I’d like a single binary database file instead of a multiple folders with plaintext files so that it can more easily be shared.
Conclusion §
I’ll embark on writing my own Cypher/Text-fabric query language database in Zig. If I can’t get far I’ll try to port text fabric to Typescript and accept the multi-file format.