To handle these challenges, we built exa-d: a data framework that uses S3 to store the web. The code below roughly outlines what it does: documents = Column(name="documents", type=str)tokenized = Column(name="documents_tokenized", type=torch.Tensor).derive()._from(documents).impl(Tokenizer)embeddings = Column(name="embeddings").derive()._from(tokenized, type=torch.Tensor).impl(EmbeddingModel) dataset = Dataset(location="s3://exa-data/documents/") execute_columns(dataset, [tokenized, embeddings]) #The Logical Layer: The Dependency Graph Data gets transformed in a production web index not as a linear sequence but as a system of independently evolving derived fields. Each field has its own update schedule and dependency surface, such as multiple embedding versions or derived signals like structured extractions. exa-d represents the index as typed columns with declared dependencies. Base columns are ingested data, while derived columns declare intent, forming an explicit dependency graph. Figure 1: Column dependencies for a singular fragment in sample row 2. For each input, an output of a defined type is produced via a specific function This does two practical things immediately: Execution order is determined by the dependency graph itself vs hardcoded scripts. If embeddings depend on tokenized output, the column declares that dependency and the system determines execution order automatically. Otherwise, a separate script specifying that order would need to be written and maintained for each pipeline variant. Column definitions are contracts. The builder pattern enforces type guarantees, for example Tokenizer: str → Tensor, and makes column definitions reusable instead of relying on string names and ad hoc assumptions about shapes and schemas. The graph determines what needs to be computed. For each derived column, the system checks whether its inputs exist and whether its output is already computed. Adding a new derived field means adding a node and its edges, not dupl...
First seen: 2026-01-14 04:08
Last seen: 2026-01-14 09:09