TimeCapsuleLLM: LLM trained only on data from 1800-1875

https://news.ycombinator.com/rss Hits: 6
Summary

TimeCapsule LLM A language model trained from scratch exclusively on data from certain places and time periods to reduce modern bias and emulate the voice, vocabulary, and worldview of the era. Imagine if an AI model didnt just pretend to be historical but actually was. v0 and v0.5 built on nanoGPT by Andrej Karpathy Core training scripts and model architecture are his work. v1 built on Phi 1.5 by Microsoft Hugging Face Link Model Behavior & Limitations v0 Early prompts show the model responding with 1800's language and behavior. Example: Prompt: "Who art Henry?" and it replied "I know that man, I have did not a black, the storm." No mention of modern concepts Mostly era-accurate vocabulary Sentences are mostly incoherent (expected for ~187MB training data) A significant improvement over v0. Victorian writing style, proper punctuation, mostly grammatical sentences Still high factual hallucination rate OCR noise (“Digitized by Google”) still present in outputs v1 First model to recall and connect a real historical event with an actual figure from the dataset. Example: Prompt: "It was the year of our Lord 1834" The output: "It was the year of our Lord 1834 and the streets of London were filled with protest and petition. The cause, as many re counted, was not bound in the way of private, but having taken up the same day in the day of Lord Palmerston, the public will receive a short statement of the difficulties under which the day of law has reached us. It is a matter of deep regret, that the present events in the history of the world are clear, and consequently will be'known. It is not true that the very men who first settled in the Gospel at Jerusalem should have so extensive and so interesting a record of the prosperity and prosperity" At first I assumed that a protest might have coincidentally taken place the same year but take a look at this: Trained using a 15GB sample from v2's 90GB dataset. Model trained to 10K steps only. There was an issue with tokenization t...

First seen: 2026-01-12 17:02

Last seen: 2026-01-12 22:02