A digitized corpus of 6,102 labor contracts between formerly enslaved people and Southern employers, 1864–1868. Transcribed by volunteers at the Smithsonian, structured by machine.
These contracts come from the Freedmen's Bureau Records held at the National Archives (NARA) and digitized by the Smithsonian Transcription Center. Thousands of volunteers transcribed handwritten contract pages from microfilmed records spanning the Reconstruction era.
The Bureau of Refugees, Freedmen, and Abandoned Lands (the "Freedmen's Bureau") operated from 1865 to 1872. One of its central functions was overseeing labor contracts between freed people and their employers—often their former enslavers. Bureau agents witnessed and approved these contracts, which specified wages, rations, working hours, and penalties.
We scraped all transcribed labor contract pages from the Transcription Center, covering 6 NARA microfilm publications across 5 states. Four additional states (Alabama, Georgia, South Carolina, Virginia) have been digitized but not yet transcribed by volunteers.
| Microfilm | State | Contracts | Words |
|---|
We built a five-step Python pipeline to go from the Smithsonian's raw HTML transcription pages to a structured dataset. The key challenge: each "page" on the Transcription Center is a single scanned image, and a single labor contract typically spans 2–4 pages. Our pipeline detects contract boundaries, then extracts structured fields using regular expressions.
Each transcription page is checked for header patterns that signal the start of a new contract. Pages without a header are treated as continuations of the previous contract. The trigger patterns include:
"This Agreement, made and entered into..."
"Articles of Agreement..."
"Contract made and entered into..."
"This Indenture made..."
"Know all men by these presents..."
From the opening paragraph of each contract, we extract five structured fields using regular expressions. Select a contract below to see how each field is identified:
The regex approach works well for standardized contract forms (especially Mississippi's "Agreement with Freedmen") but struggles with informal or heavily damaged documents. An LLM-based extraction pass is planned as a follow-up.
Tennessee and Mississippi dominate the corpus, together accounting for 95% of all contracts.
1865 was the peak year for contracting, as the Bureau oversaw the first full season of free labor after the war.
Shelby (Memphis), Robertson, and Madison counties in Tennessee lead, followed by Hinds (Jackson) in Mississippi.
Most contracts are 200–600 words. The long tail includes multi-page plantation contracts with dozens of named workers.
Confidence is based on how many of the 4 core fields (date, county, employer, workers) were successfully extracted.
The modal contract spans 2 pages (front and back of a form). Single-page entries are often cover sheets or brief agreements.
Agent names are the hardest field to extract (18.8% success rate) because they appear in less standardized positions—sometimes at the bottom as a witness, sometimes on a separate approval page. After normalizing name variants, these are the most frequently identified agents:
| # | Agent Name | Contracts |
|---|
| State | Contracts | Words | Dates | Counties | Employers | Agents |
|---|
Below are three representative contracts from different states and years, showing the range of formats and terms. Click to expand the full transcribed text.
The Smithsonian Transcription Center covers only a fraction of the surviving Freedmen's Bureau labor contract records. FamilySearch hosts a curated collection of digitized images across 12+ states, of which we identified 132,084 labor contract images using the Digital Folder Number List.
Our transcription-based dataset captures approximately 19,901 pages (~15%) of this total. The remaining ~112,000 images are digitized scans of the original handwritten microfilm but have no text transcriptions—they could be processed with OCR or handwritten text recognition (HTR) models.
The largest gaps are in South Carolina (~46,000 untranscribed images), Arkansas (~17,500), Tennessee (~13,000), Louisiana (~11,500), and Virginia (~11,000). Six states have zero transcription coverage: SC, LA, KY, GA, AL, and VA.
All images are freely accessible on FamilySearch (free account required) and from the Smithsonian's IDS image service. A future OCR pipeline could expand this dataset from ~6,000 contracts to potentially tens of thousands.
Below are three examples of the original handwritten contract pages—scanned from NARA microfilm. Volunteer transcribers at the Smithsonian converted these into the text we extracted. The untranscribed pages look similar but have not yet been converted to text.
| State | Digitized Images | Transcribed | Untranscribed | Coverage |
|---|