An unedited mirror of an OCR of the Epstein Estate's documentation, as released by the House Oversight Committee.

nickp b92183bb66 it builds		3 months ago
.github	f63bc71826 initial	3 months ago
errors	017ff274cb its done	3 months ago
results	017ff274cb its done	3 months ago
src	b92183bb66 it builds	3 months ago
.eleventy.js	b92183bb66 it builds	3 months ago
.env.example	f63bc71826 initial	3 months ago
.gitignore	f63bc71826 initial	3 months ago
LICENSE	f63bc71826 initial	3 months ago
README.md	017ff274cb its done	3 months ago
analyses.json	017ff274cb its done	3 months ago
analyze_documents.py	4e6eb4446d 97%	3 months ago
cleanup_failed.py	be682cf47a 33%, more processing, more data, dedupe etc	3 months ago
dedupe.json	86bda40ae1 dedupe	3 months ago
dedupe_types.json	017ff274cb its done	3 months ago
deduplicate.py	8e3eac2ea2 latest before bed	3 months ago
deduplicate_types.py	4e6eb4446d 97%	3 months ago
package-lock.json	f63bc71826 initial	3 months ago
package.json	b92183bb66 it builds	3 months ago
process_images.py	017ff274cb its done	3 months ago
processed_results.json	017ff274cb its done	3 months ago
processing_index.json	017ff274cb its done	3 months ago
requirements.txt	f63bc71826 initial	3 months ago

Epstein Files Archive

An automatically processed, OCR'd, searchable archive of publicly released documents related to the Jeffrey Epstein case.

About

This project automatically processes thousands of scanned document pages using AI-powered OCR to:

Extract and preserve all text (printed and handwritten)
Identify and index entities (people, organizations, locations, dates)
Reconstruct multi-page documents from individual scans
Provide a searchable web interface to explore the archive

This is a public service project. All documents are from public releases. This archive makes them more accessible and searchable.

Features

Full OCR: Extracts both printed and handwritten text from all documents
Entity Extraction: Automatically identifies and indexes:
- People mentioned
- Organizations
- Locations
- Dates
- Reference numbers
Entity Deduplication: AI-powered merging of duplicate entities (e.g., "Epstein" → "Jeffrey Epstein")
AI Document Analysis: Generates summaries, key topics, key people, and significance for each document
Document Reconstruction: Groups scanned pages back into complete documents
Searchable Interface: Browse by person, organization, location, date, or document type
Static Site: Fast, lightweight, works anywhere

Project Structure

.
├── process_images.py       # Python script to OCR images using AI
├── cleanup_failed.py       # Python script to clean up failed processing
├── deduplicate.py          # Python script to deduplicate entities
├── deduplicate_types.py    # Python script to deduplicate document types
├── analyze_documents.py    # Python script to generate AI summaries
├── requirements.txt         # Python dependencies
├── .env.example            # Example environment configuration
├── downloads/              # Place document images here
├── results/                # Extracted JSON data per document
├── processing_index.json   # Processing progress tracking (generated)
├── dedupe.json             # Entity deduplication mappings (generated)
├── dedupe_types.json       # Document type deduplication mappings (generated)
├── analyses.json           # AI document analyses (generated)
├── src/                    # 11ty source files for website
├── .eleventy.js            # Static site generator configuration
└── _site/                  # Generated static website (after build)

Setup

1. Install Dependencies

Python (for OCR processing):

pip install -r requirements.txt

Node.js (for website generation):

npm install

2. Configure API

Copy .env.example to .env and configure your OpenAI-compatible API endpoint:

cp .env.example .env
# Edit .env with your API details

3. Process Documents

Place document images in the downloads/ directory, then run:

python process_images.py

# Options:
# --limit N          # Process only N images (for testing)
# --workers N        # Number of parallel workers (default: 5)
# --no-resume        # Process all files, ignore index

The script will:

Process each image through the OCR API
Extract text, entities, and metadata
Auto-fix broken JSON: If the LLM returns invalid JSON, the script automatically sends the error back to the LLM along with the original image to get a corrected response
Save results to ./results/{folder}/{imagename}.json
Track progress in processing_index.json (resume-friendly)
Log failed files for later cleanup

If processing fails or you need to retry failed files:

# Check for failures (dry run)
python cleanup_failed.py

# Remove failed files from processed list (so they can be retried)
python cleanup_failed.py --doit

# Also delete corrupt JSON files
python cleanup_failed.py --doit --delete-invalid-json

4. Deduplicate Entities (Optional but Recommended)

The LLM may extract the same entity with different spellings (e.g., "Epstein", "Jeffrey Epstein", "J. Epstein"). Run the deduplication script to merge these:

python deduplicate.py

# Options:
# --batch-size N     # Process N entities per batch (default: 50)
# --show-stats       # Show deduplication stats without processing

This will:

Scan all JSON files in ./results/
Use AI to identify duplicate entities across people, organizations, and locations
Create a dedupe.json mapping file
The website build will automatically use this mapping

Example dedupe.json:

{
  "people": {
    "Epstein": "Jeffrey Epstein",
    "J. Epstein": "Jeffrey Epstein",
    "Jeffrey Epstein": "Jeffrey Epstein"
  },
  "organizations": {...},
  "locations": {...}
}

Deduplicate Document Types:

The LLM may also extract document types with inconsistent formatting (e.g., "deposition", "Deposition", "DEPOSITION TRANSCRIPT"). Run the type deduplication script:

python deduplicate_types.py

This will:

Collect all document types from ./results/
Use AI to merge similar types into canonical forms
Create a dedupe_types.json mapping file
The website build will automatically use this mapping

Example dedupe_types.json:

{
  "stats": {
    "original_types": 45,
    "canonical_types": 12,
    "reduction_percentage": 73.3
  },
  "mappings": {
    "deposition": "Deposition",
    "DEPOSITION": "Deposition",
    "deposition transcript": "Deposition",
    "court filing": "Court Filing"
  }
}

5. Analyze Documents (Optional but Recommended)

Generate AI summaries and insights for each document:

python analyze_documents.py

# Options:
# --limit N          # Analyze only N documents (for testing)
# --force            # Re-analyze all documents (ignore existing)

This will:

Group pages into documents (matching the website logic)
Send each document's full text to the AI
Generate summaries, key topics, key people, and significance analysis
Save results to analyses.json
Resume-friendly (skips already-analyzed documents)

Example analysis output:

{
  "document_type": "deposition",
  "key_topics": ["Flight logs", "Private aircraft", "Passenger manifests"],
  "key_people": [
    {"name": "Jeffrey Epstein", "role": "Aircraft owner"}
  ],
  "significance": "Documents flight records showing passenger lists...",
  "summary": "This deposition contains testimony regarding..."
}

6. Generate Website

Build the static site from the processed data:

npm run build    # Build static site to _site/
npm start        # Development server with live reload

The build process will automatically:

Apply deduplication if dedupe.json exists
Load document analyses if analyses.json exists
Generate a searchable analyses page

How It Works

Document Processing: Images are sent to an AI vision model that extracts:
- All text in reading order
- Document metadata (page numbers, document numbers, dates)
- Named entities (people, orgs, locations)
- Text type annotations (printed, handwritten, stamps)
Document Grouping: Individual page scans are automatically grouped by document number and sorted by page number to reconstruct complete documents
Static Site Generation: 11ty processes the JSON data to create:
- Index pages for all entities
- Individual document pages with full text
- Search and browse interfaces

Performance

Processes ~2,000 pages into ~400 multi-page documents
Handles LLM inconsistencies in document number formatting
Resume-friendly processing (skip already-processed files)
Parallel processing with configurable workers

Contributing

This is an open archive project. Contributions welcome:

Report issues with OCR accuracy
Suggest UI improvements
Add additional document sources
Improve entity extraction

Deployment

The site is automatically deployed to GitHub Pages on every push to the main branch.

GitHub Pages Setup

Push this repository to GitHub: https://github.com/epstein-docs/epstein-docs.github.io
Go to Settings → Pages
Source: GitHub Actions
The workflow will automatically build and deploy the site

The site will be available at: https://epstein-docs.github.io/

Future: Relationship Graphs

Once entities are deduplicated, the next step is to visualize relationships between people, organizations, and locations. Potential approaches:

Static Graph Generation

Pre-generate graph data during the build process:
- Build a relationships JSON file showing connections (e.g., which people appear in the same documents)
- Generate D3.js/vis.js compatible graph data
- Include in static site for client-side rendering
Graph types to consider:
- Co-occurrence network: People who appear together in documents
- Document timeline: Documents plotted by date with entity connections
- Organization membership: People connected to organizations
- Location network: People and organizations connected by locations
Implementation ideas:
- Use D3.js force-directed graph for interactive visualization
- Use Cytoscape.js for more complex network analysis
- Generate static SVG graphs for each major entity
- Add graph pages to the 11ty build (e.g., /graphs/people/, /graphs/timeline/)

Data Structure for Graphs

{
  "nodes": [
    {"id": "Jeffrey Epstein", "type": "person", "doc_count": 250},
    {"id": "Ghislaine Maxwell", "type": "person", "doc_count": 180}
  ],
  "edges": [
    {"source": "Jeffrey Epstein", "target": "Ghislaine Maxwell", "weight": 85, "shared_docs": 85}
  ]
}

The deduplication step is essential for accurate relationship mapping - without it, "Epstein" and "Jeffrey Epstein" would appear as separate nodes.

Disclaimer

This is an independent archival project. Documents are sourced from public releases. The maintainers make no representations about completeness or accuracy of the archive.

License

This project is licensed under the MIT License - see the LICENSE file for details.

The code in this repository is open source and free to use. The documents themselves are public records.

Repository: https://github.com/epstein-docs/epstein-docs

Support This Project

If you find this archive useful, consider supporting its maintenance and hosting:

Bitcoin: bc1qmahlh5eql05w30cgf5taj3n23twmp0f5xcvnnz

README.md

Epstein Files Archive

About

Features

Project Structure

Setup

1. Install Dependencies

2. Configure API

3. Process Documents

4. Deduplicate Entities (Optional but Recommended)

5. Analyze Documents (Optional but Recommended)

6. Generate Website

How It Works

Performance

Contributing

Deployment

GitHub Pages Setup

Future: Relationship Graphs

Static Graph Generation

Data Structure for Graphs

Disclaimer

License

Support This Project