An unedited mirror of an OCR of the Epstein Estate's documentation, as released by the House Oversight Committee.
|
|
3 months ago | |
|---|---|---|
| .github | 3 months ago | |
| errors | 3 months ago | |
| results | 3 months ago | |
| src | 3 months ago | |
| .eleventy.js | 3 months ago | |
| .env.example | 3 months ago | |
| .gitignore | 3 months ago | |
| LICENSE | 3 months ago | |
| README.md | 3 months ago | |
| analyses.json | 3 months ago | |
| analyze_documents.py | 3 months ago | |
| cleanup_failed.py | 3 months ago | |
| dedupe.json | 3 months ago | |
| dedupe_types.json | 3 months ago | |
| deduplicate.py | 3 months ago | |
| deduplicate_types.py | 3 months ago | |
| package-lock.json | 3 months ago | |
| package.json | 3 months ago | |
| process_images.py | 3 months ago | |
| processed_results.json | 3 months ago | |
| processing_index.json | 3 months ago | |
| requirements.txt | 3 months ago |
An automatically processed, OCR'd, searchable archive of publicly released documents related to the Jeffrey Epstein case.
This project automatically processes thousands of scanned document pages using AI-powered OCR to:
This is a public service project. All documents are from public releases. This archive makes them more accessible and searchable.
.
├── process_images.py # Python script to OCR images using AI
├── cleanup_failed.py # Python script to clean up failed processing
├── deduplicate.py # Python script to deduplicate entities
├── deduplicate_types.py # Python script to deduplicate document types
├── analyze_documents.py # Python script to generate AI summaries
├── requirements.txt # Python dependencies
├── .env.example # Example environment configuration
├── downloads/ # Place document images here
├── results/ # Extracted JSON data per document
├── processing_index.json # Processing progress tracking (generated)
├── dedupe.json # Entity deduplication mappings (generated)
├── dedupe_types.json # Document type deduplication mappings (generated)
├── analyses.json # AI document analyses (generated)
├── src/ # 11ty source files for website
├── .eleventy.js # Static site generator configuration
└── _site/ # Generated static website (after build)
Python (for OCR processing):
pip install -r requirements.txt
Node.js (for website generation):
npm install
Copy .env.example to .env and configure your OpenAI-compatible API endpoint:
cp .env.example .env
# Edit .env with your API details
Place document images in the downloads/ directory, then run:
python process_images.py
# Options:
# --limit N # Process only N images (for testing)
# --workers N # Number of parallel workers (default: 5)
# --no-resume # Process all files, ignore index
The script will:
./results/{folder}/{imagename}.jsonprocessing_index.json (resume-friendly)If processing fails or you need to retry failed files:
# Check for failures (dry run)
python cleanup_failed.py
# Remove failed files from processed list (so they can be retried)
python cleanup_failed.py --doit
# Also delete corrupt JSON files
python cleanup_failed.py --doit --delete-invalid-json
The LLM may extract the same entity with different spellings (e.g., "Epstein", "Jeffrey Epstein", "J. Epstein"). Run the deduplication script to merge these:
python deduplicate.py
# Options:
# --batch-size N # Process N entities per batch (default: 50)
# --show-stats # Show deduplication stats without processing
This will:
./results/dedupe.json mapping fileExample dedupe.json:
{
"people": {
"Epstein": "Jeffrey Epstein",
"J. Epstein": "Jeffrey Epstein",
"Jeffrey Epstein": "Jeffrey Epstein"
},
"organizations": {...},
"locations": {...}
}
Deduplicate Document Types:
The LLM may also extract document types with inconsistent formatting (e.g., "deposition", "Deposition", "DEPOSITION TRANSCRIPT"). Run the type deduplication script:
python deduplicate_types.py
This will:
./results/dedupe_types.json mapping fileExample dedupe_types.json:
{
"stats": {
"original_types": 45,
"canonical_types": 12,
"reduction_percentage": 73.3
},
"mappings": {
"deposition": "Deposition",
"DEPOSITION": "Deposition",
"deposition transcript": "Deposition",
"court filing": "Court Filing"
}
}
Generate AI summaries and insights for each document:
python analyze_documents.py
# Options:
# --limit N # Analyze only N documents (for testing)
# --force # Re-analyze all documents (ignore existing)
This will:
analyses.jsonExample analysis output:
{
"document_type": "deposition",
"key_topics": ["Flight logs", "Private aircraft", "Passenger manifests"],
"key_people": [
{"name": "Jeffrey Epstein", "role": "Aircraft owner"}
],
"significance": "Documents flight records showing passenger lists...",
"summary": "This deposition contains testimony regarding..."
}
Build the static site from the processed data:
npm run build # Build static site to _site/
npm start # Development server with live reload
The build process will automatically:
dedupe.json existsanalyses.json existsDocument Processing: Images are sent to an AI vision model that extracts:
Document Grouping: Individual page scans are automatically grouped by document number and sorted by page number to reconstruct complete documents
Static Site Generation: 11ty processes the JSON data to create:
This is an open archive project. Contributions welcome:
The site is automatically deployed to GitHub Pages on every push to the main branch.
https://github.com/epstein-docs/epstein-docs.github.ioThe site will be available at: https://epstein-docs.github.io/
Once entities are deduplicated, the next step is to visualize relationships between people, organizations, and locations. Potential approaches:
Pre-generate graph data during the build process:
Graph types to consider:
Implementation ideas:
/graphs/people/, /graphs/timeline/){
"nodes": [
{"id": "Jeffrey Epstein", "type": "person", "doc_count": 250},
{"id": "Ghislaine Maxwell", "type": "person", "doc_count": 180}
],
"edges": [
{"source": "Jeffrey Epstein", "target": "Ghislaine Maxwell", "weight": 85, "shared_docs": 85}
]
}
The deduplication step is essential for accurate relationship mapping - without it, "Epstein" and "Jeffrey Epstein" would appear as separate nodes.
This is an independent archival project. Documents are sourced from public releases. The maintainers make no representations about completeness or accuracy of the archive.
This project is licensed under the MIT License - see the LICENSE file for details.
The code in this repository is open source and free to use. The documents themselves are public records.
Repository: https://github.com/epstein-docs/epstein-docs
If you find this archive useful, consider supporting its maintenance and hosting:
Bitcoin: bc1qmahlh5eql05w30cgf5taj3n23twmp0f5xcvnnz