An unedited mirror of an OCR of the Epstein Estate's documentation, as released by the House Oversight Committee.
|
|
3 bulan lalu | |
|---|---|---|
| .github | 3 bulan lalu | |
| results | 3 bulan lalu | |
| src | 3 bulan lalu | |
| .eleventy.js | 3 bulan lalu | |
| .env.example | 3 bulan lalu | |
| .gitignore | 3 bulan lalu | |
| LICENSE | 3 bulan lalu | |
| README.md | 3 bulan lalu | |
| package-lock.json | 3 bulan lalu | |
| package.json | 3 bulan lalu | |
| process_images.py | 3 bulan lalu | |
| processing_index.json | 3 bulan lalu | |
| requirements.txt | 3 bulan lalu |
An automatically processed, OCR'd, searchable archive of publicly released documents related to the Jeffrey Epstein case.
This project automatically processes thousands of scanned document pages using AI-powered OCR to:
This is a public service project. All documents are from public releases. This archive makes them more accessible and searchable.
.
├── process_images.py # Python script to OCR images using AI
├── requirements.txt # Python dependencies
├── .env.example # Example environment configuration
├── downloads/ # Place document images here
├── results/ # Extracted JSON data per document
├── src/ # 11ty source files for website
├── .eleventy.js # Static site generator configuration
└── _site/ # Generated static website (after build)
Python (for OCR processing):
pip install -r requirements.txt
Node.js (for website generation):
npm install
Copy .env.example to .env and configure your OpenAI-compatible API endpoint:
cp .env.example .env
# Edit .env with your API details
Place document images in the downloads/ directory, then run:
python process_images.py
# Options:
# --limit N # Process only N images (for testing)
# --workers N # Number of parallel workers (default: 5)
# --no-resume # Process all files, ignore index
The script will:
./results/{folder}/{imagename}.jsonprocessing_index.json (resume-friendly)Build the static site from the processed data:
npm run build # Build static site to _site/
npm start # Development server with live reload
Document Processing: Images are sent to an AI vision model that extracts:
Document Grouping: Individual page scans are automatically grouped by document number and sorted by page number to reconstruct complete documents
Static Site Generation: 11ty processes the JSON data to create:
This is an open archive project. Contributions welcome:
If you find this archive useful, consider supporting its maintenance and hosting:
Bitcoin: bc1qmahlh5eql05w30cgf5taj3n23twmp0f5xcvnnz
The site is automatically deployed to GitHub Pages on every push to the main branch.
https://github.com/epstein-docs/epstein-docsThe site will be available at: https://epstein-docs.github.io/epstein-docs/
This project is licensed under the MIT License - see the LICENSE file for details.
The code in this repository is open source and free to use. The documents themselves are public records.
Repository: https://github.com/epstein-docs/epstein-docs
This is an independent archival project. Documents are sourced from public releases. The maintainers make no representations about completeness or accuracy of the archive.