# Epstein Files Archive

An automatically processed, OCR'd, searchable archive of publicly released documents related to the Jeffrey Epstein case.

## About

This project automatically processes thousands of scanned document pages using AI-powered OCR to:
- Extract and preserve all text (printed and handwritten)
- Identify and index entities (people, organizations, locations, dates)
- Reconstruct multi-page documents from individual scans
- Provide a searchable web interface to explore the archive

**This is a public service project.** All documents are from public releases. This archive makes them more accessible and searchable.

## Features

- **Full OCR**: Extracts both printed and handwritten text from all documents
- **Entity Extraction**: Automatically identifies and indexes:
  - People mentioned
  - Organizations
  - Locations
  - Dates
  - Reference numbers
- **Document Reconstruction**: Groups scanned pages back into complete documents
- **Searchable Interface**: Browse by person, organization, location, date, or document type
- **Static Site**: Fast, lightweight, works anywhere

## Project Structure

```
.
├── process_images.py       # Python script to OCR images using AI
├── requirements.txt         # Python dependencies
├── .env.example            # Example environment configuration
├── downloads/              # Place document images here
├── results/                # Extracted JSON data per document
├── src/                    # 11ty source files for website
├── .eleventy.js            # Static site generator configuration
└── _site/                  # Generated static website (after build)
```

## Setup

### 1. Install Dependencies

**Python (for OCR processing):**
```bash
pip install -r requirements.txt
```

**Node.js (for website generation):**
```bash
npm install
```

### 2. Configure API

Copy `.env.example` to `.env` and configure your OpenAI-compatible API endpoint:

```bash
cp .env.example .env
# Edit .env with your API details
```

### 3. Process Documents

Place document images in the `downloads/` directory, then run:

```bash
python process_images.py

# Options:
# --limit N          # Process only N images (for testing)
# --workers N        # Number of parallel workers (default: 5)
# --no-resume        # Process all files, ignore index
```

The script will:
- Process each image through the OCR API
- Extract text, entities, and metadata
- Save results to `./results/{folder}/{imagename}.json`
- Track progress in `processing_index.json` (resume-friendly)

### 4. Generate Website

Build the static site from the processed data:

```bash
npm run build    # Build static site to _site/
npm start        # Development server with live reload
```

## How It Works

1. **Document Processing**: Images are sent to an AI vision model that extracts:
   - All text in reading order
   - Document metadata (page numbers, document numbers, dates)
   - Named entities (people, orgs, locations)
   - Text type annotations (printed, handwritten, stamps)

2. **Document Grouping**: Individual page scans are automatically grouped by document number and sorted by page number to reconstruct complete documents

3. **Static Site Generation**: 11ty processes the JSON data to create:
   - Index pages for all entities
   - Individual document pages with full text
   - Search and browse interfaces

## Performance

- Processes ~2,000 pages into ~400 multi-page documents
- Handles LLM inconsistencies in document number formatting
- Resume-friendly processing (skip already-processed files)
- Parallel processing with configurable workers

## Contributing

This is an open archive project. Contributions welcome:
- Report issues with OCR accuracy
- Suggest UI improvements
- Add additional document sources
- Improve entity extraction

## Support This Project

If you find this archive useful, consider supporting its maintenance and hosting:

**Bitcoin**: `bc1qmahlh5eql05w30cgf5taj3n23twmp0f5xcvnnz`

## Deployment

The site is automatically deployed to GitHub Pages on every push to the main branch.

### GitHub Pages Setup

1. Push this repository to GitHub: `https://github.com/epstein-docs/epstein-docs`
2. Go to Settings → Pages
3. Source: GitHub Actions
4. The workflow will automatically build and deploy the site

The site will be available at: `https://epstein-docs.github.io/epstein-docs/`

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

The code in this repository is open source and free to use. The documents themselves are public records.

**Repository**: https://github.com/epstein-docs/epstein-docs

## Disclaimer

This is an independent archival project. Documents are sourced from public releases. The maintainers make no representations about completeness or accuracy of the archive.