Case Reports
Structured Case Reports / Case Related Data

Input documents containing case narratives, investigation reports, and case-related data as PDFs.

Layer 1
Ingestion
Text Extraction & Normalization
# Extract text from PDF
text = extract_pdf_text("case_report.pdf")
# Normalize format
normalized = normalize_text(text)
# Returns: Clean, structured text ready for processing
  • PDF text extraction using pdfplumber
  • Organization detection from filenames
  • Batch processing support
  • Input validation and error handling
Layer 2
Processing
Feature Extraction & Case Schema

How it's done: Hybrid approach using regex patterns for structured data (demographics, platforms, evidence, prosecution) and pattern-based matching for semantic features (severity indicators, case topics, severity phrases). ML/NER extraction supplements with law enforcement agencies, ages, dates, and locations. Text is cleaned (URL removal, artifact normalization), cases are batched by temporal patterns, and unique case IDs are generated.

# 1. Clean URLs and artifacts
cleaned_text = clean_urls_from_text(raw_text)

# 2. Batch cases by month patterns
cases = case_batching(cleaned_text, org_name="azicac")

# 3. Extract features (regex + patterns + NER)
features = extract_features(case)
# Regex: Demographics, platforms, evidence, prosecution
# Patterns: Severity indicators, case topics, phrases
# NER: Law enforcement agencies, ages, dates, locations
Demographics
Victim age, victim_gender, count
Perpetrator age, perpetrator_gender, RSO status
Platforms
Social media, online methods, communication channels
Severity
Infant, very young, rape, production indicators
Topics
Hands-on, possession, online-only, family, stranger
Prosecution
Charges, booking status, outcomes
Evidence
Images, videos, storage volume, messages
Layer 3
Storage
Case schema & persistence
# Production: DATABASE_URL → PostgreSQL (Railway).
# Local/dev: SQLite file (caselinker.db - local).
storage = CaseStorage() # or CaseStorage("caselinker.db")
storage.store_case(case)

# cases: JSON columns (topics, severity, platforms, …),
# raw_data (ingestion + case_text), extracted_features (slim schema)
# Postgres extras: precomputed_clusters, cluster_groups_slim (fast /api/cluster-groups)
  • Deployed on Railway with PostgreSQL; SQLite for local development
  • Normalized columns plus raw_data / extracted_features JSON
  • Optional slim cluster caches on Postgres for large corpora
  • Shared CaseStorage interface; hydrate/slim via case_storage_utils
Layer 4
Analysis
Filtering, facets, clustering, triage
# Tag intersection (topics, severity, platforms, investigation, relationship, custom)
cases = return_tagged_cases(all_cases, [
  {'tag': 'production', 'category': 'case_topics'},
  {'tag': 'infant', 'category': 'severity_indicators'},
])

# HTTP: /api/facet-tree, /api/facet-distinct, /api/facet-cohort-members
# (navigable tag combinations + cohort case IDs from live DB)

# Five cluster families + Jaccard “general”; /api/cluster-groups, /api/automated-analysis
triaged = triage_cases(cases) # rule-based 5–10 score
# Triage ML: saved sklearn bundle; /api/triage-model-corpus (live DB), /api/triage-live
  • Filter cases sharing the same tags (intersection logic)
  • Facet tree APIs for exploration with processed features
  • Clustering and automated insights in analysis.py; optional ML triage alongside rules
Layer 5
Visualization
Analyst UI: static pages + JSON APIs
# FastAPI serves visualization/ *.html; browser fetches /api/* JSON.


# Examples: /api/cases (full row on demand), /api/case-count, /api/stats, /api/cluster-groups, /api/facet-tree, /api/triage-eval, /api/triage-live

# D3 (and small helpers) in-page: clusters, stats, search, audit trail, live triage, analysis facet explorer
  • Slim list endpoints where possible; click-through loads full case text from API
  • Same nav across pages; server-side caching where it helps (e.g. cases list, clusters)
  • Modular: new views can sit on the same APIs and storage

Open source on GitHub

View on GitHub