CaseLinker - System Architecture

Case Reports

Structured Case Reports / Case Related Data

Input documents containing case narratives, investigation reports, and case-related data as PDFs.

Layer 1

Ingestion

Text Extraction & Normalization

# Extract text from PDF

text = extract_pdf_text("case_report.pdf")

# Normalize format

normalized = normalize_text(text)

# Returns: Clean, structured text ready for processing

PDF text extraction using pdfplumber
Organization detection from filenames
Batch processing support
Input validation and error handling

Layer 2

Processing

Feature Extraction & Case Schema

How it's done: Hybrid approach using regex patterns for structured data (demographics, platforms, evidence, prosecution) and pattern-based matching for semantic features (severity indicators, case topics, severity phrases). ML/NER extraction supplements with law enforcement agencies, ages, dates, and locations. Text is cleaned (URL removal, artifact normalization), cases are batched by temporal patterns, and unique case IDs are generated.

# 1. Clean URLs and artifacts

cleaned_text = clean_urls_from_text(raw_text)

# 2. Batch cases by month patterns

cases = case_batching(cleaned_text, org_name="azicac")

# 3. Extract features (regex + patterns + NER)

features = extract_features(case)

# Regex: Demographics, platforms, evidence, prosecution

# Patterns: Severity indicators, case topics, phrases

# NER: Law enforcement agencies, ages, dates, locations

Demographics

Victim age, victim_gender, count
Perpetrator age, perpetrator_gender, RSO status

Platforms

Social media, online methods, communication channels

Severity

Infant, very young, rape, production indicators

Topics

Hands-on, possession, online-only, family, stranger

Prosecution

Charges, booking status, outcomes

Evidence

Images, videos, storage volume, messages

Layer 3

Storage

Case schema & persistence

# Production: DATABASE_URL → PostgreSQL (Railway).

# Local/dev: SQLite file (caselinker.db - local).

storage = CaseStorage()  # or CaseStorage("caselinker.db")

storage.store_case(case)

# cases: JSON columns (topics, severity, platforms, …),

# raw_data (ingestion + case_text), extracted_features (slim schema)

# Postgres extras: precomputed_clusters, cluster_groups_slim (fast /api/cluster-groups)

Deployed on Railway with PostgreSQL; SQLite for local development
Normalized columns plus raw_data / extracted_features JSON
Optional slim cluster caches on Postgres for large corpora
Shared CaseStorage interface; hydrate/slim via case_storage_utils

Layer 4

Analysis

Filtering, facets, clustering, triage

# Tag intersection (topics, severity, platforms, investigation, relationship, custom)

cases = return_tagged_cases(all_cases, [

  {'tag': 'production', 'category': 'case_topics'},

  {'tag': 'infant', 'category': 'severity_indicators'},

])

# HTTP: /api/facet-tree, /api/facet-distinct, /api/facet-cohort-members

# (navigable tag combinations + cohort case IDs from live DB)

# Five cluster families + Jaccard “general”; /api/cluster-groups, /api/automated-analysis

triaged = triage_cases(cases)  # rule-based 5–10 score

# Triage ML: saved sklearn bundle; /api/triage-model-corpus (live DB), /api/triage-live

Filter cases sharing the same tags (intersection logic)
Facet tree APIs for exploration with processed features
Clustering and automated insights in analysis.py; optional ML triage alongside rules

Layer 5

Visualization

Analyst UI: static pages + JSON APIs

# FastAPI serves visualization/ *.html; browser fetches /api/* JSON.
  
# Examples: /api/cases (full row on demand), /api/case-count, /api/stats, /api/cluster-groups, /api/facet-tree, /api/triage-eval, /api/triage-live

# D3 (and small helpers) in-page: clusters, stats, search, audit trail, live triage, analysis facet explorer

Slim list endpoints where possible; click-through loads full case text from API
Same nav across pages; server-side caching where it helps (e.g. cases list, clusters)
Modular: new views can sit on the same APIs and storage