Appendix Technical Documentation

Methodology

Complete technical documentation of data sources, processing pipelines, model architectures, and analytical methods used in this investigation.

By Querex Data Science
15 min read
January 2026

Data Sources

Source Description Records Period
Boston Police Dept Crime Incident Reports 848,051 Aug 2015 - Present
data.boston.gov Open Data Portal 17 fields/record Updated weekly

Data Processing Pipeline

Pipeline Architecture
1. INGESTION
   └── CSV files (crime_2015.csv ... crime_2023_present.csv)
   └── Concatenation: pd.concat([...], ignore_index=True)
   └── Deduplication: drop_duplicates(subset=['INCIDENT_NUMBER'])

2. CLEANING
   └── Missing coordinates: dropna(subset=['Lat', 'Long'])
   └── Invalid values: (Lat == 0) | (Lat == -1) → removed
   └── Date parsing: pd.to_datetime(OCCURRED_ON_DATE)
   └── Outlier detection: IQR method on coordinates

3. FEATURE ENGINEERING
   └── Temporal: hour, day_of_week, month, year
   └── Cyclical: hour_sin, hour_cos, month_sin, month_cos
   └── Geographic: lat, long, district_code
   └── Target: is_violent (binary classification)

4. SPLITTING
   └── Train: 70% stratified by district
   └── Validation: 15%
   └── Test: 15% (held out)

Machine Learning Models

LightGBM Configuration

Three boosting variants were evaluated:

Parameter GBDT DART GOSS
boosting_type gbdt dart goss
num_leaves 31 31 31
learning_rate 0.05 0.05 0.05
feature_fraction 0.9 0.9 0.9
bagging_fraction 0.8 0.8 N/A
drop_rate (DART) N/A 0.1 N/A
top_rate (GOSS) N/A N/A 0.2
AUC-ROC 0.6484 0.6497 0.6318
Model Selection Rationale

DART was selected despite marginal AUC improvement because its dropout mechanism provides better generalization to unseen data, reducing the risk of overfitting to historical patterns that may not persist.

Network Analysis

iGraph Implementation

R / igraph
library(igraph)

# Create district adjacency graph
edges <- read.csv("data/boston_district_network.json")
g <- graph_from_data_frame(edges, directed = FALSE)

# Network metrics
V(g)                    # 12 vertices (districts)
E(g)                    # 20 edges (adjacencies)
graph.density(g)        # 0.303
transitivity(g)         # 0.367 (clustering coefficient)

# Centrality measures
degree(g, normalized = TRUE)
betweenness(g, normalized = TRUE)
page_rank(g)$vector

# Community detection
communities <- cluster_louvain(g)
modularity(communities) # 0.412

Optimization Solver

OR-Tools VRP Configuration

Parameter Value Description
Vehicles 3 Patrol units available
Locations 33 High-risk nodes to visit
Depot Central Station Start/end point
First Solution PATH_CHEAPEST_ARC Greedy initialization
Metaheuristic GUIDED_LOCAL_SEARCH Escape local optima
Time Limit 30 seconds Solve time budget
Solution Status OPTIMAL Proven optimal

3D Visualization

Blender Pipeline

3D assets were created using Blender MCP tools and exported as glTF for web rendering:

Asset Technique Data Source
boston-globe.glb UV Sphere + Wireframe District boundaries
district-network.glb Graph nodes/edges iGraph network
feature-crystal.glb Icosphere + Emission Feature importance
time-wave.glb Displaced plane Hourly crime counts
patrol-routes.glb Bezier curves + Points VRP solution

Reproducibility

Environment

requirements.txt
python==3.11
pandas==2.0.3
numpy==1.24.3
lightgbm==4.1.0
ortools==9.7.2996
scikit-learn==1.3.0
matplotlib==3.8.0
igraph==0.11.3
blender==4.0  # via MCP

Data Access

All source data is publicly available from the City of Boston Open Data Portal:

https://data.boston.gov/dataset/crime-incident-reports-august-2015-to-date-source-new-system

Limitations

Known Limitations

Ethical Statement

This analysis was conducted for educational and research purposes. Predictive policing technologies carry significant risks of perpetuating historical biases and must be deployed with extensive community oversight, transparency, and accountability mechanisms.

We explicitly excluded demographic features from our models to avoid direct discrimination, but acknowledge that geographic features can serve as proxies for protected characteristics.