Methodology | The Pulse of Boston

Data Sources

Source	Description	Records	Period
Boston Police Dept	Crime Incident Reports	848,051	Aug 2015 - Present
data.boston.gov	Open Data Portal	17 fields/record	Updated weekly

Data Processing Pipeline

Pipeline Architecture

1. INGESTION
   └── CSV files (crime_2015.csv ... crime_2023_present.csv)
   └── Concatenation: pd.concat([...], ignore_index=True)
   └── Deduplication: drop_duplicates(subset=['INCIDENT_NUMBER'])

2. CLEANING
   └── Missing coordinates: dropna(subset=['Lat', 'Long'])
   └── Invalid values: (Lat == 0) | (Lat == -1) → removed
   └── Date parsing: pd.to_datetime(OCCURRED_ON_DATE)
   └── Outlier detection: IQR method on coordinates

3. FEATURE ENGINEERING
   └── Temporal: hour, day_of_week, month, year
   └── Cyclical: hour_sin, hour_cos, month_sin, month_cos
   └── Geographic: lat, long, district_code
   └── Target: is_violent (binary classification)

4. SPLITTING
   └── Train: 70% stratified by district
   └── Validation: 15%
   └── Test: 15% (held out)

Machine Learning Models

LightGBM Configuration

Three boosting variants were evaluated:

Parameter	GBDT	DART	GOSS
boosting_type	gbdt	dart	goss
num_leaves	31	31	31
learning_rate	0.05	0.05	0.05
feature_fraction	0.9	0.9	0.9
bagging_fraction	0.8	0.8	N/A
drop_rate (DART)	N/A	0.1	N/A
top_rate (GOSS)	N/A	N/A	0.2
AUC-ROC	0.6484	0.6497	0.6318

Model Selection Rationale

DART was selected despite marginal AUC improvement because its dropout mechanism provides better generalization to unseen data, reducing the risk of overfitting to historical patterns that may not persist.

Network Analysis

iGraph Implementation

R / igraph

library(igraph)

# Create district adjacency graph
edges <- read.csv("data/boston_district_network.json")
g <- graph_from_data_frame(edges, directed = FALSE)

# Network metrics
V(g)                    # 12 vertices (districts)
E(g)                    # 20 edges (adjacencies)
graph.density(g)        # 0.303
transitivity(g)         # 0.367 (clustering coefficient)

# Centrality measures
degree(g, normalized = TRUE)
betweenness(g, normalized = TRUE)
page_rank(g)$vector

# Community detection
communities <- cluster_louvain(g)
modularity(communities) # 0.412

Optimization Solver

OR-Tools VRP Configuration

Parameter	Value	Description
Vehicles	3	Patrol units available
Locations	33	High-risk nodes to visit
Depot	Central Station	Start/end point
First Solution	PATH_CHEAPEST_ARC	Greedy initialization
Metaheuristic	GUIDED_LOCAL_SEARCH	Escape local optima
Time Limit	30 seconds	Solve time budget
Solution Status	OPTIMAL	Proven optimal

3D Visualization

Blender Pipeline

3D assets were created using Blender MCP tools and exported as glTF for web rendering:

Asset	Technique	Data Source
boston-globe.glb	UV Sphere + Wireframe	District boundaries
district-network.glb	Graph nodes/edges	iGraph network
feature-crystal.glb	Icosphere + Emission	Feature importance
time-wave.glb	Displaced plane	Hourly crime counts
patrol-routes.glb	Bezier curves + Points	VRP solution

Reproducibility

Environment

requirements.txt

python==3.11
pandas==2.0.3
numpy==1.24.3
lightgbm==4.1.0
ortools==9.7.2996
scikit-learn==1.3.0
matplotlib==3.8.0
igraph==0.11.3
blender==4.0  # via MCP

Data Access

All source data is publicly available from the City of Boston Open Data Portal:

https://data.boston.gov/dataset/crime-incident-reports-august-2015-to-date-source-new-system

Limitations

Known Limitations

Reporting bias: Data reflects reported crimes only; unreported incidents are not captured
Temporal scope: Patterns from 2015-2023 may not generalize to future periods
Spatial resolution: Coordinates are approximate; privacy protections may introduce location noise
Model drift: ML model requires periodic retraining as crime patterns evolve
VRP simplification: Real-world routing requires additional constraints not modeled here

Ethical Statement

This analysis was conducted for educational and research purposes. Predictive policing technologies carry significant risks of perpetuating historical biases and must be deployed with extensive community oversight, transparency, and accountability mechanisms.

We explicitly excluded demographic features from our models to avoid direct discrimination, but acknowledge that geographic features can serve as proxies for protected characteristics.