Data Sources
| Source | Description | Records | Period |
|---|---|---|---|
| Boston Police Dept | Crime Incident Reports | 848,051 | Aug 2015 - Present |
| data.boston.gov | Open Data Portal | 17 fields/record | Updated weekly |
Data Processing Pipeline
1. INGESTION
└── CSV files (crime_2015.csv ... crime_2023_present.csv)
└── Concatenation: pd.concat([...], ignore_index=True)
└── Deduplication: drop_duplicates(subset=['INCIDENT_NUMBER'])
2. CLEANING
└── Missing coordinates: dropna(subset=['Lat', 'Long'])
└── Invalid values: (Lat == 0) | (Lat == -1) → removed
└── Date parsing: pd.to_datetime(OCCURRED_ON_DATE)
└── Outlier detection: IQR method on coordinates
3. FEATURE ENGINEERING
└── Temporal: hour, day_of_week, month, year
└── Cyclical: hour_sin, hour_cos, month_sin, month_cos
└── Geographic: lat, long, district_code
└── Target: is_violent (binary classification)
4. SPLITTING
└── Train: 70% stratified by district
└── Validation: 15%
└── Test: 15% (held out)
Machine Learning Models
LightGBM Configuration
Three boosting variants were evaluated:
| Parameter | GBDT | DART | GOSS |
|---|---|---|---|
| boosting_type | gbdt | dart | goss |
| num_leaves | 31 | 31 | 31 |
| learning_rate | 0.05 | 0.05 | 0.05 |
| feature_fraction | 0.9 | 0.9 | 0.9 |
| bagging_fraction | 0.8 | 0.8 | N/A |
| drop_rate (DART) | N/A | 0.1 | N/A |
| top_rate (GOSS) | N/A | N/A | 0.2 |
| AUC-ROC | 0.6484 | 0.6497 | 0.6318 |
DART was selected despite marginal AUC improvement because its dropout mechanism provides better generalization to unseen data, reducing the risk of overfitting to historical patterns that may not persist.
Network Analysis
iGraph Implementation
library(igraph)
# Create district adjacency graph
edges <- read.csv("data/boston_district_network.json")
g <- graph_from_data_frame(edges, directed = FALSE)
# Network metrics
V(g) # 12 vertices (districts)
E(g) # 20 edges (adjacencies)
graph.density(g) # 0.303
transitivity(g) # 0.367 (clustering coefficient)
# Centrality measures
degree(g, normalized = TRUE)
betweenness(g, normalized = TRUE)
page_rank(g)$vector
# Community detection
communities <- cluster_louvain(g)
modularity(communities) # 0.412
Optimization Solver
OR-Tools VRP Configuration
| Parameter | Value | Description |
|---|---|---|
| Vehicles | 3 | Patrol units available |
| Locations | 33 | High-risk nodes to visit |
| Depot | Central Station | Start/end point |
| First Solution | PATH_CHEAPEST_ARC | Greedy initialization |
| Metaheuristic | GUIDED_LOCAL_SEARCH | Escape local optima |
| Time Limit | 30 seconds | Solve time budget |
| Solution Status | OPTIMAL | Proven optimal |
3D Visualization
Blender Pipeline
3D assets were created using Blender MCP tools and exported as glTF for web rendering:
| Asset | Technique | Data Source |
|---|---|---|
| boston-globe.glb | UV Sphere + Wireframe | District boundaries |
| district-network.glb | Graph nodes/edges | iGraph network |
| feature-crystal.glb | Icosphere + Emission | Feature importance |
| time-wave.glb | Displaced plane | Hourly crime counts |
| patrol-routes.glb | Bezier curves + Points | VRP solution |
Reproducibility
Environment
python==3.11
pandas==2.0.3
numpy==1.24.3
lightgbm==4.1.0
ortools==9.7.2996
scikit-learn==1.3.0
matplotlib==3.8.0
igraph==0.11.3
blender==4.0 # via MCP
Data Access
All source data is publicly available from the City of Boston Open Data Portal:
https://data.boston.gov/dataset/crime-incident-reports-august-2015-to-date-source-new-system
Limitations
- Reporting bias: Data reflects reported crimes only; unreported incidents are not captured
- Temporal scope: Patterns from 2015-2023 may not generalize to future periods
- Spatial resolution: Coordinates are approximate; privacy protections may introduce location noise
- Model drift: ML model requires periodic retraining as crime patterns evolve
- VRP simplification: Real-world routing requires additional constraints not modeled here
Ethical Statement
This analysis was conducted for educational and research purposes. Predictive policing technologies carry significant risks of perpetuating historical biases and must be deployed with extensive community oversight, transparency, and accountability mechanisms.
We explicitly excluded demographic features from our models to avoid direct discrimination, but acknowledge that geographic features can serve as proxies for protected characteristics.