4 Commits

Author SHA1 Message Date
nq
f576de2f2d rnd-v1.2 2026-05-03 22:00:06 +03:00
nq
50185a8050 rnd-v1.1 2026-05-03 21:59:22 +03:00
nq
e45d1cb6b4 rnd-v1 2026-05-03 21:58:47 +03:00
nq
fd44235ff4 init-research
Signed-off-by: nq <nq>
2026-05-03 21:49:36 +03:00
21 changed files with 1711 additions and 1118 deletions

199
README.md Normal file
View File

@@ -0,0 +1,199 @@
# BGtopoVJ Blue Rectangle/Square Detection PoC
This is a practical first-pass pipeline for finding blue/light-blue square and rectangle symbols in BGtopoVJ raster maps, then using those detections to score a coordinate dataset and bootstrap a YOLO detector.
The PoC is intentionally hybrid:
1. Download original BGtopoVJ `*.tif` + `*.map` sheet pairs.
2. Open the raster through GDAL/Rasterio, preferring the OziExplorer `.map` sidecar when available.
3. Mine weak candidates using OpenCV HSV thresholding + contour/rectangle filters.
4. Generate QA overlays and HTML report.
5. Score your known coordinates against nearby candidates.
6. Export weak labels into YOLO format.
7. Train a first YOLO model on your RTX 3080 FE after you review/clean the weak labels.
This is not meant to be a final truth engine on day one. It is meant to rapidly produce reviewable candidates, hard negatives, and a training set.
---
## Hardware fit
Your RTX 3080 FE is enough for the first detector. Start with:
- `yolov8s.pt`
- `imgsz=1024`
- `batch=2` or `batch=4`
- `epochs=80`
If you hit CUDA OOM, lower batch first. Do not lower image size below 896 too early, because the target symbols are small.
16 GB system RAM is tight for country-scale processing, but fine for per-sheet scanning. Avoid loading the whole corpus at once. This PoC scans by windows.
---
## Install locally
### Linux / WSL / Manjaro-like
```bash
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
```
GDAL/Rasterio can be the annoying part. If `rasterio.open("*.map")` fails, install GDAL from your OS package manager, or use the Docker option below.
### GPU Docker option
```bash
docker compose -f docker-compose.gpu.yml build
docker compose -f docker-compose.gpu.yml run --rm bgtopo-bluebox bash
```
Inside the container:
```bash
./scripts/run_pilot.sh
```
---
## Run the pilot
```bash
./scripts/run_pilot.sh
```
This downloads only two sheets, scans them, writes candidate CSV files, draws overlays, and builds:
```text
reports/poc_report.html
reports/overlays/*.png
data/interim/candidates/*_candidates.csv
```
Inspect the overlays. If too many rivers/text labels are detected, tighten `configs/blue_detector.yaml`. If real blue rectangles are missed, loosen the HSV ranges and size filters.
---
## Manual one-sheet run
```bash
python -m bgtopo_poc.cli inventory \
--config configs/blue_detector.yaml \
--out data/manifest.csv \
--limit 1
python -m bgtopo_poc.cli download \
--manifest data/manifest.csv \
--out-dir data/raw \
--out-manifest data/manifest_downloaded.csv \
--limit 1
python -m bgtopo_poc.cli detect \
--config configs/blue_detector.yaml \
--sheet-id K-34-009-2 \
--map data/raw/K-34-009-2/K-34-009-2.map \
--tif data/raw/K-34-009-2/K-34-009-2.tif \
--out-dir data/interim/candidates
python -m bgtopo_poc.cli overlay \
--tif data/raw/K-34-009-2/K-34-009-2.tif \
--candidates data/interim/candidates/K-34-009-2_candidates.csv \
--out reports/overlays/K-34-009-2_overlay.png
```
---
## Score your 60k coordinates
Expected coordinate CSV columns:
```csv
id,lat,lon,expected
pt001,42.58837223,23.19638729,unknown
```
Then run:
```bash
python -m bgtopo_poc.cli score-coords \
--config configs/blue_detector.yaml \
--sheet-id K-34-009-2 \
--coordinates data/coordinates/your_60k_points.csv \
--candidates data/interim/candidates/K-34-009-2_candidates.csv \
--map data/raw/K-34-009-2/K-34-009-2.map \
--tif data/raw/K-34-009-2/K-34-009-2.tif \
--out-dir data/interim/coordinate_scores \
--coord-crs EPSG:4326
```
Extract review crops for predicted positives/review cases:
```bash
python -m bgtopo_poc.cli crops \
--scores data/interim/coordinate_scores/K-34-009-2_coordinate_scores.csv \
--map data/raw/K-34-009-2/K-34-009-2.map \
--tif data/raw/K-34-009-2/K-34-009-2.tif \
--out-dir data/interim/crops/K-34-009-2 \
--crop-size 256
```
Important: The PoC currently scores coordinates sheet-by-sheet. The next production step is assigning every point to the right sheet footprint automatically. This requires confirming that `.map` georeferencing opens correctly on your system.
---
## Export YOLO dataset
After reviewing/correcting candidates, export YOLO tiles:
```bash
python -m bgtopo_poc.cli export-yolo \
--config configs/blue_detector.yaml \
--sheet-id K-34-009-2 \
--tif data/raw/K-34-009-2/K-34-009-2.tif \
--candidates data/interim/candidates/K-34-009-2_candidates.csv \
--out-dir data/yolo/K-34-009-2 \
--tile-size 1024 \
--overlap 128
```
Then train:
```bash
python -m bgtopo_poc.cli train-yolo \
--data-yaml data/yolo/K-34-009-2/data.yaml \
--model yolov8s.pt \
--imgsz 1024 \
--epochs 80 \
--batch 4 \
--device 0
```
---
## What to improve after this PoC works
1. Add automatic sheet-footprint discovery and coordinate-to-sheet assignment.
2. Add CVAT export/import so weak labels can be corrected by hand.
3. Add hard-negative mining for rivers, lakes, blue text and blue linework.
4. Add calibrated coordinate scoring using a small sklearn model trained on reviewed points.
5. Add active learning: prioritize review crops where the model and rule detector disagree.
6. Add full-map batch inference with overlap-aware de-duplication.
---
## Output files
```text
data/manifest.csv # discovered remote assets
data/manifest_downloaded.csv # local paths after download
data/interim/candidates/*_candidates.csv # weak detections
data/interim/coordinate_scores/*.csv # coordinate-level predictions
data/interim/crops/*/*.png # review crops
reports/overlays/*.png # visual QA overlays
reports/poc_report.html # summary report
data/yolo/*/data.yaml # YOLO training dataset
runs/bgtopo_bluebox/* # YOLO training runs
```

1
bgtopo_poc/__init__.py Normal file
View File

@@ -0,0 +1 @@
__version__ = "0.1.0"

178
bgtopo_poc/cli.py Normal file
View File

@@ -0,0 +1,178 @@
from __future__ import annotations
import argparse
import logging
from pathlib import Path
from .coordinates import extract_coordinate_crops, score_coordinates_for_sheet
from .detector_cv import detect_sheet, draw_overlay
from .export_yolo import export_candidates_to_yolo
from .inventory import discover_original_assets, download_assets, read_manifest_csv, write_manifest_csv
from .report import build_report
from .train_yolo import train_yolo
from .utils import load_yaml, setup_logging
LOG = logging.getLogger(__name__)
def cmd_inventory(args):
cfg = load_yaml(args.config)
base_url = args.base_url or cfg["source"]["base_url"]
assets = discover_original_assets(base_url=base_url, include_100k=args.include_100k)
if args.limit:
assets = assets[: args.limit]
write_manifest_csv(assets, args.out)
def cmd_download(args):
assets = read_manifest_csv(args.manifest)
selected = download_assets(assets, args.out_dir, limit=args.limit, overwrite=args.overwrite)
write_manifest_csv(selected, args.out_manifest)
def cmd_detect(args):
cfg = load_yaml(args.config)
detect_sheet(args.map, args.tif, args.sheet_id, cfg, args.out_dir)
def cmd_overlay(args):
draw_overlay(args.tif, args.candidates, args.out)
def cmd_score_coords(args):
cfg = load_yaml(args.config)
score_coordinates_for_sheet(
coord_csv=args.coordinates,
candidates_csv=args.candidates,
map_path=args.map,
tif_path=args.tif,
sheet_id=args.sheet_id,
cfg=cfg,
out_dir=args.out_dir,
coord_crs=args.coord_crs,
)
def cmd_crops(args):
extract_coordinate_crops(
coord_scores_csv=args.scores,
map_path=args.map,
tif_path=args.tif,
out_dir=args.out_dir,
crop_size=args.crop_size,
)
def cmd_export_yolo(args):
cfg = load_yaml(args.config)
export_candidates_to_yolo(
tif_path=args.tif,
candidates_csv=args.candidates,
out_dir=args.out_dir,
cfg=cfg,
sheet_id=args.sheet_id,
tile_size=args.tile_size,
overlap=args.overlap,
val_fraction=args.val_fraction,
)
def cmd_train_yolo(args):
train_yolo(args.data_yaml, model=args.model, imgsz=args.imgsz, epochs=args.epochs, batch=args.batch, device=args.device)
def cmd_report(args):
build_report([Path(p) for p in args.candidates], [Path(p) for p in args.overlays], args.out)
def build_parser():
p = argparse.ArgumentParser(prog="bgtopo-bluebox", description="BGtopoVJ blue rectangle/square PoC pipeline")
p.add_argument("--verbose", action="store_true")
sub = p.add_subparsers(dest="cmd", required=True)
s = sub.add_parser("inventory", help="Crawl original raster directory and create manifest")
s.add_argument("--config", default="configs/blue_detector.yaml")
s.add_argument("--base-url", default=None)
s.add_argument("--out", default="data/manifest.csv")
s.add_argument("--limit", type=int, default=None)
s.add_argument("--include-100k", action="store_true")
s.set_defaults(func=cmd_inventory)
s = sub.add_parser("download", help="Download .map/.tif pairs from manifest")
s.add_argument("--manifest", default="data/manifest.csv")
s.add_argument("--out-dir", default="data/raw")
s.add_argument("--out-manifest", default="data/manifest_downloaded.csv")
s.add_argument("--limit", type=int, default=2)
s.add_argument("--overwrite", action="store_true")
s.set_defaults(func=cmd_download)
s = sub.add_parser("detect", help="Detect blue rectangle/square candidates on one sheet")
s.add_argument("--config", default="configs/blue_detector.yaml")
s.add_argument("--sheet-id", required=True)
s.add_argument("--map", default=None)
s.add_argument("--tif", required=True)
s.add_argument("--out-dir", default="data/interim/candidates")
s.set_defaults(func=cmd_detect)
s = sub.add_parser("overlay", help="Draw candidate overlay for manual QA")
s.add_argument("--tif", required=True)
s.add_argument("--candidates", required=True)
s.add_argument("--out", required=True)
s.set_defaults(func=cmd_overlay)
s = sub.add_parser("score-coords", help="Score coordinate CSV against candidates from one sheet")
s.add_argument("--config", default="configs/blue_detector.yaml")
s.add_argument("--sheet-id", required=True)
s.add_argument("--coordinates", required=True)
s.add_argument("--candidates", required=True)
s.add_argument("--map", default=None)
s.add_argument("--tif", required=True)
s.add_argument("--out-dir", default="data/interim/coordinate_scores")
s.add_argument("--coord-crs", default="EPSG:4326")
s.set_defaults(func=cmd_score_coords)
s = sub.add_parser("crops", help="Extract review crops around scored coordinates")
s.add_argument("--scores", required=True)
s.add_argument("--map", default=None)
s.add_argument("--tif", required=True)
s.add_argument("--out-dir", default="data/interim/crops")
s.add_argument("--crop-size", type=int, default=256)
s.set_defaults(func=cmd_crops)
s = sub.add_parser("export-yolo", help="Export weak candidates to YOLO dataset format")
s.add_argument("--config", default="configs/blue_detector.yaml")
s.add_argument("--sheet-id", required=True)
s.add_argument("--tif", required=True)
s.add_argument("--candidates", required=True)
s.add_argument("--out-dir", default="data/yolo/bluebox")
s.add_argument("--tile-size", type=int, default=1024)
s.add_argument("--overlap", type=int, default=128)
s.add_argument("--val-fraction", type=float, default=0.20)
s.set_defaults(func=cmd_export_yolo)
s = sub.add_parser("train-yolo", help="Train YOLO on exported dataset")
s.add_argument("--data-yaml", required=True)
s.add_argument("--model", default="yolov8s.pt")
s.add_argument("--imgsz", type=int, default=1024)
s.add_argument("--epochs", type=int, default=80)
s.add_argument("--batch", type=int, default=4)
s.add_argument("--device", default="0")
s.set_defaults(func=cmd_train_yolo)
s = sub.add_parser("report", help="Build HTML QA report")
s.add_argument("--candidates", nargs="+", required=True)
s.add_argument("--overlays", nargs="*", default=[])
s.add_argument("--out", default="reports/poc_report.html")
s.set_defaults(func=cmd_report)
return p
def main():
parser = build_parser()
args = parser.parse_args()
setup_logging(args.verbose)
args.func(args)
if __name__ == "__main__":
main()

150
bgtopo_poc/coordinates.py Normal file
View File

@@ -0,0 +1,150 @@
from __future__ import annotations
import logging
from pathlib import Path
from typing import Dict, List, Optional
import numpy as np
import pandas as pd
from PIL import Image, ImageDraw
from pyproj import Transformer
from rasterio.windows import Window
from .georef import open_georaster, read_window_rgb
from .utils import ensure_dir
LOG = logging.getLogger(__name__)
def _normalize_coordinate_columns(df: pd.DataFrame) -> pd.DataFrame:
cols = {c.lower().strip(): c for c in df.columns}
lat_col = cols.get("lat") or cols.get("latitude") or cols.get("y")
lon_col = cols.get("lon") or cols.get("lng") or cols.get("longitude") or cols.get("x")
if not lat_col or not lon_col:
raise ValueError("Coordinate CSV needs lat/lon columns, or latitude/longitude, or y/x.")
out = df.copy()
out["lat"] = pd.to_numeric(out[lat_col], errors="coerce")
out["lon"] = pd.to_numeric(out[lon_col], errors="coerce")
if "id" not in out.columns:
out["id"] = [f"pt_{i:06d}" for i in range(len(out))]
return out.dropna(subset=["lat", "lon"])
def load_coordinates(path: str | Path) -> pd.DataFrame:
return _normalize_coordinate_columns(pd.read_csv(path))
def coord_to_rowcol(ds, lon: float, lat: float, coord_crs: str = "EPSG:4326") -> Optional[tuple[int, int]]:
if ds.crs is None:
return None
try:
transformer = Transformer.from_crs(coord_crs, ds.crs, always_xy=True)
x, y = transformer.transform(lon, lat)
row, col = ds.index(x, y)
return int(row), int(col)
except Exception as e: # noqa: BLE001
LOG.debug("coord_to_rowcol failed: %s", e)
return None
def score_coordinates_for_sheet(
coord_csv: str | Path,
candidates_csv: str | Path,
map_path: str | None,
tif_path: str | None,
sheet_id: str,
cfg: Dict,
out_dir: str | Path,
coord_crs: str = "EPSG:4326",
) -> Path:
out_dir = ensure_dir(out_dir)
coords = load_coordinates(coord_csv)
cands = pd.read_csv(candidates_csv)
rh = open_georaster(map_path=map_path, tif_path=tif_path)
radius = float(cfg["coordinate_scoring"].get("search_radius_px", 45))
try:
rows: List[dict] = []
for _, pt in coords.iterrows():
rc = coord_to_rowcol(rh.dataset, float(pt.lon), float(pt.lat), coord_crs=coord_crs)
if rc is None:
continue
row, col = rc
if row < 0 or col < 0 or row >= rh.height or col >= rh.width:
continue
if cands.empty:
nearest = None
else:
dx = cands["cx"].astype(float).to_numpy() - col
dy = cands["cy"].astype(float).to_numpy() - row
dist = np.sqrt(dx * dx + dy * dy)
i = int(np.argmin(dist))
nearest = (i, float(dist[i]))
score = 0.0
nearest_id = None
nearest_dist = None
nearest_style = None
nearest_det_score = None
decision = "auto_negative"
if nearest:
i, d = nearest
nearest_dist = d
if d <= radius:
nearest_id = i
nearest_style = str(cands.iloc[i].get("fill_style", "unknown"))
nearest_det_score = float(cands.iloc[i].get("score", 0.0))
dist_factor = max(0.0, 1.0 - d / radius)
score = float(0.55 * nearest_det_score + 0.45 * dist_factor)
if score >= float(cfg["coordinate_scoring"].get("strong_score", 0.90)):
decision = "auto_positive"
elif score >= float(cfg["coordinate_scoring"].get("weak_score", 0.40)):
decision = "review"
rows.append({
"id": pt.id,
"sheet_id": sheet_id,
"lat": float(pt.lat),
"lon": float(pt.lon),
"row": row,
"col": col,
"nearest_candidate_index": nearest_id,
"nearest_candidate_distance_px": nearest_dist,
"nearest_candidate_style": nearest_style,
"nearest_candidate_score": nearest_det_score,
"coordinate_score": score,
"decision": decision,
})
out_csv = Path(out_dir) / f"{sheet_id}_coordinate_scores.csv"
pd.DataFrame(rows).to_csv(out_csv, index=False)
LOG.info("Wrote coordinate scores: %s", out_csv)
return out_csv
finally:
rh.close()
def extract_coordinate_crops(
coord_scores_csv: str | Path,
map_path: str | None,
tif_path: str | None,
out_dir: str | Path,
crop_size: int = 256,
only_decisions: tuple[str, ...] = ("review", "auto_positive"),
) -> Path:
out_dir = ensure_dir(out_dir)
df = pd.read_csv(coord_scores_csv)
rh = open_georaster(map_path=map_path, tif_path=tif_path)
half = crop_size // 2
try:
for _, r in df.iterrows():
if str(r.decision) not in only_decisions:
continue
row, col = int(r.row), int(r.col)
win = Window(col - half, row - half, crop_size, crop_size)
rgb = read_window_rgb(rh.dataset, win)
img = Image.fromarray(rgb).convert("RGB")
draw = ImageDraw.Draw(img)
draw.ellipse([half - 5, half - 5, half + 5, half + 5], outline=(255, 0, 0), width=2)
name = f"{str(r.id)}__{str(r.decision)}__score_{float(r.coordinate_score):.3f}.png"
img.save(Path(out_dir) / name)
LOG.info("Wrote crops into: %s", out_dir)
return Path(out_dir)
finally:
rh.close()

223
bgtopo_poc/detector_cv.py Normal file
View File

@@ -0,0 +1,223 @@
from __future__ import annotations
import logging
from dataclasses import dataclass, asdict
from pathlib import Path
from typing import Dict, Iterable, List, Tuple
import cv2
import numpy as np
import pandas as pd
from PIL import Image, ImageDraw
from rasterio.windows import Window
from tqdm import tqdm
from .georef import iter_windows, open_georaster, pixel_to_lonlat, read_window_rgb
from .utils import ensure_dir, safe_stem
LOG = logging.getLogger(__name__)
@dataclass
class Candidate:
sheet_id: str
source_path: str
x: int
y: int
w: int
h: int
cx: float
cy: float
area: float
aspect: float
blue_fill_ratio: float
rectangularity: float
solidity: float
approx_vertices: int
fill_style: str
score: float
lon: float | None = None
lat: float | None = None
def to_dict(self):
return asdict(self)
def build_blue_mask(rgb: np.ndarray, cfg: Dict) -> np.ndarray:
hsv = cv2.cvtColor(rgb, cv2.COLOR_RGB2HSV)
mask = np.zeros(hsv.shape[:2], dtype=np.uint8)
for r in cfg["detector"].get("hsv_ranges", []):
lower = np.array(r["lower"], dtype=np.uint8)
upper = np.array(r["upper"], dtype=np.uint8)
mask |= cv2.inRange(hsv, lower, upper)
morph = cfg["detector"].get("morphology", {})
open_k = int(morph.get("open_kernel", 0) or 0)
close_k = int(morph.get("close_kernel", 0) or 0)
if open_k > 1:
k = np.ones((open_k, open_k), np.uint8)
mask = cv2.morphologyEx(mask, cv2.MORPH_OPEN, k)
if close_k > 1:
k = np.ones((close_k, close_k), np.uint8)
mask = cv2.morphologyEx(mask, cv2.MORPH_CLOSE, k)
dilate_iter = int(morph.get("dilate_iterations", 0) or 0)
if dilate_iter > 0:
mask = cv2.dilate(mask, np.ones((2, 2), np.uint8), iterations=dilate_iter)
return mask
def _classify_fill_style(blue_fill_ratio: float, rectangularity: float, solidity: float) -> str:
if blue_fill_ratio >= 0.48 and solidity >= 0.60:
return "filled"
if 0.10 <= blue_fill_ratio < 0.42 and rectangularity >= 0.28:
return "hollow"
if blue_fill_ratio < 0.18 and rectangularity >= 0.18:
return "border"
return "unknown"
def _score_candidate(blue_fill_ratio: float, rectangularity: float, solidity: float, aspect: float, approx_vertices: int) -> float:
aspect_bonus = 1.0 - min(abs(np.log(max(aspect, 1e-3))), 1.8) / 1.8
vertex_bonus = 1.0 if 4 <= approx_vertices <= 8 else 0.55
raw = 0.30 * blue_fill_ratio + 0.28 * rectangularity + 0.22 * solidity + 0.12 * aspect_bonus + 0.08 * vertex_bonus
return float(max(0.0, min(1.0, raw)))
def find_candidates_in_rgb(rgb: np.ndarray, cfg: Dict, sheet_id: str, source_path: str, xoff: int = 0, yoff: int = 0) -> List[Candidate]:
det = cfg["detector"]
mask = build_blue_mask(rgb, cfg)
contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
out: List[Candidate] = []
for contour in contours:
area = float(cv2.contourArea(contour))
if area < det["min_area_px"] or area > det["max_area_px"]:
continue
x, y, w, h = cv2.boundingRect(contour)
if w < det["min_width_px"] or h < det["min_height_px"]:
continue
if w > det["max_width_px"] or h > det["max_height_px"]:
continue
aspect = float(w / max(h, 1))
if not (det["min_aspect"] <= aspect <= det["max_aspect"]):
continue
bbox_mask = mask[y:y + h, x:x + w]
blue_fill_ratio = float(np.count_nonzero(bbox_mask) / max(w * h, 1))
if not (det["min_blue_fill_ratio"] <= blue_fill_ratio <= det["max_blue_fill_ratio"]):
continue
rectangularity = float(area / max(w * h, 1))
if rectangularity < det["min_rectangularity"]:
continue
hull = cv2.convexHull(contour)
hull_area = float(cv2.contourArea(hull))
solidity = float(area / hull_area) if hull_area > 0 else 0.0
if solidity < det["min_solidity"]:
continue
peri = float(cv2.arcLength(contour, True))
approx = cv2.approxPolyDP(contour, 0.04 * peri, True)
fill_style = _classify_fill_style(blue_fill_ratio, rectangularity, solidity)
score = _score_candidate(blue_fill_ratio, rectangularity, solidity, aspect, len(approx))
out.append(
Candidate(
sheet_id=sheet_id,
source_path=source_path,
x=int(x + xoff),
y=int(y + yoff),
w=int(w),
h=int(h),
cx=float(x + xoff + w / 2),
cy=float(y + yoff + h / 2),
area=area,
aspect=aspect,
blue_fill_ratio=blue_fill_ratio,
rectangularity=rectangularity,
solidity=solidity,
approx_vertices=int(len(approx)),
fill_style=fill_style,
score=score,
)
)
return out
def bbox_iou(a: Candidate, b: Candidate) -> float:
ax1, ay1, ax2, ay2 = a.x, a.y, a.x + a.w, a.y + a.h
bx1, by1, bx2, by2 = b.x, b.y, b.x + b.w, b.y + b.h
ix1, iy1 = max(ax1, bx1), max(ay1, by1)
ix2, iy2 = min(ax2, bx2), min(ay2, by2)
iw, ih = max(0, ix2 - ix1), max(0, iy2 - iy1)
inter = iw * ih
union = a.w * a.h + b.w * b.h - inter
return float(inter / union) if union else 0.0
def nms(cands: Iterable[Candidate], threshold: float) -> List[Candidate]:
items = sorted(cands, key=lambda c: c.score, reverse=True)
kept: List[Candidate] = []
for c in items:
if all(bbox_iou(c, k) < threshold for k in kept):
kept.append(c)
return kept
def detect_sheet(map_path: str | None, tif_path: str | None, sheet_id: str, cfg: Dict, out_dir: str | Path) -> Path:
out_dir = ensure_dir(out_dir)
rh = open_georaster(map_path=map_path, tif_path=tif_path)
try:
tile_size = int(cfg["detector"]["tile_size"])
overlap = int(cfg["detector"]["tile_overlap"])
all_candidates: List[Candidate] = []
windows = list(iter_windows(rh.width, rh.height, tile_size, overlap))
LOG.info("Scanning %s as %d windows (%dx%d px)", sheet_id, len(windows), rh.width, rh.height)
for win in tqdm(windows, desc=f"scan {sheet_id}"):
rgb = read_window_rgb(rh.dataset, win)
cands = find_candidates_in_rgb(
rgb=rgb,
cfg=cfg,
sheet_id=sheet_id,
source_path=str(rh.path),
xoff=int(win.col_off),
yoff=int(win.row_off),
)
all_candidates.extend(cands)
kept = nms(all_candidates, float(cfg["detector"].get("nms_iou_threshold", 0.25)))
# Attach georeferenced centers when possible.
for c in kept:
if rh.crs is not None:
xgeo, ygeo = pixel_to_lonlat(rh.dataset, int(c.cy), int(c.cx))
c.lon = xgeo
c.lat = ygeo
out_csv = out_dir / f"{sheet_id}_candidates.csv"
pd.DataFrame([c.to_dict() for c in kept]).to_csv(out_csv, index=False)
LOG.info("%s: wrote %d candidates to %s", sheet_id, len(kept), out_csv)
return out_csv
finally:
rh.close()
def draw_overlay(tif_path: str | Path, candidates_csv: str | Path, out_png: str | Path, max_side: int = 2400) -> Path:
"""Draw candidates on a downscaled image. Uses TIFF path for simple visual QA."""
img = Image.open(tif_path).convert("RGB")
scale = min(1.0, max_side / max(img.size))
disp = img.resize((int(img.width * scale), int(img.height * scale))) if scale < 1 else img.copy()
draw = ImageDraw.Draw(disp)
df = pd.read_csv(candidates_csv)
color_by_style = {
"filled": (0, 255, 255),
"hollow": (0, 120, 255),
"border": (20, 20, 255),
"unknown": (255, 255, 0),
}
for _, r in df.iterrows():
color = color_by_style.get(str(r.get("fill_style", "unknown")), (255, 255, 0))
x1, y1 = float(r.x) * scale, float(r.y) * scale
x2, y2 = float(r.x + r.w) * scale, float(r.y + r.h) * scale
draw.rectangle([x1, y1, x2, y2], outline=color, width=max(1, int(2 * scale + 1)))
out_png = Path(out_png)
ensure_dir(out_png.parent)
disp.save(out_png)
LOG.info("Wrote overlay: %s", out_png)
return out_png

117
bgtopo_poc/export_yolo.py Normal file
View File

@@ -0,0 +1,117 @@
from __future__ import annotations
import logging
import random
import shutil
from pathlib import Path
from typing import Dict
import pandas as pd
from PIL import Image
from .utils import ensure_dir
LOG = logging.getLogger(__name__)
STYLE_TO_CLASS = {
"unknown": 0,
"filled": 1,
"hollow": 2,
"border": 3,
}
def _crop_image_and_labels(img: Image.Image, boxes: pd.DataFrame, x0: int, y0: int, size: int):
crop = img.crop((x0, y0, x0 + size, y0 + size)).convert("RGB")
labels = []
for _, r in boxes.iterrows():
bx1, by1, bx2, by2 = float(r.x), float(r.y), float(r.x + r.w), float(r.y + r.h)
ix1, iy1 = max(bx1, x0), max(by1, y0)
ix2, iy2 = min(bx2, x0 + size), min(by2, y0 + size)
if ix2 <= ix1 or iy2 <= iy1:
continue
visible_area = (ix2 - ix1) * (iy2 - iy1)
box_area = max((bx2 - bx1) * (by2 - by1), 1)
if visible_area / box_area < 0.35:
continue
cx = ((ix1 + ix2) / 2 - x0) / size
cy = ((iy1 + iy2) / 2 - y0) / size
w = (ix2 - ix1) / size
h = (iy2 - iy1) / size
cls = STYLE_TO_CLASS.get(str(r.get("fill_style", "unknown")), 0)
labels.append(f"{cls} {cx:.6f} {cy:.6f} {w:.6f} {h:.6f}")
return crop, labels
def export_candidates_to_yolo(
tif_path: str | Path,
candidates_csv: str | Path,
out_dir: str | Path,
cfg: Dict,
sheet_id: str,
tile_size: int = 1024,
overlap: int = 128,
val_fraction: float = 0.20,
include_empty_tiles: bool = True,
max_empty_tiles: int = 250,
) -> Path:
out_dir = Path(out_dir)
for split in ["train", "val"]:
ensure_dir(out_dir / "images" / split)
ensure_dir(out_dir / "labels" / split)
img = Image.open(tif_path).convert("RGB")
boxes = pd.read_csv(candidates_csv)
step = max(1, tile_size - overlap)
random.seed(42)
empty_written = 0
total_written = 0
for y0 in range(0, max(1, img.height - tile_size + 1), step):
for x0 in range(0, max(1, img.width - tile_size + 1), step):
in_tile = boxes[
(boxes.cx >= x0) & (boxes.cx < x0 + tile_size) &
(boxes.cy >= y0) & (boxes.cy < y0 + tile_size)
]
if in_tile.empty:
if not include_empty_tiles or empty_written >= max_empty_tiles:
continue
# Keep some empty/hard-negative tiles to stop the model from detecting all blue map details.
if random.random() > 0.08:
continue
empty_written += 1
crop, labels = _crop_image_and_labels(img, boxes, x0, y0, tile_size)
split = "val" if random.random() < val_fraction else "train"
stem = f"{sheet_id}_{x0}_{y0}"
crop.save(out_dir / "images" / split / f"{stem}.jpg", quality=92)
with open(out_dir / "labels" / split / f"{stem}.txt", "w", encoding="utf-8") as f:
f.write("\n".join(labels))
total_written += 1
data_yaml = out_dir / "data.yaml"
names = cfg.get("export", {}).get("yolo_class_names", ["blue_rect_unknown", "blue_rect_filled", "blue_rect_hollow", "blue_rect_border"])
with open(data_yaml, "w", encoding="utf-8") as f:
f.write(f"path: {out_dir.resolve()}\n")
f.write("train: images/train\n")
f.write("val: images/val\n")
f.write("names:\n")
for i, name in enumerate(names):
f.write(f" {i}: {name}\n")
LOG.info("YOLO export complete: %s (%d tiles)", data_yaml, total_written)
return data_yaml
def merge_yolo_datasets(src_dirs: list[str | Path], out_dir: str | Path) -> Path:
out_dir = Path(out_dir)
for split in ["train", "val"]:
ensure_dir(out_dir / "images" / split)
ensure_dir(out_dir / "labels" / split)
for src in src_dirs:
src = Path(src)
for split in ["train", "val"]:
for img in (src / "images" / split).glob("*.jpg"):
shutil.copy2(img, out_dir / "images" / split / img.name)
for lab in (src / "labels" / split).glob("*.txt"):
shutil.copy2(lab, out_dir / "labels" / split / lab.name)
return out_dir

116
bgtopo_poc/georef.py Normal file
View File

@@ -0,0 +1,116 @@
from __future__ import annotations
import logging
from dataclasses import dataclass
from pathlib import Path
from typing import Optional, Tuple
import numpy as np
import rasterio
from rasterio.errors import RasterioIOError
from rasterio.transform import rowcol, xy
from rasterio.windows import Window
LOG = logging.getLogger(__name__)
@dataclass
class RasterHandle:
path: Path
dataset: rasterio.io.DatasetReader
used_map_file: bool
@property
def width(self) -> int:
return self.dataset.width
@property
def height(self) -> int:
return self.dataset.height
@property
def crs(self):
return self.dataset.crs
@property
def transform(self):
return self.dataset.transform
def close(self) -> None:
self.dataset.close()
def open_georaster(map_path: Optional[str | Path] = None, tif_path: Optional[str | Path] = None) -> RasterHandle:
"""Open .map if possible, otherwise .tif. .map gives georeferencing through GDAL's MAP driver."""
last_err = None
if map_path:
try:
p = Path(map_path)
ds = rasterio.open(p)
LOG.info("Opened georeferenced MAP dataset: %s", p)
return RasterHandle(p, ds, used_map_file=True)
except Exception as e: # noqa: BLE001
last_err = e
LOG.warning("Could not open MAP dataset %s: %s", map_path, e)
if tif_path:
try:
p = Path(tif_path)
ds = rasterio.open(p)
LOG.info("Opened TIFF dataset: %s", p)
return RasterHandle(p, ds, used_map_file=False)
except RasterioIOError as e:
last_err = e
raise RuntimeError(f"Could not open raster. Last error: {last_err}")
def read_window_rgb(ds: rasterio.io.DatasetReader, window: Window) -> np.ndarray:
"""Read a raster window as uint8 RGB HxWx3."""
arr = ds.read(window=window, boundless=True, fill_value=255)
if arr.ndim != 3:
raise ValueError(f"Expected band-first array, got shape={arr.shape}")
if arr.shape[0] >= 3:
arr = arr[:3]
elif arr.shape[0] == 1:
arr = np.repeat(arr, 3, axis=0)
arr = np.moveaxis(arr, 0, -1)
if arr.dtype != np.uint8:
arr = np.clip(arr, 0, 255).astype(np.uint8)
return arr
def iter_windows(width: int, height: int, tile_size: int, overlap: int):
step = max(1, tile_size - overlap)
y = 0
while y < height:
x = 0
h = min(tile_size, height - y)
while x < width:
w = min(tile_size, width - x)
yield Window(x, y, w, h)
if x + tile_size >= width:
break
x += step
if y + tile_size >= height:
break
y += step
def lonlat_to_pixel(ds: rasterio.io.DatasetReader, lon: float, lat: float) -> Tuple[int, int]:
"""Convert lon/lat to row/col. Assumes the raster CRS accepts lon/lat or GDAL handles geographic transform.
For a production version, reproject coordinates into ds.crs with pyproj first. The PoC does that in coordinates.py.
"""
row, col = rowcol(ds.transform, lon, lat)
return int(row), int(col)
def pixel_to_lonlat(ds: rasterio.io.DatasetReader, row: int, col: int) -> Tuple[float, float]:
x, y = xy(ds.transform, row, col, offset="center")
return float(x), float(y)
def has_real_georef(ds: rasterio.io.DatasetReader) -> bool:
try:
return ds.crs is not None and not ds.transform.is_identity
except Exception:
return False

115
bgtopo_poc/inventory.py Normal file
View File

@@ -0,0 +1,115 @@
from __future__ import annotations
import logging
import re
from dataclasses import dataclass, asdict
from pathlib import Path
from typing import Iterable, List, Optional
from urllib.parse import urljoin
import pandas as pd
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
from .utils import ensure_dir
LOG = logging.getLogger(__name__)
@dataclass
class SheetAsset:
sheet_id: str
map_url: Optional[str]
tif_url: Optional[str]
map_path: Optional[str] = None
tif_path: Optional[str] = None
def to_dict(self):
return asdict(self)
def discover_original_assets(base_url: str, include_100k: bool = False) -> List[SheetAsset]:
"""Discover .map/.tif pairs from the BGtopoVJ original raster directory listing."""
LOG.info("Discovering assets from %s", base_url)
html = requests.get(base_url, timeout=60).text
soup = BeautifulSoup(html, "html.parser")
hrefs = [a.get("href") for a in soup.find_all("a") if a.get("href")]
by_sheet: dict[str, SheetAsset] = {}
for href in hrefs:
if href.startswith("?") or href.startswith("/") or href == "../":
continue
if not include_100k and "100k" in href.lower():
continue
if not (href.lower().endswith(".map") or href.lower().endswith(".tif") or href.lower().endswith(".tiff")):
continue
sheet_id = re.sub(r"\.(map|tif|tiff)$", "", Path(href).name, flags=re.IGNORECASE)
item = by_sheet.setdefault(sheet_id, SheetAsset(sheet_id=sheet_id, map_url=None, tif_url=None))
full_url = urljoin(base_url, href)
if href.lower().endswith(".map"):
item.map_url = full_url
else:
item.tif_url = full_url
assets = [v for v in by_sheet.values() if v.map_url and v.tif_url]
assets.sort(key=lambda x: x.sheet_id)
LOG.info("Discovered %d complete .map/.tif pairs", len(assets))
return assets
def write_manifest_csv(assets: Iterable[SheetAsset], out_csv: str | Path) -> Path:
rows = [a.to_dict() for a in assets]
out_csv = Path(out_csv)
ensure_dir(out_csv.parent)
pd.DataFrame(rows).to_csv(out_csv, index=False)
LOG.info("Wrote manifest: %s", out_csv)
return out_csv
def read_manifest_csv(path: str | Path) -> List[SheetAsset]:
df = pd.read_csv(path).fillna("")
assets: List[SheetAsset] = []
for _, r in df.iterrows():
assets.append(
SheetAsset(
sheet_id=str(r["sheet_id"]),
map_url=str(r.get("map_url") or "") or None,
tif_url=str(r.get("tif_url") or "") or None,
map_path=str(r.get("map_path") or "") or None,
tif_path=str(r.get("tif_path") or "") or None,
)
)
return assets
def _download_one(url: str, out_path: Path, overwrite: bool = False) -> Path:
if out_path.exists() and out_path.stat().st_size > 0 and not overwrite:
return out_path
ensure_dir(out_path.parent)
with requests.get(url, stream=True, timeout=120) as r:
r.raise_for_status()
total = int(r.headers.get("content-length", "0") or 0)
with open(out_path, "wb") as f, tqdm(total=total, unit="B", unit_scale=True, desc=out_path.name) as pbar:
for chunk in r.iter_content(chunk_size=1024 * 512):
if chunk:
f.write(chunk)
pbar.update(len(chunk))
return out_path
def download_assets(
assets: List[SheetAsset],
out_dir: str | Path,
limit: Optional[int] = None,
overwrite: bool = False,
) -> List[SheetAsset]:
out_dir = Path(out_dir)
selected = assets[:limit] if limit else assets
for item in selected:
sheet_dir = out_dir / item.sheet_id
if item.map_url:
item.map_path = str(_download_one(item.map_url, sheet_dir / f"{item.sheet_id}.map", overwrite=overwrite))
if item.tif_url:
item.tif_path = str(_download_one(item.tif_url, sheet_dir / f"{item.sheet_id}.tif", overwrite=overwrite))
return selected

80
bgtopo_poc/report.py Normal file
View File

@@ -0,0 +1,80 @@
from __future__ import annotations
from pathlib import Path
import pandas as pd
from jinja2 import Template
from .utils import ensure_dir
REPORT_TEMPLATE = """
<!doctype html>
<html>
<head>
<meta charset="utf-8" />
<title>BGtopoVJ Blue Box PoC Report</title>
<style>
body { font-family: Arial, sans-serif; max-width: 1200px; margin: 24px auto; line-height: 1.45; }
table { border-collapse: collapse; width: 100%; margin: 16px 0; }
th, td { border: 1px solid #ddd; padding: 8px; font-size: 14px; }
th { background: #f3f3f3; text-align: left; }
.grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(300px, 1fr)); gap: 18px; }
.card { border: 1px solid #ddd; border-radius: 12px; padding: 12px; }
img { max-width: 100%; border: 1px solid #ccc; }
code { background: #f6f6f6; padding: 2px 4px; }
</style>
</head>
<body>
<h1>BGtopoVJ Blue Rectangle/Square PoC Report</h1>
<p>This is a weak-label mining report. Treat candidates as review targets, not truth.</p>
<h2>Candidate summary</h2>
<table>
<tr><th>Metric</th><th>Value</th></tr>
<tr><td>Total candidates</td><td>{{ total }}</td></tr>
<tr><td>Average score</td><td>{{ avg_score }}</td></tr>
<tr><td>Median score</td><td>{{ med_score }}</td></tr>
</table>
<h2>By inferred fill style</h2>
{{ style_table }}
<h2>QA overlays</h2>
<div class="grid">
{% for overlay in overlays %}
<div class="card">
<p><code>{{ overlay.name }}</code></p>
<img src="{{ overlay.rel }}" />
</div>
{% endfor %}
</div>
</body>
</html>
"""
def build_report(candidate_csvs: list[str | Path], overlays: list[str | Path], out_html: str | Path) -> Path:
frames = [pd.read_csv(p) for p in candidate_csvs if Path(p).exists()]
df = pd.concat(frames, ignore_index=True) if frames else pd.DataFrame()
out_html = Path(out_html)
ensure_dir(out_html.parent)
style_table = "<p>No candidates.</p>"
if not df.empty:
style_table = df.groupby("fill_style").agg(count=("fill_style", "size"), avg_score=("score", "mean")).reset_index().to_html(index=False)
overlay_items = []
for ov in overlays:
ov = Path(ov)
try:
rel = ov.relative_to(out_html.parent)
except ValueError:
rel = ov
overlay_items.append({"name": ov.name, "rel": str(rel).replace("\\", "/")})
html = Template(REPORT_TEMPLATE).render(
total=0 if df.empty else len(df),
avg_score="" if df.empty else f"{df['score'].mean():.3f}",
med_score="" if df.empty else f"{df['score'].median():.3f}",
style_table=style_table,
overlays=overlay_items,
)
out_html.write_text(html, encoding="utf-8")
return out_html

40
bgtopo_poc/train_yolo.py Normal file
View File

@@ -0,0 +1,40 @@
from __future__ import annotations
import logging
from pathlib import Path
LOG = logging.getLogger(__name__)
def train_yolo(data_yaml: str | Path, model: str = "yolov8s.pt", imgsz: int = 1024, epochs: int = 80, batch: int = 4, device: str = "0"):
"""Train YOLO on the generated weak-label dataset.
This function imports ultralytics lazily so the rest of the PoC works without GPU dependencies.
Review/correct the weak labels before treating this model as useful.
"""
from ultralytics import YOLO
yolo = YOLO(model)
LOG.info("Starting YOLO training: model=%s data=%s imgsz=%d epochs=%d batch=%d device=%s", model, data_yaml, imgsz, epochs, batch, device)
return yolo.train(
data=str(data_yaml),
imgsz=imgsz,
epochs=epochs,
batch=batch,
device=device,
workers=4,
cache=False,
patience=20,
project="runs/bgtopo_bluebox",
name=f"{Path(data_yaml).parent.name}_{Path(model).stem}",
hsv_h=0.005,
hsv_s=0.20,
hsv_v=0.18,
degrees=0.0,
translate=0.05,
scale=0.20,
fliplr=0.5,
flipud=0.5,
mosaic=0.25,
close_mosaic=15,
)

43
bgtopo_poc/utils.py Normal file
View File

@@ -0,0 +1,43 @@
from __future__ import annotations
import json
import logging
from pathlib import Path
from typing import Any, Dict
import yaml
def setup_logging(verbose: bool = False) -> None:
logging.basicConfig(
level=logging.DEBUG if verbose else logging.INFO,
format="%(asctime)s | %(levelname)-8s | %(message)s",
datefmt="%H:%M:%S",
)
def load_yaml(path: str | Path) -> Dict[str, Any]:
with open(path, "r", encoding="utf-8") as f:
return yaml.safe_load(f) or {}
def ensure_dir(path: str | Path) -> Path:
p = Path(path)
p.mkdir(parents=True, exist_ok=True)
return p
def write_json(path: str | Path, data: Any) -> None:
p = Path(path)
ensure_dir(p.parent)
with open(p, "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=2)
def read_json(path: str | Path) -> Any:
with open(path, "r", encoding="utf-8") as f:
return json.load(f)
def safe_stem(name: str) -> str:
return Path(name).stem.replace("/", "_").replace("\\", "_")

View File

@@ -0,0 +1,47 @@
# Tuned to be permissive. Expect false positives from rivers, lakes, names and contours.
# The purpose is weak-label mining + hard-negative discovery, not final truth.
source:
base_url: "https://web.uni-plovdiv.net/vedrin/raster/original/"
default_crs: "EPSG:4326"
detector:
tile_size: 1400
tile_overlap: 96
min_width_px: 5
min_height_px: 5
max_width_px: 95
max_height_px: 95
min_area_px: 18
max_area_px: 6500
min_aspect: 0.35
max_aspect: 4.25
min_blue_fill_ratio: 0.045
max_blue_fill_ratio: 0.92
min_rectangularity: 0.18
min_solidity: 0.22
nms_iou_threshold: 0.25
# OpenCV HSV hue range is 0-179, not 0-360.
hsv_ranges:
- name: blue_dark_to_light
lower: [82, 28, 45]
upper: [137, 255, 255]
- name: pale_cyan_low_sat
lower: [78, 10, 105]
upper: [115, 115, 255]
morphology:
open_kernel: 2
close_kernel: 3
dilate_iterations: 1
coordinate_scoring:
search_radius_px: 45
strong_score: 0.90
weak_score: 0.40
crop_size_px: 256
export:
yolo_class_names:
- blue_rect_unknown
- blue_rect_filled
- blue_rect_hollow
- blue_rect_border

View File

@@ -0,0 +1,3 @@
id,lat,lon,expected
example_001,42.58837223,23.19638729,unknown
example_002,42.70000000,23.32000000,unknown
1 id lat lon expected
2 example_001 42.58837223 23.19638729 unknown
3 example_002 42.70000000 23.32000000 unknown

253
deep-research-v1.md Normal file
View File

@@ -0,0 +1,253 @@
# Pipeline design for detecting blue rectangles in BGtopoVJ and scoring 60k coordinates
## Executive summary
The most efficient pipeline is **hybrid and source-aware**, not “pure deep learning from day one.” The prioritized source pass suggests a strong split of responsibilities: use the **per-sheet original raster files** as the canonical training source; use the **Garmin/vector products** as a possible weak-label source if they can be decompiled or converted; use the **processed and merged raster products** for fast visual QA and large-area scanning; and train the final **coordinate-level predictor** on georeferenced centered crops plus detector-derived features. That recommendation comes directly from the BGtopoVJ format inventory, the sites own build/toolchain references, and the GitHub tooling that already exists for Polish `.mp`, GeoJSON conversion, ECW access, raster tiling, and georeferencing. citeturn22view0turn28view0turn31search1turn19view0turn19view1turn19view2turn19view3turn19view4turn39search0
The best canonical source for training is **the original per-sheet TIFF plus its `.map` sidecar**, not the mobile/tiled variants. The university mirror states that the underlying data are **1:50,000 Soviet army raster TIF images**, scanned at **250 dpi**, **8-bit**, with an effective **~5.5 m/pixel** resolution, over **544 sheets** and **3240 tiles**. Per-sheet pages expose four downloadable items for each sheet: original raster image, processed raster image, Garmin vector map, and processed raster for MapNav; the original raster directory shows that each sheet is paired with a `.tif` image and a `.map` calibration file. citeturn22view0turn24search1turn28view0turn31search1
The best first production model is **not** a single end-to-end detector. It is a staged stack: **OpenCV rules** for candidate mining and hard-negative discovery, **YOLOv8s** or **RetinaNet** for scalable symbol detection, **Mask R-CNN** or a small **U-Net** only if fill-style discrimination remains weak, and then a **calibrated coordinate scorer** that converts local detections into the final “should this coordinate have a blue square/rectangle?” probability. The detector literature, official training docs, and calibration literature all support this structure: one-stage models are efficient, RetinaNet explicitly addresses foreground/background imbalance with focal loss, Mask R-CNN adds instance masks with modest overhead, U-Net remains strong for small-mask segmentation with limited data, and temperature scaling is still one of the simplest reliable ways to calibrate probabilities. citeturn40search0turn40search5turn40search6turn41search0turn41search4turn41search6turn42search8turn43search6turn43search8turn43search2turn45search12turn46search4
A separate pass over adspem.com did **not** surface actionable BGtopoVJ-specific technical material during this research run, and a direct access attempt timed out. In practical terms, that means the design below should be grounded in the much more concrete evidence from entity["company","GitHub","code hosting platform"], the mirrored BGtopoVJ site hosted by the CART Lab at entity["organization","University of Plovdiv Paisii Hilendarski","plovdiv bulgaria"], and the official documentation/original papers that describe the GIS and ML stack. citeturn7view0turn22view0
## Source triage and acquisition
A pass over GitHub did not reveal a turnkey “detect blue BGtopoVJ squares” repository, but it **did** reveal several building blocks that materially change the project design. The highest-value repositories are: a **QGIS Polish-format plugin** that can import and export `.mp` files, an **MP-to-GeoJSON converter**, **GDAL-based raster tiling/merging scripts**, a **rio-tiler** wrapper around Rasterio/GDAL for point/part/tile reads, an **ECW GDAL plugin** helper, and **MapWarper** for map rectification/export to GeoTIFF/WMS/tiles. Those tools are exactly the pieces you need if you want a vector-assisted weak-label branch, georeferenced tile extraction, full-area scanning, and human QA over dedicated validation areas. citeturn19view0turn19view1turn19view2turn19view3turn19view4turn39search0turn39search1
The BGtopoVJ mirror is the authoritative source for map formats and technical metadata. It lists support for **Garmin GPSr**, **Garmin MapSource/BaseCamp**, **ECW**, **JNX**, **MapNav**, **OsmAnd / Locus Map**, **OruxMaps**, **NaviComputer**, **WWW**, and **WMS**. It also states that the project goal was to make popular Soviet military raster maps free to use on Garmin-compatible devices, and that the workflow used tools such as Mapwel, GMapTool, Global Mapper, and cGPSmapper. citeturn22view0turn17search0
The sites download section is unusually informative and is directly useful for planning source ingestion. The merged **ECW** product is georeferenced in **UTM zone 35 / WGS84**; the merged **JNX** product is **Geographic / WGS84**; the **MapNav** version is **Mercator** and loads `*.mno` or `*mnm` raster maps; the **OsmAnd/Locus** and **OruxMaps** products expose pre-tiled zoom pyramids; and the **NaviComputer** product is a `.nmap` archive. The page also exposes full archive sizes, which matter for storage planning, and notes that the mobile/offline variants use zoom levels roughly in the **614** range. citeturn22view0turn24search1
The per-sheet pages matter even more than the merged downloads because they expose the data model you want to automate against. Each sheet page lists **original raster image**, **processed raster image**, **vector map suitable for Garmin GPSr devices**, and **processed raster image for MapNav**. The page icons and link behavior indicate that the original and processed raster products are **TIFF-based**, the Garmin map is **IMG-based**, and the MapNav product is packaged separately. The original raster directory listing further shows the crucial pairing of **`.tif` image + `.map` calibration file**, and it also exposes a **`100k/`** subdirectory, which means your ingestion manifest must explicitly separate **1:50k** and **1:100k** materials to avoid symbol-scale leakage. citeturn28view0turn29view0turn29view1turn29view2turn29view3turn31search1
That `.map` sidecar is not just a nuisance file; it is an advantage. Official OziExplorer documentation says a `.map` file stores the **image path**, **datum**, **projection**, and **calibration/georeferencing** information, and GDAL has a built-in **MAP / OziExplorer** raster driver with georeferencing support. In other words, a per-sheet BGtopoVJ “original raster” is a georeferenceable pair you can parse directly into a robust coordinate-to-pixel pipeline. citeturn37search0turn37search1turn37search2
The final source-triage conclusion is important: **start with original TIFF+MAP**, not with ECW and not with mobile tile packages, and treat the Garmin/vector branch as a **high-upside auxiliary path**. The reason is that the BGtopoVJ site explicitly says the separate vector archives are based on **BGtopoVJ-v2.00** data, while the merged raster products and the current public release are **v3.00**. That makes vector-derived features or pseudo-labels very useful, but not safe as the sole source of truth unless you version-track them carefully. citeturn22view0
All format, projection, and packaging details in the following table come from the BGtopoVJ mirror and its per-sheet pages. citeturn22view0turn28view0turn31search1
| Source product | What you actually get | Why it matters | Recommended use |
|---|---|---|---|
| Original per-sheet raster | `sheet.tif` + `sheet.map` | Canonical scan, per-sheet georeferencing, closest to source symbology | **Primary training source** |
| Processed per-sheet raster | TIFF-based processed raster | Cleaner colors and less noise than original | Weak-label mining, annotator QA |
| Garmin vector sheet | IMG-based vector map | Potential vector-assisted pseudo-labels or semantic cross-checks | Auxiliary branch only |
| Merged ECW | Country-scale georeferenced raster, UTM 35 / WGS84 | Fast large-area scanning and QA mosaic | Secondary inference source |
| Merged JNX | Country-scale georeferenced raster, Geographic / WGS84 | Easier coordinate lookup for some GPS workflows | Secondary/QA only |
| MapNav | Mercator raster package | Existing mobile-ready pyramid | Optional QA source |
| OsmAnd / Locus | Tiled zoom-pyramid package | Easy casual browsing, but resampled | Human review only |
| OruxMaps | Tiled zoom-pyramid package | Useful for spot-checking by area | Human review only |
| NaviComputer | `.nmap` archive | Offline browsing | Human review only |
| WWW / WMS | Web-served rendering | Useful for dedicated-area validation dashboards | QA and visualization |
## Data engineering and labeling design
The central engineering decision is to make the **per-sheet original TIFF+MAP pair** your canonical object space, and to preserve that provenance all the way through the pipeline. The BGtopoVJ changelog says later releases added **despeckling**, improved **contrast**, improved **georeferencing precision**, improved and harmonized **resolution**, reduced tile counts, and a harmonized **32-color palette**. Those improvements are useful operationally, but they are exactly why you should not silently mix “original,” “processed,” “merged ECW,” and “mobile tile” imagery in one training pool without a `source_format` and `source_version` field. citeturn22view0
For georeferencing and coordinate alignment, use the `.map` sidecar through **GDALs MAP driver** or by parsing the OziExplorer file directly, then expose the transform through Rasterio. Rasterios transform utilities provide `xy()` and `rowcol()` conversions between pixel and projected/geographic coordinates, while its windowed-reading APIs and window transforms let you extract georeferenced crop windows without loading whole sheets into memory. That is the right primitive for both map-wide scanning and centered-crop extraction around the 60k coordinates. citeturn37search0turn37search1turn47search1turn47search3
A robust ingestion pipeline should therefore do five things before any ML starts. First, crawl the original-raster directory and per-sheet pages into a **manifest** keyed by `sheet_id`, `scale`, `image_path`, `map_path`, `bounds`, `projection`, `edition_year`, and `source_format`. Second, reject or quarantine the `100k/` materials until the 1:50k pipeline is stable. Third, convert all 60k coordinates into a **single normalized CRS** and then into each candidate sheets CRS. Fourth, assign each coordinate to one or more overlapping sheet footprints. Fifth, extract **multi-scale crop windows** around each candidate coordinate from the canonical source, not from screenshots or WMS renderings. citeturn31search1turn37search0turn47search1turn47search3
```mermaid
flowchart LR
A[Inventory BGtopoVJ files] --> B[Parse TIFF plus MAP georeferencing]
B --> C[Build sheet footprints and manifest]
C --> D[Align 60k coordinates to sheets]
D --> E[Extract centered multi-scale crops]
E --> F[Weak-label candidates from color, shape, vectors]
F --> G[Human review in CVAT or Label Studio]
G --> H[Train detector and optional segmenter]
H --> I[Train calibrated coordinate scorer]
I --> J[Dedicated-area validation]
J --> K[Expand dataset and rescan corpus]
```
For annotation, use a **hierarchical schema**, not a flat explosion of combined classes. Your task description naturally decomposes into geometry, fill style, and color shade; training will be much easier if you keep those separate at the data level even if the first-generation model collapses them into a smaller set. The site itself says BGtopoVJ uses standard Soviet military map symbols, and it points to the English-language **Soviet Topographic Map Symbols** manual as a reference. That is a strong reason to keep room for semantic expansion later, even if you begin with purely visual classes. citeturn22view0turn21search1turn21search10
| Annotation field | Suggested values | Why it helps |
|---|---|---|
| `presence` | `positive`, `negative`, `uncertain` | Needed for the final 60k-coordinate decision task |
| `shape` | `square`, `rectangle`, `unknown_quad` | Preserves geometry without overfitting early |
| `fill_style` | `filled`, `hollow`, `border_emphasis`, `unknown_fill` | Handles your filled/empty/bordered requirement cleanly |
| `shade_bin` | `dark_blue`, `blue`, `light_blue`, `cyan_other` | Supports later color-robustness studies |
| `annotation_type` | `bbox`, `polygon`, `mask`, `weak_label` | Tracks label reliability |
| `source_format` | `orig_tif_map`, `proc_tif_map`, `ecw`, `img_vector` | Prevents silent domain mixing |
| `review_status` | `weak`, `verified`, `rejected`, `border_case` | Critical for auditing weak labels |
| `uncertain_reason` | `tile_edge`, `sheet_edge`, `low_contrast`, `blue_non_target`, `vector_mismatch`, `other` | Makes the error-analysis loop efficient |
A good review overlay should make class distinctions obvious to annotators and reviewers. An illustrative overlay convention is below.
```text
Example review overlay on a centered crop
┌─────────────────────────────────────────────┐
│ map background │
│ │
│ ┌──────┐ │
│ │██████│ cyan solid box = filled │
│ └──────┘ │
│ │
│ ┌──────┐ │
│ ○ │ │ blue outline = hollow
│ coordinate └──────┘ │
│ │
│ ┏━━━━━━━━┓ │
│ ┃ ┃ navy thick outline = border │
│ ┗━━━━━━━━┛ │
│ │
│ - - - dashed yellow box = uncertain │
└─────────────────────────────────────────────┘
```
For label generation, do **not** start with manual boxing of all 60k points. Start with **weak labels**. The cheapest first pass is a color-and-shape detector in OpenCV: convert to HSV or Lab, threshold the blue/light-blue ranges, clean masks with morphology, recover contours, approximate them to quadrilaterals, and classify them by border occupancy and fill ratio. OpenCVs `inRange`, contour extraction, polygon approximation, and template matching are all directly relevant here. This weak-label stage will not be your final model, but it will dramatically cut annotation burden and produce the hard negatives you need. citeturn40search6turn40search5turn40search11turn40search0turn40search7
The GitHub/vector branch can reduce annotation cost even further. Because the university workflow references Mapwel, GMapTool, and Garmin IMG outputs, and because GitHub tooling already exists for **Polish `.mp` import/export** and **MP-to-GeoJSON conversion**, you should explicitly test a side pipeline of **IMG → MP/GeoJSON → candidate features**. If a subset of the target blue symbols survives that conversion as stable vector features or type/style codes, those outputs can become high-quality pseudo-labels or at least a powerful review prior. If they do not, the test still pays for itself by proving that the raster detector must carry the load. citeturn17search0turn19view0turn19view1turn22view0
For the final coordinate labels, I recommend a strict three-state policy. A coordinate is **positive** if a reviewed target symbol is present in the search neighborhood and the coordinate aligns to that symbol under the projects chosen tolerance. It is **negative** only if the local crop quality is acceptable and no target symbol is present after review or after a sufficiently reliable detector pass. It is **uncertain** if the coordinate is near a sheet edge, a tile seam, a clipped symbol, a color-ambiguous mark, a candidate produced only by one source domain, or a case where the cartographic placement is visually ambiguous. In practice, uncertain labels should be excluded from first-pass supervised loss and reserved for active-learning review and calibration.
CVAT is the cleanest annotation front end for this project because its COCO export format natively supports **bounding boxes, polygons, masks, and tracks**, and it emits a predictable zip layout that is easy to feed into PyTorch and Detectron2 pipelines. Label Studio is also viable, especially if your team prefers its export APIs and raw JSON model, but CVAT aligns more naturally with a box-first, mask-optional computer-vision workflow. citeturn46search0turn47search4turn47search5
## Model options and recommended stack
Because the target objects are **tiny**, **blue**, and **shape-constrained**, different model families solve different parts of the task well. Classical CV and template matching are excellent for bootstrapping. One-stage detectors are best for scaling to map-wide scans. Two-stage detectors are strong when accuracy on small objects matters more than speed. Segmentation becomes worthwhile only when distinguishing filled versus hollow versus border-emphasis proves difficult from boxes alone. The final coordinate question is best answered by a **hybrid coordinate scorer** that consumes detector outputs rather than by a detector alone. citeturn40search0turn40search5turn40search6turn41search0turn41search4turn42search8turn43search6turn43search8turn45search12turn43search2
The comparison below combines published model characteristics with practical recommendations for this specific cartographic-symbol task. Published YOLOv8 sizes/speeds come from the Ultralytics model card, and Faster/Mask/Retina benchmark speed-memory figures come from the official MMDetection benchmark documentation. Practical budgets for 1024px map tiles are my estimates, because your tile sizes and augmentations will differ from those benchmark settings. citeturn41search4turn41search6turn42search8
| Option | Best role | Main advantages | Main limitations | Likely data need | Practical starting point |
|---|---|---|---|---|---|
| Classical CV with color + contours | Weak-label miner, baseline | Explainable, fast, CPU-only, no training | Brittle to palette drift, blue hydrography, noise | 50100 reviewed examples to tune thresholds | HSV/Lab thresholding, morphology, contour quad filters |
| Template matching | High-precision candidate miner | Very good when symbol shapes are stable | Sensitive to scale, scan noise, and style variation | 20100 templates/class | Masked normalized cross-correlation across a small scale bank |
| YOLOv8n / YOLOv8s | Main production detector | Easy training, strong ecosystem, fast inference | Tiny hollow symbols may need large images and careful tiling | ~1k3k positive instances | `imgsz=10241280`, 100150 epochs, low color jitter |
| Faster R-CNN | Accuracy-first detector | Strong small-object baseline, robust with FPN | Slower inference and heavier training | ~2k5k positives | ResNet-50-FPN, smaller anchors than COCO defaults |
| RetinaNet | Imbalance-aware detector | Focal loss helps with hard negatives | Can be slower than YOLO, still needs tuning | ~2k5k positives | ResNet-50-FPN, small anchors, hard-negative mining |
| Mask R-CNN | Box + mask + fill-style modeling | Distinguishes interior fill and border shape better | More compute, more label structure | ~1k+ verified masks or auto-masks | Use if fill-style errors dominate after box detection |
| U-Net on centered crops | Coordinate-centered segmentation | Very strong on local pixel masks with little data | Not a full-scene detector by itself | ~1k3k crop masks | 256512 px crops, Dice + BCE or Dice + focal |
| Hybrid detector + coordinate scorer | **Recommended final system** | Optimizes the actual end task and supports QA expansion | More moving parts | Moderate | Detector outputs + local crop features + calibration |
A practical compute view is below. Again, the published figures are not your exact training cost, but they are useful anchors for budget planning. citeturn41search4turn42search8turn41search0
| Model family | Published reference point | Practical budget for this task |
|---|---|---|
| YOLOv8n | 3.2M params; 0.99 ms A100 TensorRT at 640px | Good pilot model on a 1216 GB GPU with 1024px tiles |
| YOLOv8s | 11.2M params; 1.20 ms A100 TensorRT at 640px | Best first production detector on a 1624 GB GPU |
| YOLOv8m | 25.9M params; 1.83 ms A100 TensorRT at 640px | Use only if YOLOv8s misses too many small symbols |
| Faster R-CNN | Benchmark training memory about 3.0 GB in published setting | Budget 16 GB+ for larger map tiles and stable training |
| Mask R-CNN | Benchmark training memory about 3.4 GB in published setting | Budget 1624 GB if masks are enabled on 1024px tiles |
| RetinaNet | Benchmark training memory about 3.9 GB in published setting | Budget 16 GB+ if using hard-negative-heavy training |
For **hyperparameters**, I would start conservatively and preserve the maps discriminative color information. For YOLOv8, use **1024 or 1280 px tiles**, **100150 epochs**, light augmentations, and minimal hue shift because the blue-vs-non-blue distinction is central. Keep flips if symbol semantics allow them, but avoid strong perspective transforms and aggressive mosaics because they can distort cartographic context. For Faster R-CNN/RetinaNet/Mask R-CNN, use a **ResNet-50-FPN** backbone, shrink anchor sizes to cover very small quadrilateral symbols, and start from COCO-pretrained weights via TorchVision or Detectron2. The official TorchVision tutorial is a good reference for custom detection/instance-segmentation datasets, and Detectron2 remains a strong option if you want richer experiment control. citeturn41search0turn41search5turn42search9turn42search6
My recommended build order is straightforward. First, ship a **rules-only candidate miner** and use it to seed annotation. Second, train a **YOLOv8s detector** on boxes only. Third, if filled vs hollow vs border-emphasis remains an operational error source, add either **Mask R-CNN** or a **small U-Net crop segmenter**. Fourth, train the final **coordinate scorer** on features such as nearest-detection class, confidence, distance, local blue-mask statistics, crop classifier outputs, and source-format flags. That last stage is the part that answers the users actual question and should be treated as the system of record.
If the vector-assisted branch works, it can reduce annotation effort substantially. If it fails, the project still remains viable because the raster branch is already complete enough to support weak labels, human review, and robust detector training.
## Evaluation and uncertainty
You need to evaluate **two tasks**, not one. The first task is **global symbol detection** on tiles or sheets. The second task is the real business target: **coordinate-level presence prediction** for the 60k points. For detection, the right core metrics are **precision, recall, F1**, **IoU**, **AP50**, **AP75**, and **mAP@[0.5:0.95]**, ideally reported both overall and by fill-style subclass if you keep them separate. For coordinate scoring, the most important metrics are **precision**, **recall**, **F1**, and **PR-AUC**; if you need trustworthy probabilities, also report **Brier score**, **expected calibration error**, and reliability diagrams after calibration. citeturn41search0turn45search12turn46search4
Use **spatial cross-validation**, not random crop splits. BGtopoVJ has hundreds of sheets at a common source scale, and its later releases harmonized palette and resolution. If you randomly split crops, you will leak near-duplicate local styles, neighboring map content, and source-specific palette statistics across train and validation sets. The correct split primitive is the **sheet ID** or a larger contiguous geographic block. A good default is **GroupKFold by sheet**, plus one **frozen dedicated validation area** that is never touched during model development. The “dedicated area” should be a contiguous region rather than a random point sample, otherwise deployment risk will be underestimated. citeturn22view0turn31search1
Class imbalance should be treated as a first-class design issue because there are many blue non-target structures on topographic maps. Your hard negatives will include **rivers, lakes, canals, blue labels, blue linework, and blue outlines that are not the target rectangles/squares**. That is exactly the sort of dense foreground/background skew that RetinaNets **focal loss** was designed to address. Even if you choose YOLOv8 as the main detector, the lesson from RetinaNet still applies: populate the training set with **hard negatives**, not just random empty crops. citeturn43search6turn43search11
For uncertainty, use **calibration**, not just raw detector scores. The calibration literature remains clear that modern neural nets can be overconfident, and that **temperature scaling** is often a simple, strong post-processing fix. In practice, keep a clean validation split, fit a calibrator there, and then expose three operational zones: an **auto-positive** zone, an **auto-negative** zone, and a **human-review** zone in between. If your initial prior belief is that positives among candidate coordinates are around **8090%**, use that belief to prioritize active-learning review and to set starting thresholds, but let the actual held-out dedicated area determine the final deployed prevalence estimate. citeturn46search4turn46search1
A simple operating policy is:
| Score band | Action |
|---|---|
| `p >= 0.90` | auto-positive |
| `0.40 <= p < 0.90` | review unless detector evidence is exceptionally strong |
| `0.10 < p < 0.40` | review if region is mission-critical; else uncertain |
| `p <= 0.10` | auto-negative |
Those thresholds are not sacred; they are an operational starting point. Tune them to the precision target you actually care about.
The dedicated-area validation workflow should be explicit and repeatable. Pick one region that is **symbol-dense** and one region that is **symbol-sparse**. Freeze both. Run full tiling and full 60k-coordinate scoring there. Manually audit **all auto-positives**, **all uncertains**, and a **random sample of auto-negatives**. Only after that audit should you expand to neighboring areas and refresh the weak-label rules. That loop is where most real-world performance gains will come from.
## Deployment, deliverables, and timeline
For deployment, there are really two modes. The first is **batch scanning** over whole sheets or a merged raster. The second is **coordinate scoring** for the 60k points. Batch scanning should use georeferenced **windowed reads** aligned to raster block structure, with tile overlap so edge symbols are not missed; Rasterios window APIs and rio-tilers point/part/tile readers are well-suited to that. Coordinate scoring is lighter: once the sheet assignment is built, you can extract centered windows for the 60k coordinates and score them directly. citeturn47search3turn47search1turn19view4
For scalability, keep runtime format complexity low. Although BGtopoVJ provides a merged ECW and GDAL supports ECW through the Hexagon SDK, the official GDAL ECW documentation is clear that support depends on the **ECW SDK** and that licensing differs between desktop decode, compression, server deployment, and mobile contexts. That makes ECW fine for ingestion or QA, but a poor long-term training/runtime dependency unless you truly need it. The cleanest path is to ingest once and then standardize your internal processing on **TIFF / VRT / GeoTIFF / COG-like** workflows with Rasterio/GDAL. citeturn47search0turn19view3
The recommended software stack is therefore:
| Layer | Recommended tools |
|---|---|
| Ingestion and reprojection | GDAL, Rasterio, pyproj |
| Large-raster windows and tile serving | Rasterio windows, rio-tiler |
| Classical CV and weak labeling | OpenCV |
| Detection and segmentation | PyTorch, TorchVision, Ultralytics, Detectron2 |
| Annotation | CVAT, optionally Label Studio |
| GIS QA and map review | QGIS, MapWarper-style overlay workflows |
| Vector-assisted parsing | Polish-format/QGIS tooling, MP2GeoJSON, GMapTool-style utilities |
The concrete deliverables should be treated as first-class outputs, not side effects.
| Deliverable | Suggested format |
|---|---|
| Source manifest | `manifest.parquet` or `manifest.csv` with sheet bounds, CRS, source version, paths |
| Reviewed symbol dataset | COCO zip for boxes and masks; optional YOLO text export |
| Coordinate label table | `coordinates_labels.parquet` with `x`, `y`, `crs`, `sheet_id`, `presence`, `class_attrs`, `review_status` |
| Train/val/test splits | `splits.yaml` or `folds.json` grouped by sheet/block |
| Weak-label artifacts | `weak_candidates.geojson` or GeoParquet |
| Trained detector weights | `model.pt` or `model.pth` |
| Calibrator and coordinate scorer | serialized sklearn or PyTorch artifact |
| Inference outputs | GeoParquet/GeoJSON/CSV with box geometry, class, confidence, calibrated score |
| Reproducible environment | `Dockerfile`, `environment.yml`, `requirements.txt`, `Makefile` |
| Validation package | overlay PNGs, HTML report, error slices, dedicated-area audit workbook |
A good repository skeleton would look like this:
```text
data/
raw/
interim/
tiles/
annotations/
models/
src/
01_inventory_bgtopovj.py
02_parse_mapfiles.py
03_assign_coordinates.py
04_extract_tiles.py
05_seed_candidates_cv.py
06_vector_assisted_labels.py
07_export_coco.py
08_train_detector.py
09_train_segmenter.py
10_train_coordinate_scorer.py
11_calibrate_eval.py
12_infer_batch.py
13_build_validation_report.py
configs/
docker/
reports/
```
The timeline below assumes one ML engineer and part-time annotation support.
| Stage | Practical duration | Main outputs |
|---|---|---|
| Corpus inventory and georeferencing | 35 days | Manifest, sheet footprints, coordinate-to-sheet assignment |
| Weak-label miner and pilot review | 12 weeks | First reviewed boxes, hard-negative catalog |
| First detector baseline | 1 week | YOLOv8s or RetinaNet baseline, initial metrics |
| Optional segmentation branch | 1 week | Mask R-CNN or U-Net if fill-style errors remain |
| Coordinate scorer and calibration | 35 days | Calibrated probabilities and uncertainty policy |
| Dedicated-area validation | 35 days | Acceptance report, error taxonomy |
| Packaging and reproducibility | 24 days | Docker/conda environment, scripts, docs |
Resource planning should be modest but not tiny. The BGtopoVJ archive sizes imply that raw inputs alone are already multi-gigabyte, and derived tiles, masks, overlays, and experiment outputs can easily expand that by an order of magnitude. A practical workstation target is **16+ CPU cores**, **64128 GB RAM**, **1 GPU with 1624 GB VRAM**, and **24 TB SSD**. If you stay with original per-sheet TIFF+MAP instead of relying on ECW during training, operational complexity drops sharply. citeturn22view0turn31search1turn47search0
The strongest step-by-step implementation plan is this:
1. **Inventory canonical sources** from the original per-sheet TIFF+MAP corpus, and freeze a manifest.
2. **Normalize CRS handling** for the 60k coordinates and assign them to one or more sheets.
3. **Build a weak-label miner** from blue-thresholding, contour filters, and optional vector-assisted pseudo-labels.
4. **Review a pilot set** in CVAT and export COCO.
5. **Train a YOLOv8s detector** on 1024px tiles with source-aware metadata.
6. **Add a mask branch only if needed** for fill-style ambiguity.
7. **Train a coordinate scorer** that converts local detections into calibrated probabilities.
8. **Freeze a dedicated validation area**, audit it thoroughly, and only then expand to the broader corpus.
9. **Package everything reproducibly** in Docker/conda, with manifest + splits + model cards + validation report.
If I had to choose one default architecture before seeing a single label, I would choose this: **original TIFF+MAP as truth, OpenCV weak-label miner, YOLOv8s as primary detector, calibrated coordinate scorer on top, CVAT for review, and a vector-assisted fallback branch tested early but trusted only after version-mismatch checks**. That design fits the source evidence best, keeps annotation cost under control, and preserves a clean path from pilot work to map-wide deployment.

18
docker-compose.gpu.yml Normal file
View File

@@ -0,0 +1,18 @@
services:
bgtopo-bluebox:
build:
context: .
dockerfile: docker/Dockerfile.gpu
image: bgtopo-bluebox-poc:gpu
working_dir: /app
volumes:
- .:/app
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
stdin_open: true
tty: true

19
docker/Dockerfile.gpu Normal file
View File

@@ -0,0 +1,19 @@
FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive \
PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1
RUN apt-get update && apt-get install -y --no-install-recommends \
python3 python3-pip python3-venv git wget curl ca-certificates \
gdal-bin libgdal-dev python3-gdal libgl1 libglib2.0-0 \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt /app/requirements.txt
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install --no-cache-dir -r requirements.txt
COPY . /app
ENV PYTHONPATH=/app
CMD ["bash"]

File diff suppressed because it is too large Load Diff

12
pyproject.toml Normal file
View File

@@ -0,0 +1,12 @@
[project]
name = "bgtopo-bluebox-poc"
version = "0.1.0"
description = "PoC pipeline for detecting blue square/rectangle symbols in BGtopoVJ rasters and scoring coordinate points."
requires-python = ">=3.10"
dependencies = []
[tool.setuptools]
packages = ["bgtopo_poc"]
[project.scripts]
bgtopo-bluebox = "bgtopo_poc.cli:main"

22
requirements.txt Normal file
View File

@@ -0,0 +1,22 @@
# Core GIS / raster stack
rasterio>=1.3.9
pyproj>=3.6.1
shapely>=2.0.2
geopandas>=0.14.4
fiona>=1.9.6
# CV / ML
opencv-python>=4.9.0
numpy>=1.26.4
pandas>=2.2.1
Pillow>=10.2.0
scikit-learn>=1.4.1
ultralytics>=8.2.0
PyYAML>=6.0.1
tqdm>=4.66.2
requests>=2.31.0
beautifulsoup4>=4.12.3
# Reports / QA
matplotlib>=3.8.3
jinja2>=3.1.3

49
scripts/run_pilot.sh Normal file
View File

@@ -0,0 +1,49 @@
#!/usr/bin/env bash
set -euo pipefail
# Run from repository root.
# First run downloads two sheets only. Increase --limit after confirming overlays look sane.
python -m bgtopo_poc.cli inventory \
--config configs/blue_detector.yaml \
--out data/manifest.csv \
--limit 2
python -m bgtopo_poc.cli download \
--manifest data/manifest.csv \
--out-dir data/raw \
--out-manifest data/manifest_downloaded.csv \
--limit 2
python - <<'PY'
import pandas as pd
from pathlib import Path
m = pd.read_csv('data/manifest_downloaded.csv')
for _, r in m.iterrows():
sid = r['sheet_id']
Path('data/interim/candidates').mkdir(parents=True, exist_ok=True)
print(f"Scanning {sid}")
PY
while IFS=, read -r sheet_id map_url tif_url map_path tif_path; do
if [[ "$sheet_id" == "sheet_id" ]]; then continue; fi
python -m bgtopo_poc.cli detect \
--config configs/blue_detector.yaml \
--sheet-id "$sheet_id" \
--map "$map_path" \
--tif "$tif_path" \
--out-dir data/interim/candidates
python -m bgtopo_poc.cli overlay \
--tif "$tif_path" \
--candidates "data/interim/candidates/${sheet_id}_candidates.csv" \
--out "reports/overlays/${sheet_id}_overlay.png"
done < data/manifest_downloaded.csv
python -m bgtopo_poc.cli report \
--candidates data/interim/candidates/*_candidates.csv \
--overlays reports/overlays/*_overlay.png \
--out reports/poc_report.html
echo "Open reports/poc_report.html"

View File

@@ -0,0 +1,26 @@
#!/usr/bin/env bash
set -euo pipefail
# Pick one downloaded sheet/candidate file first.
SHEET_ID="${1:-K-34-009-2}"
TIF="data/raw/${SHEET_ID}/${SHEET_ID}.tif"
CAND="data/interim/candidates/${SHEET_ID}_candidates.csv"
OUT="data/yolo/${SHEET_ID}"
python -m bgtopo_poc.cli export-yolo \
--config configs/blue_detector.yaml \
--sheet-id "$SHEET_ID" \
--tif "$TIF" \
--candidates "$CAND" \
--out-dir "$OUT" \
--tile-size 1024 \
--overlap 128
# RTX 3080 FE 10 GB: start batch 2-4 at imgsz=1024. Raise only if VRAM allows it.
python -m bgtopo_poc.cli train-yolo \
--data-yaml "$OUT/data.yaml" \
--model yolov8s.pt \
--imgsz 1024 \
--epochs 80 \
--batch 4 \
--device 0