{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e2b679d6",
   "metadata": {},
   "source": [
    "# Demo Preset Generation\n",
    "\n",
    "This notebook creates the preset inputs used by the web app demo. The goal is not just to create convenient examples, but to make every example defensible: the tabular values must be traceable, the image examples must come from real held-out BreaKHis predictions, and the fusion cases must be clearly documented as synthetic rather than clinical.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7da98f38",
   "metadata": {},
   "source": [
    "## Research Objective\n",
    "\n",
    "The web demo needs presets so a marker or viewer can run the models without manually typing thirty tabular measurements or searching the image dataset. Presets are useful only if they are transparent. This notebook therefore records where each value came from, why each image was selected, and how each synthetic fusion result is constructed.\n",
    "\n",
    "Success criteria:\n",
    "\n",
    "- produce one app-facing JSON manifest at `outputs/reports/demo_presets.json`;\n",
    "- produce one audit table at `outputs/reports/demo_preset_evidence.csv`;\n",
    "- cover three tabular examples: benign, borderline, and malignant;\n",
    "- cover all binary image variants used by the app model: benign/malignant across 40X, 100X, 200X, and 400X;\n",
    "- build six synthetic fusion cases that demonstrate concordant, discordant, and borderline behaviour without implying real patient-level multimodal matching.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fb263f01",
   "metadata": {},
   "source": [
    "## Methodological Boundary\n",
    "\n",
    "The tabular Wisconsin dataset and the BreaKHis image dataset are independent. There is no real patient who has both records in this project. For that reason, fusion presets are deliberately described as **synthetic demonstration cases**. They are useful for showing model behaviour and interface behaviour, but they are not clinical multimodal evidence.\n",
    "\n",
    "This distinction matters for the dissertation: the app can demonstrate an exploratory fusion mechanism, but the written claim must remain that the fusion branch is a constrained methodological experiment under data scarcity.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ab345d2",
   "metadata": {},
   "source": [
    "## Setup And Source Files\n",
    "\n",
    "This first code cell keeps all filesystem paths and imports in one place. The `find_project_root()` helper makes the notebook runnable both from the repository root and from inside `dissertation_project/notebooks`. The notebook imports the existing `infer_wisconsin()` helper so that the preset probabilities are computed using the same inference code as the web API.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fcb16abd",
   "metadata": {},
   "outputs": [],
   "source": [
    "from __future__ import annotations\n",
    "\n",
    "import json\n",
    "import sys\n",
    "from datetime import datetime, timezone\n",
    "from pathlib import Path\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "\n",
    "def find_project_root() -> Path:\n",
    "    cwd = Path.cwd().resolve()\n",
    "    candidates = [cwd, *cwd.parents]\n",
    "    for candidate in candidates:\n",
    "        if (candidate / \"notebook_Wisconsin\" / \"brca.csv\").is_file():\n",
    "            return candidate\n",
    "        nested = candidate / \"dissertation_project\"\n",
    "        if (nested / \"notebook_Wisconsin\" / \"brca.csv\").is_file():\n",
    "            return nested\n",
    "    raise FileNotFoundError(\"Could not locate dissertation_project from the current working directory.\")\n",
    "\n",
    "\n",
    "PROJECT_ROOT = find_project_root()\n",
    "REPO_ROOT = PROJECT_ROOT.parent\n",
    "sys.path.insert(0, str(PROJECT_ROOT))\n",
    "\n",
    "from src.inference import infer_wisconsin  # noqa: E402\n",
    "\n",
    "WISCONSIN_DATA_PATH = PROJECT_ROOT / \"notebook_Wisconsin\" / \"brca.csv\"\n",
    "WISCONSIN_MODEL_PATH = PROJECT_ROOT / \"notebook_Wisconsin\" / \"model.pt\"\n",
    "WISCONSIN_SCALER_PATH = PROJECT_ROOT / \"notebook_Wisconsin\" / \"scaler.joblib\"\n",
    "IMAGE_PREDICTIONS_PATH = PROJECT_ROOT / \"outputs\" / \"reports\" / \"breakhis_full_test_predictions.csv\"\n",
    "OUTPUT_JSON = PROJECT_ROOT / \"outputs\" / \"reports\" / \"demo_presets.json\"\n",
    "OUTPUT_CSV = PROJECT_ROOT / \"outputs\" / \"reports\" / \"demo_preset_evidence.csv\"\n",
    "SOURCE_URL = \"https://breascope-ai.vercel.app/\"\n",
    "MAGNIFICATIONS = [\"40X\", \"100X\", \"200X\", \"400X\"]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18a6632e",
   "metadata": {},
   "source": [
    "## Tabular Preset Source\n",
    "\n",
    "The three tabular profiles below are the existing quick presets from the deployed BreaScope AI tabular interface:\n",
    "\n",
    "- `Typical Benign Tumour`\n",
    "- `Borderline / Suspicious Case`\n",
    "- `Typical Malignant Tumour`\n",
    "\n",
    "They are copied into this notebook rather than being left inside the frontend because the app should consume generated research artifacts, not hide research data inside UI code. Each profile keeps all 30 Wisconsin features because the tabular model expects the complete feature vector.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c57eedf9",
   "metadata": {},
   "outputs": [],
   "source": [
    "BREASCOPE_TABULAR_PRESETS = {\n",
    "    \"tabular-typical-benign\": {\n",
    "        \"source_label\": \"Typical Benign Tumour\",\n",
    "        \"label_hint\": \"benign\",\n",
    "        \"description\": \"Typical benign Wisconsin profile from the original BreaScope AI preset set.\",\n",
    "        \"values\": {\n",
    "            \"x.radius_mean\": 12.1,\n",
    "            \"x.texture_mean\": 17.9,\n",
    "            \"x.perimeter_mean\": 78.1,\n",
    "            \"x.area_mean\": 462.8,\n",
    "            \"x.smoothness_mean\": 0.092,\n",
    "            \"x.compactness_mean\": 0.08,\n",
    "            \"x.concavity_mean\": 0.046,\n",
    "            \"x.concave_pts_mean\": 0.025,\n",
    "            \"x.symmetry_mean\": 0.181,\n",
    "            \"x.fractal_dim_mean\": 0.062,\n",
    "            \"x.radius_se\": 0.28,\n",
    "            \"x.texture_se\": 1.21,\n",
    "            \"x.perimeter_se\": 2.05,\n",
    "            \"x.area_se\": 20.4,\n",
    "            \"x.smoothness_se\": 0.006,\n",
    "            \"x.compactness_se\": 0.02,\n",
    "            \"x.concavity_se\": 0.025,\n",
    "            \"x.concave_pts_se\": 0.008,\n",
    "            \"x.symmetry_se\": 0.018,\n",
    "            \"x.fractal_dim_se\": 0.0034,\n",
    "            \"x.radius_worst\": 13.4,\n",
    "            \"x.texture_worst\": 25.0,\n",
    "            \"x.perimeter_worst\": 87.0,\n",
    "            \"x.area_worst\": 535.5,\n",
    "            \"x.smoothness_worst\": 0.12,\n",
    "            \"x.compactness_worst\": 0.185,\n",
    "            \"x.concavity_worst\": 0.19,\n",
    "            \"x.concave_pts_worst\": 0.075,\n",
    "            \"x.symmetry_worst\": 0.275,\n",
    "            \"x.fractal_dim_worst\": 0.08,\n",
    "        },\n",
    "    },\n",
    "    \"tabular-borderline-suspicious\": {\n",
    "        \"source_label\": \"Borderline / Suspicious Case\",\n",
    "        \"label_hint\": \"borderline\",\n",
    "        \"description\": \"Borderline Wisconsin profile from the original BreaScope AI preset set.\",\n",
    "        \"values\": {\n",
    "            \"x.radius_mean\": 14.5,\n",
    "            \"x.texture_mean\": 19.5,\n",
    "            \"x.perimeter_mean\": 96.0,\n",
    "            \"x.area_mean\": 680.0,\n",
    "            \"x.smoothness_mean\": 0.1,\n",
    "            \"x.compactness_mean\": 0.12,\n",
    "            \"x.concavity_mean\": 0.1,\n",
    "            \"x.concave_pts_mean\": 0.055,\n",
    "            \"x.symmetry_mean\": 0.185,\n",
    "            \"x.fractal_dim_mean\": 0.063,\n",
    "            \"x.radius_se\": 0.4,\n",
    "            \"x.texture_se\": 1.3,\n",
    "            \"x.perimeter_se\": 2.9,\n",
    "            \"x.area_se\": 42.0,\n",
    "            \"x.smoothness_se\": 0.0075,\n",
    "            \"x.compactness_se\": 0.027,\n",
    "            \"x.concavity_se\": 0.032,\n",
    "            \"x.concave_pts_se\": 0.012,\n",
    "            \"x.symmetry_se\": 0.021,\n",
    "            \"x.fractal_dim_se\": 0.0041,\n",
    "            \"x.radius_worst\": 16.7,\n",
    "            \"x.texture_worst\": 26.5,\n",
    "            \"x.perimeter_worst\": 112.0,\n",
    "            \"x.area_worst\": 910.0,\n",
    "            \"x.smoothness_worst\": 0.135,\n",
    "            \"x.compactness_worst\": 0.27,\n",
    "            \"x.concavity_worst\": 0.28,\n",
    "            \"x.concave_pts_worst\": 0.12,\n",
    "            \"x.symmetry_worst\": 0.3,\n",
    "            \"x.fractal_dim_worst\": 0.084,\n",
    "        },\n",
    "    },\n",
    "    \"tabular-typical-malignant\": {\n",
    "        \"source_label\": \"Typical Malignant Tumour\",\n",
    "        \"label_hint\": \"malignant\",\n",
    "        \"description\": \"Typical malignant Wisconsin profile from the original BreaScope AI preset set.\",\n",
    "        \"values\": {\n",
    "            \"x.radius_mean\": 17.5,\n",
    "            \"x.texture_mean\": 21.0,\n",
    "            \"x.perimeter_mean\": 115.0,\n",
    "            \"x.area_mean\": 990.0,\n",
    "            \"x.smoothness_mean\": 0.105,\n",
    "            \"x.compactness_mean\": 0.145,\n",
    "            \"x.concavity_mean\": 0.16,\n",
    "            \"x.concave_pts_mean\": 0.09,\n",
    "            \"x.symmetry_mean\": 0.195,\n",
    "            \"x.fractal_dim_mean\": 0.065,\n",
    "            \"x.radius_se\": 0.55,\n",
    "            \"x.texture_se\": 1.6,\n",
    "            \"x.perimeter_se\": 3.8,\n",
    "            \"x.area_se\": 55.0,\n",
    "            \"x.smoothness_se\": 0.009,\n",
    "            \"x.compactness_se\": 0.035,\n",
    "            \"x.concavity_se\": 0.045,\n",
    "            \"x.concave_pts_se\": 0.017,\n",
    "            \"x.symmetry_se\": 0.025,\n",
    "            \"x.fractal_dim_se\": 0.0052,\n",
    "            \"x.radius_worst\": 20.9,\n",
    "            \"x.texture_worst\": 29.5,\n",
    "            \"x.perimeter_worst\": 146.0,\n",
    "            \"x.area_worst\": 1326.0,\n",
    "            \"x.smoothness_worst\": 0.16,\n",
    "            \"x.compactness_worst\": 0.35,\n",
    "            \"x.concavity_worst\": 0.4,\n",
    "            \"x.concave_pts_worst\": 0.18,\n",
    "            \"x.symmetry_worst\": 0.33,\n",
    "            \"x.fractal_dim_worst\": 0.09,\n",
    "        },\n",
    "    },\n",
    "}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8901175c",
   "metadata": {},
   "source": [
    "## Validation Helpers\n",
    "\n",
    "The helper functions make two checks explicit. First, the preset feature names must exactly match the Wisconsin feature order used by the trained model. Second, each value is positioned inside the empirical Wisconsin distribution using a percentile calculation. This gives the evidence CSV enough context to explain whether a preset is low, central, or high relative to the dataset.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c03cb06",
   "metadata": {},
   "outputs": [],
   "source": [
    "def percentile_position(series: pd.Series, value: float) -> float:\n",
    "    sorted_values = np.sort(series.astype(float).to_numpy())\n",
    "    return float(np.searchsorted(sorted_values, value, side=\"right\") / len(sorted_values) * 100)\n",
    "\n",
    "\n",
    "def ordered_features(raw_values: dict[str, float], feature_order: list[str]) -> dict[str, float]:\n",
    "    missing = [feature for feature in feature_order if feature not in raw_values]\n",
    "    extra = [feature for feature in raw_values if feature not in feature_order]\n",
    "    assert not missing, f\"Missing features: {missing}\"\n",
    "    assert not extra, f\"Unexpected features: {extra}\"\n",
    "    return {feature: float(raw_values[feature]) for feature in feature_order}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ead92ff6",
   "metadata": {},
   "source": [
    "## Building Tabular Cases\n",
    "\n",
    "This step turns the three copied profiles into app-ready tabular cases. For each preset, the notebook:\n",
    "\n",
    "- reorders values to match the Wisconsin model input contract;\n",
    "- asserts that every value is finite;\n",
    "- checks every feature is inside the observed Wisconsin min/max range;\n",
    "- runs the published Wisconsin model to record the malignant probability;\n",
    "- writes one evidence row per feature, including min, max, percentile, and source label.\n",
    "\n",
    "This is why the final demo can justify the exact numbers instead of saying they were arbitrary examples.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b9fe121c",
   "metadata": {},
   "outputs": [],
   "source": [
    "def build_tabular_cases(wisconsin_df: pd.DataFrame, feature_order: list[str]) -> tuple[list[dict], list[dict]]:\n",
    "    evidence_rows = []\n",
    "    tabular_cases = []\n",
    "    feature_summary = wisconsin_df[feature_order].agg([\"min\", \"max\"])\n",
    "\n",
    "    for preset_id, preset in BREASCOPE_TABULAR_PRESETS.items():\n",
    "        features = ordered_features(preset[\"values\"], feature_order)\n",
    "        feature_frame = pd.DataFrame([features], columns=feature_order)\n",
    "        assert np.isfinite(feature_frame.to_numpy()).all(), f\"Non-finite value in {preset_id}\"\n",
    "\n",
    "        probability = float(\n",
    "            infer_wisconsin(WISCONSIN_MODEL_PATH, WISCONSIN_SCALER_PATH, feature_frame)\n",
    "            .iloc[0][\"probability_malignant\"]\n",
    "        )\n",
    "        feature_percentiles = []\n",
    "\n",
    "        for feature, value in features.items():\n",
    "            dataset_min = float(feature_summary.loc[\"min\", feature])\n",
    "            dataset_max = float(feature_summary.loc[\"max\", feature])\n",
    "            percentile = percentile_position(wisconsin_df[feature], value)\n",
    "            assert dataset_min <= value <= dataset_max, f\"{preset_id}:{feature} outside Wisconsin range\"\n",
    "            feature_percentiles.append(percentile)\n",
    "            evidence_rows.append(\n",
    "                {\n",
    "                    \"row_type\": \"tabular_feature\",\n",
    "                    \"preset_id\": preset_id,\n",
    "                    \"case_family\": \"tabular\",\n",
    "                    \"label_hint\": preset[\"label_hint\"],\n",
    "                    \"source_label\": preset[\"source_label\"],\n",
    "                    \"feature\": feature,\n",
    "                    \"value\": value,\n",
    "                    \"dataset_min\": dataset_min,\n",
    "                    \"dataset_max\": dataset_max,\n",
    "                    \"percentile\": percentile,\n",
    "                    \"model_probability_malignant\": probability,\n",
    "                    \"source_url\": SOURCE_URL,\n",
    "                    \"selection_reason\": \"Exact profile copied from the deployed BreaScope AI preset set and validated against Wisconsin feature ranges.\",\n",
    "                }\n",
    "            )\n",
    "\n",
    "        tabular_cases.append(\n",
    "            {\n",
    "                \"id\": preset_id,\n",
    "                \"labelHint\": preset[\"label_hint\"],\n",
    "                \"description\": preset[\"description\"],\n",
    "                \"features\": features,\n",
    "                \"probabilityMalignant\": probability,\n",
    "                \"sourceLabel\": preset[\"source_label\"],\n",
    "                \"sourceUrl\": SOURCE_URL,\n",
    "                \"evidence\": {\n",
    "                    \"percentileMin\": float(np.min(feature_percentiles)),\n",
    "                    \"percentileMedian\": float(np.median(feature_percentiles)),\n",
    "                    \"percentileMax\": float(np.max(feature_percentiles)),\n",
    "                    \"selectionReason\": \"Preset profile from the existing BreaScope AI demo, kept editable in the new app.\",\n",
    "                },\n",
    "            }\n",
    "        )\n",
    "\n",
    "    return tabular_cases, evidence_rows\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0e78a23",
   "metadata": {},
   "source": [
    "## Selecting Image Presets From Held-Out Predictions\n",
    "\n",
    "The image presets must come from real BreaKHis images. They are selected from `breakhis_full_test_predictions.csv`, which is the corrected patient-level holdout prediction table from the image workflow.\n",
    "\n",
    "The selection rule is intentionally conservative: within each `label x magnification` group, only correctly predicted holdout images are eligible, and the chosen image is the one nearest to the median malignant probability for that group. This avoids cherry-picking the easiest or most extreme examples while still selecting images the model handles consistently.\n",
    "\n",
    "The result is exactly eight image presets: benign and malignant examples at 40X, 100X, 200X, and 400X.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "230c8742",
   "metadata": {},
   "outputs": [],
   "source": [
    "def select_image_cases(predictions_df: pd.DataFrame) -> tuple[list[dict], list[dict]]:\n",
    "    evidence_rows = []\n",
    "    image_cases = []\n",
    "    predictions_df = predictions_df.copy()\n",
    "    predictions_df[\"correct\"] = predictions_df[\"y_true\"] == predictions_df[\"y_pred\"]\n",
    "\n",
    "    for label in [\"benign\", \"malignant\"]:\n",
    "        for magnification in MAGNIFICATIONS:\n",
    "            stratum = predictions_df[\n",
    "                (predictions_df[\"label\"] == label)\n",
    "                & (predictions_df[\"magnification\"] == magnification)\n",
    "                & predictions_df[\"correct\"]\n",
    "            ].copy()\n",
    "            assert not stratum.empty, f\"No correct holdout images for {label} {magnification}\"\n",
    "            median_probability = float(stratum[\"y_prob\"].median())\n",
    "            stratum[\"distance_to_median\"] = (stratum[\"y_prob\"] - median_probability).abs()\n",
    "            selected = stratum.sort_values([\"distance_to_median\", \"filepath\"]).iloc[0]\n",
    "            absolute_path = Path(selected[\"filepath\"])\n",
    "            assert absolute_path.is_file(), f\"Missing selected image: {absolute_path}\"\n",
    "            relative_path = absolute_path.relative_to(PROJECT_ROOT).as_posix()\n",
    "            image_id = f\"{label}-{magnification.lower().replace('x', 'x')}-representative\"\n",
    "            probability = float(selected[\"y_prob\"])\n",
    "            selection_reason = (\n",
    "                f\"Correct held-out {label} {magnification} prediction nearest to the stratum median \"\n",
    "                f\"malignant probability ({median_probability:.6f}).\"\n",
    "            )\n",
    "\n",
    "            image_case = {\n",
    "                \"id\": f\"image-{image_id}\",\n",
    "                \"imageId\": image_id,\n",
    "                \"labelHint\": label,\n",
    "                \"description\": f\"Representative {label} BreaKHis tile at {magnification} magnification.\",\n",
    "                \"relativePath\": relative_path,\n",
    "                \"patientId\": str(selected[\"patient_id\"]),\n",
    "                \"magnification\": magnification,\n",
    "                \"probabilityMalignant\": probability,\n",
    "                \"selectionReason\": selection_reason,\n",
    "            }\n",
    "            image_cases.append(image_case)\n",
    "            evidence_rows.append(\n",
    "                {\n",
    "                    \"row_type\": \"image_preset\",\n",
    "                    \"preset_id\": image_case[\"id\"],\n",
    "                    \"case_family\": \"image\",\n",
    "                    \"label_hint\": label,\n",
    "                    \"image_id\": image_id,\n",
    "                    \"patient_id\": selected[\"patient_id\"],\n",
    "                    \"magnification\": magnification,\n",
    "                    \"relative_path\": relative_path,\n",
    "                    \"model_probability_malignant\": probability,\n",
    "                    \"stratum_median_probability\": median_probability,\n",
    "                    \"distance_to_median\": float(selected[\"distance_to_median\"]),\n",
    "                    \"selection_reason\": selection_reason,\n",
    "                }\n",
    "            )\n",
    "\n",
    "    return image_cases, evidence_rows\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68c59534",
   "metadata": {},
   "source": [
    "## Constructing Synthetic Fusion Story Cases\n",
    "\n",
    "The fusion presets reuse the tabular and image presets built above. They are story cases for the demo rather than new clinical records.\n",
    "\n",
    "The six cases are designed to show different behaviours:\n",
    "\n",
    "- concordant benign: benign tabular + benign image;\n",
    "- concordant malignant: malignant tabular + malignant image;\n",
    "- discordant stress test: benign tabular + malignant image;\n",
    "- discordant stress test: malignant tabular + benign image;\n",
    "- borderline tabular + benign image;\n",
    "- borderline tabular + malignant image.\n",
    "\n",
    "The probability is calculated using the same late-fusion rule as the app: average the tabular malignant probability and the image malignant probability. Keeping that rule here makes the manifest traceable to the web result shown in the interface.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "40694426",
   "metadata": {},
   "outputs": [],
   "source": [
    "def build_fusion_cases(tabular_cases: list[dict], image_cases: list[dict]) -> tuple[list[dict], list[dict]]:\n",
    "    tabular_by_id = {case[\"id\"]: case for case in tabular_cases}\n",
    "    image_by_id = {case[\"imageId\"]: case for case in image_cases}\n",
    "    benign_image = image_by_id[\"benign-100x-representative\"]\n",
    "    malignant_image = image_by_id[\"malignant-100x-representative\"]\n",
    "\n",
    "    specs = [\n",
    "        (\"fusion-concordant-benign\", \"tabular-typical-benign\", benign_image, \"benign\", \"concordant\", \"Benign tabular preset paired with a benign BreaKHis tile.\"),\n",
    "        (\"fusion-concordant-malignant\", \"tabular-typical-malignant\", malignant_image, \"malignant\", \"concordant\", \"Malignant tabular preset paired with a malignant BreaKHis tile.\"),\n",
    "        (\"fusion-discordant-benign-tabular-malignant-image\", \"tabular-typical-benign\", malignant_image, \"benign\", \"discordant\", \"Benign tabular preset paired with a malignant tile to stress-test synthetic disagreement.\"),\n",
    "        (\"fusion-discordant-malignant-tabular-benign-image\", \"tabular-typical-malignant\", benign_image, \"malignant\", \"discordant\", \"Malignant tabular preset paired with a benign tile to stress-test synthetic disagreement.\"),\n",
    "        (\"fusion-borderline-with-benign-image\", \"tabular-borderline-suspicious\", benign_image, \"borderline\", \"borderline\", \"Borderline tabular preset paired with a benign tile.\"),\n",
    "        (\"fusion-borderline-with-malignant-image\", \"tabular-borderline-suspicious\", malignant_image, \"borderline\", \"borderline\", \"Borderline tabular preset paired with a malignant tile.\"),\n",
    "    ]\n",
    "\n",
    "    fusion_cases = []\n",
    "    evidence_rows = []\n",
    "    for preset_id, tabular_id, image_case, label_hint, case_type, description in specs:\n",
    "        tabular_case = tabular_by_id[tabular_id]\n",
    "        tabular_probability = float(tabular_case[\"probabilityMalignant\"])\n",
    "        image_probability = float(image_case[\"probabilityMalignant\"])\n",
    "        fusion_probability = float(np.mean([tabular_probability, image_probability]))\n",
    "        selection_reason = (\n",
    "            \"Synthetic demo case built from independent tabular and image datasets; probability uses the app's current late-fusion mean.\"\n",
    "        )\n",
    "        fusion_cases.append(\n",
    "            {\n",
    "                \"id\": preset_id,\n",
    "                \"imageId\": image_case[\"imageId\"],\n",
    "                \"labelHint\": label_hint,\n",
    "                \"description\": description,\n",
    "                \"features\": tabular_case[\"features\"],\n",
    "                \"tabularPresetId\": tabular_id,\n",
    "                \"imagePresetId\": image_case[\"id\"],\n",
    "                \"caseType\": case_type,\n",
    "                \"probabilityMalignant\": fusion_probability,\n",
    "                \"tabularProbabilityMalignant\": tabular_probability,\n",
    "                \"imageProbabilityMalignant\": image_probability,\n",
    "                \"selectionReason\": selection_reason,\n",
    "            }\n",
    "        )\n",
    "        evidence_rows.append(\n",
    "            {\n",
    "                \"row_type\": \"fusion_preset\",\n",
    "                \"preset_id\": preset_id,\n",
    "                \"case_family\": \"fusion\",\n",
    "                \"label_hint\": label_hint,\n",
    "                \"tabular_preset_id\": tabular_id,\n",
    "                \"image_preset_id\": image_case[\"id\"],\n",
    "                \"image_id\": image_case[\"imageId\"],\n",
    "                \"fusion_case_type\": case_type,\n",
    "                \"tabular_probability_malignant\": tabular_probability,\n",
    "                \"image_probability_malignant\": image_probability,\n",
    "                \"model_probability_malignant\": fusion_probability,\n",
    "                \"selection_reason\": selection_reason,\n",
    "            }\n",
    "        )\n",
    "\n",
    "    return fusion_cases, evidence_rows\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c6c0d34",
   "metadata": {},
   "source": [
    "## Manifest Assembly And Assertions\n",
    "\n",
    "The final generation function combines the tabular, image, and fusion presets into one JSON document. The assertions are part of the research evidence: they fail loudly if the app-facing artifact loses a feature, misses an image variant, uses a missing file, or creates the wrong number of fusion cases.\n",
    "\n",
    "These checks protect the demo from silent drift. If the upstream predictions change later, rerunning this notebook will either regenerate a valid manifest or stop at the exact broken assumption.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb46acf6",
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_demo_presets() -> tuple[dict, pd.DataFrame]:\n",
    "    wisconsin_df = pd.read_csv(WISCONSIN_DATA_PATH)\n",
    "    feature_order = [column for column in wisconsin_df.columns if column not in {\"Unnamed: 0\", \"y\"}]\n",
    "    predictions_df = pd.read_csv(IMAGE_PREDICTIONS_PATH)\n",
    "\n",
    "    tabular_cases, tabular_evidence = build_tabular_cases(wisconsin_df, feature_order)\n",
    "    image_cases, image_evidence = select_image_cases(predictions_df)\n",
    "    fusion_cases, fusion_evidence = build_fusion_cases(tabular_cases, image_cases)\n",
    "\n",
    "    assert len(feature_order) == 30\n",
    "    assert all(list(case[\"features\"].keys()) == feature_order for case in tabular_cases)\n",
    "    assert len(tabular_cases) == 3\n",
    "    assert len(image_cases) == 8\n",
    "    assert {(case[\"labelHint\"], case[\"magnification\"]) for case in image_cases} == {\n",
    "        (label, magnification) for label in [\"benign\", \"malignant\"] for magnification in MAGNIFICATIONS\n",
    "    }\n",
    "    assert all((PROJECT_ROOT / case[\"relativePath\"]).is_file() for case in image_cases)\n",
    "    assert len(fusion_cases) == 6\n",
    "    assert {case[\"caseType\"] for case in fusion_cases} == {\"concordant\", \"discordant\", \"borderline\"}\n",
    "\n",
    "    manifest = {\n",
    "        \"schemaVersion\": 1,\n",
    "        \"generatedAt\": datetime.now(timezone.utc).isoformat(timespec=\"seconds\"),\n",
    "        \"source\": {\n",
    "            \"tabularPresetSourceUrl\": SOURCE_URL,\n",
    "            \"wisconsinData\": WISCONSIN_DATA_PATH.relative_to(PROJECT_ROOT).as_posix(),\n",
    "            \"imagePredictions\": IMAGE_PREDICTIONS_PATH.relative_to(PROJECT_ROOT).as_posix(),\n",
    "        },\n",
    "        \"disclaimer\": \"Research demo only. Synthetic fusion cases combine independent datasets and are not real patient-level multimodal records.\",\n",
    "        \"featureOrder\": feature_order,\n",
    "        \"tabular\": tabular_cases,\n",
    "        \"image\": image_cases,\n",
    "        \"fusion\": fusion_cases,\n",
    "    }\n",
    "\n",
    "    evidence = pd.DataFrame(tabular_evidence + image_evidence + fusion_evidence)\n",
    "    OUTPUT_JSON.parent.mkdir(parents=True, exist_ok=True)\n",
    "    OUTPUT_JSON.write_text(json.dumps(manifest, indent=2) + \"\\n\")\n",
    "    evidence.to_csv(OUTPUT_CSV, index=False)\n",
    "    return manifest, evidence\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3bb85a9",
   "metadata": {},
   "source": [
    "## Generate Outputs\n",
    "\n",
    "This cell writes the two artifacts consumed by the project:\n",
    "\n",
    "- `demo_presets.json` is read by the FastAPI `/demo-cases` endpoint and by the frontend fallback route;\n",
    "- `demo_preset_evidence.csv` is the audit trail for the dissertation and demo defence.\n",
    "\n",
    "The short printed summary is intentionally small so the notebook remains readable when executed top-to-bottom.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29820905",
   "metadata": {},
   "outputs": [],
   "source": [
    "manifest, evidence = generate_demo_presets()\n",
    "print(f\"Wrote {OUTPUT_JSON.relative_to(PROJECT_ROOT)}\")\n",
    "print(f\"Wrote {OUTPUT_CSV.relative_to(PROJECT_ROOT)}\")\n",
    "print(f\"Tabular: {len(manifest['tabular'])}, image: {len(manifest['image'])}, fusion: {len(manifest['fusion'])}\")\n",
    "evidence.groupby(\"row_type\").size()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ee9e156",
   "metadata": {},
   "source": [
    "## Interpreting The Output Manifest\n",
    "\n",
    "The JSON manifest is the operational artifact. It contains the feature order, the tabular preset values, the selected image paths, and the synthetic fusion definitions. The web app uses those entries to pre-fill the UI, but the user can still edit tabular fields or upload a different image.\n",
    "\n",
    "The evidence CSV is the methodological artifact. It explains why the generated values and images were chosen: tabular rows include Wisconsin range and percentile context, image rows include the median-selection rationale, and fusion rows include the exact probability construction.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c424428b",
   "metadata": {},
   "source": [
    "## Research Notes And Limitations\n",
    "\n",
    "The preset design is intentionally pragmatic for a live demonstration. It gives broad coverage without overwhelming the page.\n",
    "\n",
    "Important limitations to state in the dissertation/demo:\n",
    "\n",
    "- Wisconsin tabular presets are example morphology profiles, not real patient advice.\n",
    "- BreaKHis image presets are real dataset images, but they are selected for demonstration coverage.\n",
    "- Fusion presets join independent datasets and therefore are not real multimodal clinical records.\n",
    "- The fusion probability is a simple late-fusion average, chosen for transparency in the demo rather than as a claim of clinical optimality.\n",
    "\n",
    "These notes should stay aligned with the app disclaimer and the dissertation framing around exploratory synthetic multimodal work.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13636229",
   "metadata": {},
   "source": [
    "## Outputs\n",
    "\n",
    "- `outputs/reports/demo_presets.json`: app-facing manifest consumed by `/demo-cases` and dataset image routes.\n",
    "- `outputs/reports/demo_preset_evidence.csv`: audit table with feature ranges, percentiles, image selection evidence, and fusion probability construction.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (dissertation_dl)",
   "language": "python",
   "name": "dissertation_dl"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
