{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d541deca",
   "metadata": {},
   "source": [
    "# Scope and Research Questions\n",
    "\n",
    "## Objective\n",
    "When I first scoped this project I was aiming for a proper matched multimodal study. The more closely I looked at the data I actually had, the less comfortable I was with making that claim. Wisconsin and BreaKHis are independent datasets, so any pairing between them has to be treated as synthetic and exploratory rather than as real patient-level multimodal evidence. This notebook sets that boundary before the modelling starts.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "codex_research_commentary": true
   },
   "source": [
    "## Notebook Purpose\n",
    "\n",
    "This opening notebook defines the study before any modelling begins. Its role is to make the dissertation boundary explicit: the Wisconsin tabular branch is reused as a published comparator, the BreaKHis branch is rebuilt as the main image contribution, and fusion is treated as synthetic exploratory work rather than a clinical multimodal claim.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "codex_research_commentary": true
   },
   "source": [
    "## Why This Matters\n",
    "\n",
    "A research project needs clear scope before experiments are run. Without this framing, later fusion results could be over-interpreted. This notebook therefore records the research question, the branch responsibilities, and the criteria that later notebooks must satisfy.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "codex_research_commentary": true
   },
   "source": [
    "## Setup And Project Framing\n",
    "\n",
    "This setup cell establishes reproducibility and resolves the project paths used across the notebook series. It also records the core project note as a small table so the methodological boundary is visible in the notebook output, not only in prose.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "e52c3758",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-04-19T20:10:46.067866Z",
     "iopub.status.busy": "2026-04-19T20:10:46.067463Z",
     "iopub.status.idle": "2026-04-19T20:10:46.769991Z",
     "shell.execute_reply": "2026-04-19T20:10:46.769593Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Project root: /Users/sergeysotskiy/Documents/UNI/year 3/Dissertation/dissertation_project\n",
      "Outputs: /Users/sergeysotskiy/Documents/UNI/year 3/Dissertation/dissertation_project/outputs\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>item</th>\n",
       "      <th>value</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>core_question</td>\n",
       "      <td>Can synthetic label-aligned pairing of indepen...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>wisconsin_branch</td>\n",
       "      <td>Wisconsin tabular branch is reused exactly as ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>main_image_branch</td>\n",
       "      <td>BreaKHis is rebuilt with patient-level splitti...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>exploratory_branch</td>\n",
       "      <td>Synthetic pairing is used only for methodologi...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 item                                              value\n",
       "0       core_question  Can synthetic label-aligned pairing of indepen...\n",
       "1       wisconsin_branch  Wisconsin tabular branch is reused exactly as ...\n",
       "2   main_image_branch  BreaKHis is rebuilt with patient-level splitti...\n",
       "3  exploratory_branch  Synthetic pairing is used only for methodologi..."
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from __future__ import annotations\n",
    "\n",
    "import json\n",
    "import random\n",
    "import sys\n",
    "from pathlib import Path\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "SEED = 42\n",
    "random.seed(SEED)\n",
    "np.random.seed(SEED)\n",
    "plt.style.use('seaborn-v0_8-whitegrid')\n",
    "\n",
    "CWD = Path.cwd().resolve()\n",
    "if (CWD / 'src').exists() and (CWD / 'data').exists():\n",
    "    PROJECT_ROOT = CWD\n",
    "elif (CWD.parent / 'src').exists() and (CWD.parent / 'data').exists():\n",
    "    PROJECT_ROOT = CWD.parent\n",
    "elif (CWD.parent.parent / 'src').exists() and (CWD.parent.parent / 'data').exists():\n",
    "    PROJECT_ROOT = CWD.parent.parent\n",
    "else:\n",
    "    raise RuntimeError(f'Could not resolve dissertation_project root from {CWD}')\n",
    "\n",
    "REPO_ROOT = PROJECT_ROOT.parent\n",
    "OUTPUTS = PROJECT_ROOT / 'outputs'\n",
    "FIGURES = OUTPUTS / 'figures'\n",
    "METRICS = OUTPUTS / 'metrics'\n",
    "REPORTS = OUTPUTS / 'reports'\n",
    "MODELS = PROJECT_ROOT / 'models'\n",
    "DATA_ROOT = PROJECT_ROOT / 'data' / 'dataset_cancer_v1' / 'dataset_cancer_v1'\n",
    "WISCONSIN_ROOT = PROJECT_ROOT / 'notebook_Wisconsin'\n",
    "\n",
    "for path in [FIGURES, METRICS, REPORTS]:\n",
    "    path.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "if str(PROJECT_ROOT) not in sys.path:\n",
    "    sys.path.append(str(PROJECT_ROOT))\n",
    "\n",
    "print('Project root:', PROJECT_ROOT)\n",
    "print('Outputs:', OUTPUTS)\n",
    "\n",
    "from IPython.display import Markdown, display\n",
    "\n",
    "project_note = {\n",
    "    'core_question': 'Can synthetic label-aligned pairing of independent breast cancer datasets act as a useful exploratory testbed for fusion under data scarcity?',\n",
    "    'wisconsin_branch': 'Wisconsin tabular branch is reused exactly as published and is not edited.',\n",
    "    'main_image_branch': 'BreaKHis is rebuilt with patient-level splitting to remove leakage.',\n",
    "    'exploratory_branch': 'Synthetic pairing is used only for methodological exploration with random-pairing controls.',\n",
    "}\n",
    "pd.DataFrame(project_note.items(), columns=['item', 'value'])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0bc73d9",
   "metadata": {},
   "source": [
    "## Research Questions\n",
    "\n",
    "These are the questions the rest of the notebooks try to answer.\n",
    "\n",
    "1. How does BreaKHis behave when the split is corrected from image level to patient level?\n",
    "2. What is the strongest image-only baseline I can train locally on the corrected split?\n",
    "3. What does the existing Wisconsin branch contribute when it is reused unchanged as the tabular model?\n",
    "4. Do synthetic fusion experiments appear to improve performance, and if so, are those gains robust under control pairings?\n",
    "5. Which findings can I present as main results, and which need to stay clearly exploratory?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "codex_research_commentary": true
   },
   "source": [
    "## Research Plan Table\n",
    "\n",
    "This cell converts the dissertation workflow into staged success criteria. The table is important because it links each later notebook to a concrete purpose: dataset understanding, leakage audit, image modelling, fusion analysis, and final synthesis.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "39975b5f",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-04-19T20:10:46.771722Z",
     "iopub.status.busy": "2026-04-19T20:10:46.771573Z",
     "iopub.status.idle": "2026-04-19T20:10:46.776491Z",
     "shell.execute_reply": "2026-04-19T20:10:46.776027Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>stage</th>\n",
       "      <th>success_criterion</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Dataset understanding</td>\n",
       "      <td>Clear cohort summaries, patient counts, and vi...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Leakage audit</td>\n",
       "      <td>Patient overlap is demonstrated for the image-...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Image model development</td>\n",
       "      <td>At least one corrected baseline is trained wit...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Fusion analysis</td>\n",
       "      <td>Same-label and random-pairing controls are bot...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Dissertation synthesis</td>\n",
       "      <td>Plots, tables, and caveats are saved for direc...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     stage                                  success_criterion\n",
       "0    Dataset understanding  Clear cohort summaries, patient counts, and vi...\n",
       "1            Leakage audit  Patient overlap is demonstrated for the image-...\n",
       "2  Image model development  At least one corrected baseline is trained wit...\n",
       "3          Fusion analysis  Same-label and random-pairing controls are bot...\n",
       "4   Dissertation synthesis  Plots, tables, and caveats are saved for direc..."
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "research_plan = pd.DataFrame(\n",
    "    [\n",
    "        {'stage': 'Dataset understanding', 'success_criterion': 'Clear cohort summaries, patient counts, and visible class/magnification structure.'},\n",
    "        {'stage': 'Leakage audit', 'success_criterion': 'Patient overlap is demonstrated for the image-level split and eliminated in the corrected patient-level split.'},\n",
    "        {'stage': 'Image model development', 'success_criterion': 'At least one corrected baseline is trained with visible curves and saved artifacts.'},\n",
    "        {'stage': 'Fusion analysis', 'success_criterion': 'Same-label and random-pairing controls are both evaluated across repeated seeds.'},\n",
    "        {'stage': 'Dissertation synthesis', 'success_criterion': 'Plots, tables, and caveats are saved for direct inclusion in the written report.'},\n",
    "    ]\n",
    ")\n",
    "research_plan\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8cd20e6",
   "metadata": {},
   "source": [
    "## Working Assumptions\n",
    "\n",
    "I am keeping the Wisconsin notebook and its artifacts untouched because that part of the project is already published work. The corrected BreaKHis split is where the main new experimentation happens. Synthetic pairing is useful as an exploratory framework, but it is not evidence of real multimodal clinical utility. All important figures still need to exist both inline and in `outputs` so the written dissertation can reuse them easily.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "codex_research_commentary": true
   },
   "source": [
    "## How This Notebook Supports The Dissertation\n",
    "\n",
    "The output from this notebook acts as the contract for the rest of the workflow. Any later result should be interpreted through this frame: unimodal results are the main evidence, while synthetic fusion is an exploratory method demonstration.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (dissertation_dl)",
   "language": "python",
   "name": "dissertation_dl"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}