Logo

μ²Tokenizer: Differentiable Multi-Scale Multi-Modal Tokenizer for Radiology Report Generation

1School of Electronic Engineering and Computer Science,

Queen Mary University of London, London, UK

2School of Engineering, College of Engineering and Physical Sciences,

University of Birmingham, Birmingham, UK

3Guangdong University of Technology, Guangdong, China

4Meta Inc. US

5Nuffield Department of Clinical Neurosciences,

University of Oxford, Oxford, UK

6William Harvey Research Institute, NIHR Barts Biomedical Research Centre,

Queen Mary University London, London, UK


MICCAI 2025
See you in Daejeon, Korea 🇰🇷 from September 23-27, 2025
Describe this scan's findings.
Thinking
The clinical significance of unclear intestinal structures and thickened intestinal walls in the lower right abdomen following surgery must be interpreted in the context of potential postoperative complications. These findings raise concern for inflammation, infection, or ischemia. Adhesions from prior surgery can lead to anatomical distortion, making bowel loops difficult to delineate. The presence of both air and fluid within the bowel, along with perienteric air, increases suspicion for an anastomotic leak or a contained perforation. Striated and patchy densities in the surrounding region may represent inflammatory stranding or localized fluid collections. While postoperative edema may explain some of these findings, the combination of wall thickening, air, and fluid is more indicative of an infectious or perforative process rather than simple post-surgical healing. No abnormalities are seen in other organs, reinforcing a focus on bowel integrity. The patchy and striated shadows observed around the intestines further support the presence of localized inflammation or infection. Striated opacities typically reflect edema or inflammatory infiltration of perienteric fat, commonly seen in conditions such as diverticulitis, appendicitis, or postoperative abscesses. Patchy gas collections adjacent to thickened bowel walls may indicate a minor perforation or infection with gas-forming organisms such as *E. coli* or *Clostridium*. The lack of widespread peritoneal involvement or significant lymphadenopathy makes diffuse peritonitis or metastatic disease less likely. The thickened ileocecal wall and fluid-gas levels in the intestine suggest a focal process such as a localized infection, enterocolitis, or postoperative complication like an anastomotic leak or adhesive ileus with secondary inflammation. Localized air accumulation in the lower right abdominal wall post-surgery is best interpreted in the context of thickened intestinal walls and perienteric air. The ileocecal wall thickening supports a localized inflammatory process. Generalized peritoneal free air is absent, and other organs appear normal, which excludes visceral perforation as a primary cause. This pattern is most consistent with a localized infection or abscess involving gas-forming pathogens, potentially due to surgical contamination or anastomotic leakage. The air appears to be confined, indicating a contained infection rather than systemic sepsis or diffuse peritonitis. The presence of both air and liquid within the intestines postoperatively raises the possibility of impaired motility (ileus) or mechanical obstruction. However, the radiographic findings of patchy perienteric air and wall thickening suggest that localized inflammation or leakage may be contributing to bowel dysfunction. While ileus alone may present with dilated loops of bowel, the focal nature of these findings supports a localized process such as minor anastomotic dehiscence leading to inflammation and secondary ileus. Obstruction is less likely without a clear transition point or marked dilation. The smooth liver surface and uniform parenchymal density indicate no structural abnormalities, such as cirrhosis, focal masses, or metastatic lesions. The absence of heterogeneous attenuation changes supports normal hepatic parenchyma without evidence of infection, hemorrhage, or tumor infiltration. The gallbladder's normal size, shape, and the lack of bile duct dilation suggest no involvement of the biliary system. This excludes biliary obstruction, cholecystitis, or bile leak as sources of the observed findings, which are instead confined to the intestinal structures in the lower right abdomen. The lack of abnormalities in the spleen, pancreas, and kidneys indicates that the disease process is localized rather than systemic. Conditions such as vasculitis, widespread infection, or metastatic disease typically involve multiple organs, but none of these are present here. This supports a diagnosis limited to the surgical site, such as infection, localized inflammation, or mechanical complications. The absence of enlarged retroperitoneal lymph nodes further narrows the differential to a localized process. Systemic etiologies such as lymphoma, tuberculosis, or sarcoidosis are less likely. However, localized inflammatory conditions like Crohn’s disease or postoperative infection remain possible. Clinical correlation with symptoms and lab values is essential. Perienteric air and wall thickening are not typical of simple postoperative changes and are more indicative of a visceral leak. Anastomotic dehiscence is the most plausible explanation for these findings, especially when combined with subcutaneous emphysema. Adhesions or abscess formation are less likely without signs of obstruction or fluid collections with air-fluid levels. The fistula in the left lower abdominal wall is likely **colocutaneous**, given its anatomical location and the presence of fluid accumulation and a drainage tube. Possible origins include chronic colonic inflammation (such as from Crohn’s disease or diverticulitis) or surgical complications such as colostomy breakdown or wound infection. There is no evidence of bladder involvement, ruling out vesicocutaneous fistula. The bladder appears normal in contour and density, indicating that the urinary system is not involved in the current pathology. This supports the conclusion that the fluid accumulation and fistula are not related to urinary leakage or obstruction. The drainage tube and fistula are likely interrelated. The tube may have been placed to manage fluid from the fistula, or the fistula may have developed in response to the tube’s presence, possibly due to irritation or infection. Their coexistence in the same anatomical region indicates a clinical or pathological connection. The fluid accumulation in the abdominopelvic cavity is likely connected to the fistula. If the fistula communicates with the bowel or another hollow viscus, enteric contents may be leakinginto the peritoneal space, leading to inflammation and fluid exudation. The drainage tube may be attempting to manage this leakage, but ongoing communication through the fistula may sustain the fluid collection. These findings raise concern for potential complications such as infection, inflammation, or ongoing leakage. The presence of a fistula increases the risk of sepsis and skin breakdown. While imaging does not directly show signs of acute infection (e.g., rim-enhancing fluid or purulent collections), the clinical context and presence of drainage and fluid accumulation suggest an underlying complicated process. Bowel obstruction is less likely unless there is involvement of adjacent loops, which is not explicitly noted. Collectively, these findings suggest a complex postoperative scenario that may involve anastomotic leakage or localized infection. Management should include close monitoring, possible infectious disease consultation, surgical review, and potential interventional radiology involvement to assess and manage the fluid and fistula.
CT imaging reveals a distended stomach with 12 mm, heterogeneously enhancing mucosal thickening in the body and antrum; mild ileocecal wall thickening without abnormal enhancement; a normally proportioned liver containing multiple non-enhancing low-density cysts (largest 14 × 20 mm in segment VIII), an 8 × 16 mm patchy low-density lesion in segment IV showing slight progressive enhancement, and a tiny calcified focus in segment VI; non-dilated intra- and extra-hepatic bile ducts; unremarkable gallbladder, spleen, pancreas and retroperitoneal nodes; and normal-shaped kidneys with clear perirenal fat and multiple non-enhancing cysts (largest 13 × 17 mm in the right mid-pole) without pelvicalyceal or ureteric dilatation.
What does the thickened gastric mucosa (12mm) with uneven and obvious enhancement in the stomach suggest?
Thinking
The gastric mucosa thickens to 12mm, exceeding normal physiological limits (typically small 5-10mm). The appearance shows uneven and obvious enhancement, suggesting an infiltrative process disrupting the gastric wall's vascular architecture. Heterogeneous enhancement is often associated with neoplastic or inflammatory conditions. Malignancy (e.g., gastric adenocarcinoma) is a key concern due to the irregular enhancement pattern, which reflects disorganized tumor vascularity and necrosis. Lymphoma or severe gastritis could also present similarly, but the lack of systemic findings (e.g., lymphadenopathy) and the focal nature favor a primary malignant process. The absence of other acute findings (e.g., ulcerations) further raises suspicion for malignancy.
The thickened gastric mucosa with uneven, obvious enhancement raises concern for gastric malignancy, particularly adenocarcinoma, due to the irregular vascular pattern and significant wall thickening. Differential diagnoses include lymphoma or severe inflammatory gastritis, but malignancy is the most critical consideration.
肾脏有哪些异常,特别是在位置、形状和囊肿病变方面?
Thinking
在评估肾脏时,我首先观察其位置,注意到左肾“位置略低”,提示存在位置异常。其次,“双肾形态正常”,说明未见结构扭曲或畸形。关于囊性病变,可见“双肾多发低密度囊性病变”,其中右肾中极最大病灶约 13 mm × 17 mm。上述病灶未见异常强化,提示为良性囊性改变。此外,双侧肾盂、肾盏及输尿管均未见扩张,可排除梗阻性病变。
双肾位置异常,左肾略低。双肾形态正常。双肾内见多发低密度囊性病灶,其中右肾中极可见较大囊性病灶,大小约 13 mm × 17 mm;上述病灶均未见异常强化。未见肾盂、肾盏或输尿管扩张。

Overview

Automated radiology report generation (RRG) is a promising application of AI that aims to assist radiologists by producing detailed textual reports from medical images like CT scans. However, this task faces two major hurdles:

  1. Complex Information Extraction: Efficiently extracting all relevant diagnostic information from large, variable-resolution imaging data is computationally intensive and prone to information loss.
  2. Subjective Evaluation: Traditional text-generation metrics often fail to capture the clinical accuracy and semantic meaning that are crucial for medical reports.

To address these challenges, we introduce μ²LLM, a multi-scale multimodal large language model. At its core is the novel μ² Tokenizer, an intermediate layer that intelligently fuses visual features from CT scans with textual information. The model is further refined using Direct Preference Optimization (DPO), guided by the specialized medical report evaluation metric, GREEN, to ensure the generated reports align with expert standards.

Our experimental results on four large-scale CT datasets show that μ²LLM outperforms existing methods, highlighting its potential for generating high-quality radiology reports even with limited training data.

Weight distance correlation

Key Contributions

  1. μ²LLM Framework: We propose a novel multi-modal large language model (MLLM) designed to efficiently preserve critical details from medical imaging by integrating guided questions.
  2. μ² Tokenizer Layer: The core of our framework is the μ² Tokenizer, an intermediate layer that uses multi-level attention and multi-scale aggregation to refine and fuse visual and text embeddings, maximizing semantic correspondence while maintaining computational efficiency.
  3. Enhanced Training with DPO: We employ Direct Preference Optimization (DPO) to align our model's outputs with expert-validated clinical accuracy. The preference data is curated using the GREEN score, a robust LLM-based metric for evaluating the clinical accuracy of radiology reports.
  4. State-of-the-Art Performance: Despite its smaller parameter size (1B), our model consistently outperforms larger baseline models (7B to 14B) across multiple datasets, demonstrating the effectiveness of our approach.


Weight distance correlation

Model Architecture

The μ²LLM framework integrates a 3D Vision Transformer (ViT3D) as the image encoder and a large language model (LLM) for report generation. The key innovation, the μ² Tokenizer, acts as a bridge between them.

Overall Pipeline (μ²LLM)

Weight distance correlation
Fig. 1: Overview of our proposed µ²LLM model that is centered with the μ² Tokenizer layer for high quality RRG task.

  1. Input: A 3D CT scan and a related textual question (e.g., "Can you provide a diagnosis based on the findings in the abdomen?").
  2. Image Encoder (ViT3D): The CT scan is split into multiple frames to avoid information loss from downsampling. The ViT3D processes these frames to extract a sequence of visual tokens.
  3. μ² Tokenizer: This layer takes the visual tokens from the encoder and the tokenized question. It uses multi-scale attention mechanisms to fuse the two modalities, producing a compact and information-rich set of embeddings.
  4. LLM: The final image embeddings are passed to the LLM along with the original question to generate the detailed radiology report.

The μ² Tokenizer Module

The μ² Tokenizer is built upon the Linear Video Tokenizer (LinVT) and introduces three key improvements:

  • Relative Positional Encoding (RPE): Instead of absolute positions, we use relative positional encodings within the attention mechanism. This allows the model to better capture local relationships between different areas of the 3D scan, which is crucial for identifying anatomical patterns.
  • Differentiable Token Selection (DTS): Traditional "hard" top-k token selection can lead to information loss and slow optimization. We replace it with a fully differentiable "soft" selection, which computes a weighted sum of all tokens. This preserves more information and allows gradients to flow back to all visual tokens, improving training stability.
  • Dynamic Multi-scale Pooling (DMTP): Rather than using fixed pooling kernel sizes, our dynamic approach allows the network to learn how to weight and select the most appropriate pooling strategy based on the input, making the feature extraction process more adaptive and effective.

Prompt Engineering


Weight distance correlation

In our workflow, we applied three prompt-engineering techniques—CT Report Rewriting, CT Report Reasoning Synthesis, and CT Report Translation—with particular emphasis on CT Report Reasoning Synthesis. The concrete process and involved prompts are listed in Appendix A.

Our CT Report Reasoning Synthesis pipeline converts each CT Report and its free-text radiology report into a rich supervisory package for multimodal learning by sequentially prompting a single large-language model in five roles:

First, the question-generation stage reads the full report (findings and impression) and asks the LLM to propose a diverse collection of natural-language questions that a radiologist, trainee or downstream AI system might reasonably ask. Prompt constraints force coverage across lesion attributes, anatomical localisation, diagnostic certainty, and suggested follow-up, giving each study a rich inquiry space.

Second, each question is paired with the original report and resubmitted to the LLM under a “think-step-by-step” instruction. The model must clearly reason out, citing exact report fragments or well-established imaging priors, before providing a concise answer. The resulting tuples—question, answer, and raw reasoning—capture both knowledge and justification in a single pass.

Third, an automatic quality gate re-examines every tuple. A second LLM pass checks factual consistency between answer and report, heuristics reject non-English or vacuous chains-of-thought, and domain-specific rules eliminate pathophysiologic contradictions (for example, claiming a pneumothorax is “improved” when it is first detected). Only tuples that survive all three filters remain.

Fourth, accepted reasoning traces are refined: the LLM compresses them into short, evidence-linked paragraphs whose citations reference specific report lines. Redundancy is pruned, hedging language is toned down and, where appropriate, probabilistic qualifiers are inserted to reflect clinical uncertainty in a calibrated fashion.

Finally, the pipeline fuses all refined traces into a single, structured “report-thinking” narrative. The LLM merges overlapping rationale, orders arguments anatomically and separates them into Findings Rationale, Impression Rationale and Follow-up Rationale sections. The finished datapoint therefore contains a CT volume, its VQA pairs (with answers) and a coherent explanation grounding every key statement, enabling scalable training of multimodal models that can answer questions and justify their answers with radiologic evidence.

Results

Our model was evaluated on several benchmark datasets against various high-performing LLMs. We used both traditional metrics (ROUGE, METEOR, BERTScore) and the advanced LLM-based GREEN score, which measures clinical accuracy.

Performance Comparison

μ²LLM achieves state-of-the-art results across all datasets, significantly outperforming larger models like LaMed-Llama-2-7B and RadFM-14B The use of DPO fine-tuned with GREEN scores further boosted performance, with a notable 20% improvement in the GREEN score on average.

Table 1: Performance Comparison Across Different Datasets

Datasets Models ROUGE-1 GREEN METEOR BERTScore
LaMed-Phi-3-4B 0.136 0.011 0.058 0.807
LaMed-Llama-2-7B 0.139 0.009 0.060 0.810
Abdomen Atlas RadFM-14B 0.037 0.000 0.013 0.794
RadGPT-N 0.247 0.112
μ²LLM-1B (SFT) 0.529 0.281 0.295 0.891
μ²LLM-1B (SFT&DPO) 0.567 0.346 0.319 0.895
LaMed-Phi-3-4B 0.130 0.002 0.050 0.814
LaMed-Llama-2-7B 0.103 0.001 0.048 0.815
CT-Rate RadFM-14B 0.054 0.014 0.017 0.812
CT-CHAT-8B 0.294 0.113 0.221 0.815
μ²LLM-1B (SFT) 0.517 0.384 0.330 0.879
μ²LLM-1B (SFT&DPO) 0.539 0.429 0.359 0.890
LaMed-Phi-3-4B 0.126 0.009 0.047 0.821
LaMed-Llama-2-7B 0.163 0.009 0.065 0.823
AMOS-MM RadFM-14B 0.046 0.001 0.015 0.812
μ²LLM-1B (SFT) 0.421 0.339 0.249 0.881
μ²LLM-1B (SFT&DPO) 0.459 0.400 0.876 0.881

Ablation Study

Ablation experiments confirmed that each component of the μ² Tokenizer contributes positively to the model's performance. Differentiable Token Selection (DTS) provided the most significant boost, improving the GREEN score by up to 0.2 points.

Table 2: Ablation Study on μ² Tokenizer Components

Model BLEU ROUGE-1 GREEN METEOR BERTScore
Baseline 0.190 0.405 0.204 0.210 0.864
+RPE 0.281 0.421 0.277 0.236 0.880
+DTS 0.271 0.411 0.299 0.240 0.888
+DMTP 0.254 0.401 0.233 0.220 0.874
μ²LLM-1B (SFT) 0.279 0.421 0.339 0.249 0.881
μ²LLM-1B (SFT&DPO) 0.336 0.459 0.400 0.876 0.881

BibTeX

        
            @misc{li2025mu2tokenizerdifferentiablemultiscalemultimodal,
                  title={${\mu}^2$Tokenizer: Differentiable Multi-Scale Multi-Modal Tokenizer for Radiology Report Generation}, 
                  author={Siyou Li and Pengyao Qin and Huanan Wu and Dong Nie and Arun J. Thirunavukarasu and Juntao Yu and Le Zhang},
                  year={2025},
                  eprint={2507.00316},
                  archivePrefix={arXiv},
                  primaryClass={cs.LG},
                  url={https://arxiv.org/abs/2507.00316}, 
            }