What Is DDSM? The Mammography Database Explained

DDSM stands for the Digital Database for Screening Mammography, a publicly available collection of 2,620 mammogram studies used to develop and test computer software that helps detect breast cancer. Originally funded by the U.S. Department of Defense Breast Cancer Research Program, it has become one of the most widely referenced datasets in breast cancer imaging research.

What the Database Contains

The DDSM includes mammogram images across three categories: normal, benign, and malignant. Each case comes with verified pathology information, meaning the diagnosis was confirmed through biopsy or other clinical follow-up rather than relying solely on a radiologist’s visual interpretation. This verification is what makes the dataset useful for training algorithms: the software can learn from images where the correct answer is already known.

The original images were scanned from film mammograms using laser scanners, producing 12-bit images with 4,096 levels of contrast. Spatial resolution varied depending on the scanner used, with roughly 30% of images at 0.085 mm resolution and the remaining 70% at 0.150 mm. The files were stored in a lossless JPEG format, which preserved all original image data without compression artifacts.

Why It Matters for Breast Cancer AI

The DDSM gave researchers a common benchmark. Before standardized datasets like this existed, teams developing computer-aided detection (CAD) systems had no consistent way to compare their results against one another. With a shared set of labeled images, different approaches could be tested on the same data and ranked by accuracy.

The dataset has been used in hundreds of studies. Object detection systems trained on DDSM data have achieved accuracy rates around 85% for identifying suspicious masses. More recent deep learning approaches that combine neural networks with image generation techniques have pushed accuracy to 89% on DDSM images. One system using the curated version of the dataset reported a 99% success rate in detecting and classifying breast masses, though performance varies significantly depending on the specific task and methodology.

These CAD systems are not designed to replace radiologists. They function as a second opinion, flagging areas in a mammogram that warrant closer inspection and helping clinicians catch abnormalities they might otherwise miss.

The Curated Version: CBIS-DDSM

The original DDSM, while groundbreaking, had practical limitations. Its lossless JPEG format was outdated and difficult to work with using modern software tools. The annotations marking the locations of masses and calcifications needed updating to meet current standards in computer vision research.

To address this, researchers created the CBIS-DDSM (Curated Breast Imaging Subset of DDSM). This updated version includes decompressed images, refreshed segmentation outlines and bounding boxes drawn by trained mammographers, and pathologic diagnoses formatted to match modern machine learning workflows. It contains 753 calcification cases and 891 mass cases, each with benign or malignant labels. The dataset is formatted similarly to the large-scale image collections that current deep learning models expect as input.

How to Access the Data

The CBIS-DDSM is hosted by The Cancer Imaging Archive (TCIA), a public repository of de-identified medical images maintained at the University of Arkansas for Medical Sciences. The full download is approximately 163.5 GB and requires a dedicated tool called the NBIA Data Retriever. Access is free, though the size of the dataset means you’ll need substantial storage space and a reliable internet connection.

The original DDSM has largely been superseded by the curated version for active research, but both remain available. Researchers working on breast cancer detection, image classification, or medical AI benchmarking continue to use CBIS-DDSM alongside newer datasets like INbreast and mini-MIAS, often testing their models across multiple collections to demonstrate generalizability.