What Is Padding in CNN: Valid vs. Same Explained

Padding in a CNN (convolutional neural network) is the practice of adding extra pixels around the border of an input image or feature map before applying a convolutional filter. Without padding, every convolutional layer shrinks the output dimensions, which means you lose spatial information and limit how many layers you can stack. Padding solves this by controlling the output size and preserving data at the edges of an image.

Why Convolution Shrinks Your Output

To understand why padding exists, you need to see the problem it fixes. When a convolutional filter slides across an image, it can only be centered on pixels that have enough surrounding neighbors to fill the filter window. A 3×3 filter needs one pixel of context on every side. A 5×5 filter needs two pixels of context on every side. This means the filter can never be centered on the very edge pixels of the original image.

If you run a 3×3 filter across a 6×6 image with no padding, the output shrinks to 4×4. Run another 3×3 convolution, and you’re down to 2×2. After just a few layers, your feature maps have collapsed to almost nothing. You’ve also systematically underrepresented the edge and corner pixels, since the filter passes over them far fewer times than it passes over pixels near the center. In tasks like object detection or segmentation, where information at the borders matters, this is a real problem.

How Padding Works

Padding adds a border of extra pixels around the input before the filter starts sliding. The most common approach fills that border with zeros, which is why you’ll often see it called zero padding. If you add one pixel of padding on each side, a 6×6 input becomes an 8×8 grid before convolution. A 3×3 filter then produces a 6×6 output, matching the original input size exactly.

The general formula for the output dimension of a convolutional layer is:

Output size = (Input size − Filter size + 2 × Padding) / Stride + 1

Stride is how many pixels the filter moves at each step (usually 1). With this formula, you can calculate exactly how much padding you need to achieve a target output size.

Valid Padding vs. Same Padding

Most deep learning frameworks give you two standard options, often labeled “valid” and “same.”

Valid padding means no padding at all. The filter only visits positions where it fits entirely within the input. The output is always smaller than the input, shrinking by (filter size − 1) pixels in each dimension. This is the default in some frameworks and works fine when you don’t mind the size reduction.
Same padding adds just enough zeros around the border so the output has the same height and width as the input (assuming a stride of 1). For a 3×3 filter, that means one pixel of padding on each side. For a 5×5 filter, two pixels. This is by far the most commonly used option in modern architectures because it lets you control the spatial dimensions explicitly through pooling layers rather than losing size at every convolution.

Some frameworks also support “full” padding, which adds enough border pixels so that every input pixel is visited by every position of the filter. The output is actually larger than the input in this case, but it’s rarely used in standard classification or detection networks.

Why Zero Padding Is the Default

Filling the border with zeros is simple and effective. Zeros don’t contribute any signal when multiplied with the filter weights, so they act as neutral filler. The network learns to handle these padded regions naturally during training. The edge neurons may produce slightly different activation patterns than center neurons, but in practice this has minimal impact on performance.

Other padding strategies exist. Reflection padding mirrors the pixels at the edge of the image, so the border contains a flipped copy of the nearby content. Replication padding copies the nearest edge pixel outward. These alternatives show up in specific applications like style transfer or image generation, where zero-filled borders can introduce visible artifacts. For standard classification tasks, zero padding works well and adds no computational overhead.

Padding in Practice

Nearly every modern CNN architecture relies on same padding throughout its convolutional layers. Networks like ResNet, VGG, and EfficientNet use 3×3 filters with one pixel of padding as their fundamental building block. This design choice keeps the spatial resolution constant through each convolutional layer and delegates all downsampling to pooling layers or strided convolutions, giving the architect precise control over where and how the feature maps shrink.

When you’re building your own network, a useful rule of thumb: set padding to (filter size − 1) / 2 for any odd-sized filter with stride 1. That gives you same padding. For a 3×3 filter, padding is 1. For a 5×5 filter, padding is 2. For a 7×7 filter, padding is 3. Even-sized filters technically need asymmetric padding (different amounts on each side), which is one reason odd-sized filters are overwhelmingly preferred in practice.

In frameworks like PyTorch, you set padding as a parameter when defining a convolutional layer. In TensorFlow and Keras, you typically just pass padding=”same” or padding=”valid” as a string and the framework calculates the pixel count for you.

How Padding Affects What the Network Learns

Padding isn’t just a bookkeeping trick for managing dimensions. It has a real effect on what the network can learn. Without padding, edge pixels contribute to far fewer output neurons than center pixels. A pixel in the corner of a 28×28 image with a 3×3 filter contributes to exactly one output neuron, while a center pixel contributes to nine. This creates an implicit bias where the network pays less attention to the borders of the image.

With same padding, every original pixel contributes to the same number of output neurons (equal to the filter area), leveling the playing field. Corner information still has zeros as neighbors, so the learned features at the edges may differ slightly from those in the interior, but at least no spatial position is systematically ignored. For tasks where objects can appear anywhere in the frame, this more uniform treatment of the input space helps the network generalize.

In very deep networks with dozens or hundreds of layers, the cumulative size reduction from valid padding would be catastrophic. A 224×224 input image passed through 50 layers of 3×3 convolutions with no padding would shrink to 124×124, losing roughly 70% of its spatial area. Same padding eliminates this problem entirely, making it practical to build networks with hundreds of convolutional layers while maintaining spatial resolution until you intentionally reduce it.