Strategies of docker images optimization

Published: January 3, 2020

# DevOps

Background

Docker, an enterprise container platform is developers’ favorite due to its flexibility and ease of use. It makes it generally easy to create, deploy, and run applications inside of containers. With containers, you can gather applications and their core necessities and dependencies into a single package turn it into a Docker image, and replicate it. Docker images are built from Dockerfiles, where you define what the image should look like, as well as the operating system and commands.

However, large Docker images lengthen the time it takes to build and share images between clusters and cloud providers. When creating applications, it’s therefore worth optimizing Docker Images and Dockerfiles to help teams share smaller images, improve performance, and debug problems. A lot of verified images available on Docker Hub are already optimized, so it is always a good idea to use ready-made images wherever possible. If you still need to create an image of your own, you should consider several ways of optimizing it for production.

Task description

As a part of a larger project, we were asked to propose ways to optimize Docker images for improving performance. There are several strategies to decrease the size of Docker images to optimize for production. In this research project, we tried to explore different possibilities that would yield the best boost of performance with less effort.

Our approach

By optimization of Docker images, we mean two general strategies:

reducing the time of image building to speed up the CI/CD flow;
reducing the image size to speed up the image pull operations and cut costs of storing build artifacts.

Therefore, we proceeded along these two directions, trying to improve the overall performance. But first we need some tools to measure how effective is our process and to find bottlenecks.

Inspection techniques,

Docker image size inspection

You can review your Docker image creation history layer by layer and see the size of each layer. This will allow you to focus on the most significant parts to achieve the biggest reduction in size.

Command:

$ docker image history img_name

Example output:

IMAGE CREATED CREATED BY SIZE

b91d4548528d 34 seconds ago /bin/sh -c apt-get install -y python3 python… 140MB

f5b439869a1b 2 minutes ago /bin/sh -c apt-get install -y wget 7.42MB

9667e45447f6 About an hour ago /bin/sh -c apt-get update 27.1MB

a2a15febcdf3 3 weeks ago /bin/sh -c #(nop) CMD \[“/bin/bash”\] 0B

<missing> 3 weeks ago /bin/sh -c mkdir -p /run/systemd && echo ‘do… 7B

<missing> 3 weeks ago /bin/sh -c set -xe && echo ‘#!/bin/sh’ > /… 745B

<missing> 3 weeks ago /bin/sh -c [ -z “$(apt-get indextargets)” ] 987kB

<missing> 3 weeks ago /bin/sh -c #(nop) ADD file:c477cb0e95c56b51e… 63.2MB

Docker build time inspection

When it comes to measuring the timings of Dockerfile steps, the most expensive steps are COPY/ADD and RUN. The duration of COPY and ADD commands cannot be reviewed(unless you are going to manually start and stop timers), but it corresponds to the layer size, so just check the layer size using docker history and try to optimize it.

As for RUN, it is possible to slightly modify a command inside to include a call to `time` command, which would output how long it take

RUN time apt-get update

But it requires many changes in Dockerfile and looks poorly, especially for commands combined with &&

Fortunately, there’s a way to do that with a simple external tool called gnomon.

Install NodeJS with NPM and do the following

sudo npm i -g gnomon

docker build | gnomon

The output will show you how long either step took:

…

0.0001s Step 34/52 : FROM node:10.16.3-jessie as node_build

0.0000s — -> 6d56aa91a3db

0.1997s Step 35/52 : WORKDIR /tmp/

0.1999s — -> Running in 4ed6107e5f41

…

Clean build vs Repetitive builds

One of the most interesting pieces of information you can gather is how your build process performs when you run it for the first time and when you run it several times in a row with minimal changes to source code or with no changes at all.

In an ideal world, consequent builds should be blazingly fast and use as many cached layers as possible. In case when no changes were introduced it’s better to avoid running a docker build. This can be achieved by external build tools with support for up-to-date checks, like Gradle. And in case of small minor changes, it would be great to have additional volumes be proportionally small.

It’s not always possible or it might require too much effort, so you should decide how important is it for you, which changes you expect to happen often and what’s going to stay unchanged, what is the overhead for each build, and whether this overhead is acceptable.

And now let’s think of the ways to reduce build time and storage overheads.

Reducing the image size

Base image with a smaller footprint

It is always wise to choose a lightweight alternative for an image. In many cases, they can be found on existing platforms: FROM ubuntu:14.04 # 188 MB

FROM ubuntu:18.04 # 64.2MB

There are even more lightweight alternatives for Ubuntu, for example Alpine Linux:

FROM alpine:3 # 5.58MB

However, you need to check if you depend on Ubuntu-specific packages or libc implementation (Alpine Linux uses musl instead of glibc). See the comparison table.

Cleanup commands

Another useful strategy to reduce the size of the image is to add cleanup commands to apt-get install commands. For example, the commands below clean temporary apt files left after the package installation:

RUN apt-get install -y && \ 
unzip \ 
wget && \  
rm -rf /var/lib/apt/lists/* && \  
apt-get purge   --auto-remove && \ 
apt-get clean

If your toolkit does not provide tools for cleaning up, you can use the rm command to manually remove obsolete files.

RUN wget www.some.file.xz && unzip www.some.file.xz && rm www.some.file.xz

Cleanup commands need to appear in the RUN instruction that creates temporary files/garbage. Each RUN command creates a new layer in the filesystem, so subsequent cleanups do not affect previous layers.

Static builds of libraries

It is well-known that static builds usually reduce time and space, so it is useful to look for a static build for C libraries you rely on.

Static build:

RUN wget -q https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz && \  
tar xf ffmpeg-git-amd64-static.tar.xz && \  
mv ./ffmpeg-git-20190902-amd64-static/ffmpeg /usr/bin/ffmpeg && \  
rm -rfd ./ffmpeg-git-20190902-amd64-static && \  
rm -f ./ffmpeg-git-amd64-static.tar.xz # 74.9MB

Dynamic build:

RUN apt-get install -y ffmpeg # 270MB

Only necessary dependencies

The system usually comes up with recommended settings and dependencies that can be tempting to accept. However, many dependencies are redundant, making the image unnecessarily heavy. It is a good practice to use the --no-install-recommends flag for the apt-get install command to avoid installing “recommended” but unnecessary dependencies. If you do need some of the recommended dependencies, it’s always possible to install them by hand.

RUN apt-get install -y python3-dev # 144MB

RUN apt-get install --no-install-recommends -y python3-dev # 138MB

No pip caching

As a rule, a cache directory speeds up installation by caching some commonly used files. However, with Docker images, we usually install all requirements once, which makes the cache directory redundant. To avoid creating the cache directory, you can use the --no-cache-dir flag for the pip install command, reducing the size of the resulting image.

RUN pip3 install flask # 4.55MB

RUN pip3 install --no-cache-dir flask # 3.84MB

Multi-stage builds

A multi-stage build is a new feature requiring Docker 17.05 or higher. With multi-stage builds, you can use multiple FROM statements in your Dockerfile. Each FROM instruction can use a different base, and each begins a new stage of the build. You can selectively copy artifacts from one stage to another, leaving behind everything you don’t want in the final image.

FROM ubuntu:18.04 AS builder  
RUN apt-get update  
RUN apt-get install -y wget unzip  
RUN wget -q https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz && \ 
tar xf ffmpeg-git-amd64-static.tar.xz && \  
mv ./ffmpeg-git-20190902-amd64-static/ffmpeg /usr/bin/ffmpeg && \  
rm -rfd ./ffmpeg-git-20190902-amd64-static && \ 
rm -f ./ffmpeg-git-amd64-static.tar.xz  
  

FROM ubuntu:18.04  
COPY --from=builder /usr/bin/ffmpeg /usr/bin/ffmpeg   
# The builder image itself will not affect the final image size  
# the final image size will be increased only to /usr/bin/ffmpeg file’s size

Intermediate images cleanup

Although builder stage image size will not affect the final image size, it still will consume disk space of your build agent machine. The most straightforward way is to call

docker image prune

But this will also remove all other dangling images, which might be needed for some purposes. So here’s a more safe approach to removing intermediate images: add a label to all the intermediate images and then prune only images with that label.

FROM ubuntu:18.04 AS builder  
LABEL my_project_builder=true

docker image prune --filter label=my_project_builder=true

Use incremental layers

Multi-stage builds are powerful instruments, but the final stage always implies COPY commands from intermediate stages to the final image and those file sets can be quite big. In case you have a huge project, you might want to avoid creating a full-sized layer, and instead take the previous image and append only a few files that have changed.

Unfortunately, COPY command always creates a layer of the same size as the copied fileset, no matter if all files match. Thus, the way to implement incremental layers would be to introduce one more intermediate stage based on a previous image. To make a diff layer, rsync can be used.

FROM my_project:latest AS diff_stage  
LABEL my_project_builder=true  
RUN cp /opt/my_project /opt/my_project_base

COPY --from=builder /opt/my_project /opt/my_project   
RUN patch.sh /opt/my_project /opt/my_project_base /opt/my_project_diff

FROM my_project:latest  
LABEL my_project_builder=true  
RUN cp /opt/my_project /opt/my_project_base

COPY --from=builder /opt/my_project /opt/my_project   
RUN patch.sh /opt/my_project /opt/my_project_base /opt/my_project_diff

Where patch.sh is the following

#!/bin/bash

rm -rf $3

mkdir -p $3

pushd $1

IFS=’

‘

for file in \`rsync -rn --out-format=”%f” ./ $2\`; do

[ -d “$file” ] || cp --parents -t $3 “$file”

done

popd

For the first time, you will have to initialize my_project:latest image, by tagging the base image with the corresponding target tag

docker tag ubuntu:18.04 my_project:latest

And do this every time you want to reset layers and start incrementing from scratch. This is important if you are not going to store old builds forever, cause hundreds of patches might consume more than ten full images.

Also, in the code above we implied rsync is included in the builder’s base image to avoid spending extra time for installing it on every build. The next section is going to present several more ways to save the build time.

Reducing the build time

Common base image

The most obvious solution to reduce build time is to extract common packets and commands from several projects into a common base image. For example, we can use the same image for all projects based on the Ubuntu/Python3 and dependent on unzip and wget packages.

A common base image:

FROM ubuntu:18.04
RUN apt-get update && \
apt-get install python3-pip python3-dev python3-setuptools unzip wget

A specific image:

FROM your-docker-base
RUN wget www.some.file
CMD [“python3”, “your_app.py”]

.dockerignore

To prevent copying unnecessary files from the host, you can use a .dockerignore file that contains all temporarily created local files/directories like .git, .idea, local virtualenvs, etc..

Smarter layer caching

Docker uses caching for filesystem layers, where in most cases each line in the Dockerfile produces a new layer. As there are layers that are more likely to be changed than others, it is useful to reorder all commands according to the probability of changes in the ascending order. This technique saves you time by rebuilding only the layers that have been changed so that you can copy source files when you need them.

Unordered command sequence:

FROM ubuntu:18.04
RUN apt-get update
COPY your_source_files /opt/project/your_source_files
RUN apt-get install -y --no-install-recommends python3

Ordered command sequence:

FROM ubuntu:18.04
RUN apt-get update
RUN apt-get install -y --no-install-recommends python3
COPY your_source_files /opt/project/your_source_files

Dependencies Caching

Sometimes, one of the most time-consuming steps for big projects is dependencies downloading. It is inevitable to perform at least once, but consequent builds should use cache. Surely, layer caching could help in this case — just separate dependencies downloading step and actual build:

COPY project/package.json ./package.json

RUN npm i
COPY project/ ./

RUN npm run build

However, the resolution will happen once you increment any minor version. So if slow resolution is a problem for you, here’s one more approach.

Most dependencies resolution systems like NPM, PIP, and Maven support local cache to speed up consequent resolution. In the previous section, we wrote how to avoid leaking of pip cache to the final image. But together with incremental layers approach it is possible to save it inside an intermediate image. Setup an image with rsync, add a label like `stage=deps` and prevent that intermediate image from being removed by cleanup

docker images --filter label=my\_project\_builder=true --filter label=stage=deps --filter dangling=true --format {{.ID}} | xargs -i docker tag \{\} my\_project/deps

Then let the builder stage depend on my_project/deps image, perform build, and copy compiled files to the final image.

Value added

Such intelligent implementation of optimization strategies allowed us to reduce the Docker image size by over 50% giving significant increase in the speed of image building and sharing.

Feel free to share your best practices of writing better Dockerfiles in the comments below.