Drupal quest for unused files

Submitted by admin on Sat, 02/18/2017 - 19:55

You have most likely been there too. The website is nothing out of the ordinary, quite a small one, but the space it occupies on the hard drive is no match to the outward size, which boosts backup costs the customer pays. The most obvious reason for the discrepancy is unused files, we thought, and so started looking for them. In this article, I tell the story of the quest for unused files in a Drupal-powered website (and not just Drupal, for that matter).

First off, how do the culprits, the unused files, appear? No mystery here, really. For example, you may have had a news section to the website. Later on, you decided to get rid of it, restructure the site, so the news section was deleted, but the pictures that made the news all the more merrier did not go away with the section. Same applies to such things as product catalogues, blog posts etc. In other words, there is no site immune to the plague of unused files.

Below you will find a sequence of simple actions that allows saying the final goodbye to unused files. ATTENTION! Backup before doing anything to the code or the files!

How do you define unused files?

  • A file is considered to be unused if it is not mentioned in the DB in any way;
  • there is no link or reference to the file in the code (theme, css, javascript).

Make a DB dump and put it to the site’s root folder. Create a .sh file there – name it dfindfiles.sh, for example, – and put the following code into that file:

#!/bin/sh
START=./sites/default/files
CURDIR=`pwd`
IG_STYLES=./styles/*
IG_JS=./js/*
IG_CSS=./css/*

dbdump=`pwd`/dumpwebsite.sql
usedfile=`pwd`/output_used.txt
notusedfile=`pwd`/output_notused.txt
notusedfile_check=`pwd`/output_notused_check.txt

cd ${START}
echo "Step 1. Checking for used and unused files to database..."
echo "$(date) $line"
for file in `find . ! -path "$IG_JS" ! -path "$IG_CSS" ! -path "$IG_STYLES" -type f -print | cut -c 3- | sed 's/ /#}/g'`
do
  file2=`echo $file | sed 's/#}/ /g'`
  file3=`basename $file2`
  result=`grep -c "$file3" $dbdump`
  if [ $result = 0 ]; then
    echo $file2 >> $notusedfile
  else
    echo $file2 >> $usedfile
  fi
done
cd ${CURDIR}

echo "Step 2. Checking files from list not used files..."
echo "$(date) $line"
for p in $(cat $notusedfile); do
  grep -rnw --include=*.{module,inc,php,js,css,html,htm,xml} ${CURDIR} -e $p  > /dev/null || echo $p >> $notusedfile_check;
done

echo "Files checking done."
echo "Check the following text-file for results:"
echo "$notusedfile_check"

The script

Following the unused files definnition, the script does two things:

  • searches for any mentions of the file in the DB;
  • searches for links and references to the file in the source code.
#!/bin/sh

Setting environment

START=./sites/default/files

Setting the start directory for scanning. This is where the site’s files go. By default, the path is sites/default/files. If in doubt, browse to your Drupal control panel and check what you have in Configuration – File system, Default file system path field.

 

CURDIR=`pwd`

Setting current directory containing the file with the code.

IG_STYLES=./styles/*

Ignoring the directory where images are generated.

IG_JS=./js/*

Ignoring the directory where javascript is generated.

IG_CSS=./css/*

Ignoring the directory where css is generated.

dbdump=`pwd`/dumpwebsite.sql

Specifying the DB dump.

usedfile=`pwd`/output_used.txt

Specifying the file that contains the list of used files.

notusedfile=`pwd`/output_notused.txt

Specifying the file that contains the list of files not found in the DB.

notusedfile_check=`pwd`/output_notused_check.txt

And here you find the names of files you can delete without any second thought.

cd ${START}

Back to the start…

echo "Step 1. Checking for used and unused files to database..."

Announcing commencement of the first step.

echo "$(date) $line"

Telling when that first step was made.

for file in `find . ! -path "$IG_JS" ! -path "$IG_CSS" ! -path "$IG_STYLES" -type f -print | cut -c 3- | sed 's/ /#}/g'`
do
  file2=`echo $file | sed 's/#}/ /g'`
  file3=`basename $file2`
  result=`grep -c "$file3" $dbdump`
  if [ $result = 0 ]; then
    echo $file2 >> $notusedfile
  else
    echo $file2 >> $usedfile
  fi
done

Cycle to search for files that fit the first definition. The script sets ignored directories and replaces spaces in names of files with “#}”. Inside the cycle, filenames acquire their initial appearance and are searched for in the DB dump. If there is an entry with the name, the path to that file it goes to output_used.txt, if there is none – to output_notused.txt

cd ${CURDIR}

Changing to site’s root directory.

echo "Step 2. Checking files from list not used files..."

Announcing commencement of the second step.

echo "$(date) $line"

Telling when that step was made.

for p in $(cat $notusedfile); do
  grep -rnw --include=*.{module,inc,php,js,css,html,htm} ${CURDIR} -e $p  > /dev/null || echo $p >> $notusedfile_check;
done

This is the cycle that checks for files listed in output_notused.txt. If a file is found, it goes to /dev/null, else – gets listed in output_notused_check.txt

echo "Files checking done."
echo "Check the following text-file for results:"

Telling the world the search is over.

echo "$notusedfile_check"

Outputting the final results file. output_notused_check.txt now contains the list of files that you can delete. Copy it to the directory where scanning begins (sites/default/files in our example). Change to that directory and run:

xargs rm -fr < output_notused_check.txt

The files listed in output_notused_check.txt will be deleted. Once done, check if all the images you have on the website load well and nothing is missing. Next, go to the site’s root directory and delete dumpwebsite.sql, output_used.txt, output_notused.txt и output_notused_check.txt. May your quest for unused files be a successful one!

Add new comment

Filtered HTML

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.