Find one file of each kind (on Linux)

Suppose you have large number of files but most of them are identical. For example, the files’ contents are: A A A B B C C C C .
You’d like to find one of each kind and for example compare.

mkdir cmp
cd cmp/
find .. -name stupid-page.html | xargs md5sum | sort | awk '{print >$1}'
head -1 -q [0-f]* | awk '{print $2}' | xargs diffuse

Or if you just need the list, replace the last line with

head -1 -q [0-f]* | awk '{print $2}'

Note that you are left with the lists of files. Each list is named after MD5 of the content of the files listed in it.
Like this:

> ls -1
09b37d3089b1c1837e4741973df1e67e
4d701e2420bf49c85dd21c9b1dbb10e1
6135d23fcb0113ab9a2f574d7f0bf703
> cat 09b37d3089b1c1837e4741973df1e67e
09b37d3089b1c1837e4741973df1e67e  ../some-folder/stupid-page.html
09b37d3089b1c1837e4741973df1e67e  ../some-other-folder/stupid-page.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s