RocketGit

catalinux / dupdump (public) (License: GPLv3) (since 2016-03-01) (hash sha1)

Find duplicated files and directories.

Clone URLs: https://rocketgit.com/user/catalinux/dupdump ssh://rocketgit@ssh.rocketgit.com/user/catalinux/dupdump git://git.rocketgit.com/user/catalinux/dupdump

No history found.

README:

Description: Find duplicated files/directories
Author: Catalin(ux) M. BOIE
Start date: 2012-04-09

Plan:
- compute sha1 on files/dirs lazy (check only size and only after the checksum).
- sort files and dir tables
- check directories first
- check files, hiding all siblings reporting above

DIR
subdir1
subsubdir1
subsubdir2
file1

DIR
subdir2

DIR->subdirs = subdir1
subdir1->next = subdir2

subdir1->subdirs = subsubdir1
subsubdir1->next = subsubdir2

== Pseudocode ==
main.c: for every directory passed as parameters:
call ntfw with callback 'callback':
ignore !files and !dirs
if we already seen that inode, skip it
if is a dir, call dir_add:
alloc a dir node and fill name, dev, ino, level
if is a level 0 dir (passed as para), add it to
dir_info array
else
find parent dir and set ->parent to it
->next_sibling = parent->subdirs
parent->subdirs = q
else, call file_add:
alloc a file node q
set size, name, dev, ino, level and init SHAs
find parent and add q to parent->files
set also the parent
now, add q also to a hash by size (file_info), sorted by size
call file_find_dups
for every bucket of file_info that has at least one item
if we have no next, it means that we cannot have a dup
and we mark it up with no_dup_possible flag
for every item in hash:
we group by size and we call compare_file_range
compare_file_range will fill item->dups
call dir_find_dups
for every dir passed as para (dir_info):
call dir_build_hash
we allocate an array that will keep all dirs that may have matches
for every possible dir we call dir_find_dups_populate_list
sort dirs by hash
find same hash dirs
call dir_process_range on first..last with same hash
link all dirs under the lowest level one
call dump_duplicates
if flag no_dup_possible is set, skip
if do_not_dump is set, skip
if is alone in the chain, no dup possible, skip
for every same hash dir:
if left is 1, we skip it because was already dump
if do_not_dump is set, skip
mark dir as left, to not appear in a 'right' position
mark main dir as 'do_not_dump', because we already dumped it
mark current dir as 'do_not_dump' because we already dumped it
dump

Damn complicated.
Let's try a simple approach.
Let's build a single linked list of files, order by size. Hash was too complicated and saved nothing.
Maybe saved some time to add files inside
Build the dirs list
Keep in mind mark up dirs that contains files that cannot have duplicates (unique size).
Don't forget to sort the files inside a dir before building the hash.

Hints:
Before first commit, do not forget to setup your git environment:

git config --global user.name "your_name_here"
git config --global user.email "your@email_here"

Clone this repository using HTTP(S):

git clone https://rocketgit.com/user/catalinux/dupdump

Clone this repository using ssh (do not forget to upload a key first):

git clone ssh://rocketgit@ssh.rocketgit.com/user/catalinux/dupdump

Clone this repository using git:

git clone git://git.rocketgit.com/user/catalinux/dupdump

You are allowed to anonymously push to this repository.
This means that your pushed commits will automatically be transformed into a merge request:

... clone the repository ...
... make some changes and some commits ...
git push origin main