/README (9f4bbb9647e9fea4e861fa9a04bf32a716a2da05) (2627 bytes) (mode 100644) (type blob)

Description: Find duplicated files/directories
Author: Catalin(ux) M. BOIE
Start date: 2012-04-09

Plan:
- compute sha1 on files/dirs lazy (check only size and only after the checksum).
- sort files and dir tables
- check directories first
- check files, hiding all siblings reporting above


DIR
	subdir1
		subsubdir1
		subsubdir2
		file1

DIR
	subdir2


DIR->subdirs = subdir1
subdir1->next = subdir2

subdir1->subdirs = subsubdir1
subsubdir1->next = subsubdir2


== Pseudocode ==
main.c: for every directory passed as parameters:
	call ntfw with callback 'callback':
       		ignore !files and !dirs
		if we already seen that inode, skip it
		if is a dir, call dir_add:
			alloc a dir node and fill name, dev, ino, level
			if is a level 0 dir (passed as para), add it to
				dir_info array
			else
				find parent dir and set ->parent to it
				->next_sibling = parent->subdirs
				parent->subdirs = q
		else, call file_add:
			alloc a file node q
			set size, name, dev, ino, level and init SHAs
			find parent and add q to parent->files
			set also the parent
			now, add q also to a hash by size (file_info), sorted by size
	call file_find_dups
		for every bucket of file_info that has at least one item
			if we have no next, it means that we cannot have a dup
				and we mark it up with no_dup_possible flag
			for every item in hash:
				we group by size and we call compare_file_range
					compare_file_range will fill item->dups
	call dir_find_dups
		for every dir passed as para (dir_info):
			call dir_build_hash
			we allocate an array that will keep all dirs that may have matches
			for every possible dir we call dir_find_dups_populate_list
			sort dirs by hash
			find same hash dirs
				call dir_process_range on first..last with same hash
					link all dirs under the lowest level one
	call dump_duplicates
		if flag no_dup_possible is set, skip
		if do_not_dump is set, skip
		if is alone in the chain, no dup possible, skip
		for every same hash dir:
			if left is 1, we skip it because was already dump
			if do_not_dump is set, skip
			mark dir as left, to not appear in a 'right' position
			mark main dir as 'do_not_dump', because we already dumped it
			mark current dir as 'do_not_dump' because we already dumped it
			dump

Damn complicated.
Let's try a simple approach.
Let's build a single linked list of files, order by size. Hash was too complicated and saved nothing.
Maybe saved some time to add files inside
Build the dirs list
	Keep in mind mark up dirs that contains files that cannot have duplicates (unique size).
	Don't forget to sort the files inside a dir before building the hash.



Mode Type Size Ref File
100644 blob 20 85940595c7c3a70ebc0bd5da9b35bc6b6a16a71a .exclude
100644 blob 105 9e50f3bfb5cc392fa65019aef80cab5093162bd2 .gitignore
100644 blob 35147 94a9ed024d3859793618152ea559a168bbcbb5e2 LICENSE
100644 blob 635 5ec5fadb5ab8ec7839ca5f11414aa2a855cffa03 Makefile.in
100644 blob 2627 9f4bbb9647e9fea4e861fa9a04bf32a716a2da05 README
100644 blob 2216 4699616f54bc9be1acd4b252ddd76b75e9eeb48a TODO
100755 blob 31 382d4ea2c0c98b1b25ea01f1e194cfc4990ac527 configure
100755 blob 15674 c93b35dad5dedf498b90aafcbf409a4844b1bc8c duilder
100644 blob 807 741ea33bf42f98943be21be26fc7e1b6b38d8378 duilder.conf
100644 blob 2040 22eee88f6126c7effa781bcb8fde0c58ca487731 dupdump.1
100644 blob 3981 c59d9bbf4076703d2ffc82502f91595393199bce dupdump.c
100644 blob 805 a992c9f287eb58cd910aca63c6e009526ec2595f dupdump.spec.in
100755 blob 205 677395e91b18c8272dc795ace0d17ec5610e2d70 process.sh
100644 blob 30737 8f737a70836f0180a635351bfd342d2d0efbfe89 store.c
100644 blob 1916 113ca447b857e1890ad0db35a95a06849330b8db store.h
040000 tree - 2f1796ebce0f596969d86738ee6b635521296929 tests
Hints:
Before first commit, do not forget to setup your git environment:
git config --global user.name "your_name_here"
git config --global user.email "your@email_here"

Clone this repository using HTTP(S):
git clone https://rocketgit.com/user/catalinux/dupdump

Clone this repository using ssh (do not forget to upload a key first):
git clone ssh://rocketgit@ssh.rocketgit.com/user/catalinux/dupdump

Clone this repository using git:
git clone git://git.rocketgit.com/user/catalinux/dupdump

You are allowed to anonymously push to this repository.
This means that your pushed commits will automatically be transformed into a merge request:
... clone the repository ...
... make some changes and some commits ...
git push origin main