/README (9f4bbb9647e9fea4e861fa9a04bf32a716a2da05) (2627 bytes) (mode 100644) (type blob)

Description: Find duplicated files/directories
Author: Catalin(ux) M. BOIE
Start date: 2012-04-09

Plan:
- compute sha1 on files/dirs lazy (check only size and only after the checksum).
- sort files and dir tables
- check directories first
- check files, hiding all siblings reporting above


DIR
	subdir1
		subsubdir1
		subsubdir2
		file1

DIR
	subdir2


DIR->subdirs = subdir1
subdir1->next = subdir2

subdir1->subdirs = subsubdir1
subsubdir1->next = subsubdir2


== Pseudocode ==
main.c: for every directory passed as parameters:
	call ntfw with callback 'callback':
       		ignore !files and !dirs
		if we already seen that inode, skip it
		if is a dir, call dir_add:
			alloc a dir node and fill name, dev, ino, level
			if is a level 0 dir (passed as para), add it to
				dir_info array
			else
				find parent dir and set ->parent to it
				->next_sibling = parent->subdirs
				parent->subdirs = q
		else, call file_add:
			alloc a file node q
			set size, name, dev, ino, level and init SHAs
			find parent and add q to parent->files
			set also the parent
			now, add q also to a hash by size (file_info), sorted by size
	call file_find_dups
		for every bucket of file_info that has at least one item
			if we have no next, it means that we cannot have a dup
				and we mark it up with no_dup_possible flag
			for every item in hash:
				we group by size and we call compare_file_range
					compare_file_range will fill item->dups
	call dir_find_dups
		for every dir passed as para (dir_info):
			call dir_build_hash
			we allocate an array that will keep all dirs that may have matches
			for every possible dir we call dir_find_dups_populate_list
			sort dirs by hash
			find same hash dirs
				call dir_process_range on first..last with same hash
					link all dirs under the lowest level one
	call dump_duplicates
		if flag no_dup_possible is set, skip
		if do_not_dump is set, skip
		if is alone in the chain, no dup possible, skip
		for every same hash dir:
			if left is 1, we skip it because was already dump
			if do_not_dump is set, skip
			mark dir as left, to not appear in a 'right' position
			mark main dir as 'do_not_dump', because we already dumped it
			mark current dir as 'do_not_dump' because we already dumped it
			dump

Damn complicated.
Let's try a simple approach.
Let's build a single linked list of files, order by size. Hash was too complicated and saved nothing.
Maybe saved some time to add files inside
Build the dirs list
	Keep in mind mark up dirs that contains files that cannot have duplicates (unique size).
	Don't forget to sort the files inside a dir before building the hash.



Mode Type Size Ref File
100644 blob 20 85940595c7c3a70ebc0bd5da9b35bc6b6a16a71a .exclude
100644 blob 92 356318813c30e05ee3e03216c2a48d640915d147 .gitignore
100644 blob 35147 94a9ed024d3859793618152ea559a168bbcbb5e2 LICENSE
100644 blob 492 d27d99fbe57f79a9b059679b784fc70e58dc878d Makefile.in
100644 blob 2627 9f4bbb9647e9fea4e861fa9a04bf32a716a2da05 README
100644 blob 1189 64d2d81271c9e90421c6cf4fee3a9a733cb6c917 TODO
100755 blob 23 d33bb6c4ecdce1390ce1db3c79ea3b93e22ea755 configure
100755 blob 13495 0c1a8d53ed08e012a023972eac0327973cf6a77c duilder
100644 blob 263 6db3b5f4e5bdb7ed764f76a860690f3f8f7c4eb9 duilder.conf
100644 blob 1252 b032b52bfc5b39522a071c8b95dc92db6b604e88 dupdump.1
100644 blob 3863 5e2fe4bdafe61419667dd18f66710a0d9cae63a0 dupdump.c
100644 blob 799 bb53a32d0101613941ef17372849c795b888e9a2 dupdump.spec.in
100755 blob 205 677395e91b18c8272dc795ace0d17ec5610e2d70 process.sh
100644 blob 27691 b64e73fa48885722f8559d18fb88b046037fbc5e store.c
100644 blob 1993 37e6eac46f97c945da91eb070a16dab3e0e433d8 store.h
040000 tree - 3d84e50ea3a4f9be7ad6b92712eb1d760fbe610d tests
Hints:
Before first commit, do not forget to setup your git environment:
git config --global user.name "your_name_here"
git config --global user.email "your@email_here"

Clone this repository using HTTP(S):
git clone https://rocketgit.com/user/catalinux/dupdump

Clone this repository using ssh (do not forget to upload a key first):
git clone ssh://rocketgit@ssh.rocketgit.com/user/catalinux/dupdump

Clone this repository using git:
git clone git://git.rocketgit.com/user/catalinux/dupdump

You are allowed to anonymously push to this repository.
This means that your pushed commits will automatically be transformed into a merge request:
... clone the repository ...
... make some changes and some commits ...
git push origin main