For now, I'm satisfying myself with a simple byte for byte comparison to test for equality. I know that's too naive, but I'll fix that later. Much later. The next exercise is to build a "Purge Toy".
...
This is where false negatives and false positives is going to come back to bite me, maybe. I'll let you know how it works out.
I'm going to try this one more time. After that I give up.
A CRC value calculated on a file
is a unique fingerprint of that file.
Calculate the CRC for every file that you process. It
alone may be very safely used as a fingerprint value.
Assuming that you use a good CRC calculation procedure, you will get uniqueness in a 2 raised to the 32nd power space.
Once you calculate the CRC value for a given file, you no longer need to worry about comparing its contents by byte with something else.
The important point here is once the CRC has been calculated for every file that you encounter, and you save it,
it substitutes fully for byte by byte comparisons of that file with other files.Here is an example of the use of the calculated CRC values.
The file paths and their calculated CRC values can be placed in a map. The map's index is the CRC value. Looking up the value of a CRC value that is in the map returns the file path. As you process each new file, the CRC of that file can be calculated and then used to look up into the map by index. If you find a valid entry of a file path - in other words there is already a file in the map matching the CRC value - then you found a duplicate file.
(File path means the full file name, including the directory. I mean the unique file name of that file in the file system.)
In the procedure I am recommending, no false positives. And no byte by byte comparison.
The valuable property of the calculated CRC value of every file is that it may be used to look up files that have exactly the same contents. (With an exceptionally high degree of confidence.)
I think if you don't understand what I am saying, you are missing an opportunity to do this "right".
PS: there are quite a few free and open source duplicate-purging file utility programs available that essentially find all files that are the same in the same or side by side directories.