Author Topic: Learning Java in Retirement  (Read 497 times)

The Gorn

  • Your agonizer, please. And be sure to keep the batteries charged!
  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 14182
  • Gornix user
    • View Profile
Re: Learning Java in Retirement
« Reply #45 on: November 11, 2011, 11:59:22 am »
For now, I'm satisfying myself with a simple byte for byte comparison to test for equality.  I know that's too naive, but I'll fix that later.  Much later.  The next exercise is to build a "Purge Toy". 
...
This is where false negatives and false positives is going to come back to bite me, maybe.  I'll let you know how it works out.

I'm going to try this one more time. After that I give up.

A CRC value calculated on a file is a unique fingerprint of that file.

Calculate the CRC for every file that you process. It alone may be very safely used as a fingerprint value. 

Assuming that you use a good CRC calculation procedure, you will get uniqueness in a 2 raised to the 32nd power space.

Once you calculate the CRC value for a given file, you no longer need to worry about comparing its contents by byte with something else.

The important point here is once the CRC has been calculated for every file that you encounter, and you save it, it substitutes fully for byte by byte comparisons of that file with other files.

Here is an example of the use of the calculated CRC values.

The file paths and their calculated CRC values can be placed in a map. The map's index is the CRC value. Looking up the value of a CRC value that  is in the map returns the file path. As you process each new file, the CRC of that file can be calculated and then used to look up into the map by index. If you find a valid entry of a file path - in other words there is already a file in the map matching the CRC value - then you found a duplicate file.

(File path means the full file name, including the directory. I mean the unique file name of that file in the file system.)

In the procedure I am recommending, no false positives. And no byte by byte comparison.

The valuable property of the calculated CRC value of every file is that it may be used to look up files that have exactly the same contents. (With an exceptionally high degree of confidence.)

I think if you don't understand what I am saying, you are missing an opportunity to do this "right".

PS: there are quite a few free and open source duplicate-purging file utility programs available that essentially find all files that are the same in the same or side by side directories.
« Last Edit: November 11, 2011, 12:29:10 pm by The Gorn »
Gornix is protected by the GPL. *

* Gorn Public License. Duplication by inferior sentient species prohibited.


Walter Mitty

  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 1025
    • View Profile
Re: Learning Java in Retirement
« Reply #46 on: November 11, 2011, 02:20:59 pm »
Assuming that you use a good CRC calculation procedure, you will get uniqueness in a 2 raised to the 32nd power space.

The odds are in your favor, by something like 4 billion to one.  I'm sure it's good advice.

I'd just like to try it the other way, see what happens.  I haven't been able to do things any way I like since the last time I did hobby programming.  And that was too many years ago.

If I get into trouble,  I'll backtrack,  pick up a good CRC polynomial,  translate it into code, and try again.  What I won't do is plug in somebody else's code that I don't understand.  And it's going to take me some effort to understand a CRC polynomial.




Walter Mitty

  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 1025
    • View Profile
Re: Learning Java in Retirement
« Reply #47 on: November 12, 2011, 07:31:20 am »
BTW, my first exposure to CRC was about 35 years ago, when I was learning about DECnet.  DEC's data link layer, DDCMP, had some kind of CRC appended on the end of packets.  IIRC, it was a 16 bit CRC, not a 32 bit one.  DDCMP was going for error detect and retransmit, not error correction.

I'm not sure, but I think one of DEC's disk controllers stashed a CRC32 on the end of every sector, and they used that to correct a misread on a sector, most of the time.

I decided against CRC in this case because I didn't want to read the files when building  the map.  And, to my knowledge, you have to read the file to compute a CRC.  Now that I think of it, it would be really esoteric for Windows to capture a CRC32 anyways.

Down the road, way down the road, I'm going to want a surrogate for file contents.  When I get there, CRC will be at the top of the list.  In the meantime, taking offense at my "philistine" approach sounds like a religion to me.    :laugh:



The Gorn

  • Your agonizer, please. And be sure to keep the batteries charged!
  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 14182
  • Gornix user
    • View Profile
Re: Learning Java in Retirement
« Reply #48 on: November 12, 2011, 12:17:37 pm »
Down the road, way down the road, I'm going to want a surrogate for file contents.  When I get there, CRC will be at the top of the list.  In the meantime, taking offense at my "philistine" approach sounds like a religion to me.    :laugh:

Just keep this in mind the next time someone in your presence doesn't design a database according to Hoyle.  :P

Ok, I get it.

Just one thing. You know that file lengths are so non unique in your sample that you have acknowledged that you will have collisions that you don't want (false positives.)

So you wind up having to compare by contents of the file also.

So what's the difference? It sounds like you have to open and read the file anyway with your method.

The bits of your algorithm you have revealed sounds painful to code and test. To me, anyway.
Gornix is protected by the GPL. *

* Gorn Public License. Duplication by inferior sentient species prohibited.


Walter Mitty

  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 1025
    • View Profile
Re: Learning Java in Retirement
« Reply #49 on: November 12, 2011, 04:09:30 pm »
Down the road, way down the road, I'm going to want a surrogate for file contents.  When I get there, CRC will be at the top of the list.  In the meantime, taking offense at my "philistine" approach sounds like a religion to me.    :laugh:

Just keep this in mind the next time someone in your presence doesn't design a database according to Hoyle.  :P

Ok, I get it.

Trust me, during my seventeen years in consulting/contracting  I learned to be very, very gentle with people whose idea about database design ran the opposite of my own.  And some of the worst people I ever ran into were the ones who were never, ever wrong about anything.  Those people alienate more people than they convince.  I claim not to have been one of them, although I probably was once or twice.

Two things about design issues:  for any significant design case, there will be more than one satisfactory solution.  And arguing about which one is better, if they are both satisfactory, is often counter productive.

Second, no one is a total fool.  For any design decision, no matter how stupid it looks, there was some set of circumstances that made it seem like a good idea at the time.

Walter Mitty

  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 1025
    • View Profile
Re: Learning Java in Retirement
« Reply #50 on: November 12, 2011, 04:27:15 pm »
Well, I got the prototype of the Purge Toy up and running today.  Gawd, some of the programming errors I made seem like a complete neophyte.  Oh well, I am sort of a neophyte when it comes to Java.  It's going to take me some time to tame this language to the point where I feel comfortable. 

Anyway, I use a picture CD with one 95 pictures on it, and a test case with copies of all 95 files.  The folder structure was rich.  It found all 95 duplicate in a little over 10 seconds.  If this scales up linearly, I should be able to process my entire body of photographs in less than an hour.

I expect the map building phase to scale up on the order of n*log(n).  I don't know what Java's defaults are for hashing tables,  but I guess I'm gonna find out.  I expect the map using phase to scale linearly.  But there could be something obvious I'm missing.  Time will tell.

The next tests are going to involve files that should not be deleted,  and files where the names and/or the folder structure have been scrambled.



Walter Mitty

  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 1025
    • View Profile
Re: Learning Java in Retirement
« Reply #51 on: November 15, 2011, 02:25:25 pm »
I'm ready to put the purge toy project to bed for the time being.

On the upside.  There are no known bugs.  It runs fast enough to suit me.  It's written in clear simple code that I'm going to be able to understand months from now.

On the downside.  There could always be unknown bugs. 

I might end up regretting my whole approach.  If so, I'll "think of it as learning experience".  (ha ha)

It's coded antithetically to the religion of OOP.  In particular, when I have common variables that I want to access from all over the place, I just declare them "public static"  and to hell with it.  No OOP devotees are ever going to look at this code anyway.

I've decided that I don't want to really delete the pictures to be purged.  I want to move them to the Windows recycle bin instead.

I'd like to compile it for version 1.4 of the JVM.  But generics are not supported in 1.4.
I'll come up with a workaround.

Walter Mitty

  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 1025
    • View Profile
Re: Learning Java in Retirement
« Reply #52 on: November 16, 2011, 08:43:33 am »
In my last remarks, I made some disparaging remarks about people who turn OOP into a religion.  In this post, I want to restore the balance.

I honestly beilieve that object oriented thinking represents a fundamental step forward in the art of formal thinking about reality.  Likewise, I honestly think that OOP represents a fundamental step forward in the art of programming.  It's a step I didn't take 25 years ago.  And it's one of the reasons I want to learn Java.

To some extent, my attitude was poisoned by people who evangelized OOP in revolutionary terms.  You know the spiel,  "forget everything you ever learned about programming"  or "throw all  of your existing code away".  These are revolutionists, and revolutionists bring with them the threat of chaos.

But, in addition to revolutionists, there are coopters.  If revolutionists threaten chaos, coopters threaten mediocrity.  I confess to being in danger of being a coopter rgarding OOP myself.  I have a real  resistance to learning how to analyze a problem statement into objects.  That resistance translates into some real resistance in learning how to design java classes. 


In my own defense, I'm going to say that there's a certain method to my madness (no pun intended.)

Walter Mitty

  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 1025
    • View Profile
Re: Learning Java in Retirement
« Reply #53 on: November 16, 2011, 08:54:02 am »
Cont'd.

I want to learn java one step at a time.  The first thing I want to learn is how to use C syntax with proficiency.  Regardless of how much I like Pascal syntax,  C syntax is the main stream.  There are couple of ways in which I will continue to be idiosyncratic.

The first is the use of whitespace.  I'm following the indentation rules that I've seen in java code snippets online.  But apart form that, I'm using whitespace in source code the way I have for decades, that is liberally.  Judging form online examples, java programmers use whitespace like they were preserving the Brazilian rainforest or something.

I like to use whitespace in addition to comments to help me tell the story the program is trying to tell.  Sometimes a blank line in the middle of a statement sequence tells me more than an extensive comment would. 

I'm writing for the reader.  The first reader is me, months later.  The second reader is the compiler.  The third reader is some other programmer, someone I may never meet.  All three are important.

The second way I depart from the online examples is variable names.  I'll use brief variable mnames,  like "i" and "j" for loop control, and such.  But the big variables get big names.   I learned this while working with databses, and I have no apologies to offer.  If it takes too much typing, that can be cured by an appropriate text editor.


Walter Mitty

  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 1025
    • View Profile
Re: Learning Java in Retirement
« Reply #54 on: November 16, 2011, 08:55:35 am »
Cont'd.

Apart from my idiosyncrasies, I'd like to learn how to manipulate java as if it were a pretty standard structured language.  Then, and only then,  I want to learn effective class design.

end of 3 post sequence.



David Randolph

  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 2501
    • View Profile
Re: Learning Java in Retirement
« Reply #55 on: December 28, 2011, 10:36:44 am »
I have my doubts about computing CRC's in Java. Most fast methods require clear management of the computation word size. I was going to post a C++ method, but found this statement: "Sun JDK 1.6 contains sun.misc.CRC16" which is a much better way to address the problem.


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf