The "Meet the GiMP!" ForumGIMP, Image Processing (DIP) and PhotographyImage Processing SoftwareGeneralComparing Picture Contents
Pages: [1] 2 3   Go Down
Print
Author Topic: Comparing Picture Contents  (Read 2175 times)
GIMPel
Lives here ;-)
***
Posts: 483


View Profile
« on: February 01, 2010, 01:09:21 am »

Hello,

is there already a tool that can compare the contents of pictures?
I just want to know if some files are equal in the picture contents,
no further information is needed.
And for me it's at the moment enough if it only works on jpeg-files.

F-Spot uses md5sum, but seems to look at the whole file,
not at the contents of the picture.
For comparisons, where different meta-data for same pic-contents are needed
and files with different meta data are intended to be seen as different pics, this is fine.

But I have some jpeg-files (roughly estimated about 2000), which seem to be similar
in contents of the picture, but the md5sums are differing. And I want them to be detected as equal. (The different meta-data seems to be the result of handling the pics first with digikam and later with f-spot; in f-spot I only handle the metadata externally, so it will not be written into the file, but digikam maybe did it different.)

So I need to compare the pure image data.

Any idea which tool can help here?


P.S.: How do you handle the metadata... do you let f-spot write it into the jpeg-header or not?
  At the moment I think about changing the external handling. It seems not to be possible to have *both* activated in f-spot (external for speed, and internmal for safe operation, even if the f-spot database would be lost.)

EDIT: some sentences/clarifications added.
« Last Edit: February 01, 2010, 01:13:33 am by GIMPel » Logged

monoceros84
Administrator
Sr. Member
***
Posts: 1594



View Profile WWW
« Reply #1 on: February 01, 2010, 09:25:54 am »

Have a look at digiKam. Not 100% sure right now but I think I read that they have something like this.
http://www.kipi-plugins.org/drupal/node/12
digiKam can call it. Or any other KDE graphics tool.

Google can also do this in it's internet Image Search. Maybe Picasa has this already.

EDIT:
From the changelog of digiKam 0.10:
Quote
AlbumGUI : New fuzzy Search tools based on sketch drawing template. Fuzzy searches backend use an Haar wevelet interface. You simply draw a rough sketch of what you want to find and digiKam displays for you a thumbnail view of the best matches.
[...]
AlbumGUI : New Search tools to find similar images against a reference image.
AlbumGUI : New Search tools to find duplicates images around whole collections.
Logged

Cheers,
Mathias

Visit this site about my photography, my experiences in Norway and my blog:
http://www.gedankenquirl.de (German language)

GIMPel
Lives here ;-)
***
Posts: 483


View Profile
« Reply #2 on: February 01, 2010, 09:47:14 am »

Have a look at digiKam. Not 100% sure right now but I think I read that they have something like this.
http://www.kipi-plugins.org/drupal/node/12
digiKam can call it. Or any other KDE graphics tool.
[...]

Hmhhhh, don't know if it is what I'm looking for...

"The Kipi Find Duplicate Images plugin is a tool for find duplicate photograph on your image collections.

The Kipi plugin “Find Duplicate Images” currently has no documentation."

Hmhhh, no documentation... and "find duplicate photograph" may also be based on
bytewise (or hash-based) file comparison?

But thanks for the hint, maybe it really looks at the contents of a picture.
I hope this can also be called as a command line tool.
I don't want to use digilam again.
A while ago I switched to F-Spot, coming from Digikam, and now changing back would not be what I'm looking for. (Not that I'm completely happy with gnome-stuff.)



Quote
[...]
Google can also do this in it's internet Image Search. Maybe Picasa has this already.

EDIT:
From the changelog of digiKam 0.10:
Quote
AlbumGUI : New fuzzy Search tools based on sketch drawing template. Fuzzy searches backend use an Haar wevelet interface. You simply draw a rough sketch of what you want to find and digiKam displays for you a thumbnail view of the best matches.
[...]
AlbumGUI : New Search tools to find similar images against a reference image.
AlbumGUI : New Search tools to find duplicates images around whole collections.

Hmhhh. The question is, how they define "duplicate" Wink
Logged

nachbarnebenan
Lives here ;-)
***
Posts: 204



View Profile WWW
« Reply #3 on: February 01, 2010, 09:48:30 am »

You can try ImgSeek from http://www.imgseek.net, which is the originate for this Kipi technology. It is good, but can take a lot of time and doesn't support raw, so if the developed image and its raw became separated (by renaming etc.) it won't bring them back together.
Logged

GIMPel
Lives here ;-)
***
Posts: 483


View Profile
« Reply #4 on: February 01, 2010, 11:39:35 am »

You can try ImgSeek from http://www.imgseek.net, which is the originate for this Kipi technology. It is good, but can take a lot of time and doesn't support raw, so if the developed image and its raw became separated (by renaming etc.) it won't bring them back together.


Hmhh, looks good.

Is not command line, but stand alone.
So I may try it on my current F-spot tree and the old stuff in the Digikam-Tree.

Good hint, thanks!
Logged

GIMPel
Lives here ;-)
***
Posts: 483


View Profile
« Reply #5 on: February 01, 2010, 10:19:30 pm »

You can try ImgSeek from http://www.imgseek.net, which is the originate for this Kipi technology. It is good, but can take a lot of time and doesn't support raw, so if the developed image and its raw became separated (by renaming etc.) it won't bring them back together.

OK, I have just played around with it for some minutes.
Nice stuff, even if not perfect.

But I have not tested it in detail, just played around with it.

It looks for similar files.
Similarity seems to be color mood, not shape of objects.

But for my needs at the moment it is not the ideal fit,
because I look for exact match.
So I write my own solution.

I'm not sure if I should write a complete-match tool,
or rather something like a hash-on-picture-contents
tool (something like an md5-of-picture-data). The first
one is what I'm looking for, but the second way would
be more general/flexible and also could solve the problem.


BTW: for other purposes the imgSeek might be very good, for example,
if you look for a picture that will have similar mood and can be added
to a list of pictures for an exhibition. Maybe some more testing of that imgSeek-tool
might make sense then.
Logged

Rolf
Administrator
Sr. Member
***
Posts: 1461


View Profile
« Reply #6 on: February 01, 2010, 10:22:20 pm »

If you are looking for exact copies then a MD5 on the image data would be right. But every new saving of (at least) JPG changes these data a bit. So you won't find the copy that has been saved again.
Logged

GIMPel
Lives here ;-)
***
Posts: 483


View Profile
« Reply #7 on: February 01, 2010, 11:34:49 pm »

Hey Rolf,

If you are looking for exact copies then a MD5 on the image data would be right. But every new saving of (at least) JPG changes these data a bit. So you won't find the copy that has been saved again.

Well, I mean md5 on the pure picture-data without jpeg-metadata (comments/EXIF).
If I only read the data to calculate the md5sum, there will be no change.

The problem I have, is: I'm not sure, which of the picture files that are in the old digikam-directory are also already included in the f-spot dir (and database).

If the program only changes the jpeg-comments section
and does not touch the picture-data part of the jpeg-file,
then the pic will stay the same.
I hope that digikam and f-spot also just have done this,
when changing the header. If they would have read in the whole
picture, change the comments and write it back afterwards, you are right and I will be
on a blind alley.
So maybe I will not have a possibility to find out the same pics.

But hopefully both programs really only touch the jpeg-header!

BTW: Not even every operation on jpeg picture data is lossy;
some operations can also be henadled lossless!
AFAIK it is rotating by a multiple of 90 degrees that can be handled
lossless. maybe some other operations also (AFAIK cutting can be done lossless
under some constraints on the dimensions, but I'm not sure here).

Tools like exifautotran, which uses jpegtran, can do lossless operations on
jpeg-files (but also provide lossy operations).

When I import pictures into a program like f-spot or digicam,
the kind of copy should only be done on the filesystem-level,
so opening and closing the file with jpeg-reading programs
might not be necessary.
But when something is changed in the pic-header, then - if done in the
wrong way - it also changes the picture data.

Changing the picture data itself should *never* occur in such a case of import.

But who can be sure here?

i just can hope it Wink

Otherwise I have to look throuhgh thousands of picture files.
I yseterday compared some tens or some hundreds until I gave up
and decided for a software solution.

(But it was interesting: I found some old photographs from my pre-DSLR-era,
which were really good! Comparingly worse camera technics, but good shots.
So it was not senseless to look at so many pics before I decided to look for a program that helps here.)


EDIT: changed one-way road -> blind alley ;-)
Logged

Rolf
Administrator
Sr. Member
***
Posts: 1461


View Profile
« Reply #8 on: February 02, 2010, 12:57:35 am »

One idea for speeding up the process - just compare a little part of the image. Always the same position, of course. ;-) Much less data and the chance of 1000 identical pixels in the same position of two images is quite small.
Logged

monoceros84
Administrator
Sr. Member
***
Posts: 1594



View Profile WWW
« Reply #9 on: February 02, 2010, 08:04:15 am »

I still don't get the point. If you want to write that stuff for fun: fine, this is perfectly ok.
But if you just want it to be done I can't see a disadvantage of temporarily using digiKam and the included functions. digiKam won't change your picture directories so it will still work with f-spot afterwards.
Just include the old and the new directory as albums into digiKam, run the duplicate checker and never touch digiKam again if you want to.
However, I am still not sure if this included checker compares only the image data or also the headers.
Logged

Cheers,
Mathias

Visit this site about my photography, my experiences in Norway and my blog:
http://www.gedankenquirl.de (German language)

GIMPel
Lives here ;-)
***
Posts: 483


View Profile
« Reply #10 on: February 02, 2010, 11:36:03 pm »

One idea for speeding up the process - just compare a little part of the image. Always the same position, of course. ;-) Much less data and the chance of 1000 identical pixels in the same position of two images is quite small.

Wow!

Very good idea! :-)

I think I have to look a certain amount of bytes from the back...
...because the header will be at beginning of the file.

But at the moment I'm not sure, if the picture data is really the last stuff in a jpeg-file.

What about other metadata, as embedded thumbnails?

Will all this stuff be before the pic-data begins?

So; I'm not quite sure at the moment.

But if all additional stuff is in the beginning of the file,
looking from end of file backward to the beginning migth be the fastest thing, yes!
Logged

GIMPel
Lives here ;-)
***
Posts: 483


View Profile
« Reply #11 on: February 02, 2010, 11:41:28 pm »

I still don't get the point. If you want to write that stuff for fun: fine, this is perfectly ok.
But if you just want it to be done I can't see a disadvantage of temporarily using digiKam and the included functions. digiKam won't change your picture directories so it will still work with f-spot afterwards.
Just include the old and the new directory as albums into digiKam, run the duplicate checker and never touch digiKam again if you want to.

In which directory will Digikam remove the doublettes/triplettes/fourthelttes......
superflous files?

I may have a lot of work afterwards...

...it looks more and more, I have to do it by my own.

BTW: the mess has already increased, because F-Spot
has used a different include-directory than I had before...
... maybe the imports in 2010 were the first imports since I updated
my system... now I have two different Directories used by F-Spot...

... any idea on solving this?
(should start a new thread on this topic...?!)


Quote
However, I am still not sure if this included checker compares only the image data or also the headers.

So we are now where we started Wink
Logged

monoceros84
Administrator
Sr. Member
***
Posts: 1594



View Profile WWW
« Reply #12 on: February 03, 2010, 08:38:59 am »

In which directory will Digikam remove the doublettes/triplettes/fourthelttes......
superflous files?

You will prob. have to set this. Or you will be asked each time. Anything different would not make any sense.
Sorry, but I still had no time to test this. The last days where just 'coming home, going to sleep, eating breakfast and leaving again'...

BTW: the mess has already increased, because F-Spot
has used a different include-directory than I had before...
... maybe the imports in 2010 were the first imports since I updated
my system... now I have two different Directories used by F-Spot...

That's why I don't like f-spot. A software that changes my manual directory tree is not even considered to be used. I want my affairs at least in a basic order even if the program is not working anymore someday.

... any idea on solving this?
(should start a new thread on this topic...?!)

Please do so. It's a different topic.

Quote
However, I am still not sure if this included checker compares only the image data or also the headers.

So we are now where we started Wink

Hehe, yes. I promise to try it when I have some hours at home.
Logged

Cheers,
Mathias

Visit this site about my photography, my experiences in Norway and my blog:
http://www.gedankenquirl.de (German language)

GIMPel
Lives here ;-)
***
Posts: 483


View Profile
« Reply #13 on: February 03, 2010, 01:48:38 pm »

In which directory will Digikam remove the doublettes/triplettes/fourthelttes......
superflous files?

You will prob. have to set this. Or you will be asked each time. Anything different would not make any sense.
Sorry, but I still had no time to test this. The last days where just 'coming home, going to sleep, eating breakfast and leaving again'...


Well, it's best to write my own solution.

I already have some C-code which I can rely on,
regarding jpeg-lib. Today I looked for some library-stuff
that calculates me the md5.

I have found openssl-lib, but my first trielas gave me
crashes, so i do later, when i have more time to make it better.
Or maybe I find a better, smaller lib for it.

If it needs more time to learn other programs to solve my task then it needs to program it by myself, it's clear what I prefer.
Also I have more control on the results, and if nothing works like it should, I know whom to blame for it.




Quote
BTW: the mess has already increased, because F-Spot
has used a different include-directory than I had before...
... maybe the imports in 2010 were the first imports since I updated
my system... now I have two different Directories used by F-Spot...

That's why I don't like f-spot. A software that changes my manual directory tree is not even considered to be used. I want my affairs at least in a basic order even if the program is not working anymore someday.

Ah, today i used strcae to look on f.-spot in detail,
and why it eats up so much system time.

It does hundreds of completely senseless calls....

It's bloaty bullshit.
Maybe it relies too much on mono or gimp-libs,
maybe it's bullshit by it's own, but when I have
things like stat-syscall on the dsame file over and over again,
or calls to a gettime-function over and over again,
or read s of files that never should be looked at, then it is complete
bloat and bullshit.

If digikam would not have been crashed very often for a while,
I would not have changed back to f-spot.

It seems, if my annoyances will stay for longer, I should write my own
picture-managing program.

Look here:

some minutes of stracing f-spot:
===============================
oliver@siouxsie:~$ grep -c clock_gettime  f-spot.STRACE
1438208
===============================

The function clock_gettime() was called about 1.4 million times!

THIS ABSOLUTELY IS BLOAT AND BULLSHIT!

If it has a database, why does it not use it?

If it has read informations of files,
why does it not use them?
If there was no new import, it is not necessary to look up informations over and over again.

Bad design. Bloat. Bullshit.

Linux becomes more and more a free version  of M$.
They copy all the worse things.
Mono is such a copy of nonsense.
It is intended to make programming easy, but it switches off the brain of the developer.
...like Java, heheh.

OK, I better stop now.




Quote
Quote

So we are now where we started Wink

Hehe, yes. I promise to try it when I have some hours at home.

OK,  no further action more on this topic,
I will program my own solution.

But thank you and the others for your effort.


GIMPel
Logged

GIMPel
Lives here ;-)
***
Posts: 483


View Profile
« Reply #14 on: February 03, 2010, 09:35:31 pm »

[...]

OK,  no further action more on this topic,
I will program my own solution.

But thank you and the others for your effort.


GIMPel

OK, program runs as expected.  :-)

Just some code-cleaning now and I have what I wanted. Smiley

EDIT: at the moment 218 lines of C, but can throw out some stuff, but will add some comments...  faster than hoping other programs will MAY help me somehow... heheh.

Itbftr       Tongue
Logged

Pages: [1] 2 3   Go Up
Print
Jump to: