Finding Duplicates

I really, really want to find duplicate records/assets by some useful identifying information aside from less-than-helpful options Cumulus Desktop Client currently provides. I've never had good luck with File Data Size or Asset Creation Date. The first option in the Find Duplicates dialog box arguably holds the most unrealized promise but, like many features was myopic in conception and implementation. "Compare Names Using Field... [drop-down list of Record Name or Asset Name]".

Give the option of expanding that to include other fields or add an entirely new option to pick a filed to compare. The prime candidate would be "XMP Original Document ID"... or any of the XMP fields Cumulus Captures from the XMP Media Management Schema, for that matter... oh heck, IPTC Date Created too (as it also captures the time).

As I write this I'm resorting to several hacks to determine the duplicate assets one of my photographers placed into a cataloging hot folder. The asset are automatically renamed so I can't rely on the Record Name or Asset Name options.

10 thoughts on “Finding Duplicates

  1. I con­sider myself a mas­ter black belt in de-duplicating large col­lec­tions. That’s because I acci­den­tally dou­ble and triple loaded thou­sands of assets dur­ing my first DAM imple­men­ta­tion. I have two pri­mary meth­ods that work well and maybe you can adapt them to your sit­u­a­tion. I was on Win­dows and used Arte­sia, so the tools will be different.

    First method was to clean up assets out­side the DAM. I used a cool util­ity (free) called Dou­bleKiller. I usu­ally com­pare file­name, file size and check­sum. Check­sum is the key because it lim­its the odds of two files hav­ing the same check­sum to 1:20,000,000 or some­thing like that. There are a few other Win­dows util­i­ties that can do this and I am sure Mac ver­sion exist (care­ful, some of these util­ity sites are scams).

    The sec­ond method is sim­i­lar but uses Artesia’s fea­ture to cal­cu­late check­sum on ingest and then store it as file level meta­data. It is a lit­tle slower than the first method but it lets you see all meta­data on the assets. The meta­data val­ues usu­ally give hints on why the dupli­cates exists in the first place. That’s how I found out it was me (Imported by=you).

    The Arte­sia method requires the use of their Expert Search func­tion that is only open to admins. The same queries could be eas­ily run against any DAM data­base by some­one skilled in SQL. I used to take those data­base reports and work them in Excel using fil­ters and sorts.

    File check­sum is the key to any reli­able asset cleanup. With­out that value, I think all other meta­data based cleanup tech­niques are risky.

    • Hey John!

      Thanks for com­ment­ing. I agree that MD5 Check­sums are the best way. I hate muck­ing about with files in the OS once they’ve been cat­a­loged… which was why I was try­ing to do this only inside of it. Granted that meta­data alone can be risky, but If there had already been meta­data write-backs to the assets (which in my case there were) the Check­sums wouldn’t match cause the XMP pack­ets would have been modified.

      That’s why I really wanted to com­pare against the Orig­i­nal Doc­u­ment ID field in the XMP I knew that in this pool of assets, that would be a sure-fire way to iden­tify dupli­cate DNGs.

      Cheers!

      • Andrew,

        Your write-backs make me sad :-( Actu­ally, I agree with the tech­nique but maybe not the imple­men­ta­tion. I would have the orig­i­nal check­sum saved as meta­data as ver­sion one and then write-back in ver­sion two or cre­ate a rela­tion­ship. I think the asset state at import should always be recorded and saved as his­tory. Inter­est­ingly, a lot of con­sumer photo soft­ware re-writes file attrib­utes at import.

        Great blog post.

  2. Hi Andrew,

    This is actu­ally one of those sit­u­a­tions where there’s a really nice, but undoc­u­mented Cumu­lus fea­ture that could help.

    If you add the Record Field ”Asset Con­tent Iden­ti­fier” to your Cat­a­log (and of course update the Records in the Cat­a­log), Cumu­lus will auto­mat­i­cally fill that field with a md5 check­sum of the Asset.

    You could use that field for your search to find dupli­cates. I guess you would have to cre­ate some script or EJaP to accom­plish exactly what you’re look­ing for, but at least it’s a pos­si­ble way.

    The oth­er­wise, very cool thing with this hid­den fea­ture, is that it can be used to pre­vent dupli­cates from being cat­a­loged in the first place.

    If the above field is present, and indexed for searching/sorting, Cumu­lus with use it in the ordi­nary Asset Han­dling Set dupli­cate con­trol if you do one more thing. I addi­tion you also have to add the fol­low­ing line to the client.xml:
    1

    If you also add the fol­low­ing line to that same file, Cumu­lus will also include ”Asset Name” in the dupli­cate search:
    1

    Hav­ing these two lines present, Cumu­lus will treat files where both Asset Name and the md5 check­sum as duplicates.

    I know, you have to do it on every sin­gle client, e.g. Cumu­lus Native Client, InDe­sign Com­pan­ion, RoboFlow etc. Not ideal, but doable.

    (By the way, if you use a Cen­tral Asset Loca­tion, you also need to add the field/fields ”Orig­i­nal Asset Con­tent Iden­ti­fier” and ”Orig­i­nal Asset Name” to your Catalog)

    Yeah, I can guess what you’re prob­a­bly think­ing. This sort of func­tion­al­ity should be doc­u­mented and it should be con­fig­urable by other means than man­u­ally edit­ing a con­fig­u­ra­tion file. I have told Canto that, so hope­fully it will be more vis­i­ble in a future release.

    • Thank you for you com­ment, Johan!

      I have a love/hate rela­tion­ship with the ”Asset Con­tent Iden­ti­fier” field. We used to have it in our cat­a­log, but it was (in part) caus­ing tremen­dous slow down while cat­a­loging… the cumu­lus tool got bogged down com­put­ing the MD5s on our assets. I say in part, because there were a num­ber of fac­tors at play… that field being one of them.

      Could you repost the tags to client.xml with­out the open­ing and clos­ing mark­ers? Word­Press didn’t like them and only put the value of “1” instead.

      Cheers!
      –Andrew

      • The two tags to enter in the client.xml are (open­ing and clos­ing mark­ers omitted):

        ns:duplicatecontrolusecontent 1 /ns:duplicatecontrolusecontent

        ns:duplicatecontrolusecontentassetname 1 /ns:duplicatecontrolusecontentassetname

          • hi andrew

            i was just going to say the same thing that johan said

            (ive also omit­ted the less-than and greater-than signs in the ns: tags)

            1)
            you can in fact use the md5 of the asset con­tent (a com­pletely dif­fer­ent method of check­ing\
            for dupli­cates, due to the nature of data that is used)

            2)
            first you need to add the “asset con­tent iden­ti­fier” field
            to your cat­a­log and switch on index­ing for sort/search.

            3)
            then update all records to fill this field (or you only use the field for newly added assets).

            4)
            if you are using a cen­tral asset loca­tion you should also add the field “orig­i­nal asset con­tent iden­ti­fier” and index it.

            5)
            this field is used when check­ing for dupli­cates while the asset in the cen­tral asset loca­tion may have changed
            and the “asset con­tent iden­ti­fier” got updated accordingly.

            6)
            you also need to change the client.xml of the cat­a­loging soft­ware to use the field
            instead of the usual method for check­ing dupli­cate assets:
            ns:DuplicateControlUseContent 1 /ns:DuplicateControlUseContent

            please be aware:
            !! There is an impor­tant change with ver­sion 8.6.1 !!
            Because of a bug­fix the client no longer writes into or reads from the orig­i­nal client.xml
            in the conf folder of the client instal­la­tion.
            a writable copy of the client.xml will be used which is stored in the users appli­ca­tion data folder.

            On Mac OS X this is:
            /Users/USER/Library/Application Support/Canto/Cumulus Client/conf/client.xml

            On Win­dows it can dif­fer from sys­tem to sys­tem. On Win­dows 7 it is:
            C:\Users\USER\AppData\Roaming\Canto\Cumulus Client\conf\client.xml

            7)
            this per­forms a search for the “orig­i­nal asset con­tent iden­ti­fier” if present
            or the “asset con­tent iden­ti­fier” if present
            and assumes all match­ing records rep­re­sent­ing the asset being imported.

            8)
            use the option “add only” in the asset han­dling set options to avoid updat­ing the
            match­ing record which would then change the asset ref­er­ence if the asset is not in the cen­tral asset location.

            9)
            an addi­tional client.xml option allows to also include the “asset name” in the search:
            ns:DuplicateControlUseContentAssetName 1 /ns:DuplicateControlUseContentAssetName

            then only record that match the “asset name” in addi­tion to the con­tent are assumed to ref­er­ence the asset being imported.

            please be aware:
            !! There is an impor­tant change with ver­sion 8.6.1 !!
            Because of a bug­fix the client no longer writes into or reads from the orig­i­nal client.xml
            in the conf folder of the client instal­la­tion.
            a writable copy of the client.xml will be used which is stored in the users appli­ca­tion data folder.

            On Mac OS X this is:
            /Users/USER/Library/Application Support/Canto/Cumulus Client/conf/client.xml

            On Win­dows it can dif­fer from sys­tem to sys­tem. On Win­dows 7 it is:
            C:\Users\USER\AppData\Roaming\Canto\Cumulus Client\conf\client.xml

            10)
            note:
            we had one cus­tomer who had the fol­low­ing use case:
            he burns cd-rom series of assets and includes an over­lap­ping set of assets on each cd-rom but the actual asset vary from cd-rom to cd-rom.
            he also used a cen­tral asset loca­tion and wanted to have each asset imported only once regard­less of the loca­tion on the cd-rom and the asset name.
            =======
            =======

            cheers
            ~ kyle

  3. As always it is impor­tant to really define the term dupli­cate first — because as you men­tioned you are not (nec­es­sar­ily) look­ing for a dupli­cate of the file (as XMP/IPTC infor­ma­tion inside it might have changed) but really for a dupli­cate of the image stored in a file. While a check­sum is the best way to find dupli­cate files there are sev­eral pos­si­ble issues with find­ing dupli­cate images.
    We (Modula4) offer a tool called Image­Search that actu­ally helps you find sim­i­lar images to a given source image. Since you can adjust the sen­si­tiv­ity of the algo­rithm you could also use it to find dupli­cate images — that would work for exact dupli­cates as well as for slightly changed vari­a­tions of an image. So for what you are look­ing for Image­Search might actu­ally give you the best results even­though it has not been built for this — but maybe if there is inter­est we could look into adding some de-duplication functionality.

Put your DAM comment here!