aaron's picture

retaining folder names as 'tags' for duplicate documents

| | | | | | | | | | |

I have a large PDF collection - and several applications that claim to help me 'manage' my PDFs.

Some are focused on being note-taking tools; some see judicious selection from your PDF collection as research fodder for the writing process (Scrivener, Selenium); others are interested in organizing and mobilizing reference metadata for citation while writing papers (Reference Manager, RefWorks, EndNote); others have only added file management to robust citation management (Sente, EndNote) or vice-versa (Zotero).

And then there's Papers. Hands down the best document + metadata search, retrieval & article management app for the mac.

But it doesn't 'do' reference citation during the writing process. Instead, it hands off its metadata on your papers to Endnote, Bookends, Sente, (insert Reference Management Software title here) to do the inline (or offline) citation work.

And it doesn't scale gracefully to include metadata for non-article references (e.g. it has one type of entity that it tracks metadata for - journal articles. If you have book chapters, conference talks, or course lectures in your collection, good luck). Other citation management software dynamically change the metadata fields available based on the document type of your citations.

Of course, many crazy document types = a big headache for proper formatting of your citations - you'd have to have a formatting template for each document type for each citation style, ready waiting in the wings. And it seems like every field and/or journal has their own 'proper' citation format these days... So Papers lets you get your hate on with your reference management software - smart move? Maybe..

But this post is about the problem of harvesting and organizing documents outside of these PDF management apps, & how to capture that organizational work back into your collection when you get rid of duplicate pdfs of the same document.

My scenario:

When writing a literature review on clustering methods for visualizing text, (pre mac!) I manually and individually collected about 300 articles over the course of about 5 weeks.

As I found a good document, I downloaded the PDF and I simultaneously downloaded an RIS file for each citation - and imported that info into a Reference Manager database.

I used a custom output template to create a string that I used to rename each file - manually (e.g. Authors. PubDate. ArticleTitle.pdf)

A renamed file was an implicit indication that I had its metadata in hand; so the file was then moved from the 'downloads\get metadata!' folder to the 'in RefMan' folder - manually.

Problem with this process is that hand-curating your collection - getting a file, getting its metadata, (correcting metadata), doing some principled renaming of files, meta-data based relocating of files - takes sooooo much time.

At one point in the writing process, I was advised to focus on only those articles that featured visualizations of the junk they were discussing.

So I manually went through and skimmed each file looking for 'pictures.' (It was 4th grade Language Arts field trip to the library all over again.)

I copied each article that contained some sort of picture into a subfolder called 'ContainVis'.

This content-based classification created a duplicate file, but I wasnt worried about that then.

The problem was, I had begun reading and annotating the larger collection of files (e.g. 'in Refman') and this included making highlighting and other text annotations within the PDF files, using Adobe Acrobat.

So now I have two copies of a single article, which may or may not have different annotation contents.

Later, when trying to 'mac up', i copied the original folder documents into my Papers database, *and* into my Sente database. (At that point the question of which app would 'win' my PDF management business was still an open one.)

Im also pretty sure that at a separate manic point of disk-cleaning, I copied all the documents in the 'in Refman' folder into eArticles.

So that leaves oh, 2 plus-or-minus 2 copies of the same files that may-or-may-not have different annotation metadata attached.

And, oh yeah, the PDF management software packages both change my manual file naming schema (which separates data elements with periods) to using the same data elements separated by commas.

So you cant just match with file names, because they may be different (though mostly the same).

You cant just go with the most recent last modified date, because non-content changes may have been saved after annotations were made on an original file.

You cant compare hashes, because they will differ.

Best I have been able to do is to compare filenames, and create subsets of matched files based on their location; zero in of the files in the locations i want to delete while retaining my content-based-categorization; use file label colors as temporary flags to color both keeper and copy files; select labeled files in the keeper folder; tag them.

I still havent figgured out how to compare the annotations inside documents - though I have found SDKs and scripts which I could use...   bookmark tuals 0.1 on

Bg Bottom