Bulk harvesting of newpaper articles from Trove on MacOS 10.9 or 10.10 using Retailer – Instructions

The following is a cut-and-paste from my blog entry on this subject 🙂

Conal Tuohy (@conal_tuohy) presented a session at THATCamp Canberra 2014 on Retailer, an interface tool he’s developing to provide the National Library of Australia's Trove service with an Open Archives Initiative Protocol for Metadata Harvesting-compliant interface.

The aim of the session was to get attendees to install Retailer on their laptops and then perform some searches.
It turned out that installing Retailer on the Mac laptops present wasn’t quite as straight-forward as might have been hoped (the linux-heads present had no such problems).

During the session, we worked out a procedure that does work for users of MacOS 10.9 (Mavericks) and MacOS 10.10 (Yosemite). This procedure is explained, step-by-step, below. Please read through these instructions in their entirety before you try to install Retailer on your Mac, so that you don’t make incorrect assumptions about the following steps 🙂 Please note that I’m going to make the following assumptions:

  1. You haven’t moved your default Downloads location from the default location (ie the Downloads folder in your home directory)
  2. That you know how to open the Applications folder to see the complete list of your installed applications.
  3. That you’ve applied for, and received, a Trove API key. You’re not going to get far without one.

The installation instructions

  1. Start by reading Con’s blog post introducing Retailer. You may not understand all of it, and it’s very Debian-centric, but read it anyway, so you understand what Retailer is and how it works, and why you need to download various pieces of software.
  2. Download the Java Development Kit (JDK) installer. Yes, you want the JDK (which installs a full Java compiler & tools), not the Java Runtime Environment (JRE), which is just a plugin for your web browsers). You also need to ensure that you download Java 8 update 25 or newer; earlier version of the installer weren’t aware of MacOS 10.10 (the latest, greatest), and treated it as 10.1 (ye olde ancient version from the early 2000s), and would refuse to install because they thought your OS was too old. You can download the installer from http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html You need to click on the radio button that says “Accept Licence Agreement” then you can download jdk-8u25-macosx-x64.dmg. Let it put it into your default Downloads location. Do not install it at this time.
  3. Download Apache Tomcat v. 8 from https://tomcat.apache.org/download-80.cgi Look in the section labelled “Binary Distributions”. The first sub-section is labelled “Core”. You should download the tar.gz version. Do not unpack the compressed file at this time.
  4. Download jOAI from http://www.dlese.org/dds/services/joai_software.jsphttp://www.dlese.org/dds/services/joai_software.jsp Click on the “Download from SourceForge” link, and let it put the download in your default downloads folder. Do not unpack the download at this time.
  5. Download Retailer from https://github.com/Conal-Tuohy/Retailer/releases You should click on the green button with the down arrow and “retailer.war” on it. Let it put it in your default download location. Do not do anything with this file at this time.
  6. OK, at this point your default downloads location should contain (please note the version numbers in the following were current at time of writing, your mileage may vary):
    1. apache-tomcat-8.0.14.tar.gz
    2. jdk8u25-macosx-x64.dmg
    3. joai_v3.1.1.3.zip
    4. retailer.war
  7. Now it’s time to visit our friend the command-line. Open up a Terminal window (the Terminal is in the “Utilities” folder inside your Application” folder). Do not close this Terminal window until you’re told it’s safe to do so much later; you’re going to be making a great deal of use of it.
  8. You need to decide where you want to put the apache-tomcat installation. I recommend the /Users/Shared folder. Type
    cd /Users/Shared
    into the terminal window, and hit return.
  9. Now type the following three lines into the Terminal, hitting the return key after you’ve typed each line. The first line unpacks the tomcat server, the second line copies retailer.war to where it needs to be, and the third line extracts oai.war from the archive and puts it where it needs to be.
    tar -xvf ~/Downloads/apache-tomcat-8.0.14.tar.gz --gunzip
    cp ~/Downloads/retailer.war apache-tomcat-8.0.14/webapps/
    unzip -j ~/Downloads/joai_v3.1.1.3.zip joai_v3.1.1.3/oai.war -d apache-tomcat-8.0.14/webapps
    
  10. OK, now you should install the Java Development Kit. Double click on the jdk-8u25-macosx-x64.dmg file to open the disc image, then run the enclosed installer. Once the installation has completed, eject the disc image.
  11. Go back to the Terminal. Type
    java -version
    If all has gone well, you should see something like:

    java version "1.8.0_25"
    Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
    Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)
  12. Now type
    ./apache-tomcat-8.0.14/bin/startup.sh
    If all goes well, you should see something like:

    Using CATALINA_BASE: /Users/Shared/apache-tomcat-8.0.14
    Using CATALINA_HOME: /Users/Shared/apache-tomcat-8.0.14
    Using CATALINA_TMPDIR: /Users/Shared/apache-tomcat-8.0.14/temp
    Using JRE_HOME: /Library/Java/JavaVirtualMachines/jdk1.8.0_25.jdk/Contents/Home
    Using CLASSPATH: /Users/Shared/apache-tomcat-8.0.14/bin/bootstrap.jar:/Users/Shared/apache-tomcat-8.0.14/bin/tomcat-juli.jar
    Tomcat started.
  13. Start your web browser of choice, and point it at:
    http://localhost:8080
    If all goes well, you should see a web page for Apache Tomcat.
  14. When Tomcat started up, it should have unpacked the two .war files into separate directories for you. You need to edit Retailer’s configuration file. Go back to your Terminal window, and type

    open -a TextEdit apache-tomcat-8.0.14/webapps/retailer/WEB-INF/web.xml
    to open the file in TextEdit. Replace the text “INSERT TROVE API KEY HERE” with your Trove API key.
    Now you need to add an additional parameter, to tell Retailer that you’re going to use it to perform Trove searches. Add the following lines just before the <servlet> line:

    <context-param>
    <param-name>xslt</param-name>
    <param-value>trove.xsl</param-value>
    </context-param>

    Save the file and exit TextEdit.

  15. Go back to the Terminal window, and type
    cp apache-tomcat-8.0.14/webapps/retailer/WEB-INF/web.xml /Users/Shared/retailer-config-backup.xml
    This will make a backup of your configuration file outside the Retailer web app; I’ve had my web.xml “restored” to the default a couple of times through no action of my own, so having a backup on hand has been useful.
  16. Point your web browser at:
    http://localhost:8080/oai/admin/harvester.do
    and click on “Add new harvest”.
  17. Fill in the settings as per Con’s blog post. For your first harvest, I suggest you use “search: international cometary explorer”; this doesn’t match too many items (most are in The Canberra Times, post 1954) Note the section “Save files from this harvest:”.
    The default harvest location is
    /Users/Shared/apache-tomcat-8.0.14/webapps/oai/WEB-INF/harvested_records
    You’ll probably want to put these somewhere else, so select “at a location I specify…” and type in a folder path (eg /Users/Shared/harvested_records/ICE ). Click on the “save” button
  18. Click on “All” under “Manual Harvest”. You’ll be asked if you want to replace the results of a previous harvest. Since you haven’t harvested before, your answer should be “OK” (in future, you’ll be better off clicking on the “New” button to add any new results to your pre-existing harvest).
  19. Wait. Depending upon your search parameters, your harvest may take some time. You can keep an eye on it by clicking on “View harvest history and progress” and then occasionally refreshing the page.
  20. Your harvested records will be stored in the location you specified.

Please note that unless you specifically turn it off, the Tomcat server will continue running until your computer is shut down or rebooted; even if you log out and log in as a different user, the Tomcat server will continue running. You can turn it off by typing
./apache-tomcat-8.0.14/bin/startup.sh
into the Terminal window.

2 Responses to Bulk harvesting of newpaper articles from Trove on MacOS 10.9 or 10.10 using Retailer – Instructions

  1. Pingback: Posts, summaries and reflections | THATCamp Canberra 2014

Leave a Reply to Steve Leahy Cancel reply

Your email address will not be published. Required fields are marked *