How to use the 'tools' app
Tom has built an app that contains a number of useful tools that Cline Center employees can use. The tools in this app are listed below. Right now, the most recent version of this app can be found on nexus-5-prod.clinecenter.illinois.edu. To run this app, simply go to /home/cline/apps/dedup and then run the following command 'sh ./tools.sh'. This should call up all of the options that are listed below with directions for how to run each option. Please note that this should only be used by Cline Center staff and isn't intended for outside use.
1 - Compare AIDs in Mongo with SOLR: Compare the AIDs found in mongo with those found in SOLR to see what AIDs are missing, and where they are missing. It will also catch duplicate AIDs in mongo.
2 - Compare Article Archives: Compare two article archives to ensure the same AIDS exist in both.
3 - Compare Mongo Collections: Ensure all articles in one Mongo collection is also in another, specifying a possible source name.
4 - Compute Named Entities for a Document: This tool will compute named entities for a document text. The AID must be supplied by the user.
5 - Deduplicate Within a Document Set: Given a set of AIDs provided by a CSV file, file any documents within the set that are duplicates of others. Produce information about each duplicate at the console, and produced a deduplicated list in another CSV file. The shingle size (number of characters used to compute the shingle), the resolution (number of shingles to keep), and the match threshold can all be configured as well as the input and output files.
6 - Dump Article HTML: Dump the HTML for a set of articles specified in a CVS file into a directory specified by the user.
7 - Dump Article Text: Dump the text for a set of articles specified in a CVS file into a directory specified by the user.
8 - Dump Coded AIDs: Dump all AIDs associated with coded events, these are true relevant articles.
9 - Dump Events and Text: Dump a CSV representation of all event data associated with a set of queue IDs. The queues IDs must be provided in a file separated by newline characters. The events will be dumped in CSV format and will include the annotationed text excerpt and the locations of the annotations.
10 - Find Whitelisted AIDs: Produce a list of the AIDs from a file that are in the whitelist.
11 - Generate Test Train: Generate a test and a train CSV 90% train, 10% test from the provided input dataset.
12 - Inject new records from a CSV file: This tool reads a CSV file, and injects the resulting records directly into the datastore bypassing the staging mechanism and processing pipeline. This can be used to ensure a record lands in the data store. The CSV must include an AID and offset, and the AID must NOT be a duplicate. For existing records, staging should be used.
13 - Restore Solr Index from MongoDB: Restore a Solr index from a MongoDB datastore.
14 - Show All Document Data: This tool, given an AID, will show all the metadata for the associated document.
15 - Show Document Text: This tool, given an AID, will show the article text only.
16 - Show Geocoding for Document: Given a document ID(or comma separated IDs), show the geocoding parse tree (used to compute support) and the geocoding for each cluster.
17 - Stage Changes: This tool reads a CSV file, and stages the documents found there for reprocessing. If the CSV file contains the AID, the associated document will be updated. Computationally produced metadata will always be stripped and reproduced. The data that is always reproduced includes sentiment scores, extracted data, people, places, geolocations, and so on. These documents however are NOT deduplicated, as it is assumed that is already done. New content to be added to the repository will NOT include an AID field, in these cases, the publication date, title, and content must be included.
18 - Validate Quiver Datastores: Ensure Solr and MongoDB are in sync in the default Quiver.
19 - Validate Similarity Detection: Search for similar documents in the corpus. A 10-day window (publication date) around each document is searched for duplicates.