Outline of Development Items (Click on the 'Asana' items to see the task in Asana)
1. Improve the shingler system based on the de-duplication work that Tom and Loretta did for the project.
2. Identify a solution to store duplicates without taking up so much disk space. (Asana)
3. Lodestar bug fixes (Asana) (Already been addressed)
4. Clean up the Gradle builds (Asana) (This has already been addressed)
5. Addressing GNI Features and Bugs
Fix issues with the GNI
Fix issues identified to clean text of content (Asana)
Better organize and present Bulk LexisNexis data (Asana).
Remove source_name “LexisNexis” (Asana)
Resolve issues with the processing of new BLN data
Resolve issues with empty values
Add new features to GNI
Add a feature to add new BLN sources (Asana)
Add word count (or single word token counts) variables for Content and Title (Asana)
Document-term matrices as an output? (Asana)
Add text complexity variables? If so, what? (Asana)
Update Moral Foundation sentiment using Diesner’s ‘Expanded Moral Foundations’ dictionary (Asana): https:%%//%%dl.acm.org/doi/abs/10.1145/3465336.3475112
Extracted Quotes and speaker variables (assuming we use Stanford CoreNLP?) - Nice to have but not required. (Asana)
Web
Explore adding new sources to the web crawler. If we can add them, then add them. (Asana)
Add a new variable (or maybe use the ‘publisher’ field) to denote the source of web content instead of using simply the URL field. (Asana)
Improve the performance of the SPEED Classifier (Not required). Probably need to focus on this in the next dev cycle) (Asana)
Add new content: (Asana)
Add URLs for BulkLexisNexis_v2 sources (for users who do not have access to the content) (Asana)
Who it impacts: Cyber-infrastructure, Analyst Ops, Analytics and Data Management, External Engagement, Users of Cline Center GNC data
7. Transition from Illinois NER over to Stanford CoreNLP or some other alternative (i.e. spaCy). (Asana)
8. Improve Geocoder based on findings from Event Detection work that Buddy and Loretta are working on (Asana)
Update the Geonames database that the geocoder uses (done)
Implement pattern-finding approaches to identify geolocations
Improve the list of potential geolocations that the geocoder provides for a location entity that has been identified in a text
Create a new field with the top choice (based on score) for a predicted location.
Access how well the geo-coder does use the Universal Gold Standard event data and against other off-the-shelf geocoders available
Who it impacts: Cyber-infrastructure, Analyst Ops, Analytics and Data Management, External Engagement, Users of Cline Center GNC data
9. Archer Updates
Migrate Archer UI to new Vaadin (Asana)
Archer UI Updates
Feature Requests
Required
Not required but would be nice to have (Asana)
Bug fixes
Archer API Features (Asana) (Already been addressed)
Who it impacts: Cyber-infrastructure, Analyst Ops, Analytics and Data Management, External Engagement, Users of Cline Center GNC data
10. MongoDB Items
11. Storm Cluster items
12. Retire ‘production’ VM (Asana)
Resolve CCP portal. We need to do one of the following:
Figure out where to store Presidential Campaigns Data
Migrate the MySQL database that is on this VM
Who it impacts: Cyber-infrastructure, CCP, Scott Althaus, Analytics and Data management
Expected Deployment Date: May 2023
VMs that will be retired and what they will be replaced by
Systems that will be impacted/updated: