6.0 Dev Cycle Outline

Outline of Development Items (Click on the 'Asana' items to see the task in Asana)


1. Improve the shingler system based on the de-duplication work that Tom and Loretta did for the project.

2. Identify a solution to store duplicates without taking up so much disk space.  (Asana

3. Lodestar bug fixes (Asana) (Already been addressed)

4. Clean up the Gradle builds (Asana) (This has already been addressed)

5. Addressing GNI Features and Bugs

Fix issues with the GNI

Fix issues identified to clean text of content (Asana)

Better organize and present Bulk LexisNexis data (Asana).

Remove source_name “LexisNexis” (Asana)

Resolve issues with the processing of new BLN data

Resolve issues with empty values

Add new features to GNI

Add a feature to add new BLN sources (Asana)

Add word count (or single word token counts) variables for Content and Title (Asana)

Document-term matrices as an output? (Asana)

Add text complexity variables?  If so, what? (Asana)

Update Moral Foundation sentiment using Diesner’s ‘Expanded Moral Foundations’ dictionary (Asana): https:%%//%%dl.acm.org/doi/abs/10.1145/3465336.3475112 

Extracted Quotes and speaker variables (assuming we use Stanford CoreNLP?) - Nice to have but not required. (Asana)

Web

Explore adding new sources to the web crawler.  If we can add them, then add them. (Asana)

Add a new variable (or maybe use the ‘publisher’ field) to denote the source of web content instead of using simply the URL field. (Asana)

Improve the performance of the SPEED Classifier (Not required).  Probably need to focus on this in the next dev cycle) (Asana)

Add new content:  (Asana)

Add URLs for BulkLexisNexis_v2 sources (for users who do not have access to the content) (Asana)

Who it impacts: Cyber-infrastructure, Analyst Ops, Analytics and Data Management, External Engagement, Users of Cline Center GNC data

7. Transition from Illinois NER over to Stanford CoreNLP or some other alternative (i.e. spaCy). (Asana)

8. Improve Geocoder based on findings from Event Detection work that Buddy and Loretta are working on (Asana)

Update the Geonames database that the geocoder uses (done)

Implement pattern-finding approaches to identify geolocations

Improve the list of potential geolocations that the geocoder provides for a location entity that has been identified in a text

Create a new field with the top choice (based on score) for a predicted location.

Access how well the geo-coder does use the Universal Gold Standard event data and against other off-the-shelf geocoders available

Who it impacts: Cyber-infrastructure, Analyst Ops, Analytics and Data Management, External Engagement, Users of Cline Center GNC data

9. Archer Updates

Migrate Archer UI to new Vaadin (Asana)

Archer UI Updates

Feature Requests

Required

Not required but would be nice to have (Asana)

Bug fixes

Archer API Features (Asana) (Already been addressed)

Who it impacts: Cyber-infrastructure, Analyst Ops, Analytics and Data Management, External Engagement, Users of Cline Center GNC data

10. MongoDB Items

11. Storm Cluster items

12. Retire ‘production’ VM (Asana)

Resolve CCP portal.  We need to do one of the following:

Figure out where to store Presidential Campaigns Data

Migrate the MySQL database that is on this VM

Who it impacts: Cyber-infrastructure, CCP, Scott Althaus, Analytics and Data management


Expected Deployment Date: May 2023


VMs that will be retired and what they will be replaced by

  1. Staging-5-prod → Staging-6-prod
  2. Solr-5-prod → Solr-6-prod 
  3. Nexus-5-prod →Nexus-6-prod
  4. Mysql-5-prod → Mysql-6-prod 
  5. Mongo-5-prod → Mongo-6-prod 
  6. Gis-5-prod → Gis-6-prod
  7. Duplicates-5-prod → Duplicates-6-prod
  8. Production → TBD if other VMs will need to be created


Systems that will be impacted/updated:

  1. Scout
  2. Lodestar (again despite the recent update)
  3. Archer
  4. Archer API
  5. Storm