6.0 Dev Cycle Outline

Outline of Development Items (Click on the 'Asana' items to see the task in Asana)

1. Improve the shingler system based on the de-duplication work that Tom and Loretta did for the project.

Research to see if there is a way to identify more duplicates without compromising processing efficiency. If there is a solution, we will try to implement it. (Asana)
Develop a stand-alone article de-duplication tool that Cline Center staff can use on a specified set of articles (Asana)
Who it impacts: Analyst Ops, Cyber-infrastructure, Analytics and Data Management

2. Identify a solution to store duplicates without taking up so much disk space. (Asana)

Articles that are exact duplicates will have their content stripped.
Content from articles that are ‘near-duplicates’ is retained and stored in the MongoDB dedicated for duplicate articles.
Feature request: Is there a way we can make this information useful? (Maybe count the number of ‘duplicates’ and near-duplicates that are associated with an article
Who it impacts: Cyber-infrastructure

3. Lodestar bug fixes (Asana) (Already been addressed)

Fix login issue (or lack of login ability) from GNC 5.1.1 (Asana)
Fix the stats reporting issue from GNC 5.1.1 (Asana)
Schedule an early deployment
Who it impacts: Cyber-infrastructure and Analyst Ops

4. Clean up the Gradle builds (Asana) (This has already been addressed)

Who it impacts: Cyber-infrastructure

5. Addressing GNI Features and Bugs

Fix issues with the GNI

Fix issues identified to clean text of content (Asana)

Better organize and present Bulk LexisNexis data (Asana).

Remove source_name “LexisNexis” (Asana)

Resolve issues with the processing of new BLN data

Resolve issues with empty values

Empty web content (Asana)
Url_host blank
Missing URL values for web (Asana)
Missing title values

Add new features to GNI

Add a feature to add new BLN sources (Asana)

Add word count (or single word token counts) variables for Content and Title (Asana)

Document-term matrices as an output? (Asana)

Question: Do we want the DTM to be a single token or multiple tokens?
User-defined? With limits?

Add text complexity variables? If so, what? (Asana)

Lexical Richness (aka Type Token Ratio): example: https:%%//%%pypi.org/project/lexicalrichness/
Hapax Richness (# Words Occurring Once / # of Total Words)
Flesch’s reading ease test: The formula for Flesch reading ease score is: 206.835 - 1.015 × (total words ÷ total sentences) - 84.6 × (total syllables ÷ total words)
Flesch-Kincaid equation: Flesch-Kincaid grade level formula: 0.39 x (words/sentences) + 11.8 x (syllables/words) - 15.59.

Update Moral Foundation sentiment using Diesner’s ‘Expanded Moral Foundations’ dictionary (Asana): https:%%//%%dl.acm.org/doi/abs/10.1145/3465336.3475112

Extracted Quotes and speaker variables (assuming we use Stanford CoreNLP?) - Nice to have but not required. (Asana)

Maybe save the speaker of the quote.
Save the parse tree and store it in a separate mongo.

Web

Explore adding new sources to the web crawler. If we can add them, then add them. (Asana)

Closed captions from news sources?
Pull articles from fact-checking sites

Add a new variable (or maybe use the ‘publisher’ field) to denote the source of web content instead of using simply the URL field. (Asana)

Improve the performance of the SPEED Classifier (Not required). Probably need to focus on this in the next dev cycle) (Asana)

Add new content: (Asana)

Eastview FBIS
Newsreel data

Add URLs for BulkLexisNexis_v2 sources (for users who do not have access to the content) (Asana)

Who it impacts: Cyber-infrastructure, Analyst Ops, Analytics and Data Management, External Engagement, Users of Cline Center GNC data

7. Transition from Illinois NER over to Stanford CoreNLP or some other alternative (i.e. spaCy). (Asana)

Question: Before we transition, what kind of license does CoreNLP have? Can we commercialize it? License is GNU General Public License V3 so it is ok to proceed
Who it impacts: Cyber-infrastructure, Analyst Ops, Analytics and Data Management, External Engagement, Users of Cline Center GNC data

8. Improve Geocoder based on findings from Event Detection work that Buddy and Loretta are working on (Asana)

Update the Geonames database that the geocoder uses (done)

Implement pattern-finding approaches to identify geolocations

Improve the list of potential geolocations that the geocoder provides for a location entity that has been identified in a text

Remove low probability locations using a standard deviation.

Create a new field with the top choice (based on score) for a predicted location.

Access how well the geo-coder does use the Universal Gold Standard event data and against other off-the-shelf geocoders available

Who it impacts: Cyber-infrastructure, Analyst Ops, Analytics and Data Management, External Engagement, Users of Cline Center GNC data

9. Archer Updates

Migrate Archer UI to new Vaadin (Asana)

Archer UI Updates

Feature Requests

Required

Incorporate Google Analytics (Asana)

Not required but would be nice to have (Asana)

Implementing Shibboleth authentication during login?
Edit Snapshot names
Ability to plot average sentiment over time
Ability to plot multiple items on a timeline
Ability to show Solr’s relevance score in the dashboard and enable a user to download it and sort with it
Ability to view, plot, and download word counts for Content and Title
Use a different delimiter for delimiting values within the Dashboard

Bug fixes

Can only download the results from the first query I download data from (Asana)
Add content_lenth and title_length to items that users can view in the dashboard, download, and plot. Can we plot these two measures as well? (Asana)

Archer API Features (Asana) (Already been addressed)

Term Queries
Stats Queries
Debug Query Feature
GroupBy Query

Who it impacts: Cyber-infrastructure, Analyst Ops, Analytics and Data Management, External Engagement, Users of Cline Center GNC data

10. MongoDB Items

Add authentication for MongoDB backend access (Asana)
Evaluate Mongo Read performance and try to improve on it (Asana)
Who it impacts: Cyber-infrastructure

11. Storm Cluster items

Move Storm UI and command-line off of the cluster machines and move them to dedicated virtual machines. (Nice to have if possible) (Asana)
Upgrade the Apache Storm System to 2.2.0 (Asana)
Who it impacts: Cyber-infrastructure

12. Retire ‘production’ VM (Asana)

Resolve CCP portal. We need to do one of the following:

Migrate to new VM
Hand off the portal to CCP

Figure out where to store Presidential Campaigns Data

Migrate the MySQL database that is on this VM

Who it impacts: Cyber-infrastructure, CCP, Scott Althaus, Analytics and Data management

Expected Deployment Date: May 2023

VMs that will be retired and what they will be replaced by

Staging-5-prod → Staging-6-prod
Solr-5-prod → Solr-6-prod
Nexus-5-prod →Nexus-6-prod
Mysql-5-prod → Mysql-6-prod
Mongo-5-prod → Mongo-6-prod
Gis-5-prod → Gis-6-prod
Duplicates-5-prod → Duplicates-6-prod
Production → TBD if other VMs will need to be created

Systems that will be impacted/updated:

Scout
Lodestar (again despite the recent update)
Archer
Archer API
Storm