6.0 Dev Cycle Outline
Outline of Development Items (Click on the 'Asana' items to see the task in Asana)
1. Improve the shingler system based on the de-duplication work that Tom and Loretta did for the project.
- Research to see if there is a way to identify more duplicates without compromising processing efficiency. If there is a solution, we will try to implement it. (Asana)
- Develop a stand-alone article de-duplication tool that Cline Center staff can use on a specified set of articles (Asana)
- Who it impacts: Analyst Ops, Cyber-infrastructure, Analytics and Data Management
2. Identify a solution to store duplicates without taking up so much disk space. (Asana)
- Articles that are exact duplicates will have their content stripped.
- Content from articles that are ‘near-duplicates’ is retained and stored in the MongoDB dedicated for duplicate articles.
- Feature request: Is there a way we can make this information useful? (Maybe count the number of ‘duplicates’ and near-duplicates that are associated with an article
- Who it impacts: Cyber-infrastructure
3. Lodestar bug fixes (Asana) (Already been addressed)
- Fix login issue (or lack of login ability) from GNC 5.1.1 (Asana)
- Fix the stats reporting issue from GNC 5.1.1 (Asana)
- Schedule an early deployment
- Who it impacts: Cyber-infrastructure and Analyst Ops
4. Clean up the Gradle builds (Asana) (This has already been addressed)
- Who it impacts: Cyber-infrastructure
5. Addressing GNI Features and Bugs
Fix issues with the GNI
Fix issues identified to clean text of content (Asana)
Better organize and present Bulk LexisNexis data (Asana).
Remove source_name “LexisNexis” (Asana)
Resolve issues with the processing of new BLN data
Resolve issues with empty values
Add new features to GNI
Add a feature to add new BLN sources (Asana)
Add word count (or single word token counts) variables for Content and Title (Asana)
Document-term matrices as an output? (Asana)
- Question: Do we want the DTM to be a single token or multiple tokens?
- User-defined? With limits?
Add text complexity variables? If so, what? (Asana)
- Lexical Richness (aka Type Token Ratio): example: https:%%//%%pypi.org/project/lexicalrichness/
- Hapax Richness (# Words Occurring Once / # of Total Words)
- Flesch’s reading ease test: The formula for Flesch reading ease score is: 206.835 - 1.015 × (total words ÷ total sentences) - 84.6 × (total syllables ÷ total words)
- Flesch-Kincaid equation: Flesch-Kincaid grade level formula: 0.39 x (words/sentences) + 11.8 x (syllables/words) - 15.59.
Update Moral Foundation sentiment using Diesner’s ‘Expanded Moral Foundations’ dictionary (Asana): https:%%//%%dl.acm.org/doi/abs/10.1145/3465336.3475112
Extracted Quotes and speaker variables (assuming we use Stanford CoreNLP?) - Nice to have but not required. (Asana)
- Maybe save the speaker of the quote.
- Save the parse tree and store it in a separate mongo.
Web
Explore adding new sources to the web crawler. If we can add them, then add them. (Asana)
- Closed captions from news sources?
- Pull articles from fact-checking sites
Add a new variable (or maybe use the ‘publisher’ field) to denote the source of web content instead of using simply the URL field. (Asana)
Improve the performance of the SPEED Classifier (Not required). Probably need to focus on this in the next dev cycle) (Asana)
Add new content: (Asana)
- Eastview FBIS
- Newsreel data
Add URLs for BulkLexisNexis_v2 sources (for users who do not have access to the content) (Asana)
Who it impacts: Cyber-infrastructure, Analyst Ops, Analytics and Data Management, External Engagement, Users of Cline Center GNC data
7. Transition from Illinois NER over to Stanford CoreNLP or some other alternative (i.e. spaCy). (Asana)
- Question: Before we transition, what kind of license does CoreNLP have? Can we commercialize it? License is GNU General Public License V3 so it is ok to proceed
- Who it impacts: Cyber-infrastructure, Analyst Ops, Analytics and Data Management, External Engagement, Users of Cline Center GNC data
8. Improve Geocoder based on findings from Event Detection work that Buddy and Loretta are working on (Asana)
Update the Geonames database that the geocoder uses (done)
Implement pattern-finding approaches to identify geolocations
Improve the list of potential geolocations that the geocoder provides for a location entity that has been identified in a text
- Remove low probability locations using a standard deviation.
Create a new field with the top choice (based on score) for a predicted location.
Access how well the geo-coder does use the Universal Gold Standard event data and against other off-the-shelf geocoders available
Who it impacts: Cyber-infrastructure, Analyst Ops, Analytics and Data Management, External Engagement, Users of Cline Center GNC data
9. Archer Updates
Migrate Archer UI to new Vaadin (Asana)
Archer UI Updates
Feature Requests
Required
- Incorporate Google Analytics (Asana)
Not required but would be nice to have (Asana)
- Implementing Shibboleth authentication during login?
- Edit Snapshot names
- Ability to plot average sentiment over time
- Ability to plot multiple items on a timeline
- Ability to show Solr’s relevance score in the dashboard and enable a user to download it and sort with it
- Ability to view, plot, and download word counts for Content and Title
- Use a different delimiter for delimiting values within the Dashboard
Bug fixes
- Can only download the results from the first query I download data from (Asana)
- Add content_lenth and title_length to items that users can view in the dashboard, download, and plot. Can we plot these two measures as well? (Asana)
Archer API Features (Asana) (Already been addressed)
- Term Queries
- Stats Queries
- Debug Query Feature
- GroupBy Query
Who it impacts: Cyber-infrastructure, Analyst Ops, Analytics and Data Management, External Engagement, Users of Cline Center GNC data
10. MongoDB Items
- Add authentication for MongoDB backend access (Asana)
- Evaluate Mongo Read performance and try to improve on it (Asana)
- Who it impacts: Cyber-infrastructure
11. Storm Cluster items
- Move Storm UI and command-line off of the cluster machines and move them to dedicated virtual machines. (Nice to have if possible) (Asana)
- Upgrade the Apache Storm System to 2.2.0 (Asana)
- Who it impacts: Cyber-infrastructure
12. Retire ‘production’ VM (Asana)
Resolve CCP portal. We need to do one of the following:
- Migrate to new VM
- Hand off the portal to CCP
Figure out where to store Presidential Campaigns Data
Migrate the MySQL database that is on this VM
Who it impacts: Cyber-infrastructure, CCP, Scott Althaus, Analytics and Data management
Expected Deployment Date: May 2023
VMs that will be retired and what they will be replaced by
- Staging-5-prod → Staging-6-prod
- Solr-5-prod → Solr-6-prod
- Nexus-5-prod →Nexus-6-prod
- Mysql-5-prod → Mysql-6-prod
- Mongo-5-prod → Mongo-6-prod
- Gis-5-prod → Gis-6-prod
- Duplicates-5-prod → Duplicates-6-prod
- Production → TBD if other VMs will need to be created
Systems that will be impacted/updated:
- Scout
- Lodestar (again despite the recent update)
- Archer
- Archer API
- Storm