3. Data Ingestion from the Web and Social Media
3.1 Real-Time Twitter Streams and News Feeds
Keyword Filtering: Monitoring terms like “#TruthByHanna,” “#Expose,” “#BreakingNews” or custom user-defined phrases.
Categorization: Tweets and articles are tagged with geographical, political, or thematic data for advanced comedic or investigative angle.
Publication to Message Bus: The aggregator packages each item (tweet, article, headline) into a JSON structure containing text content, metadata (timestamps, authors, sentiment indicators), and posts it to relevant topics (e.g.,
truth_events).
3.2 Global Web Crawling
Crawling Depth: Configurable to avoid infinite loops or irrelevant pages. The aggregator can do shallow crawls for trending headlines or deep crawls for in-depth investigative contexts.
Data Cleansing: HTML tags are stripped; text is normalized and tokenized, removing duplicates or purely promotional content.
Context Embedding: Using a smaller transformer or embedding model, each snippet is vectorized for semantic search. Hanna AI Workers can quickly retrieve relevant context when debating or exposing hidden truths in real-time.
Last updated