3. Data Ingestion from the Web and Social Media

3.1 Real-Time Twitter Streams and News Feeds

  • Keyword Filtering: Monitoring terms like “#TruthByHanna,” “#Expose,” “#BreakingNews” or custom user-defined phrases.

  • Categorization: Tweets and articles are tagged with geographical, political, or thematic data for advanced comedic or investigative angle.

  • Publication to Message Bus: The aggregator packages each item (tweet, article, headline) into a JSON structure containing text content, metadata (timestamps, authors, sentiment indicators), and posts it to relevant topics (e.g., truth_events).

3.2 Global Web Crawling

  • Crawling Depth: Configurable to avoid infinite loops or irrelevant pages. The aggregator can do shallow crawls for trending headlines or deep crawls for in-depth investigative contexts.

  • Data Cleansing: HTML tags are stripped; text is normalized and tokenized, removing duplicates or purely promotional content.

  • Context Embedding: Using a smaller transformer or embedding model, each snippet is vectorized for semantic search. Hanna AI Workers can quickly retrieve relevant context when debating or exposing hidden truths in real-time.

Last updated