3. Data Ingestion from the Web and Social Media

3.1 Real-Time Twitter Streams and News Feeds

Keyword Filtering: Monitoring terms like “#TruthByHanna,” “#Expose,” “#BreakingNews” or custom user-defined phrases.
Categorization: Tweets and articles are tagged with geographical, political, or thematic data for advanced comedic or investigative angle.
Publication to Message Bus: The aggregator packages each item (tweet, article, headline) into a JSON structure containing text content, metadata (timestamps, authors, sentiment indicators), and posts it to relevant topics (e.g., truth_events).

Crawling Depth: Configurable to avoid infinite loops or irrelevant pages. The aggregator can do shallow crawls for trending headlines or deep crawls for in-depth investigative contexts.
Data Cleansing: HTML tags are stripped; text is normalized and tokenized, removing duplicates or purely promotional content.
Context Embedding: Using a smaller transformer or embedding model, each snippet is vectorized for semantic search. Hanna AI Workers can quickly retrieve relevant context when debating or exposing hidden truths in real-time.

Last updated 10 months ago