🐾 claw-stack

Last updated: March 2026 Β· Source: info-pipeline

Live Intelligence Feed

A Python pipeline that pulls AI and tech content from 7 platforms, applies keyword filtering and relevance scoring, deduplicates results, and produces a unified JSON/Markdown report for downstream agent consumption.

← Module Overview

What

info-pipeline aggregates AI and tech content from 7 heterogeneous platforms into a single scored, deduplicated feed. Each platform has a dedicated collector that normalizes its output to a common schema. A filter/scorer stage applies global keyword matching and relevance scoring (0–100) before the report is written. The output is machine-readable JSON plus a human-readable Markdown report.

Why

Staying current with AI developments requires monitoring many heterogeneous sources simultaneously: GitHub for new repositories, Hacker News and Reddit for community discussion, YouTube for research explanations, Product Hunt for new tools, Twitter/X for early signals, and Chinese platforms for developments that English-language sources miss. Doing this manually across 7 platforms is time-consuming and inconsistent.

Existing aggregators (RSS readers, Feedly, etc.) produce human-readable feeds but not machine-readable structured data. An AI research agent needs a scored, deduplicated, normalized feed it can query and reason about β€” not a list of links to read. This pipeline provides that feed in a format designed for agent consumption.

Architecture

The pipeline has four stages:

[Collectors β€” run in parallel]
  β†’ GitHub Trending, Hacker News, Reddit, YouTube,
    Product Hunt, X/Twitter, Chinese platforms (via MCP)
  β†’ each normalizes to unified schema

[Filter / Scorer]
  β†’ keyword matching against global keyword list
  β†’ relevance score 0–100 (keyword density + platform signal)
  β†’ deduplication by URL, then title similarity

[Report Writer]
  β†’ unified JSON (all items)
  β†’ Markdown report (top N items)

Data sources:

Platform Language Notes
GitHub Trending EN Topic-filtered repos by stars/forks in the last N days
Hacker News EN Top Stories, filtered by minimum score
Reddit EN Multiple AI subreddits β€” no API key required
YouTube EN Configured channel playlists via Data API v3
Product Hunt EN Daily new products via GraphQL API
X / Twitter EN Keyword search β€” requires Basic API tier (low-volume)
Chinese platforms ZH Zhihu, 36kr, Juejin, Sspai, InfoQ, Bilibili via trends-hub MCP

Code structure:

Component Responsibility
collectors/base.py BaseCollector with common fetch, retry, and rate-limit logic
collectors/*.py One file per platform β€” each extends BaseCollector
collectors/__init__.py ALL_COLLECTORS registry β€” registers all enabled collectors
filters/scorer.py Keyword filtering, relevance scoring (0–100), deduplication
main.py Entry point β€” orchestrates collector runs and report generation

Every item across all sources shares the same JSON schema:

{
  "title": "Article / project title",
  "url": "https://...",
  "source": "github",
  "score": 85,
  "published_at": "2026-02-18T10:00:00Z",
  "summary": "Short description or excerpt",
  "tags": ["llm", "open-source"]
}

Key Design Decisions

Unified output schema β€” normalize at the source

Every collector's job is to normalize its platform's output to the common schema before it leaves the collector. The scorer and report writer deal only with the common schema β€” they have no platform-specific logic. This makes adding a new source a matter of implementing one new class with a single fetch() method.

Config-driven, not code-driven

Keywords, platform parameters, and source selection all live in a YAML config file. Tuning the feed doesn't require code changes. This matters when iterating quickly on which topics to follow or adjusting platform-specific thresholds like minimum HN score.

Graceful degradation on missing API keys

A collector with no configured API key is skipped with a warning β€” not an error. The pipeline still produces output from the sources that are configured. This means the pipeline can run partially (e.g. GitHub + HN only) without requiring all seven API credentials to be present.

MCP for Chinese platforms β€” avoid direct scraping

Chinese tech platforms (Zhihu, 36kr, Juejin) have complex anti-bot measures and no official English-language APIs. The trends-hub MCP service handles these sources separately; the pipeline calls it as a tool rather than implementing platform-specific scrapers that would need constant maintenance.

How to Build Your Own

1. Define the unified schema first

Before writing any collector, define the output schema all collectors must produce. The minimum useful fields are: title, url, source, score, published_at, summary. Adding a field later requires updating every collector.

2. Implement BaseCollector with retry and rate-limit logic

Rate limiting and retry-on-error are needed by every platform collector. Put them in the base class. Each concrete collector only needs to implement how to fetch its platform's data and how to map it to the common schema β€” not how to handle HTTP errors or rate limits.

3. Keep scoring simple β€” keyword density + platform weight

A score of 0–100 based on keyword match count (normalized by text length) plus a platform-specific engagement signal (GitHub stars, HN score, Reddit upvotes) is sufficient. Don't add LLM-based scoring β€” it adds latency and cost without proportionate quality gains for most use cases.

4. Deduplicate by URL first, then by title similarity

The same story often appears across multiple platforms. URL deduplication catches exact duplicates. Title similarity (simple word overlap is sufficient) catches cases where the same article is linked with slightly different URLs or titles across platforms.

5. Twitter API rate limits are a real constraint

The Basic Twitter/X API tier has strict monthly request limits. Keep max_results low (10–15 per run) and run the collector less frequently than others. Build the collector to fail gracefully and not block the entire pipeline run if it hits a rate limit.

Frequently Asked Questions

Can I run it with only some sources enabled?

Yes. Sources without configured API keys are skipped with a warning rather than failing the whole run. You can also run specific collectors directly. Reddit and Hacker News require no API keys and work out of the box.

How does the relevance score work?

The scorer checks how many of the configured global keywords appear in the title and summary. Items that match no keywords are filtered out; matches boost the score up to 100. Original platform engagement metrics (stars, HN score, upvotes) also factor in.

What is the trends-hub MCP for Chinese platforms?

The Chinese platform collector calls MCP tools from the trends-hub service (a separate open-source project) to fetch Zhihu trending, 36kr, Juejin, and others. That service must be running locally for the Chinese platform source to work.

Authors: Qiushi Wu & Orange 🍊