Last updated: March 2026 Β· Source: info-pipeline
Live Intelligence Feed
A Python pipeline that pulls AI and tech content from 7 platforms, applies keyword filtering and relevance scoring, deduplicates results, and produces a unified JSON/Markdown report for downstream agent consumption.
What
info-pipeline aggregates AI and tech content from 7 heterogeneous platforms into a single scored, deduplicated feed. Each platform has a dedicated collector that normalizes its output to a common schema. A filter/scorer stage applies global keyword matching and relevance scoring (0β100) before the report is written. The output is machine-readable JSON plus a human-readable Markdown report.
Why
Staying current with AI developments requires monitoring many heterogeneous sources simultaneously: GitHub for new repositories, Hacker News and Reddit for community discussion, YouTube for research explanations, Product Hunt for new tools, Twitter/X for early signals, and Chinese platforms for developments that English-language sources miss. Doing this manually across 7 platforms is time-consuming and inconsistent.
Existing aggregators (RSS readers, Feedly, etc.) produce human-readable feeds but not machine-readable structured data. An AI research agent needs a scored, deduplicated, normalized feed it can query and reason about β not a list of links to read. This pipeline provides that feed in a format designed for agent consumption.
Architecture
The pipeline has four stages:
[Collectors β run in parallel]
β GitHub Trending, Hacker News, Reddit, YouTube,
Product Hunt, X/Twitter, Chinese platforms (via MCP)
β each normalizes to unified schema
[Filter / Scorer]
β keyword matching against global keyword list
β relevance score 0β100 (keyword density + platform signal)
β deduplication by URL, then title similarity
[Report Writer]
β unified JSON (all items)
β Markdown report (top N items) Data sources:
| Platform | Language | Notes |
|---|---|---|
| GitHub Trending | EN | Topic-filtered repos by stars/forks in the last N days |
| Hacker News | EN | Top Stories, filtered by minimum score |
| EN | Multiple AI subreddits β no API key required | |
| YouTube | EN | Configured channel playlists via Data API v3 |
| Product Hunt | EN | Daily new products via GraphQL API |
| X / Twitter | EN | Keyword search β requires Basic API tier (low-volume) |
| Chinese platforms | ZH | Zhihu, 36kr, Juejin, Sspai, InfoQ, Bilibili via trends-hub MCP |
Code structure:
| Component | Responsibility |
|---|---|
| collectors/base.py | BaseCollector with common fetch, retry, and rate-limit logic |
| collectors/*.py | One file per platform β each extends BaseCollector |
| collectors/__init__.py | ALL_COLLECTORS registry β registers all enabled collectors |
| filters/scorer.py | Keyword filtering, relevance scoring (0β100), deduplication |
| main.py | Entry point β orchestrates collector runs and report generation |
Every item across all sources shares the same JSON schema:
{
"title": "Article / project title",
"url": "https://...",
"source": "github",
"score": 85,
"published_at": "2026-02-18T10:00:00Z",
"summary": "Short description or excerpt",
"tags": ["llm", "open-source"]
} Key Design Decisions
Unified output schema β normalize at the source
Every collector's job is to normalize its platform's output to the common schema before it leaves the collector. The scorer and report writer deal only with the common schema β they have no platform-specific logic. This makes adding a new source a matter of implementing one new class with a single fetch() method.
Config-driven, not code-driven
Keywords, platform parameters, and source selection all live in a YAML config file. Tuning the feed doesn't require code changes. This matters when iterating quickly on which topics to follow or adjusting platform-specific thresholds like minimum HN score.
Graceful degradation on missing API keys
A collector with no configured API key is skipped with a warning β not an error. The pipeline still produces output from the sources that are configured. This means the pipeline can run partially (e.g. GitHub + HN only) without requiring all seven API credentials to be present.
MCP for Chinese platforms β avoid direct scraping
Chinese tech platforms (Zhihu, 36kr, Juejin) have complex anti-bot measures and no official English-language APIs. The trends-hub MCP service handles these sources separately; the pipeline calls it as a tool rather than implementing platform-specific scrapers that would need constant maintenance.
How to Build Your Own
1. Define the unified schema first
Before writing any collector, define the output schema all collectors must produce. The minimum useful fields are: title, url, source, score, published_at, summary. Adding a field later requires updating every collector.
2. Implement BaseCollector with retry and rate-limit logic
Rate limiting and retry-on-error are needed by every platform collector. Put them in the base class. Each concrete collector only needs to implement how to fetch its platform's data and how to map it to the common schema β not how to handle HTTP errors or rate limits.
3. Keep scoring simple β keyword density + platform weight
A score of 0β100 based on keyword match count (normalized by text length) plus a platform-specific engagement signal (GitHub stars, HN score, Reddit upvotes) is sufficient. Don't add LLM-based scoring β it adds latency and cost without proportionate quality gains for most use cases.
4. Deduplicate by URL first, then by title similarity
The same story often appears across multiple platforms. URL deduplication catches exact duplicates. Title similarity (simple word overlap is sufficient) catches cases where the same article is linked with slightly different URLs or titles across platforms.
5. Twitter API rate limits are a real constraint
The Basic Twitter/X API tier has strict monthly request limits. Keep max_results low (10β15 per run) and run the collector less frequently than others. Build the collector to fail gracefully and not block the entire pipeline run if it hits a rate limit.
Frequently Asked Questions
Can I run it with only some sources enabled?
Yes. Sources without configured API keys are skipped with a warning rather than failing the whole run. You can also run specific collectors directly. Reddit and Hacker News require no API keys and work out of the box.
How does the relevance score work?
The scorer checks how many of the configured global keywords appear in the title and summary. Items that match no keywords are filtered out; matches boost the score up to 100. Original platform engagement metrics (stars, HN score, upvotes) also factor in.
What is the trends-hub MCP for Chinese platforms?
The Chinese platform collector calls MCP tools from the trends-hub service (a separate open-source project) to fetch Zhihu trending, 36kr, Juejin, and others. That service must be running locally for the Chinese platform source to work.
Authors: Qiushi Wu & Orange π