Cover image generated by Nano Banana Pro

Cover image generated by Nano Banana Pro

Motivation

Reading trending threads on Hacker News is one of my favorite ways to discover interesting stories and read (mostly) thought-provoking discussions. Since reading all the top stories would be very time-consuming, I use web apps such as Gemini, Google AI Studio, and Claude to have an LLM agent automatically fetch web pages and summarize their content for me. I then quickly browse the summaries and decide which threads I want to read in full. This approach has been quite effective for me.

However, these AI apps are usually not very transparent about their use of tools. Sometimes they do not actually fetch the web page linked from a Hacker News story but instead infer the content from the Hacker News discussions. Sometimes they flat-out hallucinate without actually fetching anything from either Hacker News or the linked resource. Additionally, copying and pasting links to the top Hacker News stories is repetitive and should be easy to automate.

For these reasons, I decided to build my own little Hacker News Reader app that automatically fetches the Top and Best stories on Hacker News, retrieves the linked pages, summarizes the linked content and the Hacker News discussions, and presents the results as static HTML files through a deterministic workflow. It has been very helpful for me to see the big picture — to get a broader sense of which topics are being widely discussed, rather than semi-randomly picking a few threads to read.

Since the codebase for the app is not ready to open-source in the near future (I want to review it to avoid leaking any sensitive information), instead, I will write a blog post documenting the development of this app. Hopefully, someone will find it useful.

Please note that this blog post describes the latest version of the app as of early February 2026. This app may have been updated since then. I’ll try to link to any subsequent blog posts — or the eventual open-source repository — here for your reference.

Architecture Overview

flowchart TB
  START((Orchestrator Start))
  INGEST[HN Story + Comment Ingestion]
  FETCH[Web Page Fetching]
  PARSE[AI Web Page Cleanup]
  SUMM[AI Summarization]
  REPORT[Static HTML Reports]
  PUBLIC[Public HTML Files]
  VIS[NiceGUI App]
  DB[(SQLite Database)]

  subgraph PIPELINE["Orchestrator Pipeline"]
    INGEST
    FETCH
    PARSE
    SUMM
    REPORT
  end

  START --> INGEST --> FETCH --> PARSE --> SUMM --> REPORT

  PARSE <-. read/write .-> DB
  SUMM <-. read/write .-> DB
  DB -. read .-> REPORT
  FETCH <-. read/write .-> DB
  INGEST -. write .-> DB
  DB -. read .-> VIS
  REPORT -. write .-> PUBLIC


  RUST_NOTE["Green = Rust components"]:::rust_legend
  PY_NOTE["Blue = Python components"]:::python_legend

  class INGEST rust;
  class FETCH rust;
  class PARSE python;
  class SUMM python;
  class REPORT python;
  class VIS python;
  class DB db;
  class PUBLIC frontend;

  classDef python fill:#e3f2fd,stroke:#1565c0,stroke-width:1px,color:#0b2b55;
  classDef db fill:#fff3e0,stroke:#ef6c00,stroke-width:1px,color:#4a2b00;
  classDef frontend fill:#f3f3f3,stroke:#666,stroke-width:1px,color:#1f1f1f;
  classDef rust fill:#e8f5e9,stroke:#2e7d32,stroke-width:1px,color:#0f3d1f;
  classDef rust_legend fill:transparent,stroke:transparent,color:#2e7d32,stroke-width:0px;
  classDef python_legend fill:transparent,stroke:transparent,color:#1565c0,stroke-width:0px;

There are four main components in this system:

  1. Hacker News Ingestor: A multi-threaded Rust program that fetches Top and Best stories, the content of those stories, and the content of the comments on those stories from the Hacker News API.
  2. Web page fetcher and parser: A hybrid program consisting of a Rust submodule that fetches the raw HTML from the URLs associated with the Hacker News stories, and a Python submodule that uses LLMs to extract the core content from the fetched HTML. The extraction phase removes noise that may distract the summarizer in the next step.
  3. Summarizer: A Python program that uses LLMs to summarize the content of the stories and the comments on those stories.
  4. Report generator: A Python program that generates a report for each specific Top and Best list of stories and an index of the reports. Each story includes a summary of the linked content and a summary of the discussion about the story.

The entire workflow is orchestrated by a Python script, and each component is executed in order. The data is stored in a local SQLite database. Each job is scoped to a snapshot of the “top” and “best” story lists. This allows us to resume an interrupted job by skipping stories and comments that have already been processed by each component.

The generated reports are self-contained HTML files. The index is also a self-contained HTML file. The reports and the index are available at https://hnreader.ceshine.net/ via Cloudflare Pages. This is a completely static website intended to minimize hosting costs and complexity.

Screenshot of hnreader.ceshine.net

Screenshot of hnreader.ceshine.net

In addition to the static reports, I have developed an interactive NiceGUI app, Visualizer, for internal testing purposes. The app exposes internal states stored in the local SQLite database. Because it exposes fetched web pages, I am not making the app publicly available.

We’ll cover each component in more details in the next few sections.

(This system currently processes only stories with a valid URL. Stories without links, such as Ask HN and Who’s Hiring, are skipped for now.)

Hacker News Ingestion

flowchart TD
    A[Python app calls ingest_once]:::python --> B[PyO3 bridge into Rust and build config]:::python
    B --> D[Run Rust ingestion pipeline]:::rust

    D --> E[HN API client]:::rust
    D -->|spawns writer task| F[SQLite writer]:::rust

    E -->|spawn workers with concurrency limit| G[Fetch story lists]:::rust
    G --> H[Fetch story and comment items]:::rust

    H -->|item rows to persist| F
    G -->|list snapshots to persist| F

    F --> I[Upsert snapshots, stories, comments]:::rust
    I --> J[Return ingestion summary]:::rust
    J --> K[Python receives counts]:::python
    
    classDef rust fill:#e8f5e9,stroke:#2e7d32,stroke-width:1px,color:#0b2b55;
    classDef python fill:#e3f2fd,stroke:#1565c0,stroke-width:1px,color:#0b2b55;

The Hacker News ingestion pipeline is implemented in Rust and is usually triggered from Python via the PyO3 bridge. The bridge builds the ingestion configuration that includes:

  • Database path
  • Story list selection (top or best)
  • Limits for stories, comments, depth, and concurrency
  • Rate limit settings

The Rust pipeline spawns multiple workers to fetch stories and comments asynchronously, with multiple workers running concurrently. Rate limiting is enforced via a mutex to ensure a minimum delay between API requests.

The pipeline first fetches the list of stories, processes the stories one by one, and collects their comments one level at a time (from the direct descendants to the deepest comment level defined in the configuration). Collected data is written to the database sequentially at the end of each step to avoid race conditions.

Once the list of stories is persisted to the database as a snapshot, the ingestion process can be resumed, and all previously ingested stories and comments associated with this list will be overwritten by the upsert operations. This could cause time discrepancy issues if the process is resumed long after the snapshot was taken. Ideally, the entire ingestion pipeline should be treated as an atomic operation to avoid such issues. However, since retry attempts usually occur shortly after a failure, this is considered a reasonable trade-off between simplicity and correctness.

The Python caller of the ingestion pipeline receives a simple ingestion summary containing the snapshot timestamp, the number of stories in the list, and the number of items ingested or skipped for each item type.

Web Page Fetching and Parsing

flowchart TD
    START[Start: Story IDs and Model ID]:::python
    META[Fetch Story Metadata]:::python
    FILTER{Filter URLs:<br>Non-Empty URLs<br>External URLs}:::python
    BRIDGE[PyO3 Bridge to Rust]:::python
    RUST_PREP[Prepare Fetch Set:<br>DB and Schema Ready<br>Skipped vs To_Fetch]:::rust
    MCP_SPAWN[[Playwright MCP Server]]:::rust
    RUST_FETCH[Fetch Markdown via Tool:<br>Concurrent Buffer Unordered]:::rust
    TOOL_OK{Tool Result Has Text}:::rust
    UPSERT_RAW[Upsert Raw Markdown:<br>Fetch Timestamp Stored]:::rust
    FETCH_STATUS[Record Fetch Status:<br>Success Skipped or Error]:::rust
    MCP_SHUTDOWN[Best Effort Shutdown]:::rust
    PAGE_LIST{Build Fetched Page List}:::python
    WEBPARSER[Run Web Parser Agent:<br>Line Numbers Added]:::python
    PARSE_STATUS{Parse Status}:::python
    CONDENSE[Extract Content Ranges:<br>Condensed Markdown]:::python
    PARSE_ERROR[Log Error and Skip Update]:::python
    UPDATE_PAGES[Update Fetched Pages:<br>Condensed Markdown and Flags]:::python
    RETURN_FLAGS[Return Story ID to Success Flag]:::python

    START --> META --> FILTER
    FILTER --> BRIDGE --> RUST_PREP --> RUST_FETCH --> TOOL_OK
    RUST_PREP -.->|Start MCP Service| MCP_SPAWN
    RUST_FETCH -.->|Tool Call| MCP_SPAWN
    MCP_SPAWN -.->|Response| TOOL_OK
    TOOL_OK -->|yes| UPSERT_RAW --> FETCH_STATUS
    TOOL_OK -->|no| FETCH_STATUS
    FETCH_STATUS --> MCP_SHUTDOWN --> PAGE_LIST --> WEBPARSER --> PARSE_STATUS
    MCP_SPAWN -.->|Stop MCP Service| MCP_SHUTDOWN
    PARSE_STATUS -->|success| CONDENSE --> UPDATE_PAGES --> RETURN_FLAGS
    PARSE_STATUS -->|blocked/anomalous| UPDATE_PAGES
    PARSE_STATUS -->|failed/crashed| PARSE_ERROR --> RETURN_FLAGS

    classDef rust fill:#e8f5e9,stroke:#2e7d32,stroke-width:1px,color:#0b2b55;
    classDef python fill:#e3f2fd,stroke:#1565c0,stroke-width:1px,color:#0b2b55;

This component consists of two submodules: the Rust web page fetcher and the Python fetch-result parser. The two are connected by a Python orchestrator.

Web Page Fetcher

I chose to implement the web page fetcher in Rust because of its superior concurrency control and high-performance multi-process management capabilities.

The fetcher currently runs the Playwright MCP server I developed to fetch web pages. This MCP server has been used in my other projects and my daily LLM chat workflows, so incorporating it into this project creates minimal development friction (and it’s free!). I may add support for commercial scraping APIs such as Jina’s Reader API and the Firecrawl API in the future for pages and sites that do not play well with Playwright.

The MCP server supports concurrency, so the Rust fetcher can directly initiate multiple tool calls against a single MCP process, making the fetching process incredibly fast.

We use the fetch_markdown tool from the MCP server to collect the Markdown version of the fetched web page for brevity and readability.

The entire fetcher module is only 150 lines of code.

Fetch Result Parser

Fetched web pages often contain a lot of irrelevant content, such as navigation headers, footers, advertisements, privacy notifications, and other elements. To prevent this content from distracting the summarizer, I added a fetch-result parser that utilizes LLMs to extract the main content from fetched web pages.

To avoid hallucinations from LLMs, instead of asking LLMs to repeat the main content, we attach a line number to each line of the fetched content and ask the LLM to extract the line numbers of the main content. While there may be incorrectly identified lines, the extracted content will always be verbatim from the fetched content, ensuring no hallucinations.

The LLMs are also tasked with determining two flags based on the input content: ‘blocked’ and ‘anomalous’. The ‘blocked’ flag indicates cases where the fetch attempt is blocked by the website (usually resulting in an ‘access denied’ message or a CAPTCHA request). The ‘anomalous’ flag indicates cases where the fetched result does not seem relevant to the story’s title (usually caused by page rendering issues).

Finally, we ask the LLMs to provide the reasoning for their decision. This helps us debug the results and, hopefully, improve the accuracy of the output.

Below is the part of the system prompt for the LLM that specifies the output format:

Output format

- Return a single JSON object matching this schema exactly:
  - content_ranges: list of [start, end]
  - blocked: boolean
  - anomalous: boolean
  - reasoning: string
- Do not add any other keys.

You can find the complete system prompt here.

If either of the two flags is raised in the LLM output, we will record the issue for the target URL in the database. Otherwise, we reconstruct the main content from the original fetch result and the LLM output and persist it to the database.

Web Page Content and Discussion Summarizer

flowchart TD
    START[Start: Story IDs and Model ID]:::python
    INPUTS[Gather Inputs:<br>Story Metadata<br>Thread Comments<br>Optional Page Content]:::python
    FLATTEN[Flatten Thread:<br>Preorder Comment List]:::python
    CLEAN[Clean and Truncate Comment Text]:::python
    PAYLOAD[Build Thread Payload:<br>Story Header and Comments]:::python
    PAGE_CHECK{Page Content Available}:::python
    HEADER_PAGE[Prompt Header:<br>source_mode: page]:::python
    HEADER_HN[Prompt Header:<br>source_mode: hn_only]:::python
    PROMPT[Assemble LLM Prompt:<br>Header Comment Lines and Page - if available]:::python
    AI_AGENT[[External AI Agent]]:::python
    FORMAT[Format Summary Markdown and Persist to DB]:::python

    START --> INPUTS --> FLATTEN --> CLEAN --> PAYLOAD --> PAGE_CHECK
    PAGE_CHECK -->|yes| HEADER_PAGE --> PROMPT
    PAGE_CHECK -->|no| HEADER_HN --> PROMPT
    PROMPT -->|Send Prompt| AI_AGENT
    AI_AGENT -->|Receive Summary| FORMAT

    classDef python fill:#e3f2fd,stroke:#1565c0,stroke-width:1px,color:#0b2b55;

This is the last component in the system that writes to the database. The main task of this summarizer component is to synthesize a prompt that instructs the LLM to generate two summaries: one for the content of the linked web page and one for the Hacker News discussion thread.

Because the fetcher/parser component may fail to collect the content of the linked web page, we have a fallback that infers the post’s content from the discussion. We mark the normal mode with a source_mode: page line and the fallback mode with a source_mode: hn_only line. The system prompt contains instructions to handle both modes appropriately.

The ingested Hacker News comments are traversed in pre-order to mimic the presentation on the Hacker News website. Each comment is placed in the prompt as a single-line JSON object (one object per line).

Because the page content is optional, the prompt lists the comments first; it then appends the page content (if present) at the end.

This is the input format included in the system prompt:

Title: <Story Title>
URL: <URL linked by the Story>
Descendants: <Number of comments under this story>
source_mode: [page | hn_only]

{"id": 9224, "parent_id": 8863, "depth": 0, "rank_in_parent": 1, "by": "user1", "time": 1175816820, "dead": false, "deleted": false, "text": "This is a top-level comment containing the full text..."}
{"id": 9272, "parent_id": 9224, "depth": 1, "rank_in_parent": 1, "by": "user2", "time": 1175822880, "dead": false, "deleted": false, "excerpt": "This is a nested reply containing only an excerpt..."}
... (one JSON object per line)

--- 

[The Page Content in Markdown when `source_mode` is `page`]

You can find the complete system prompt here.

Similar to the AI agent for the fetch result parser, the summarizer agent is required to produce structured output containing the two summaries we requested. The received output is briefly validated and then persisted to the database.

Report Generator

flowchart TD
    START[Start: Report Request]:::python
    SCOPE{Report Scope}:::python
    SNAPSHOT_HTML[Build Snapshot Report with Jinja2]:::python
    NAV_PATCH[Patch Navigation Links]:::python
    WRITE_REPORT[Write Report File]:::python
    UPDATE_NEIGHBORS[Update Neighbor Reports]:::python
    INDEX_PAGE[Regenerate Index Page]:::python

    SNAPSHOTS[Fetch Snapshot List]:::python
    GROUP_LISTS[Group by List Name]:::python
    LOOP_REPORTS[Iterate Snapshots<br>Chronological Order]:::python
    PATCH_BATCH[Patch Prev and Next Links]:::python
    WRITE_BATCH[Write Report Files]:::python
    INDEX_ALL[Generate Index Page]:::python

    FETCH_DATA[Fetch Snapshot Data]:::python

    START --> SCOPE
    SCOPE -->|single| FETCH_DATA
    FETCH_DATA --> SNAPSHOT_HTML
    SNAPSHOT_HTML -->|single report| NAV_PATCH --> WRITE_REPORT --> UPDATE_NEIGHBORS --> INDEX_PAGE

    SCOPE -->|batch| SNAPSHOTS --> GROUP_LISTS --> LOOP_REPORTS --> FETCH_DATA
    SNAPSHOT_HTML -->|batch reports| PATCH_BATCH --> WRITE_BATCH --> INDEX_ALL

    classDef python fill:#e3f2fd,stroke:#1565c0,stroke-width:1px,color:#0b2b55;

This is the component I relied most heavily on AI to build, because I am not familiar with JavaScript. I intentionally avoided any complex JavaScript frameworks to minimize project complexity and reduce maintenance and hosting costs for the reports.

The Python report generation code relies on the Jinja2 templating engine. Each snapshot has its own self-contained HTML report. Generating the HTML report files is straightforward: we read the summary data from the database and let Jinja2 populate the templates. We optionally perform some post-processing on the summary data to inject links to the referenced Hacker News comments.

The tricky part comes when building the index page and the navigation links within each report page. We have two generation modes: “single” and “batch”. The former generates a report for a single snapshot, while the latter generates reports for all snapshots in the database. The “single” mode requires finding the previous and next snapshots for the target snapshot. The navigation links in the reports for those neighboring snapshots may need to be updated. The “batch” mode is much more straightforward, as we can collect the list of snapshots in chronological order and process them one by one.

I use Pico CSS for the UI and Alpine.js for dynamic behavior. The pages are uploaded and deployed to Cloudflare Pages via a single command: npx wrangler pages deploy reports --commit-dirty=true.

You can access the reports at https://hnreader.ceshine.net.

Screenshot of Cloudflare Pages Dashboard

Screenshot of Cloudflare Pages Dashboard

NiceGUI-based Visualizer

This is an internal application for validating intermediate database states and for my own use (e.g., generating a copy-ready prompt for each story so I can explore interesting stories further in a separate LLM thread). Below are some screenshots demonstrating the application’s capabilities.

Front Page

Front Page

Snapshot Browser

Snapshot Browser

Page Content Viewer

Page Content Viewer

Thread Debugger

Thread Debugger

Future Improvements

A more intelligent comment ingestion strategy:

Currently, the ingestion process is mainly controlled by two parameters: one controls the maximum depth of the comment tree that it traverses; the other controls the maximum number of child comments to be collected for each story or comment node. This may not be the best strategy for obtaining the most relevant comments. For example, we may want to go deeper for the first few child (first-level) comments of a story.

A thorough analysis of common patterns in the Hacker News discussion tree is needed to devise such a strategy.

A more robust page fetcher:

The MCP server currently in use does not handle CAPTCHAs for fetch requests the way it does for search requests. Improving the MCP server would help increase the number of pages that can be fetched. Alternatively, adding a second fetcher layer that relies on a commercial scraping API would also do the trick, albeit at some cost.

Support concurrent AI agent requests:

Currently, the fetch-result parser and the summarizer agents each handle one request at a time. This worked well during initial development with local LLMs. With commercial LLM APIs, though, this leaves significant performance on the table because requests could run concurrently, at the cost of added codebase complexity.

Provide search functionality for the summary reports:

We could probably use Algolia’s free tier to provide simple search functionality for the static website. We could also build a simple search index for story titles and implement a self-contained HTML page to provide a more integrated experience.

Improve the system prompt of the summarizer agent:

The current system prompt has rigid output requirements for all story types. We may want to give the agent more leeway to decide what content to include in the summary. Providing more examples of story types could also help.

AI Use Disclosure

I rely heavily on the Codex CLI (GPT-5.2 Codex) to create Mermaid diagrams consistent with the codebase. I also used AI tools to revise the rest of the post, primarily for grammar and word choice. However, I wrote most of the content myself; it was not generated using prompts.

I developed the Hacker News reader mainly with assistance from the Gemini CLI (3 Pro Preview and 3 Flash Preview), occasionally using Codex CLI and OpenCode.

Acknowledgments

This article helped me set up Mermaid diagram rendering on this website: Getting Mermaid Diagrams Working in Hugo