Aditya's Blog

Polling a public Telegram channel, no account

I wanted to monitor a public Telegram channel for new posts. No account, no API key, no MTProto session, no bot in the channel.

A note before the technique: this is an experiment, not a scraping operation. Polling cadence in minutes, sequential requests, low volume. Public channels only. Anything private needs the official auth path. Don't be the reason this stops working for everyone.

The obvious route is https://t.me/s/<channel>, Telegram's web preview that serves the message HTML. Works for most channels. Some channels turn it off; you get the "View in Telegram" join card and nothing else. RSS bridges and t.me/s/ scrapers both die there.

The leak is in per-post URLs. Even with the web preview disabled, https://t.me/<channel>/<post_id> still serves OpenGraph meta tags so other sites can unfurl Telegram links.

<meta property="og:title"       content="Author Name in Channel Title">
<meta property="og:description" content="...actual post body text...">
<meta property="og:image"       content="https://cdn1.telesco.pe/file/...">

That's the entire payload. Author, text, image. Parse it with two regexes.

import re, urllib.request

def parse(channel, pid):
    req = urllib.request.Request(
        f"https://t.me/{channel}/{pid}",
        headers={"User-Agent": "Mozilla/5.0"},
    )
    html = urllib.request.urlopen(req, timeout=10).read().decode()
    title = re.search(r'og:title" content="([^"]*)"', html)
    desc  = re.search(r'og:description" content="([^"]*)"', html)
    image = re.search(r'og:image" content="([^"]*)"', html)
    if not title or " in " not in title.group(1):
        return None  # deleted, doesn't exist, or rate-limited
    return {
        "author": title.group(1).split(" in ")[0],
        "text":   desc.group(1) if desc else "",
        "image":  image.group(1) if image else None,
    }

A few things that bit me:

Telegram serves the channel landing page (HTTP 200) when the post doesn't exist or you're rate-limited. Distinguish real posts by the <Author> in <Channel> shape of og:title.

Post IDs are sparse. Replies, deletes, and admin actions consume IDs without producing visible posts. A 5-min poll on an active channel scans ~50 IDs to surface ~20 real posts. Walk forward from your last seen ID; tolerate a configurable gap before declaring the end of the new range.

og:image is time-signed. The cdn1.telesco.pe/file/... URL expires in minutes. If you want the image, download it now, not later. Store the local path.

Sequential, polite delay. ~300 ms between requests is fine. Parallel fetchers above ~4 workers trip the throttle and you get landing pages back instead of real posts. Silent data loss.

Long posts are probably truncated. og:description is meant for link unfurls, not full message text. Empirically it cuts at the first paragraph break. Sometimes the first line of a long essay is all you get. Channels with mostly short posts (alerts, status updates) lose nothing. Channels with long-form posts will give you partial bodies and there's no obvious way to know from the response that you got less than the whole thing. If you need the full text, the embed URL (?embed=1&mode=tme) returns the rendered widget HTML with the complete message inside <div class="tgme_widget_message_text">. Costs you a second request per post.