Battle against search and AI bots on your ngrok endpoints
Here’s a simple scenario: You have an ngrok endpoint available on the public internet. No IP restrictions, no authorization blockade from basic to OAuth, no nothing.
You might think your ngrok endpoint is functionally “dark” from the public internet. If no one knows the domain name, and you’re using the ngrok endpoint only for internal use, surely no one or automated bot would be accessing (and potentially using) the data you make available there. Right?
Some of our users have found the opposite is true. ngrok endpoints, which they thought were hidden, are actively crawled and indexed by search engine bots and AI bots looking for more data to refine their models against. If you load up your Traffic Inspector, you might find the same hints of bot traffic. These intrepid developers have been looking for answers to preventing this unwanted access, and now you might be, too—luckily, you have plenty of options.
For now, I can help you with three options, all leveraging the Traffic Policy module.
How and why are bots accessing ngrok endpoints?
First, let’s discuss these bots to give you some context about your targets and why you might want to block them. That will even influence which option you take.
This shouldn’t come as much of a surprise: Search engine providers, AI companies, content scrapers, and many other entities use bots to automatically crawl as much of the public internet as possible to hoover up relevant data. Bots usually spend their time crawling public websites on “standard” domains, like our very own ngrok blog, then use hyperlinks to extend what they “know” about the public internet.
But if these bots are supposedly relying only on public hyperlinks to expand their “spider” web of data, how are they discovering and eventually crawling these “private” ngrok endpoints? There have been long discussions for over a decade over whether and how companies like Google might sniff Gmail for fresh URLs or automatically fire off GET requests to any URL you enter into your browser’s search bar. Companies like OpenAI also aggressively crawl as much of the internet as possible, but aren’t particularly forthright about how their bots operate.
Despite all the uncertainty, we can say definitively that ngrok does not:
- Automatically block any traffic on your endpoint that you haven’t explicitly configured.
- Deploy a
robots.txt
file on your behalf, which tells crawlers and bots which URLs they’re allowed to access.
If you’re seeing bot traffic on your ngrok endpoint and you’d rather not, you have the power to battle against them.
Option 1: Block all bots with a robots.txt
file
First, if you don’t feel like adding a robots.txt
file on the app or service running on your ngrok endpoint, you can ask ngrok to do it on your behalf. The policy below filters any traffic arriving to {YOUR-NGROK-ENDPOINT}/robots.txt
and serves a plaintext custom response.
If you haven’t configured a policy using the Traffic Policy module before, check out our docs for examples based on how you deployed ngrok. These policies are compatible with the agent CLI, various SDKs, and the ngrok Operator.
---
inbound:
- name: “Add robots.txt to block all bots”
expressions:
- "req.url.contains('/robots.txt')"
actions:
- type: "custom-response"
config:
content: "User-agent: *\r\nDisallow: /"
headers:
content-type: "text/plain"
status_code: 200
With this policy, any respectable bot traffic looking to crawl your endpoint will first see that you’ve disallowed it on all paths with the following two lines:
User-agent: *
Disallow: /
You could also make your robots.txt
more specific. For example, if you want to only block the OpenAI crawlers ChatGPT-User
and GPTBot
:
---
inbound:
- name: “Add `robots.txt` to deny specific bots and crawlers”
expressions:
- "req.url.contains('/robots.txt')"
actions:
- type: "custom-response"
config:
content: "User-agent: ChatGPT-User\r\nDisallow: /\r\nUser-agent: GPTBot\r\nDisallow: /"
headers:
content-type: "text/plain"
status_code: 200
Now, the reality is that your robots.txt
is not completely bot-proof—even Google’s crawling and indexing documentation states clearly that it’s not a mechanism for keeping a web page from being accessed or even indexed in Google searches. The same probably applies to AI crawlers like OpenAI—for that, you need to take your bot-blocking to another level.
Option 2: Block the user agents of specific bots
Next, you can instruct ngrok to filter traffic based on the user agent of incoming requests, disallowing certain bot traffic while keeping “human” traffic around.
---
inbound:
- name: “Block specific bots by user agent”
expressions:
- "req.user_agent.matches('(?i).*(chatgpt-user|gptbot)/\\\\d+.*')"
actions:
- type: "deny"
config:
status_code: 404
This rule matches against all requests with a user agent containing chatgpt-user
or gptbot
and serves a 404
response, preventing them from accessing your ngrok endpoint and crawling or utilizing your data for any reason. The bonus here is that you can still allow your legitimate human users from accessing the endpoint with user agents you approve of, like curl
requests or via a browser.
You can extend the list of user agents to match against like so: (chatgpt-user|gptbot|anthropic|claude|any|other|user-agent|goes|here)
.
Option 3: Block all traffic except specific IPs
Finally, we come to the most restrictive Traffic Policy action: restrict IPs or IP ranges with CIDRs.
---
inbound:
- name: “Restrict IPs by CIDR”
actions:
- type: "restrict-ips"
config:
enforce: true
allow:
- "192.0.2.0/32"
Now, when the pesky bots—or any unwanted traffic, for that matter—hits your ngrok endpoint, they’ll get some version of the following response:
Which bot-blocking method should you pick?
If you want the endpoint to be fully accessible to the public internet, but just want to turn away bots with minimal impact, the first option will prevent future bots from constantly pinging your service and showing up in your Traffic Inspector or other logs.
If you want the endpoint to be fully accessible to the public internet, but want to ensure specific bots can’t ignore your robots.txt
and crawl your content anyway, the second option will do the trick.
If you want the endpoint to be accessible only to specific people or networks, and block bots alongside the rest of the public internet, you should implement IP restrictions, at the very least, with the third option.
The great thing about how ngrok handles Traffic Policy actions is that if you wanted the utmost guarantee that your apps, services, and APIs are available to you and trusted peers, you can implement all three in sequence to:
- Inform bots to not crawl or index any path on the domain name associated with your ngrok endpoint.
- Actively block bots that attempt to crawl your endpoint despite your ngrok-supplied robots.txt “warning.”
- Deny any traffic from an unknown source.
What’s next?
Blocking bots is just one example of the powerful and flexible ways you can configure Traffic Policy actions, whether you’re using ngrok as a development tool for webhook testing or in production as an API gateway or for Kubernetes ingress. Check out some of our other resources to learn more:
- Traffic Policy documentation
- Our “gallery” of example policies
- ngrok blog: Traffic Policy Engine - What are CEL variables?
- Add Auth0-based JWT authentication
Want to share your thoughts on bot-blocking and beyond with ngrok? Send us your thoughts in our Community Repo—the home for all discussions, bug reports, and feedback about your mTLS and API gateway experience.
Finally, a big kudos to Justin, our Senior Technical Support Engineer, for bravely fording into the bot-battle to create the first solution based around robots.txt
. 🦾