Your headers looked just like Chrome's. The first 200 test requests worked fine. But when you tried to scrape more, you got a 403 Forbidden error. The site's anti-bot system still caught you. Why? Your Python requests library has a different TLS fingerprint than Chrome.
Your headers said you were Chrome, but the underlying HTTPS connection gave you away as a Python script. Sending too many requests triggered this check, and the mismatch got you blocked.
Headers are more than just data. They're like fingerprints. Anti-bot systems use them to spot scrapers. Get one thing wrong, and you'll be blocked and waste time.
This guide shows you which headers matter, how anti-bot systems work, and why headers alone are often not enough. We'll show you header setups that work for different sites and when you might need to use other tools.
For those new to the library, install it with pip install requests to follow along.
Key Takeaways
- Headers are Fingerprints: Anti-bot systems check more than just your
User-Agent. They look at header order, letter casing, and something called a TLS fingerprint. - TLS Fingerprinting is Key: The
requestslibrary has a TLS fingerprint that's different from a real browser, which gets you blocked when you send too many requests. - Volume Triggers Detection: Scraping a few pages might work fine. But when you scrape a lot, sites run deeper checks and block you.
requestsHas Its Limits:requestsis good for simple sites, but it's not enough for tougher sites like e-commerce or job boards.- Production Scraping is Complex: To scrape tough sites, you need to handle header rotation, TLS fingerprints, proxies, and sessions.
- Managed Solutions Simplify Scraping: Tools like ScrapFly take care of all the hard parts for you, so you can just get the data you need.
What Are Headers in Python Requests?
In HTTP, headers are key-value pairs that are sent with every request and response to tell the server what to do. They are a key part of how clients and servers talk to each other. For instance, headers can tell the server about the type of device sending the request, or whether the client wants a JSON response.
Each request starts a conversation between the client (like a browser or your script) and the server, with headers acting as instructions. The most common headers include:
- Content-Type: Shows the media type (e.g.,
application/json), helping the server understand the format of data you're sending. - Authorization: Used for sending login details or API tokens to access protected pages.
- User-Agent: Identifies your client application, which helps servers tell real users apart from bots.
- Accept: Tells the server what content types (e.g., JSON, XML) your client can handle, so the server can send back a format you can understand.
- Cookie: Sends stored cookies to keep you logged in or remember your session.
- Cache-Control: Controls caching behavior, like how long to store a copy of a page.
Headers can be easily managed using Python’s requests library. This lets you get headers from a response or set custom headers to customize each request.
Example: Getting Headers with Python Requests
In Python, you can get headers from a response using response.headers.
import requests
response = requests.get('https://httpbin.dev')
print(response.headers)
{
"Access-Control-Allow-Credentials": "true",
"Access-Control-Allow-Origin": "*",
"Content-Security-Policy": "frame-ancestors 'self' *.httpbin.dev; font-src 'self' *.httpbin.dev; default-src 'self' *.httpbin.dev; img-src 'self' *.httpbin.dev https://cdn.scrapfly.io; media-src 'self' *.httpbin.dev; script-src 'self' 'unsafe-inline' 'unsafe-eval' *.httpbin.dev; style-src 'self' 'unsafe-inline' *.httpbin.dev https://unpkg.com; frame-src 'self' *.httpbin.dev; worker-src 'self' *.httpbin.dev; connect-src 'self' *.httpbin.dev",
"Content-Type": "text/html; charset=utf-8",
"Date": "Fri, 25 Oct 2024 14:14:02 GMT",
"Permissions-Policy": "fullscreen=(self), autoplay=*, geolocation=(), camera=()",
"Referrer-Policy": "strict-origin-when-cross-origin",
"Strict-Transport-Security": "max-age=31536000; includeSubDomains; preload",
"X-Content-Type-Options": "nosniff",
"X-Xss-Protection": "1; mode=block",
"Transfer-Encoding": "chunked"
}
The output shows headers the server sends back, with details like
- media type
Content-Type - security policies (
Content-Security-Policy) - allowed origins (
Access-Control-Allow-Origin).
Example: Setting Custom Headers
Custom headers, like adding a User-Agent for device emulation, can make requests appear more authentic:
headers = {'User-Agent': 'my-app/0.0.1'}
response = requests.get('https://httpbin.dev/headers', headers=headers)
print(response.json())
{
"headers": {
"Accept": ["*/*"],
"Accept-Encoding": ["gzip, deflate"],
"Host": ["httpbin.dev"],
"User-Agent": ["my-app/0.0.1"],
"X-Forwarded-For": ["45.242.24.152"],
"X-Forwarded-Host": ["httpbin.dev"],
"X-Forwarded-Port": ["443"],
"X-Forwarded-Proto": ["https"],
"X-Forwarded-Server": ["traefik-2kvlz"],
"X-Real-Ip": ["45.242.24.152"]
}}
This setup helps your request appear more like it came from a browser, which can prevent you from being blocked.
Why Headers Fail on Real Sites
Even if you copy a browser's headers perfectly, your requests can still get blocked. This is because modern anti-bot systems check for other patterns to tell if you're a real user or a scraper.
The Header Fingerprinting Problem
Anti-bot systems create a "fingerprint" of each request by looking at a mix of things. One of the most common is TLS fingerprinting, which identifies the tool used to send the request.
The Python requests library has a unique TLS fingerprint that is different from any major web browser. When you send a request with headers that say "I'm Chrome," but your TLS handshake says "I'm Python," the mismatch gives you away. For more on this, see our guide to TLS fingerprinting.
How Cloudflare, Datadome, and PerimeterX Detect Scrapers
Big anti-bot services like Cloudflare, Datadome, and PerimeterX use a few tricks to find and block bots:
- Header Order: They expect headers to be in the same order as a real browser.
- Missing Browser Headers: They check for headers like
Sec-Fetch-*, which browsers send but scraping scripts often don't. - TLS Fingerprint Mismatch: As mentioned, the TLS signature is checked against the
User-Agentheader. - Behavior: They track how fast you send requests and how you handle cookies to decide if you're a human.
The Volume Threshold Problem
When you're just testing with a few requests, your scraper might work just fine. Most sites don't run deep checks on every single request. However, once you send a lot of requests, you cross a line that triggers more checks.
At that point, your "perfect" headers are checked more closely, the TLS fingerprint is reviewed, and the mismatch gets you blocked. To scrape a lot of pages, you need to match both the headers and the TLS fingerprint, which means using better tools.
For example, a requests call and a real Chrome request might both work in testing. But only the Chrome request will work when you scrape at a large scale because its TLS fingerprint matches its headers.
This is where a service like ScrapFly becomes very helpful. It uses real browsers with matching TLS signatures, allowing you to scrape as much as you want without worrying about being detected.
Are Headers Case-Sensitive?
A common question is whether header names are case-sensitive.
According to the HTTP rules, header names are not case-sensitive. This means Content-Type, content-type, and CONTENT-TYPE are all the same. However, it's a good idea to stick to the standard capitalization (like Content-Type).
Why Case Sensitivity Matters for Bot Detection
When servers check requests, small details like using the wrong case for letters can give you away. Real browsers use a specific case for letters in header names. While requests handles this for you, some anti-bot systems might still use odd casing as a small clue for their bot detection. It's one of many small "tells" that can help get you flagged as a bot.
In practice, the requests library automatically fixes the case for you. However, header values (like “application/json”) are case-sensitive and must be correct.
Example of Case-Insensitive Headers
In requests, you can set headers in any case, and it will work correctly:
import requests
# Setting 'content-type' in lowercase
headers = {'content-type': 'application/json'}
response = requests.post('https://httpbin.dev/api', headers=headers)
print(response.request.headers)
{
"Content-Type": "application/json",
"User-Agent": "python-requests/2.28.1",
"Accept-Encoding": "gzip, deflate",
"Accept": "*/*",
"Connection": "keep-alive"}
As shown above, requests automatically converted content-type to the standard Content-Type. This demonstrates that Python’s requests library will normalize header names for you, maintaining compatibility with web servers regardless of the case used in the original code.
Does Header Order Matter?
In most standard API interactions, the order of headers sent with a Python requests headers call does not affect functionality, as the HTTP specification does not require a specific order for headers. However, when dealing with advanced anti-bot and anti-scraping systems, header order can play an unexpectedly significant role in determining whether a request is accepted or blocked.
Why Header Order Matters for Bot Detection
Anti-bot systems like Cloudflare, DataDome, and PerimeterX check the exact order of your headers. Browsers send headers in a consistent order. Tools like requests often use a different order.
This difference is a big red flag. Trying to set the header order yourself is a pain and breaks easily. Anti-bot companies always change their rules, so you would need to figure out the new header order every time your scraper breaks.
This annoying work is why many developers use a service like ScrapFly, which automatically handles header order for you.
Example: Browser Headers vs. Python Requests Headers
A browser might send headers in this order:
{
"User-Agent": "Mozilla/5.0...",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://example.com",
"Connection": "keep-alive"
}
With Python’s requests library, headers might look slightly different:
import requests
headers = {
"Accept": "application/json",
"User-Agent": "my-scraper/1.0",
"Connection": "keep-alive",
"Referer": "https://httpbin.dev"
}
response = requests.get("https://httpbin.dev/headers", headers=headers)
print(response.json()) # This output may vary based on server handling
{
"headers": {
"Accept": "application/json",
"User-Agent": "my-scraper/1.0",
"Connection": "keep-alive",
"Referer": "https://httpbin.dev"
}
}
This slight difference in header ordering can hint to anti-bot systems that the request might be automated, especially if combined with other signals, such as the User-Agent format or missing headers.
By analyzing this order, advanced detection systems can identify patterns often associated with automated scripts or bots. When a request does not match the usual order, the server may assume it’s coming from a bot, potentially resulting in blocked requests or captcha challenges.
Standard Headers in Python Requests
To make your requests look like they come from a browser, it's helpful to know which headers are standard.
Key Standard Headers
- User-Agent: Identifies your browser and OS.
- Accept: Tells the server what content types you can handle.
- Accept-Language: Your preferred language (e.g.,
en-US). - Accept-Encoding: Compression methods you accept (e.g.,
gzip). - Referer: The URL of the page you came from.
- Connection: Usually set to
keep-alive.
Verifying Browser Headers
To make sure your headers look real:
- Browser Developer Tools: Use the Network tab in your browser's developer tools to see the headers for any request. You can copy these for your scraper.
- Proxy Tools: Tools like Fiddler let you see and edit HTTP headers.
Example: Mimicking Headers in Python
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,...',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://httpbin.dev',
'Connection': 'keep-alive'
}
response = requests.get('https://httpbin.dev', headers=headers)
print(response.status_code)
Using headers like these makes your request look more like it came from a real user.
Importance of the User-Agent String
The User-Agent is very important. It tells the server what browser, OS, and device you are using so it can send back content that works for you.
You can learn more in our guide on
In this article, we’ll take a look at the User-Agent header, what it is and how to use it in web scraping. We'll also generate and rotate user agents to avoid web scraping blocking.How to Effectively Use User Agents for Web Scraping
Where Headers Actually Work
Even though anti-bot systems are getting smarter, requests and a simple set of headers can still work in some cases. These are for sites that don't have strong bot detection, or where you don't send too many requests.
Scenarios Where Requests + Headers Succeed
- Simple News or Blog Sites: Many simple sites don't have strong anti-bot tools. A basic
User-AgentandAcceptheader is often enough. - Internal APIs: Private APIs often just need a simple
Authorizationheader with a key and don't use advanced bot detection. - Small Personal Projects: If you're just scraping a few pages for fun, you probably won't get noticed. It's okay if you get blocked sometimes.
These simple cases work because they don't trigger the deep checks that happen on big, protected sites. But as soon as you try to scrape e-commerce sites or job boards, headers are not enough.
Headers for POST Requests
For POST requests, headers are very important because they tell the server about the data you are sending.
Key Headers for POST Requests
Content-Type: Indicates the data format, such as
application/jsonfor JSON data,application/x-www-form-urlencodedfor form submissions, ormultipart/form-datafor files. Setting this correctly ensures the server parses your data as expected.User-Agent: Identifies the client application, which helps with API access and rate limit policies.
Authorization: Needed for secure endpoints to authenticate requests, often using tokens or credentials.
Accept: Specifies the desired response format (e.g.,
application/json), aiding in consistent data handling and error processing.
Example Usage of Headers for POST Requests
To send JSON data, set the Content-Type to application/json:
import requests
headers = {
'Content-Type': 'application/json',
'User-Agent': 'my-app/0.0.1'
}
data = '{"key": "value"}'
response = requests.post('https://httpbin.dev/api', headers=headers, json=data)
print(response.status_code)
print(response.json())
This helps the server process your data correctly.
Browser-Specific Headers
Some sites check for headers that only real browsers send. Adding these can make your scraper look more human.
Common Browser-Specific Headers
- DNT (Do Not Track): Tells the server you don't want to be tracked (
1). - Sec-Fetch-Site: Shows where the request is coming from (
same-origin,cross-site,none). - Sec-Fetch-Mode: Shows the type of request (
navigatefor loading a page). - Sec-Fetch-Dest: Shows what kind of content is expected (
document,image).
Example of Browser-Specific Headers in Python:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36',
'DNT': '1', # Respects user preference for tracking
'Sec-Fetch-Site': 'none', # Represents a top-level navigation action
'Sec-Fetch-Mode': 'navigate', # Simulates a full-page load
'Sec-Fetch-Dest': 'document' # Indicates the content type expected
}
response = requests.get('https://httpbin.dev', headers=headers)
print(response.status_code)
200
print(response.headers)
{
"Access-Control-Allow-Credentials": "true",
"Access-Control-Allow-Origin": "*",
"Content-Security-Policy": "frame-ancestors 'self' *.httpbin.dev; font-src 'self' *.httpbin.dev; default-src 'self' *.httpbin.dev; img-src 'self' *.httpbin.dev https://cdn.scrapfly.io; media-src 'self' *.httpbin.dev; script-src 'self' 'unsafe-inline' 'unsafe-eval' *.httpbin.dev; style-src 'self' 'unsafe-inline' *.httpbin.dev https://unpkg.com; frame-src 'self' *.httpbin.dev; worker-src 'self' *.httpbin.dev; connect-src 'self' *.httpbin.dev",
"Content-Type": "text/html; charset=utf-8",
"Date": "Sun, 27 Oct 2024 11:48:47 GMT",
"Permissions-Policy": "fullscreen=(self), autoplay=*, geolocation=(), camera=()",
"Referrer-Policy": "strict-origin-when-cross-origin",
"Strict-Transport-Security": "max-age=31536000; includeSubDomains; preload",
"X-Content-Type-Options": "nosniff",
"X-Xss-Protection": "1; mode=block",
"Transfer-Encoding": "chunked"
}
By including these headers, you can make your request appear closer to those typically sent by browsers, reducing the likelihood of being flagged as a bot or encountering access restrictions.
Why Use Browser-Specific Headers?
- Avoid Bot Detection: They make your requests look like real user traffic.
- Better Compatibility: Some sites give different content to bots.
- More Successful Requests: They can reduce your chances of being blocked.
Production Scraping Complexity
The truth is that for production web scraping, managing headers is just one part of the problem. To scrape tough sites at a large scale, you need to build and manage a lot of complex code, which is more than just rotating headers.
What You'd Need to Build Yourself
- Header Rotation: You need a large list of real browser headers. You have to rotate them in a smart way, making sure a Chrome
User-Agentis sent with other Chrome headers. See our guide to User-Agent rotation. - TLS Fingerprint Management: The
requestslibrary's TLS fingerprint is a big giveaway. You'd need to use other tools likehttpxorcurl_cffito copy a browser's TLS signature. This makes things more complex and can be slower. - Proxy Coordination: Your headers should be matched with your proxies. For example, a German language header should be sent from a German proxy. Matching them is key to not getting blocked. Learn more in our guide to proxy rotation.
- Session Management: You need to manage cookies to keep a session for logins or rotate them when scraping multiple pages at once.
- Monitoring and Adapting: Anti-bot rules change all the time. You would need to build a system to track how often you get blocked, then figure out the new rules and change your code.
Building and managing all this is a lot of work. This is why many teams use a service to handle the hard parts. With ScrapFly, our team handles all of this so you can focus on getting data, not on getting blocked.
Power-up with ScrapFly
The last section showed how hard it is to scrape at a large scale. The endless cycle of building, fixing, and re-building is why many teams switch to ScrapFly.
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.
- Anti-bot protection bypass - scrape web pages without blocking!
- Rotating residential proxies - prevent IP address and geographic blocks.
- JavaScript rendering - scrape dynamic web pages through cloud browsers.
- Full browser automation - control browsers to scroll, input and click on objects.
- Format conversion - scrape as HTML, JSON, Text, or Markdown.
- Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.
DIY Approach:
# DIY: header rotation, proxy management, TLS fingerprinting, failure handling...
def get_headers():
# code to get a random, real header
return headers
def get_proxy():
# code to get a rotating proxy
return proxy
def make_request_with_curl_cffi(url):
# complex setup with curl_cffi to copy a browser
...
try:
# manage headers, proxies, and TLS
make_request_with_curl_cffi("https://example.com")
except:
# handle blocks, retries, etc.
...
ScrapFly API:
from scrapfly import ScrapflyClient, ScrapeConfig
scrapfly = ScrapflyClient(key="YOUR_API_KEY")
result = scrapfly.scrape(ScrapeConfig(
url="https://example.com",
asp=True, # turn on anti-scraping protection
))
It's much simpler and more powerful.
FAQs
Why do my headers work in testing but fail in production?
This is usually because you're sending too many requests. In testing, your low number of requests doesn't trigger extra anti-bot checks. In production, more requests trigger TLS fingerprinting. Even with perfect headers, a different TLS fingerprint will get you blocked. ScrapFly solves this by using real browser TLS profiles.
How often do I need to update my header patterns?
For sites with strong protection like Cloudflare or Datadome, you may need to update your headers every few weeks or even days. These systems change their rules all the time. This is a lot of work to maintain, and ScrapFly handles it for you.
When should I use browser automation instead of requests?
Use browser automation tools like Playwright or Selenium when you need to load JavaScript, handle clicks, or when you're blocked by TLS fingerprinting. While these tools solve the TLS problem, they are slower and use more computer resources. ScrapFly offers both simple requests and full browser automation.
Summary
Headers are a basic part of web scraping, but they are also the main way anti-bot systems find and block scrapers. As we've seen, just copying a browser's headers is not enough. Things like TLS mismatches and wrong header order make scraping with requests a pain.
For simple sites, requests and a few headers can work. But for tough sites, you need a lot more. This means rotating headers and proxies, managing TLS fingerprints, and always updating your code.
You have two choices:
- Option A: Build and manage all this complex stuff yourself.
- Option B: Use ScrapFly and just focus on getting the data you need.
The choice depends on how you want to spend your time.