Web Scraping in Python: A Practical Guide for Developers
Our guides are based on hands-on testing and verified sources. Each article is reviewed for accuracy and updated regularly to ensure current, reliable information.Read our editorial policy.
Use Python web scraping to collect publicly available web data and convert it into clean JSON, CSV, or database records. Start with requests and BeautifulSoup for normal HTML pages. Use Playwright only when the page needs JavaScript to load the data.
Here is the smallest useful version.
For example:
import requests
from bs4 import BeautifulSoup
url = "https://example.com"
response = requests.get(url, timeout=10)
soup = BeautifulSoup(response.text, "html.parser")
title = soup.select_one("h1")
print(response.status_code)
print(title.get_text(strip=True) if title else "No title found")
Output:
200
Example Domain
That is the basic idea. You fetch the page, parse the HTML, select what you need, and save the result.
What Is Web Scraping in Python?
Web scraping means collecting data from web pages with code. You send a request to a URL. Then you extract useful parts from the response.
Simple enough.
You may scrape product prices, article titles, job listings, public tables, or page metadata. Python is handy here because its scraping tools are easy to read and quick to test.
A basic scraping workflow looks like this:
| Step | What you do | Common tool |
|---|---|---|
| Fetch | Download the page | requests |
| Parse | Read the HTML | BeautifulSoup or lxml |
| Select | Find specific fields | CSS selectors or XPath |
| Clean | Remove noise and bad values | Python functions |
| Export | Save the data | JSON, CSV, or a database |
Let’s go through each part.
Managing HTTP Requests
HTTP requests are where scraping starts. An HTTP request is the message your code sends to a server when it asks for a page.
In Python, requests is the clean option most of the time. It can send GET requests, pass headers, set timeouts, and return the page response.
Do not skip the timeout.
Without a timeout, your script can hang for a long time if the server stops responding. The official Requests docs also note that requests do not time out unless you set a timeout explicitly.
Bad way first:
import requests
response = requests.get("https://example.com")
print(response.status_code)
Output:
200
This works. But it can hang.
Cleaner version:
For example:
import requests
url = "https://example.com"
headers = {
"User-Agent": "CodeItBroResearchBot/1.0 [email protected]"
}
response = requests.get(url, headers=headers, timeout=10)
print(response.status_code)
print(response.text[:80])
Output:
200
<!doctype html>
<html>
<head>
This version is safer. You set a timeout. You also identify the script with a clear User-Agent.
For larger scraping projects, you may also need caching, regional testing, and request routing. Some teams use proxies for that kind of setup, but proxies do not replace permission checks, delays, and clean request handling.
Check Status Codes First
A scraper should not trust every response. A server may return an error page, a redirect, or a rate-limit message.
Check first.
| Status code | Meaning | What you should do |
|---|---|---|
200 |
OK | Parse the page |
301 / 302 |
Redirect | Check the final URL |
403 |
Forbidden | Stop and review access |
404 |
Not found | Skip or log it |
429 |
Too many requests | Slow down |
500 |
Server error | Retry later |
For example:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://example.com/missing-page", timeout=10)
soup = BeautifulSoup(response.text, "html.parser")
print(soup.get_text(strip=True)[:40])
Output:
Example Domain
This can mislead you. You may parse an error page as real data.
Cleaner version:
import requests
url = "https://example.com"
response = requests.get(url, timeout=10)
if response.status_code == 200:
print("Ready to parse")
else:
print(f"Skip this URL: {response.status_code}")
Output:
Ready to parse
Be strict here. Bad input creates bad data.
Using Headers Responsibly
Headers tell the server more about your request. The User-Agent header identifies the client making the request.
Some developers use fake browser headers to slip past basic checks. That is a bad habit if the goal is to ignore site rules.
Do not treat headers as a trick.
Use headers to make your crawler clear and predictable. For larger crawlers, add contact information so site owners can reach you.
For example:
import requests
headers = {
"User-Agent": "MyResearchCrawler/1.0 [email protected]"
}
response = requests.get("https://example.com", headers=headers, timeout=10)
print(response.status_code)
Output:
200
This is plain and fair. Most of the time, that is what you want.
How to Parse Raw HTML
Once you fetch a page, you need to parse the HTML. Parsing means turning raw markup into a structure your code can search.
BeautifulSoup is the friendly option. lxml is the faster option.
| Library | Best for | Main strength |
|---|---|---|
| BeautifulSoup | Beginners and medium tasks | Easy HTML navigation |
| lxml | Larger scraping jobs | Speed and XPath support |
| Playwright | JavaScript-rendered pages | Real browser automation |
Let’s start with BeautifulSoup.
from bs4 import BeautifulSoup
html = """
<html>
<body>
<h1>Python Web Scraping</h1>
<p class="price">$19.99</p>
</body>
</html>
"""
soup = BeautifulSoup(html, "html.parser")
title = soup.select_one("h1")
price = soup.select_one(".price")
print(title.get_text(strip=True))
print(price.get_text(strip=True))
Output:
Python Web Scraping
$19.99
CSS selectors are neat. They let you target elements with familiar patterns like h1, .price, or #main.
If the source HTML looks messy, paste it into an HTML Formatter. It makes the structure easier to read before you write selectors.
BeautifulSoup vs lxml
BeautifulSoup is easy. lxml is fast.
That is the main difference.
Use BeautifulSoup when you want clear code. Use lxml when you process many pages or need XPath, a query syntax for selecting nodes inside HTML or XML.
For example:
from bs4 import BeautifulSoup
html = "<html><body><h1>Hello</h1></body></html>"
soup = BeautifulSoup(html, "lxml")
print(soup.select_one("h1").get_text(strip=True))
Output:
Hello
This keeps BeautifulSoup’s simple API while using lxml as the parser.
Adapting to Modern Web Pages
Many pages are not plain HTML anymore. React, Vue, Angular, and similar frameworks can load the real data after the page opens.
This causes confusion.
You may see data in your browser, but not in response.text. That means JavaScript probably loaded it later.
Before using a browser automation tool, check the Network tab. Look for Fetch or XHR requests. Many pages load public JSON from an endpoint.
If you find JSON, format it first. CodeItBro’s JSON Formatter helps you inspect API responses before you write extraction logic.
If you are reading minified JavaScript while debugging a page, CodeItBro’s JavaScript Formatter can make the script easier to scan.
Headless Browser Automation
Sometimes, plain requests are not enough. The page may need scrolling, clicking, waiting, or JavaScript execution.
Use Playwright then.
Playwright can run Chromium, WebKit, and Firefox. It can automate browser actions through Python.
For example:
from playwright.sync_api import sync_playwright
url = "https://example.com"
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle")
title = page.locator("h1").inner_text()
print(title)
browser.close()
Output:
Example Domain
This is powerful. It is also heavier.
Use Playwright only when static HTML or a public API endpoint does not work.
Rate Limiting and Retries
Rate limiting means slowing your scraper down. Retries mean trying again after a temporary failure.
You need both.
Without delays, your scraper can trigger 429 Too Many Requests. It can also put unfair load on the target site.
Bad way first:
import requests
urls = [
"https://example.com",
"https://example.com",
"https://example.com"
]
for url in urls:
response = requests.get(url, timeout=10)
print(response.status_code)
Output:
200
200
200
This looks fine on three URLs. At scale, it becomes noisy.
Cleaner version:
import time
import requests
urls = [
"https://example.com",
"https://example.com",
"https://example.com"
]
for url in urls:
response = requests.get(url, timeout=10)
print(response.status_code)
time.sleep(2)
Output:
200
200
200
Small delay. Better manners.
For temporary failures, use exponential backoff, a retry pattern where you wait longer after each failed attempt.
import time
import requests
def fetch_with_retries(url, max_retries=3):
delay = 2
for attempt in range(max_retries):
response = requests.get(url, timeout=10)
if response.status_code == 200:
return response.text
if response.status_code in [429, 500, 502, 503, 504]:
time.sleep(delay)
delay *= 2
continue
response.raise_for_status()
raise Exception(f"Failed to fetch {url}")
html = fetch_with_retries("https://example.com")
print(html[:15])
Output:
<!doctype html>
This is safer. Your script does not keep hammering the same URL.
Proxies and Request Distribution
Some scraping projects use proxies for regional testing, network routing, or large-scale collection. A proxy is a server that sends requests on your behalf.
Use them carefully.
| Question | Why it matters |
|---|---|
| Does the site offer an API? | An API is usually cleaner. |
| Can you reduce request volume? | Less traffic means fewer problems. |
| Can you cache responses? | You avoid repeated downloads. |
| Does the site allow this? | Rules matter. |
| Are you collecting public data? | Private data is risky. |
In advanced cases, teams may compare datacenter, ISP, mobile, and residential proxies for regional testing or distributed collection.
But do not use proxies to ignore access controls.
That creates technical, legal, and reputation risk.
Use Case: E-commerce Data Scraping
E-commerce scraping is common. You may collect public prices, availability, product titles, or category data.
Keep it clean.
| Field | Example | Why you collect it |
|---|---|---|
| Product name | Wireless Mouse | Identify the item |
| Price | $24.99 | Track changes |
| Availability | In stock | Monitor supply |
| URL | /product/mouse | Trace the source |
Here is a local example. It does not hit a real shop, so you can run it safely.
For example:
from bs4 import BeautifulSoup
html = """
<div class="product-card">
<a href="/mouse">Wireless Mouse</a>
<span class="price">$24.99</span>
</div>
<div class="product-card">
<a href="/keyboard">Mechanical Keyboard</a>
<span class="price">$79.99</span>
</div>
"""
soup = BeautifulSoup(html, "html.parser")
products = []
for card in soup.select(".product-card"):
name = card.select_one("a")
price = card.select_one(".price")
products.append({
"name": name.get_text(strip=True),
"url": name.get("href"),
"price": price.get_text(strip=True)
})
print(products)
Output:
[{'name': 'Wireless Mouse', 'url': '/mouse', 'price': '$24.99'}, {'name': 'Mechanical Keyboard', 'url': '/keyboard', 'price': '$79.99'}]
This is the pattern. Select cards. Extract fields. Store dictionaries.
Product Catalog Extraction
Product catalog scraping usually means more than one page. You may need categories, pagination, filters, and detail pages.
Build checks early.
Common catalog tasks include:
- Finding category URLs
- Collecting product links
- Handling pagination
- Scraping names, SKUs, prices, images, and stock status
- Removing duplicates
- Logging failed URLs
Website layouts change often. A class name that works today may fail next month.
For example:
def validate_product(product):
if not product.get("name"):
return False
if not product.get("price"):
return False
return True
product = {
"name": "Wireless Mouse",
"price": "$24.99"
}
print(validate_product(product))
Output:
True
Simple checks save time. They catch broken selectors before bad data reaches your files.
Export Scraped Data as JSON
JSON is great for nested data. Product records, API responses, and metadata fit it well.
Use JSON first when the data has a structure.
For example:
import json
products = [
{
"name": "Wireless Mouse",
"price": "$24.99",
"available": True
},
{
"name": "Mechanical Keyboard",
"price": "$79.99",
"available": False
}
]
json_output = json.dumps(products, indent=2)
print(json_output)
Output:
[
{
"name": "Wireless Mouse",
"price": "$24.99",
"available": true
},
{
"name": "Mechanical Keyboard",
"price": "$79.99",
"available": false
}
]
If your scraper returns Python-style data while debugging, CodeItBro’s Python Dict to JSON Converter can help you convert values like True, False, and None into valid JSON.
You can also paste the final output into CodeItBro’s Online JSON Editor to inspect and validate it.
Export Scraped Data as CSV
CSV is better for flat tables. It works well with Excel, Google Sheets, and many database import tools.
But it does not love nested data.
For example:
import csv
import io
products = [
{
"name": "Wireless Mouse",
"price": "$24.99",
"available": True
},
{
"name": "Mechanical Keyboard",
"price": "$79.99",
"available": False
}
]
file = io.StringIO()
writer = csv.DictWriter(file, fieldnames=["name", "price", "available"])
writer.writeheader()
writer.writerows(products)
print(file.getvalue())
Output:
name,price,available
Wireless Mouse,$24.99,True
Mechanical Keyboard,$79.99,False
If you already have scraped CSV data, CodeItBro’s CSV to JSON Converter can help you turn it into structured JSON for APIs or testing.
Build a Small Scraping Pipeline
A scraper gets messy fast. Split it into small functions.
This keeps it readable.
| Function | Job |
|---|---|
fetch_html() |
Downloads the page |
parse_products() |
Extracts fields |
validate_product() |
Rejects bad records |
main() |
Runs the workflow |
For example:
from bs4 import BeautifulSoup
def parse_products(html):
soup = BeautifulSoup(html, "html.parser")
products = []
for item in soup.select(".product-card"):
name = item.select_one(".name")
price = item.select_one(".price")
products.append({
"name": name.get_text(strip=True) if name else None,
"price": price.get_text(strip=True) if price else None
})
return products
def validate_product(product):
return bool(product["name"] and product["price"])
def main():
html = """
<div class="product-card">
<span class="name">Wireless Mouse</span>
<span class="price">$24.99</span>
</div>
"""
products = parse_products(html)
clean_products = [p for p in products if validate_product(p)]
print(clean_products)
main()
Output:
[{'name': 'Wireless Mouse', 'price': '$24.99'}]
Once you have stable sample output, CodeItBro’s JSON to JSON Schema Converter can help you create a schema for validation before you store records.
Ethical and Legal Best Practices
Scraping has rules. Some are technical. Some are legal.
Take them seriously.
| Practice | Why it matters |
|---|---|
| Check the site terms | You need to know what is allowed. |
Read robots.txt |
It tells crawlers which paths are allowed. |
| Prefer official APIs | They are usually cleaner and safer. |
| Avoid private data | Privacy risk is real. |
| Use delays | You reduce server load. |
| Cache responses | You avoid repeat requests. |
Google says robots.txt is mainly used to manage crawler traffic. It also notes that robots.txt does not enforce crawler behavior, so responsible crawlers choose to obey it.
If you manage your own site, you can test your rules with CodeItBro’s Robots.txt Tester.
Do not scrape private or login-protected data without permission.
If your project touches personal, financial, medical, or copyrighted data, get legal advice first.
Common Mistakes to Avoid
| Mistake | What happens | Better choice |
|---|---|---|
| No timeout | Your script hangs | Set timeout=10 |
| No status check | You parse error pages | Check status_code |
| Fragile selectors | Data breaks silently | Validate records |
| Too many requests | You hit rate limits | Add delays |
| Dirty exports | Reports become wrong | Clean JSON or CSV |
Final Checks Before You Use Scraped Data
Run a final check before you use scraped data in an app, report, or dashboard. Small errors spread quickly.
Check these:
- Required fields are present.
- Prices use one format.
- URLs are absolute or handled correctly.
- JSON is valid.
- Duplicates are removed.
- Failed pages are logged.
If your scraper collects title tags or descriptions from many pages, CodeItBro’s Bulk Meta Tag Extractor can help you compare page metadata during SEO or competitor research.
Conclusion
Use Python web scraping to collect public web data and turn it into clean JSON, CSV, or database records. Start with requests and BeautifulSoup for normal HTML pages. Use Playwright only when the page needs JavaScript to load the data.
The smallest useful version still looks like this:
For example:
import requests
from bs4 import BeautifulSoup
url = "https://example.com"
response = requests.get(url, timeout=10)
soup = BeautifulSoup(response.text, "html.parser")
title = soup.select_one("h1")
print(response.status_code)
print(title.get_text(strip=True) if title else "No title found")
Output:
200
Example Domain
That is the whole workflow in compressed form. Fetch the page, parse the HTML, select what you need, validate the data, and export it cleanly.
Thanks for reading. Happy coding!


