Web Scraping in Python: A Practical Guide for Developers

Himanshu Tyagi
Last updated on May 2, 2026

Our guides are based on hands-on testing and verified sources. Each article is reviewed for accuracy and updated regularly to ensure current, reliable information.Read our editorial policy.

Use Python web scraping to collect publicly available web data and convert it into clean JSON, CSV, or database records. Start with requests and BeautifulSoup for normal HTML pages. Use Playwright only when the page needs JavaScript to load the data.

Here is the smallest useful version.

For example:

code
import requests
from bs4 import BeautifulSoup

url = "https://example.com"

response = requests.get(url, timeout=10)
soup = BeautifulSoup(response.text, "html.parser")

title = soup.select_one("h1")

print(response.status_code)
print(title.get_text(strip=True) if title else "No title found")

Output:

code
200
Example Domain

That is the basic idea. You fetch the page, parse the HTML, select what you need, and save the result.

What Is Web Scraping in Python?

Web scraping means collecting data from web pages with code. You send a request to a URL. Then you extract useful parts from the response.

Simple enough.

You may scrape product prices, article titles, job listings, public tables, or page metadata. Python is handy here because its scraping tools are easy to read and quick to test.

A basic scraping workflow looks like this:

Step What you do Common tool
Fetch Download the page requests
Parse Read the HTML BeautifulSoup or lxml
Select Find specific fields CSS selectors or XPath
Clean Remove noise and bad values Python functions
Export Save the data JSON, CSV, or a database

Let’s go through each part.

Managing HTTP Requests

HTTP requests are where scraping starts. An HTTP request is the message your code sends to a server when it asks for a page.

In Python, requests is the clean option most of the time. It can send GET requests, pass headers, set timeouts, and return the page response.

Do not skip the timeout.

Without a timeout, your script can hang for a long time if the server stops responding. The official Requests docs also note that requests do not time out unless you set a timeout explicitly.

Bad way first:

code
import requests

response = requests.get("https://example.com")

print(response.status_code)

Output:

code
200

This works. But it can hang.

Cleaner version:

For example:

code
import requests

url = "https://example.com"

headers = {
    "User-Agent": "CodeItBroResearchBot/1.0 [email protected]"
}

response = requests.get(url, headers=headers, timeout=10)

print(response.status_code)
print(response.text[:80])

Output:

code
200
<!doctype html>
<html>
<head>

This version is safer. You set a timeout. You also identify the script with a clear User-Agent.

For larger scraping projects, you may also need caching, regional testing, and request routing. Some teams use proxies for that kind of setup, but proxies do not replace permission checks, delays, and clean request handling.

Check Status Codes First

A scraper should not trust every response. A server may return an error page, a redirect, or a rate-limit message.

Check first.

Status code Meaning What you should do
200 OK Parse the page
301 / 302 Redirect Check the final URL
403 Forbidden Stop and review access
404 Not found Skip or log it
429 Too many requests Slow down
500 Server error Retry later

For example:

code
import requests
from bs4 import BeautifulSoup

response = requests.get("https://example.com/missing-page", timeout=10)
soup = BeautifulSoup(response.text, "html.parser")

print(soup.get_text(strip=True)[:40])

Output:

code
Example Domain

This can mislead you. You may parse an error page as real data.

Cleaner version:

code
import requests

url = "https://example.com"

response = requests.get(url, timeout=10)

if response.status_code == 200:
    print("Ready to parse")
else:
    print(f"Skip this URL: {response.status_code}")

Output:

code
Ready to parse

Be strict here. Bad input creates bad data.

Using Headers Responsibly

Headers tell the server more about your request. The User-Agent header identifies the client making the request.

Some developers use fake browser headers to slip past basic checks. That is a bad habit if the goal is to ignore site rules.

Do not treat headers as a trick.

Use headers to make your crawler clear and predictable. For larger crawlers, add contact information so site owners can reach you.

For example:

code
import requests

headers = {
    "User-Agent": "MyResearchCrawler/1.0 [email protected]"
}

response = requests.get("https://example.com", headers=headers, timeout=10)

print(response.status_code)

Output:

code
200

This is plain and fair. Most of the time, that is what you want.

How to Parse Raw HTML

Once you fetch a page, you need to parse the HTML. Parsing means turning raw markup into a structure your code can search.

BeautifulSoup is the friendly option. lxml is the faster option.

Library Best for Main strength
BeautifulSoup Beginners and medium tasks Easy HTML navigation
lxml Larger scraping jobs Speed and XPath support
Playwright JavaScript-rendered pages Real browser automation

Let’s start with BeautifulSoup.

code
from bs4 import BeautifulSoup

html = """
<html>
  <body>
    <h1>Python Web Scraping</h1>
    <p class="price">$19.99</p>
  </body>
</html>
"""

soup = BeautifulSoup(html, "html.parser")

title = soup.select_one("h1")
price = soup.select_one(".price")

print(title.get_text(strip=True))
print(price.get_text(strip=True))

Output:

code
Python Web Scraping
$19.99

CSS selectors are neat. They let you target elements with familiar patterns like h1, .price, or #main.

If the source HTML looks messy, paste it into an HTML Formatter. It makes the structure easier to read before you write selectors.

BeautifulSoup vs lxml

BeautifulSoup is easy. lxml is fast.

That is the main difference.

Use BeautifulSoup when you want clear code. Use lxml when you process many pages or need XPath, a query syntax for selecting nodes inside HTML or XML.

For example:

code
from bs4 import BeautifulSoup

html = "<html><body><h1>Hello</h1></body></html>"

soup = BeautifulSoup(html, "lxml")

print(soup.select_one("h1").get_text(strip=True))

Output:

code
Hello

This keeps BeautifulSoup’s simple API while using lxml as the parser.

Adapting to Modern Web Pages

Many pages are not plain HTML anymore. React, Vue, Angular, and similar frameworks can load the real data after the page opens.

This causes confusion.

You may see data in your browser, but not in response.text. That means JavaScript probably loaded it later.

Before using a browser automation tool, check the Network tab. Look for Fetch or XHR requests. Many pages load public JSON from an endpoint.

If you find JSON, format it first. CodeItBro’s JSON Formatter helps you inspect API responses before you write extraction logic.

If you are reading minified JavaScript while debugging a page, CodeItBro’s JavaScript Formatter can make the script easier to scan.

Headless Browser Automation

Sometimes, plain requests are not enough. The page may need scrolling, clicking, waiting, or JavaScript execution.

Use Playwright then.

Playwright can run Chromium, WebKit, and Firefox. It can automate browser actions through Python.

For example:

code
from playwright.sync_api import sync_playwright

url = "https://example.com"

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto(url, wait_until="networkidle")

    title = page.locator("h1").inner_text()

    print(title)

    browser.close()

Output:

code
Example Domain

This is powerful. It is also heavier.

Use Playwright only when static HTML or a public API endpoint does not work.

Rate Limiting and Retries

Rate limiting means slowing your scraper down. Retries mean trying again after a temporary failure.

You need both.

Without delays, your scraper can trigger 429 Too Many Requests. It can also put unfair load on the target site.

Bad way first:

code
import requests

urls = [
    "https://example.com",
    "https://example.com",
    "https://example.com"
]

for url in urls:
    response = requests.get(url, timeout=10)
    print(response.status_code)

Output:

code
200
200
200

This looks fine on three URLs. At scale, it becomes noisy.

Cleaner version:

code
import time
import requests

urls = [
    "https://example.com",
    "https://example.com",
    "https://example.com"
]

for url in urls:
    response = requests.get(url, timeout=10)
    print(response.status_code)
    time.sleep(2)

Output:

code
200
200
200

Small delay. Better manners.

For temporary failures, use exponential backoff, a retry pattern where you wait longer after each failed attempt.

code
import time
import requests

def fetch_with_retries(url, max_retries=3):
    delay = 2

    for attempt in range(max_retries):
        response = requests.get(url, timeout=10)

        if response.status_code == 200:
            return response.text

        if response.status_code in [429, 500, 502, 503, 504]:
            time.sleep(delay)
            delay *= 2
            continue

        response.raise_for_status()

    raise Exception(f"Failed to fetch {url}")

html = fetch_with_retries("https://example.com")

print(html[:15])

Output:

code
<!doctype html>

This is safer. Your script does not keep hammering the same URL.

Proxies and Request Distribution

Some scraping projects use proxies for regional testing, network routing, or large-scale collection. A proxy is a server that sends requests on your behalf.

Use them carefully.

Question Why it matters
Does the site offer an API? An API is usually cleaner.
Can you reduce request volume? Less traffic means fewer problems.
Can you cache responses? You avoid repeated downloads.
Does the site allow this? Rules matter.
Are you collecting public data? Private data is risky.

In advanced cases, teams may compare datacenter, ISP, mobile, and residential proxies for regional testing or distributed collection.

But do not use proxies to ignore access controls.

That creates technical, legal, and reputation risk.

Use Case: E-commerce Data Scraping

E-commerce scraping is common. You may collect public prices, availability, product titles, or category data.

Keep it clean.

Field Example Why you collect it
Product name Wireless Mouse Identify the item
Price $24.99 Track changes
Availability In stock Monitor supply
URL /product/mouse Trace the source

Here is a local example. It does not hit a real shop, so you can run it safely.

For example:

code
from bs4 import BeautifulSoup

html = """
<div class="product-card">
  <a href="/mouse">Wireless Mouse</a>
  <span class="price">$24.99</span>
</div>
<div class="product-card">
  <a href="/keyboard">Mechanical Keyboard</a>
  <span class="price">$79.99</span>
</div>
"""

soup = BeautifulSoup(html, "html.parser")

products = []

for card in soup.select(".product-card"):
    name = card.select_one("a")
    price = card.select_one(".price")

    products.append({
        "name": name.get_text(strip=True),
        "url": name.get("href"),
        "price": price.get_text(strip=True)
    })

print(products)

Output:

code
[{'name': 'Wireless Mouse', 'url': '/mouse', 'price': '$24.99'}, {'name': 'Mechanical Keyboard', 'url': '/keyboard', 'price': '$79.99'}]

This is the pattern. Select cards. Extract fields. Store dictionaries.

Product Catalog Extraction

Product catalog scraping usually means more than one page. You may need categories, pagination, filters, and detail pages.

Build checks early.

Common catalog tasks include:

  • Finding category URLs
  • Collecting product links
  • Handling pagination
  • Scraping names, SKUs, prices, images, and stock status
  • Removing duplicates
  • Logging failed URLs

Website layouts change often. A class name that works today may fail next month.

For example:

code
def validate_product(product):
    if not product.get("name"):
        return False

    if not product.get("price"):
        return False

    return True

product = {
    "name": "Wireless Mouse",
    "price": "$24.99"
}

print(validate_product(product))

Output:

code
True

Simple checks save time. They catch broken selectors before bad data reaches your files.

Export Scraped Data as JSON

JSON is great for nested data. Product records, API responses, and metadata fit it well.

Use JSON first when the data has a structure.

For example:

code
import json

products = [
    {
        "name": "Wireless Mouse",
        "price": "$24.99",
        "available": True
    },
    {
        "name": "Mechanical Keyboard",
        "price": "$79.99",
        "available": False
    }
]

json_output = json.dumps(products, indent=2)

print(json_output)

Output:

code
[
  {
    "name": "Wireless Mouse",
    "price": "$24.99",
    "available": true
  },
  {
    "name": "Mechanical Keyboard",
    "price": "$79.99",
    "available": false
  }
]

If your scraper returns Python-style data while debugging, CodeItBro’s Python Dict to JSON Converter can help you convert values like True, False, and None into valid JSON.

You can also paste the final output into CodeItBro’s Online JSON Editor to inspect and validate it.

Export Scraped Data as CSV

CSV is better for flat tables. It works well with Excel, Google Sheets, and many database import tools.

But it does not love nested data.

For example:

code
import csv
import io

products = [
    {
        "name": "Wireless Mouse",
        "price": "$24.99",
        "available": True
    },
    {
        "name": "Mechanical Keyboard",
        "price": "$79.99",
        "available": False
    }
]

file = io.StringIO()
writer = csv.DictWriter(file, fieldnames=["name", "price", "available"])

writer.writeheader()
writer.writerows(products)

print(file.getvalue())

Output:

code
name,price,available
Wireless Mouse,$24.99,True
Mechanical Keyboard,$79.99,False

If you already have scraped CSV data, CodeItBro’s CSV to JSON Converter can help you turn it into structured JSON for APIs or testing.

Build a Small Scraping Pipeline

A scraper gets messy fast. Split it into small functions.

This keeps it readable.

Function Job
fetch_html() Downloads the page
parse_products() Extracts fields
validate_product() Rejects bad records
main() Runs the workflow

For example:

code
from bs4 import BeautifulSoup

def parse_products(html):
    soup = BeautifulSoup(html, "html.parser")
    products = []

    for item in soup.select(".product-card"):
        name = item.select_one(".name")
        price = item.select_one(".price")

        products.append({
            "name": name.get_text(strip=True) if name else None,
            "price": price.get_text(strip=True) if price else None
        })

    return products

def validate_product(product):
    return bool(product["name"] and product["price"])

def main():
    html = """
    <div class="product-card">
      <span class="name">Wireless Mouse</span>
      <span class="price">$24.99</span>
    </div>
    """

    products = parse_products(html)
    clean_products = [p for p in products if validate_product(p)]

    print(clean_products)

main()

Output:

code
[{'name': 'Wireless Mouse', 'price': '$24.99'}]

Once you have stable sample output, CodeItBro’s JSON to JSON Schema Converter can help you create a schema for validation before you store records.

Scraping has rules. Some are technical. Some are legal.

Take them seriously.

Practice Why it matters
Check the site terms You need to know what is allowed.
Read robots.txt It tells crawlers which paths are allowed.
Prefer official APIs They are usually cleaner and safer.
Avoid private data Privacy risk is real.
Use delays You reduce server load.
Cache responses You avoid repeat requests.

Google says robots.txt is mainly used to manage crawler traffic. It also notes that robots.txt does not enforce crawler behavior, so responsible crawlers choose to obey it.

If you manage your own site, you can test your rules with CodeItBro’s Robots.txt Tester.

Do not scrape private or login-protected data without permission.

If your project touches personal, financial, medical, or copyrighted data, get legal advice first.

Common Mistakes to Avoid

Mistake What happens Better choice
No timeout Your script hangs Set timeout=10
No status check You parse error pages Check status_code
Fragile selectors Data breaks silently Validate records
Too many requests You hit rate limits Add delays
Dirty exports Reports become wrong Clean JSON or CSV

Final Checks Before You Use Scraped Data

Run a final check before you use scraped data in an app, report, or dashboard. Small errors spread quickly.

Check these:

  • Required fields are present.
  • Prices use one format.
  • URLs are absolute or handled correctly.
  • JSON is valid.
  • Duplicates are removed.
  • Failed pages are logged.

If your scraper collects title tags or descriptions from many pages, CodeItBro’s Bulk Meta Tag Extractor can help you compare page metadata during SEO or competitor research.

Conclusion

Use Python web scraping to collect public web data and turn it into clean JSON, CSV, or database records. Start with requests and BeautifulSoup for normal HTML pages. Use Playwright only when the page needs JavaScript to load the data.

The smallest useful version still looks like this:

For example:

code
import requests
from bs4 import BeautifulSoup

url = "https://example.com"

response = requests.get(url, timeout=10)
soup = BeautifulSoup(response.text, "html.parser")

title = soup.select_one("h1")

print(response.status_code)
print(title.get_text(strip=True) if title else "No title found")

Output:

code
200
Example Domain

That is the whole workflow in compressed form. Fetch the page, parse the HTML, select what you need, validate the data, and export it cleanly.

Thanks for reading. Happy coding!

Himanshu Tyagi

About Himanshu Tyagi

At CodeItBro, I help professionals, marketers, and aspiring technologists bridge the gap between curiosity and confidence in coding and automation. With a dedication to clarity and impact, my work focuses on turning beginner hesitation into actionable results. From clear tutorials on Python and AI tools to practical insights for working with modern stacks, I publish genuine learning experiences that empower you to deploy real solutions—without getting lost in jargon. Join me as we build a smarter tech-muscle together.

Free Online Tools

Try These Related Tools

Free browser-based tools that complement what you just read — no sign-up required.

Keep Reading

Related Posts

Explore practical guides and fresh insights that complement this article.

Top 9 VS Code Extensions to 10x Your Productivity
Programming

Top 9 VS Code Extensions to 10x Your Productivity

Ever feel like your editor is missing something? Here&#8217;s the spoiler: these nine extensions turned my sluggish workflow into something I actually enjoy. They aren&#8217;t magic. But each one strips away a daily annoyance so I can focus on what matters. Let&#8217;s dive into my real-world take — frustrations and all — on the VS [&hellip;]

XML Pretty Print Using Python – with examples
Programming

XML Pretty Print Using Python – with examples

You want to take an ugly one‑liner XML string and make it human‑readable. You can do this in Python with a few lines of code. You will typically use xml.dom.minidom, the xml.etree.ElementTree API, lxml, or BeautifulSoup. Each option reads raw XML and writes a formatted representation. For example: Output: This quick example uses the DOM [&hellip;]

JSON Example With Data Types and Arrays (Beginner Guide)
JSON

JSON Example With Data Types and Arrays (Beginner Guide)

Ever stared at a JSON blob and wondered what every curly brace and square bracket meant? You&#8217;re not alone. Spoiler: JSON is simpler than you think. Let me walk you through it quickly. Why JSON matters? JSON (JavaScript Object Notation) is everywhere. APIs speak it. Config files use it. Marketers—hi—send data with it. JSON is [&hellip;]