Python web scraping for beginners: pull data from websites into a spreadsheet
web scraping is using code to automatically extract data from a website — the same data you would manually copy and paste, but done in seconds and at scale.
practical examples of what solopreneurs scrape:
– competitor pricing pages (to track price changes over time)
– product listings from marketplaces (to monitor inventory or pricing)
– review data from sites that do not offer exports
– public government data that is only available in HTML tables
– news headlines for topic monitoring
this guide uses Python, Google Colab (free, no installation), and two libraries: requests (fetches the page) and BeautifulSoup (parses the HTML).
before you scrape: check if an API exists
many sites offer an API — a structured data feed designed for programmatic access. API data is cleaner, more reliable, and legal by design. always check for an API before scraping.
where to look:
– docs.site.com or developer.site.com
– Google search: “[site name] API”
– the site footer: “Developers” or “API” link
if an API exists, use it instead of scraping. if not, proceed.
also check the site’s robots.txt (yoursite.com/robots.txt). this file tells you which parts of the site the owner asks scrapers not to access. follow it. scraping data that robots.txt disallows creates legal risk and gets your IP blocked.
setting up the environment (no installation)
open Google Colab (colab.research.google.com). create a new notebook.
install the required libraries (pre-installed on most Colab environments, but run this to be safe):
!pip install requests beautifulsoup4 -q
import them:
import requests
from bs4 import BeautifulSoup
import pandas as pd
scraping your first page: fetching and parsing HTML
we will scrape the Wikipedia table of the world’s most populous cities as a practice example (publicly available, no restrictions).
# step 1: fetch the page
url = "https://en.wikipedia.org/wiki/List_of_largest_cities"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
# check it worked
print(response.status_code) # should print 200
requests.get() downloads the page HTML. the User-Agent header makes the request look like a browser — some sites block requests without it.
# step 2: parse the HTML
soup = BeautifulSoup(response.text, 'html.parser')
# find the first table
table = soup.find('table', {'class': 'wikitable'})
# extract the rows
rows = table.find_all('tr')
print(f"Found {len(rows)} rows")
BeautifulSoup converts the raw HTML into a searchable structure. find() finds the first matching element. find_all() returns all matches.
# step 3: extract data from the rows
data = []
for row in rows[1:]: # skip header row
cells = row.find_all(['td', 'th'])
if cells:
row_data = [cell.get_text(strip=True) for cell in cells]
data.append(row_data)
# step 4: load into pandas
df = pd.DataFrame(data)
print(df.head())
get_text(strip=True) extracts the text content of each cell, removing HTML tags and extra whitespace.
# step 5: save to CSV
df.to_csv('largest_cities.csv', index=False)
print("saved.")
you now have a CSV of the table from the Wikipedia page. this process works for any HTML table on any page.
handling common scraping scenarios
multiple pages (pagination)
many sites spread data across pages: page 1, page 2, page 3.
all_data = []
for page_num in range(1, 11): # scrape pages 1-10
url = f"https://example.com/listings?page={page_num}"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.text, 'html.parser')
# extract your data here
items = soup.find_all('div', class_='listing-item')
for item in items:
title = item.find('h2').get_text(strip=True)
price = item.find('span', class_='price').get_text(strip=True)
all_data.append({'title': title, 'price': price})
df = pd.DataFrame(all_data)
df.to_csv('listings.csv', index=False)
finding the right HTML elements to target
the critical skill in scraping is finding the HTML element that contains the data you want.
in any browser: right-click the text you want to scrape → Inspect (or Inspect Element). the browser’s developer tools show you the HTML structure.
look for the tag (div, span, table, p), the class name (class="price"), or the ID (id="main-content").
common patterns:
– soup.find('h1') — the first h1 heading
– soup.find('span', class_='price') — a span with class “price”
– soup.find(id='results-table') — any element with id “results-table”
– soup.find_all('li', class_='product-name') — all list items with class “product-name”
adding politeness (sleep between requests)
scraping too fast gets your IP blocked. add a pause between requests:
import time
for page_num in range(1, 50):
# scrape the page...
time.sleep(2) # wait 2 seconds before the next request
2 seconds between requests is a reasonable default for most sites. for slower or fragile sites, increase to 5 seconds.
where BeautifulSoup stops working
BeautifulSoup only works on static HTML — pages where the content is in the initial HTML response.
many modern sites use JavaScript to load content dynamically. if you fetch the page and the data you need is not in the HTML, the site uses JavaScript rendering.
for JavaScript-rendered pages, use Playwright or Selenium — these control a real browser programmatically. they are more complex to set up but handle any website.
for most simple scraping tasks (tables, directory listings, article content), BeautifulSoup and requests are sufficient.
ethical and legal considerations
- follow robots.txt restrictions
- do not scrape at a rate that burdens the server
- do not scrape personal data (names, emails, phone numbers) without clear legal basis
- check the site’s terms of service — some explicitly prohibit scraping
- if you are scraping to republish data, verify the license and attribution requirements
for practice: wikipedia.org and data.gov both allow scraping. they have well-structured HTML and no robots.txt restrictions on general access.
for finding datasets that already exist without scraping: best free datasets for research 2026.
for using scraped data in analysis: Python pandas tutorial for beginners.