ChatGPT said:
This is a common issue when using pandas.read_html() on Wikipedia.
Even though you can open the page in a browser, Wikipedia blocks requests that do not have a User-Agent header, and read_html() uses a simple request underneath (via urllib) that often gets rejected with HTTP 403 Forbidden.
✅ Why your code failed
-
Browser requests include headers like user-agent, cookies, etc. → Wikipedia allows.
-
pandas.read_html() default request looks like an automated bot. → Wikipedia blocks → 403 Forbidden.
✅ How to fix it
You need to supply a proper User-Agent header by downloading the page yourself (using requests) and then passing the HTML to read_html().
✔ Working code
import pandas as pd import requestsurl = "https://en.wikipedia.org/wiki/List_of_countries_by_population" headers = {"User-Agent": "Mozilla/5.0"}response = requests.get(url, headers=headers) response.raise_for_status() # ensure no HTTP errors dfs = pd.read_html(response.text) print(len(dfs)) dfs[0].head()
This works because:
-
requests.get()sends a normal browser-like User-Agent. -
Wikipedia returns the page.
-
pd.read_html()parses tables from the HTML content.
You should see the first table as expected.
You said:
ChatGPT said:
You do not need to modify anything —
User-Agent: "Mozilla/5.0" is a generic, browser-like User-Agent and works perfectly for Wikipedia.
Why it works
Most modern browsers (including Chrome, Edge, Firefox) send a User-Agent string that begins with:
Mozilla/5.0
For example, your actual Chrome User-Agent probably looks like:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36
But Wikipedia does not require the full long string — it only checks that the request looks like it came from a browser, not a script.
✔ Conclusion
-
headers = {"User-Agent": "Mozilla/5.0"}is enough -
No need to pretend to be Chrome specifically
-
The code will work exactly the same