Rework notebooks to use the static self-hosted fake job board#350
Open
martin-martin wants to merge 4 commits intomasterfrom
Open
Rework notebooks to use the static self-hosted fake job board#350martin-martin wants to merge 4 commits intomasterfrom
martin-martin wants to merge 4 commits intomasterfrom
Conversation
indeed.com has tightened their bot protection against web scraping, which is why requests to their site as they are described in this course return 403 Forbidden status codes. I've attempted to circumvent this using fake headers (something that would be explainable in an intro course) but no luck, 403 prevails. I've previously [reworked the written tutorial](https://realpython.com/beautiful-soup-web-scraper-python/#step-1-inspect-your-data-source) to use a self-hosted [fake job board](https://realpython.github.io/fake-jobs/) that I set up just for the purpose of the tutorial. As a quick fix for the video course, I added an explanatory lesson to the video coure and reworked the Jupyter notebooks. The information and processes that I explain in the rest of the course are still valid and a good introduction for how to approach scraping a static website.
gahjelle
approved these changes
Feb 6, 2023
Contributor
gahjelle
left a comment
There was a problem hiding this comment.
@martin-martin Great job updating this! I agree with you in removing the output from the notebooks!
I found one tiny bug (title -> title_element) that's noted as a line comment.
Otherwise, this looks good to me!
We could potentially ask @KateFinegan to have a quick LE glance on the changes as well.
build-a-web-scraper/03_parse.ipynb
Outdated
| "source": [ | ||
| "link_text = title_link.text\n", | ||
| "link_text" | ||
| "title = title.text\n", |
Contributor
There was a problem hiding this comment.
title is currently not defined, should we refer to title_element?
Suggested change
| "title = title.text\n", | |
| "title = title_element.text\n", |
Co-authored-by: gahjelle <geirarne@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
indeed.com has tightened their bot protection against web scraping, which is why requests to their site as they are described in this course return 403 Forbidden status codes.
I've attempted to circumvent this using fake headers (something that would be explainable in an intro course) but no luck, 403 prevails.
I've previously reworked the written tutorial to use a self-hosted fake job board that I set up just for the purpose of the tutorial.
As a quick fix for the video course, I added an explanatory lesson to the video coure and reworked the Jupyter notebooks.
The information and processes that I explain in the rest of the course are still valid and a good introduction for how to approach scraping a static website.
Where to put new files:
my-awesome-articleHow to merge your changes: