Lecture 13
Dr. Mine Çetinkaya-Rundel
Duke University
STA 199 - Spring 2023
October 12, 2022
ae-12 project from GitHub, render your document, update your name, and commit and push.Increasing amount of data is available on the web
These data are provided in an unstructured format: you can always copy&paste, but it’s time-consuming and prone to errors
Web scraping is the process of extracting this information automatically and transform it into a structured dataset
Two different scenarios:
Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy).
Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files.
|>read_html() - Read HTML data from a url or character string (actually from the xml2 package, but most often used along with other rvest functions)html_element() / html_elements() - Select a specified element(s) from HTML documenthtml_table() - Parse an HTML table into a data framehtml_text() - Extract text from an elementhtml_text2() - Extract text from an element and lightly format it to match how text looks in the browserhtml_name() - Extract elements’ nameshtml_attr() / html_attrs() - Extract a single attribute or all attributesae-12ae-12 (repo name will be suffixed with your GitHub name).When working in a Quarto document, your analysis is re-run each time you knit
If web scraping in a Quarto document, you’d be re-scraping the data each time you knit, which is undesirable (and not nice)!
An alternative workflow:


Two different scenarios for web scraping:
Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy)
Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files