Using (and learning) Python to scrape and manipulate text

To get a sense of much of the work involved in DH, as well as the underlying logic of the nuts and bolts, we’re going to learn a little Python. Python is a comparatively easy-to-use, object-oriented, high-level programming language. As such, it’s easier to parse and read than some other languages.

But, as Michelle Moravec has suggested, “invest your time in learning methods not tools.”  By this, she doesn’t mean not to learn tools, but rather, to focus on methodology, with the actual tool used being secondary.  In other words, even for the exercise we’re doing now, the point is not to learn python for the sake of knowing python, it’s for the sake of getting our feet wet in programming.

But it’s also to learn some methodology for a skill that might come in very handy sometime: how to scrape data from the web.  Think of all those digitized files or websites out there, that we can only get data from by downloading the pages.  That’s fine for a few dozen or even a few score pages, but what about hundreds?  Thousands?  That’s where it makes sense to find some way to automate the ingestion of documents, get rid of all that pesky HTML and other formatting necessary for them to get all gussied up for the web anyway,  have nice clean files ready for data analysis, and extract some data for analysis.  That’s what we’ll be doing with python.

Accordingly, here are the lessons we’ll be doing, from the Programming Historian 2.  First, scroll down to Python Programming Basics.  For actually writing and running your python programs, editors suggest either using the command line or installing Komodo Edit.  If either of these work for you, excellent.   For the Mac, I actually prefer TextWrangler , and Windows folk may like Notepad++ (both of these are free), as Komodo Edit is a lot more complex than our needs warrant for the purposes of this exercise.  Another cool option, if you don’t want to install anything, is repl.it, which runs free, online programming environments, including python, that operate on files on your computer.  Another thing to remember: the filename extension (that is, the suffix) for python program files is “.py”; no matter what text editor you use, if you’re using one, make sure that you save your files as .py files rather than as .txt files, which tends to be the default on text editors (except, of course, in those lessons in which you are instructed to save a .txt or html file, etc.).

Then, the lessons to complete:

I know, this seems like a lot.  But trust me, some of this is review, and most of these lessons go by quickly (several of them under 10 minutes each).  Plus, we’ll be getting started on them in class Wednesday night–I’m working on arranging a couple of extra laptops for us to use.  I guarantee that we’ll all be able to work through them within a week, with a minimum of pulling of hair and gnashing of teeth.