This is from a presentation Austin gave at BUDSC18. If you need additional information in the meantime contact us.
In this interactive session, I will guide the audience through methods for preparing a large, uniform data set from the writings of Phillip K. Dick suitable for use in discovering the uses of language around gender within the corpus using algorithmic methods. I will be working with a variety of media for source material (scanned pages of manuscripts, printed books, PDF's, web-based text, etc) and will use the Python programming language along with various libraries to extract text, format the data and retain meta-data information required to reference the original sources. I will be able to document my work in a way that will allow future researchers to read and understand the nuances of my methodology with enough detail to recreate my dataset from the original sources (whether or not they appear in the same medium) using an entirely different programming language or suite of tools.
It's my hope I can make this easier... perhaps even with an executable, but I wanted to get the code in everyone's hands before I spent too much time on that, so for now, here's the very command line way to do it
- Download this repository
- Open a terminal and navigate to this directory
- Install pipenv... sadly you may have to read the docs for your installation... i'll try to streamline this later https://pipenv.readthedocs.io/en/latest/install/#installing-pipenv
- Install Spacy English Model https://spacy.io/models/:
pipenv run python -m spacy download en
- Install Vader Lexicon for NLTK (sentiment analysis):
pipenv run python -m nltk.downloader vader_lexicon
- Run: pipenv run python server.py
- Launch in your browser... http://0.0.0.0:5000/
OpenCV: installing opencv - we use https://pypi.org/project/opencv-python/ to install open cv
Markovify Pytesseract PyPDF2 Pillow Flask Spacy NLTK Roman - for roman numerals
I'm still working on this, so i'm just calling it copyrighted until I work out the details. If you want to use this in your work just contact me and we'll work it out. Also, I should have this updated in a week or so.