From 11a82525fd77a522f686948cfc7bc560a757f22b Mon Sep 17 00:00:00 2001 From: Xevion Date: Fri, 7 Aug 2020 18:30:54 -0500 Subject: [PATCH] update README with information on quote processing, basic installation/dependencies --- README.md | 110 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 110 insertions(+) diff --git a/README.md b/README.md index 8530321..3fd2f60 100644 --- a/README.md +++ b/README.md @@ -19,10 +19,120 @@ A Vue.js and Flask Web Application designed to provide a quick way to search for - Instant Search provided Algolia - Sleek, responsive design that is easy on the eyes +## Quote Data + +### Credit + +Credit to [officequotes.net/](https://www.officequotes.net/) for providing all quote data. + +Credit to [imdb.com](https://www.imdb.com/title/tt0386676/) for episode descriptions. + +### Processing + +Quotes are scraped directly from the website as of this moment. + +This repository will hold the current pre-processed raw quote data, but the application has the ability to fetch and parse +HTML pages directly as needed. + +``` +python server/cli.py fetch + --season SEASON Fetches all episodes from a specific season. + --episode EPISODE Fetches a specific episode. Requires SEASON to be specified. + --all Fetches data for every episode from every season. + --skip SEASON:EPISODE When specified, it will skip a given episode. +``` + +The data has to be parsed, but due to high irregularity (at least too much for me to handle), the files will have to be +inspected and manually processed. + +```python server/cli.py preprocess + --season SEASON Pre-processes all episodes from a specific season. + --episode EPISODE Pre-processes a specific episode. Requires SEASON to be specified. + --all Pre-processes all episodes from every season. + --overwrite DANGER: Will overwrite files. May result in manually processed files to be lost forever. +``` + +From then on, once all files have been pre-processed, you will have to begin the long, annoying process of editing them into my custom format. + +These raw pre-processed files are located in `'./server/data/processed/` + +Each section (barring the first) is pre-pended by a .hyphen. + +``` +CharacterName: Text that character says. +OtherCharacter: More text that other character says.. +- +ThirdCharacter: Text that character says in a second scene/section. +-!1 +Fourth Character With Spaces In Name: Text that fourth character says in a deleted scene. +Fifth-Character: Which deleted scene? Deleted scene number one. +``` + +Deleted scenes are marked by a initial exclamation mark, and then a number of digits marking which deleted scene they are a part of. + +Please note that extra text like 'Deleted Scenes 3' might appear before a hyphen - this is expected and is helpful when deciding +which scene goes with which Deleted Scene ID. If you don't know, do what I did - go look at the web page it's based on. +Otherwise, I read the quotes and figure out based on context. + +This concept is rather loose, slow, and dumb, it simply allows me to mark what deleted scenes go together while working +with a incredibly inconsistent, human curated data format. + +To ease text processing, I did come up with RegEx expressions for search and replacement: + +``` +^([\w\s]+\-*[\w\s]*):\s+ +$1| +``` + +From then on, the process becomes much simpler, 95% of the work needed to process quotes is already done. + +Now that quotes are in a consistent (although custom) format, they need to be processed into individual episodes. In reality, +they are just the JSON format of the previous stage. + +``` +python server/cli.py process + --season SEASON Processes all episodes from a specific season. + --epsiode EPISODE Processes a specific episode. Requires SEASON to be specified. + --all Processes all episodes from all seasons. +``` + +Now that they're all in individual files, the final commands can be ran to compile them into one file, a static +'database' or something. Technically, they could be kept scattered, but I decided to make it simpler with just 1 big file. + +This also is where Algolia comes in. + +``` +python server/cli.py build [algolia|final] +``` + +Each command is ran with no special arguments (as of now), generating a `algolia.json` or `data.json` in the `./server/data/` folder. + +This `data.json` file is loaded by the Flask server and the `algolia.json` can be uploaded to your primary index. + ## Setup This project was built on Python 3.7 and Node v12.18.3 / npm 6.14.6. +### Installation + +To install all Node/NPM dependencies, run + +``` +npm install +``` + +To install Python's dependencies, run + +``` +pip install -r ./requirements.txt +``` + +I recommend that you use a virtualenv in order to keep dependencies separate from other projects, as I do. +Personally, I use PyCharm Professional to maintain virtualenvs, just because it's easy to start, use, update and maintain +them. + +### Running + - Vue.js can be ran via `npm run serve`. - Run this in `./client/`. - Flask can be ran via `flask run`.