Skip to content
/ goggle Public

The Wikipedia Goggle: a search engine for Wikipedia written in modern C++

License

Notifications You must be signed in to change notification settings

aapeliv/goggle

Folders and files

NameName
Last commit message
Last commit date

Latest commit

95a4545 · May 3, 2022
Apr 29, 2022
May 3, 2022
Apr 30, 2022
Apr 30, 2022
Apr 30, 2022
Apr 7, 2022
Mar 29, 2022
Apr 29, 2022
Mar 29, 2022
Apr 30, 2022
Apr 2, 2022
Apr 30, 2022
May 3, 2022

Repository files navigation

The Wikipedia Goggle logo

The Wikipedia Goggle

The Wikipedia Goggle is a search engine for the English Wikipedia, using a trigram index and a ranking algorithm similar to Google's original PageRank, implemented in modern C++.


An animation of the system working


Getting started

  1. Install Bazel from https://bazel.build/install (you can do brew install bazel if you have Homebrew)

  2. Clone the git repo

git clone https://github.com/aapeliv/goggle.git
  1. Download files from the Wikipedia data dump using. E.g. to get a sample dump from April 1st 2022, go to https://dumps.wikimedia.org/enwiki/20220401/, download a partial dump and extract the index file.
cd data

# download the data dump
wget https://dumps.wikimedia.org/enwiki/20220401/enwiki-20220401-pages-articles-multistream1.xml-p1p41242.bz2

# download the data index file
wget https://dumps.wikimedia.org/enwiki/20220401/enwiki-20220401-pages-articles-multistream-index1.txt-p1p41242.bz2

# extract the index
bunzip2 enwiki-20220401-pages-articles-multistream-index1.txt-p1p41242.bz2
  1. Run the tests
bazel test //...
  1. Start the indexer and backend with
bazel run //src:goggle
  1. Once the backend comes up with a message `Serving on 8080.', you can test it with a query such as
curl "http://localhost:8080/query?q=finland"

Running the whole thing self contained

  1. Build an optimized binary
bazel build --config=optz //src:goggle`
  1. Build an optimized frontend
cd frontend/
npm run build
  1. Get a TLS certificate and place them in the working directory

  2. Download the full Wikipedia dump and index

  3. Run the full thing

./bazel-bin/src/goggle \
  --db_dir=prod_db/ \
  --dump_file path/to/articles-multistream.xml.bz2 \
  --index_file path/to/articles-multistream-index.txt \
  --enable_tls \
  --server_cert path/to/cert.pem \
  --server_key path/to/key.pem \
  --frontend_server_dir frontend/build/