How to build your own NLP Dataset?

Agenda

In this article, we will build our own Wikipedia dataset.
We will first look for a website that includes a list of keywords related to a given topic. We will then extract these keywords and use them to query a wikipedia API. The result of this will be a list of wikipedia pages related to the topic we chose.

Building a list of keywords from a given topic

Scrap a web page

I found here a website that includes a pretty long list of words related to Machine Learning and I chose to use it as an example. Here is how this website looks like:

Quick view of the page I will use to extract keywords

Extract the useful information

Now the question is how to get a clear list of keywords from this page ? Let’s have a quick look at the source code (Ctr+U for windows)

Chunk of html to understand the structure of the code
['a/b testing',
'accuracy',
'action',
'activation function',
'active learning',
'validation set',
...
'vanishing gradient problem',
'wasserstein loss',
'weight',
'weighted alternating least squares (wals)',
'wide model',
'width']

Query Wikipedia

Now that you have a list of keywords, you just need to query the Wikipedia API. It is pretty straight forward as you can see in this example:

It took me around 1 second per query so be ready to wait a few minutes!
>>> wiki_data[0][“title”]
'A/B testing'
>>> wiki_data[0][“summary”]
'A/B testing (also known as bucket testing or split-run testing) is a user experience research methodology. A/B tests consist of a randomized experiment with two variants, A and B. It includes application of statistical hypothesis testing or "two-sample hypothesis testing" as used in the field of statistics. A/B testing is a way to compare two versions of a single variable, typically by testing a subject\'s response to variant A against variant B, and determining which of the two variants is more effective.\n\n'
>>> wiki_data[1][“title”]
'Accuracy and precision'
...

What’s next ?

Now that you have your freshly built your own small data set, you can start analyzing it ! I will write tutorials about different machine learning technics that will be applied on this data set so check it out if you’re interested.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store