The Wikipedia Miner Toolkit is for developers and researchers who want closely examine the raw structure of Wikipedia. If you are looking for a pre-built Wikipedia-based ontology, then something like DBpedia or FreeBase will probably be more relevant to you. However, if you want to make use of the structure and content of Wikipedia, then this toolkit will make your work a lot easier.
What this does
- Summarizes Wikipedia's link structure, category structure, page types, etc.
PERL scripts are provided to extract these statistics and summaries from the static xml dumps. These scripts scale in linear time and can flexibly split the data where necessary in case of memory constraints.
- Models Wikipedia as easy to understand Java classes
such as Article, Category, Anchor, etc. See the Java Doc for details.
- Communicates with a MySql database
The summarized data is stored persistently. You can access it immediately, without waiting for anything to load.
- Caches summaries to memory if required.
Sometimes you will rather spend time pre-loading the summaries to memory, so you can avoid the overhead of constantly querying the database.
- Provides flexible searching, via anchors, titles and redirects
as they occur or via stemming, case-folding, etc. You can also add your own search methods, and prepare the data so that they can be used efficiently.
- Measures how Wikipedia's concepts relate to each other.
The toolkit includes proven semantic relatedness measures that are both accurate and cheap.
- Detects Wikipedia topics when they are mentioned in documents.
This includes machine-learned approaches for disambiguating ambiguous terms, and identifying the topics that are most likely to be of interest to the reader.
What this doesn't do
- Live updates
This toolkit requires you to download and preprocess you own static dump of Wikipedia. Because of that, you will always be a few days behind.
- Languages other than English
Most of this work (wikification, semantic relatedness, etc) is language independent, but has not been tested on anything other than en and simple versions of Wikipedia.
- Parsing of MediaWiki syntax
This toolkit includes code to strip out MediaWiki syntax to obtain plain text or html versions of Wikipedia's content, but this is far from comprehensive. Other (probably better) parsers are available here.
- Anything with templates, info-boxes or revision history.
We simply haven't had a chance to look into this.
Remember, Wikipedia Miner is entirely open source, and is free to evolve as you see fit.
Requirements
To run wikipedia miner, you will need lots of hard-drive space and around 3G of memory. On top of that, the toolkit requires:
- Write access to a MySql Server
- Java (1.5 or above)
- MySql Connector/J
A Java API for connecting to MySql databases.
- Trove
A Java API for efficient sets, hashtables, etc.
- Weka
An open-source Java-based workbench of machine learning algorithms for data mining tasks.
If you only need Wikipedia's structure rather than it's full textual content, then you can save a lot of time by using one of our pre-summarized dumps (available here). Otherwise, you will need:
-
An XML dump of Wikipedia
Download one of the pages-articles.xml.bz2 files from here
- Perl
- Parse:MediaWikiDump
Perl tools for processing MediaWiki dump files.
If you want to host your own Wikipedia Miner web services, then you will also need:
Licence
The Wikipedia Miner toolkit is open-source software, distributed under the terms of the GNU General Public License.