Running a Mediawiki in your own copy > Restore the whole Mediwiki Backup

Submitted by ruo on Sat, 03/11/2017 - 03:39

This blog show how to download the Wikipedia last pages articles dump and import into Mysql.

Request:

Mediawiki dump for pages and articles is 13.7G today,  after import the data will become 77Gb, make sure it has about 150GB disk when you importing the data, and at least 100Gb for running the site. For most of the cloud servers, like Azure, AWS EC2, Linode, you can resize the server as you needed.

 

Workflow:

Download and exact.

> download the last dump of Mediawiki, for pages and articles only, we can download

enwiki-latest-pages-articles-multistream.xml.bz

from https://dumps.wikimedia.org/  

 

> Exact the bz2 file.  at the end it will be a 60Gb file

enwiki-latest-pages-articles-multistream.xml

 

Use MWDumper to convert XML to SQL

>Build MWDumper from source

the mwdumper.jar (untill 2017/3/11) still has the bug: need Updated Xerces library to fix intermittent UTF-8 breakage will output the error

need to build an update to data MWDumper from source and with some patch.

 

if you meet other issues durling use MWDumper, try to search on https://www.mediawiki.org/wiki/Manual_talk:MWDumper

git clone https://phabricator.wikimedia.org/diffusion/MWDU/mwdumper.git
cd mwdumper

Modify the file

src\org\mediawiki\importer\XMLDumpReader.java

Add

import java.io.*;
import org.xml.sax.InputSource;

At the header. And change the function as follow:

        public void readDump() throws IOException {
                try {
                        SAXParserFactory factory = SAXParserFactory.newInstance();
                        SAXParser parser = factory.newSAXParser();
                        Reader reader = new InputStreamReader(input,"UTF-8");
                        InputSource is = new InputSource(reader);
                        is.setEncoding("UTF-8");
                        parser.parse(is, this);
                } catch (ParserConfigurationException e) {
                        throw (IOException)new IOException(e.getMessage()).initCause(e);
                } catch (SAXException e) {
                        throw (IOException)new IOException(e.getMessage()).initCause(e);
                }
                writer.close();
        }

After the file modifed, package the jar file.

mvn compile
mvn package

 

It should generate the

target/mwdumper-<version>.jar

, when I did this, it was mwdumper-1.25.jar

>Use MWDumper to convert the XML to SQL

put the mwdumper.jar and the XML into same folder, and execute:

java -jar mwdumper.jar enwiki-lastest-pages-articles-multistream.xml --format=sql:1.5 > enwiki-latest-pages-articles-multistream.sql

Waiting and get the finial SQL, which can use mysql command to import to database.

Mysql prepare

>use Mariadb, otherwise need enforce mysql use utf8mb4

the Mariadb is the mysql version used by Meidiawiki too. I recommoned to use this.

If you use the Mysql community,  you maybe meet the bug:

ERROR 1366 (HY000) at line 16989: Incorrect string value: '\xF0\x9D\x95\xB9\xF0\x9D...' for column 'rev_user_text' at row 875
create database wiki default character set binary;

It's the problem of the UTF8 Character sets problems. Setup the whole system to use the utf8mb4 replace utf8 will solute this issue.

 

>Prepare the database table.

Before import, we need prepare a database table first,  my solution is create the table "wiki", and running mediawiki install process, then I get a database table with everything I need.

Start Import

the import process is simple. don't forget use screen command if you are using ssh to connect remote database, it will take few hours.

mysql  wiki < enwiki-latest-pages-articles-multistream.sql --force

 

I use --forece on the command to skip the errors.  if you see some important error(not the small error like duplicate page title), it's better empty the page,revision,article table and fix the error then start the import again.

 

At the end, you will get the database about 77Gb. and you can search the wiki on your site.

 

 

 

 

 

 

 

Add new comment

Restricted HTML

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.