Convert online forum to pdf collection.

Internet forums have a habit of being run by people who, very understandably, become interested in other things in life and, over the years, the site admin can slip from being foremost in their minds. The key forum members can fall out and start banning each other, which doesn't help. Likewise, once very keen moderators can also become inactive. I belonged to an old car forum which was administered by someone who simply never used it. If the site went down, someone had to ring her and ask if she could do a reboot. Needless to say something went wrong with the server hardware and, since there was no backup, everything was lost. The decades of accumulated expertise of its members were simply gone in a flash. Whilst that was about 15 years ago with a forum run by old car fuddy-duddies that had only been running for a few years, a more recent and spectacular disaster was the Nekochan forum for 3D graphic wizards which ran for about 15 years. Anyone, trying to run a Silicon Graphics computer simply had to use this site to keep their system running. A couple of years ago (2018), the site administrator decided to pull the plug on the server, saying it was too difficult to make the forum GDPR-compliant, or something. The reasons don't matter but everything was gone (except for bits saved by the users) and it never came back. 

It is interesting to look at discussions about the possibility of users backing up forums on php bulletin boards and its quite staggering that the software experts have very little idea what it is like for a forum member to suddenly lose contact with all the expertise that was previously available to them. You can find posts saying things like "...just because the server goes down, it does not mean that the information is lost..."!! Well, yep, of course it doesn't mean the information is lost, but then... err... where is it?? The site admin aren't contactable, the members don't have each others e-mail addresses because they were using PM's instead... so all you can do is keep trying in vain to reconnect to a website that never comes back. 

Hence, I have wondered if it is possible for users to backup phpBB's themselves and there are some interesting sites about this on the web and also some people who offer to do this as a service. It seems to require at least some knowledge of how php, scripting and databases work, etc, so that you can set up a local working copy of the forum on your own computer. Anyhow, right-now I have absolutely no interest at all in learning that sort of thing. So is there a way that a numpty can make a safe copy of the publically-available parts of an online forum? Who knows, it could well be all that's left of it in future?

I had a pea-brained bumble around the forum I was interested in and found that each topic had a unique topic number which can be called-up in this manner: 

http://www.volvo300mania.com/uk/forum/viewtopic.php?f=54&t=17749

where this particular discussion is referred to by an f-number and a t- or topic-number. I guessed taht f is the internal forum number for the subject area and more dunder-headed bungling around the forum suggested that this is true. So if we want to automatically call up each topic, do we also need its f-number? Thankfully the answer seems to be 'no' because if we miss out the "f=54&" and call up: 

http://www.volvo300mania.com/uk/forum/viewtopic.php?t=17749

we get exactly the same discussion. That's good, less brain-ache. So it looks like we could generate every possible value of t, up to the total number of posts in the forum, to see all of the topics ever discussed.

Ok, so there's a plan growing in my head to write a script that calls-up every topic and grabs a copy. Do we want the copy to be in the form of html and picture files, stylesheets, smilies, etc?? Not really. What about something a bit more rock-solid like pdf?? Nutter! It would be mad to use pdf for something that changes all the time like a forum. True, but the old ones don't change that much because everyone is using bookface instead. How are you ever going to organise and search through thousands of pdf files to find the relevant ones? Well, it would seem likely that there must be some pretty sophisticated software for text-searching pdf collections as its such a stable, standard and popular format. I'm not sure about this, but we'll leave that one for now and check it later. 

So, we'll go with pdf as the format of choice, partly because we can make a printer-friendly version of any topic on the forum by clicking on the spanner icon and selecting 'Print view'. But, there are thousands of topics so is there a way we can do that programagically? Yo, it seems that indeed we can by adding "&view=print" after the topic number: 

http://www.volvo300mania.com/uk/forum/viewtopic.php?t=17749&view=print 

This gives us a cut-down view of the topic with a white background and it still includes the pictures, which is great for pdf. However, one problem remains for large topics which is that the server only dredges up 15 posts at a time and there doesn't seem to be a way for users to change the number of posts displayed at any one time. Again, I could be wrong on that, but for now lets see how we can call up the successive pages. Choosing a topic spread over several pages, it became pretty clear that we can select the page we want to view by specifying the starting post number of that page. In the example below we can call up page 2 as follows: 

http://www.volvo300mania.com/uk/forum/viewtopic.php?t=16827&start=15

OK, but you said there were 15 posts per page so shouldn't page 2 start at post number 16 since posts 1 to 15 will be on page 1? Er, yep, did cross my mind, too, but it turns out that the numbering is zero-based, i.e. the first post on page 1 is actually post number zero. You can check it yourself - set "&start=1" and you get the second post in that thread, rather than the first! Anyhow, it looks like we could step through all the pages of any particular topic by incrementing the "start" number in steps of 15. Yep, sounds good. Nice one, but there's a hitch because if you go beyond the last page of a topic, the server simply dredges up the last page again and again. We can show that for the above topic which is spread over three pages. The last page therefore begins at post number 30 but we can increase the start number to 45, 60, 75, etc, and we keep on generating exactly the same page. So if we do this with a script or something, how is it ever going to know that we have reached the last page if we don't get an error or anything when we have exceeded the page limit?? Good point. Well, we could look at the text within each page and see if it is the same as the text in the previous page and, if it is, we stop reading more pages. Seems really dumb to be doing all this, but it might work??

So we could make a script that increases the t number up to the total number of topics in the forum and for each value of t we increase the start number in steps of 15 till the text stops changing. "Nice one, cheers. What are you having? Piña colada! Corr, you must be sophisticated." (A. Sayle, "Ullo John...", 1982). At each step on the t number and the start number, we grab a copy of the page which the server generates. Then, once we have read all the pages for that topic, we join them together. 

Alright then, so how do we do all this then? Anyone remember ZX81 basic? Well, about the closest you can get to it is little Lua, which is a Brazillian scripting language that shaves seconds off competitors' run-times by a very wide margin, apparently. It has a thing called socket.http that allows it to read web-pages and there is a program we can get called wkhtmltopdf which converts web pages to pdf. We run that for each page that we visit on the forum and use another program called pdfunite that joins the pdf files together in the right order for each topic. We run these external programs with protected calls (pcall) so that if any step gives an error, the whole web-scraping process does not stop while you are asleep. A bit of housekeeping to delete unwanted files, etc, and we end up with one pdf file for each thread in the forum. This is the actual script called e.g. scrapy.lua and its set-up for linux but I'm sure it could be changed to work on Windows e.g. mv becomes move, rm becomes del, etc, and both wkhtmltopdf and pdfunite are available for Windows. Alternatively, you could use a virtual linux shell like msys, msys2, wsl, etc?? 

http=require'socket.http'
--iterator1 = io.lines("uniq.txt")
start_topic=0
end_topic=20000
urlbit1="http://www.volvo300mania.com/fr/forum/viewtopic.php?t="
urlbit2="&start="
urlbit3="&view=print"
for topic=start_topic,end_topic do
--for topic in iterator1 do
        old_url_text_length=0
        saved_pages=""
        for start_page=0, 15000, 15 do
            url=urlbit1..topic..urlbit2..start_page..urlbit3
            url_text, statusCode, headers, statusText = http.request(url)
            if statusCode ~=200 then
               print("Topic "..topic.." does not exist.")
               break
            end
            worked_OK, url_text_length=pcall(string.len,url_text)
            if not worked_OK then print("Error getting length of text for topic: "..topic) break
            elseif start_page==0 then print("New topic: "..topic)
            elseif url_text_length==old_url_text_length then
               print("Reached end of this topic.")
               break
            end
            old_url_text_length=url_text_length
            newfile=topic.."_"..start_page..".pdf"
            wkhtmltopdf_string="wkhtmltopdf ".."\""..url.."\" "..newfile
            --print(wkhtmltopdf_string)
            if pcall(os.execute, wkhtmltopdf_string)
               then
                   print("Saved: "..newfile)
                   saved_pages=saved_pages..newfile.." "
            else
                print("Error converting topic: "..topic.." to pdf.")
            end
        end
        combined_file=topic..".pdf"
        _, count = string.gsub(saved_pages, ".pdf ", ".pdf ")
        if count>0 then
          print("Saved pages: "..saved_pages)
          end
        if count>1 then
          pcall(os.execute, "pdfunite "..saved_pages..combined_file)
          print("Combined as: "..combined_file)
          pcall(os.execute, "rm "..saved_pages)
          print("Removed originals: "..saved_pages)
        elseif count==1 then
          pcall(os.execute, "mv "..saved_pages..combined_file)
          print("Renamed original: "..saved_pages.." to: "..combined_file)
        end
end

Now, I said something about detecting the end of a topic by generating each successive page till the text stops changing, well, that was what my original scripts did but there was a slight complication because some pages contain images that have been uploaded to the forum itself, rather than to an external site e.g. Photobucket, etc. Each time you visit one of these pages, the forum software updates the number of times the image has been viewed. Therefore if you keep reloading the page, the number of views shown in that page will keep incrementing and the text within that page will keep changing. Hence, it turned out that comparing the actual text within a page with that of the previous page is not a good way of detecting the end of the thread. Brain-ache, but I think we can solve it practically by looking at the length of the text within a page and comparing that with the number of characters in the previous page. That's not an absolutely fail-safe way of doing it because two successive pages could still be very different and have exactly the same text-length, but its unlikely and this is probably the best I can do, given that I don't want to spend weeks on this or learn php, etc. So we use Lua's string.len() function to get the text-length of each page in a topic and when that stops changing from one page to the next, we stop looking for further pages in that topic. I found out the hard way that if your script keeps reloading the same page thousands of times, you will be wasting tons of bandwidth and the server may start to throttle your downloads. Hence, it is best to test out this sort of thing late at night when networks are quiet. 

There's a link to the lua script here and a prettified version here. There's an annotated picture of the script below showing the bits that would need changing for any particular forum and some of the points discussed above. 

Just install Lua and type: 

lua scrapy.lua 

or something like:

lua scrapy.lua > scrapy_results.log & 

on linux to save the results in a log file. The topics on the forum will then be saved as separate pdf files in the form topic_number.pdf in your working folder. 

Since there are about 20,000 topics in the forum that I wanted to backup, netiquette suggests that we should do this in blocks of only a few thousand topics at a time when the server and network are relatively lightly used, e.g. probably after midnight. 

Anyhow, after a few nights of downloading, the resulting pdf files look reasonably OK. 



Manuals

Some of the files on the forum which can be downloaded are scanned-in manuals which are in pdf format already. However, they are not searchable so they need to be OCR'd. In addition, they are password-protected and the password therefore has to be removed so that the OCR can work. This can be done using a suitable online service. The unlocked file can be OCR'd by installing ocrmypdf and running it as follows: 

ocrmypdf --force-ocr input-unlocked.pdf output.pdf

Since the pdf file contains some text already (but it is of no use) we need to use the --force-ocr option. The OCR'd file is over twice the size of the original so we need to resample the images with ps2pdf.

ps2pdf -dPDFSETTINGS=/screen input.pdf output.pdf

I guess one needs to think about copyright when distributing manufacturers manuals but they are already on the web, the products are all long, long obsolete and there is no profit in any of this. Indeed, we maintain enthusiasm for the brand, m'lud! 

Topic 16367

I'm really laying on the BS now to make this sound like Rosalind Franklin's photograph 51, which it isn't but this particular topic had a problem because the very well-meaning and brilliant person who uploaded it used links to large-format photographs that weren't optimised for the web. The result was that when the pages were converted to pdf, the pictures were not scaled properly and were very much bigger than the printed pages so that the viewer could only see about a quarter of each image. Setting the page size to A1 seemed to solve the problem but the text then became absolutely miniscule. What was needed was some way of downloading
all the images and resizing them. Hence, for this topic it was necessary to download all the html, css, images, etc, with wget so that the pictures could be resized locally. The topic was spread over 26 pages so its best to make a script (based on the previous one) to step through these pages and download everything shown in them. 

The wget options are as follows:

wget -E -H -k -r -l 1 -w 2 --random-wait --tries=1 http://www.example.com

-E saves everything with an html extension which helps a lot later on (!),
-H makes it spans hosts e.g. visit websites hosting the images,
-k convert links to pictures, etc, so that they point to the downloaded copies,
-r recursive magic - follows and downloads all the links in the page,
-l 1 makes the recursion only go one step away from the page to save time, diskspace, etc,
-w 2 makes it wait an average of 2 seconds between each download for server etiquette,
--random-wait varies the actual wait time to trick the server that we're not a bot,
--tries=1 stops it spending ages repeatedly trying to download broken links.

I'm not saying this is the abso best combination of keywords, etc, but wget is pretty flexible so you can cook-up your own options and it will usually work. The following lua script runs wget on each of the 26 pages of this thread and saves all the html and pictures in a multitude of local folders.

http=require'socket.http'
--iterator1 = io.lines("uniq.txt")
start_topic=16367
end_topic=16367
urlbit1="http://www.volvo300mania.com/uk/forum/viewtopic.php?t="
urlbit2="&start="
urlbit3="&view=print"
for topic=start_topic,end_topic do
--for topic in iterator1 do
        old_url_text_length=0
        saved_pages=""
        for start_page=0, 500, 15 do
            url=urlbit1..topic..urlbit2..start_page..urlbit3
            url_text, statusCode, headers, statusText = http.request(url)
            if statusCode ~=200 then
               print("Topic "..topic.." does not exist.")
               break
            end
            worked_OK, url_text_length=pcall(string.len,url_text)
            if not worked_OK then print("Error getting length of text for topic: "..topic) break
            elseif start_page==0 then print("New topic: "..topic)
            elseif url_text_length==old_url_text_length then
               print("Reached end of this topic.")
               break
            end
            old_url_text_length=url_text_length
            newfile=topic.."_"..start_page
            wget_string="wget -E -H -k -r -l 1 -w 2 --random-wait --tries=1 ".."\""..url.."\""
            --print(wget_string)
            if pcall(os.execute, wget_string)
               then
                   print("Saved: "..newfile)
            else
                print("Error saving: "..newfile)
            end
        end
end

The above file is here. The majority of folders which were downloaded were just junk but the images of interest were present in two folders. You can use a file manager to find the images (e.g. search for *.jpg), select those that we want to resize (i.e. the big ones) and make a list of them along with their path names. We can then use another free software tool called imagemagick to shrink the pictures so that they fit in A4 size pages. The relevant command is mogrify (to shrink and overwrite the original) and we need to edit up our list of filenames with a spreadsheet or something so that it looks like this:

...
mogrify -resize 30% "/home/jon/bobby_charlton/farm5.staticflickr.com/4281/34830732684_dcf0cf2ce3_k.jpg"
mogrify -resize 30% "/home/jon/bobby_charlton/farm5.staticflickr.com/4277/34830723894_4f3ddbc198_k.jpg"
mogrify -resize 30% "/home/jon/bobby_charlton/farm5.staticflickr.com/4288/34830728834_0b3168b9b6_k.jpg"
mogrify -resize 30% "/home/jon/bobby_charlton/farm5.staticflickr.com/4277/34830418794_c2254ae261_k.jpg"
...

The speech marks around the filenames are there in case they contain spaces and other oddities that would otherwise confuse imagemagick or the shell. Once the images are resized, we can run another lua script which regenerates each page of the forum thread using the shrunken images instead. As before, each page is converted to pdf by wkhtmltopdf and the pdf files are joined together by pdfunite.

urlbit1="/home/jon/bobby_charlton/www.volvo300mania.com/uk/forum/viewtopic.php?t=16367&start="
urlbit2="&view=print.html"
saved_pages=""
for start_page=0, 375, 15 do
    url=urlbit1..start_page..urlbit2
    newfile="16367_"..start_page..".pdf"
    wkhtmltopdf_string="wkhtmltopdf ".."\""..url.."\" "..newfile
    print(wkhtmltopdf_string)
    os.execute(wkhtmltopdf_string)
    saved_pages=saved_pages..newfile.." "
end
combined_file="16367.pdf"
os.execute("pdfunite "..saved_pages..combined_file)

This script is here and it seems to work OK, generating a 933 page pdf file in which the pictures all seem to fit within the A4 pages. However, the file is 200Mbytes and therefore is really too big to make available to other forum users as a part of a, sort of, community backup. However, we can make a smaller version by further down-sampling the pictures within it using the following ghostscript command:

ps2pdf -dPDFSETTINGS=/screen 16367.pdf 16367_screen.pdf

The resulting file is only 40 MB looks and looks reasonably OK as long as you don't zoom up too much on the pictures. The text is quite small compared to the pictures but you can make a version with bigger text by using a browser extension like PrintWhatYouLike (PWYL) and printing each page of the topic to a pdf file. This is just about OK for a 26 page topic, but not great for anything much bigger than that. You can (on linux) list the resulting pdf files in a handy way for more scripting with the command:

echo `ls -1v *.pdf`

and join them together with:

pdfunite 1.pdf 2.pdf 3.pdf ... 23.pdf 24.pdf 25.pdf 26.pdf 16376_PWYL.pdf

To shrink down the combined file, this sort of thing did a reasonable job as a mobile-friendly version:

ps2pdf -dDownsampleColorImages=true -dColorImageResolution=48 16376_PWYL.pdf 16376_PWYL_screen.pdf



Managing the pdf collection

Coolio, so we have almost 20,000 pdf files in our collection and we need some way of managing them so that we can search for files containing information on specific topics and ideally rank them in some way to select the best ones first. A few google searches suggests, as I suspected, that there are quite a few tools for doing this, including Adobe itself. My choice is completely sporadic and random but the one I have installed is recoll which has a chequerboard icon with a unique square suggestive of finding a needle in a haystack and you can get it from this funny-sounding website here which means good accounting, or something. 

If we start up recoll it, like similar programs, needs to make an index of the files it has to search through and this takes some time, but makes subsequent searches virtually instantaneous. Recoll will by default try to index all the files on your computer which might be desirable, but in my case I will just index the folder containing the pdfs. So we go to Preferences -> Index configuration -> Global parameters -> Top directories and set this to the folder containing the pdf files. Under the Files tab you can then start the indexing process which takes a few minutes. We can then search our pdf collection for files containing certain words or do Boolean and proximity searches for more useful results. Searching for the word "cats" definitely seems to bring up a very cat-rich result as the top hit so it looks promising. 


 
Note to self: I found the best way to view the results with qpdfview is to add the following line to the end of the .recoll/mimeview file: 

application/pdf = qpdfview --unique --search %s %f#%p

so that the pdf files of successive hits can be opened as separate tabs in the right order. 


Also, tried DocFetcher which works great, too, and is (unlike recoll) free for Windows.

One more complication

As a result of downloading the files it became apparent that some of the pictures were missing. A closer look showed that it was only those which had been uploaded to the forum and, as opposed to being attached 'inline', they were simply, err, attached. The result is that they appear with a wide grey outline around them saying "ATTACHMENTS" at the top. Should these be called outlined pictures? Dunno? Perhaps, so we'll call them just that. Anyway, the problem is that these outlined pictures do not appear in the print view and are therefore not included in the downloaded pdf version of the forum thread. A few web searches suggested that it is a known effect but it was not obvious how to solve it, certainly not for users, anyway. A quick PM to the forum administrator to ask if this can be changed somehow was in my outbox for 9 months (but it has now been received). In the meantime decided to simply re-scrape the whole forum and download those pages (i.e. blocks of 15 posts) in the normal (non-print) view and append them to each of their respective pdf files from the previous scrape. To find specifically those pages containing outlined images, we do a case-sensitive search of the URL text looking for the word "Attachments" which is only present in the affected pages, although it could, of course, be present in the user-text, but that will not matter. Here is the script which downloads the extra pages as pdf's and gives them the suffix "_extra_pictures.pdf". When more than one page in a thread contains outlined pictures, these pages are joined together. 

http=require'socket.http'
start_topic=15000
end_topic=18000
urlbit1="http://www.volvo300mania.com/uk/forum/viewtopic.php?t="
urlbit2="&start="
for topic=start_topic,end_topic do
        old_url_text_length=0
        saved_pages=""
        for start_page=0, 15000, 15 do
            url=urlbit1..topic..urlbit2..start_page
            url_text, statusCode, headers, statusText = http.request(url)
            if statusCode ~=200 then
               print("Topic "..topic.." does not exist.")
               break
            end
            worked_OK, url_text_length=pcall(string.len,url_text)
            if not worked_OK then print("Error getting length of text for topic: "..topic) break
            elseif start_page==0 then print("New topic: "..topic)
            elseif url_text_length==old_url_text_length then
               print("Reached end of this topic.")
               break
            end
            old_url_text_length=url_text_length
            newfile=topic.."_"..start_page..".pdf"
            if string.find(url_text,"Attachments") then
               print("Outlined attachment present, will save pages as screen view.")
               wkhtmltopdf_string="wkhtmltopdf ".."\""..url.."\" "..newfile
               --print(wkhtmltopdf_string)
               if pcall(os.execute, wkhtmltopdf_string)
                 then
                     print("Saved: "..newfile)
                     saved_pages=saved_pages..newfile.." "
                     else
                  print("Error converting topic: "..topic.." to pdf.")
               end
            end
        end
        if saved_pages~="" then
            print("Saved pages: "..saved_pages)
            combined_file=topic.."_extra_pictures.pdf"
            _, count = string.gsub(saved_pages, ".pdf ", ".pdf ")
            if count>1 then
              pcall(os.execute, "pdfunite "..saved_pages..combined_file)
              print("Combined as: "..combined_file)
              pcall(os.execute, "rm "..saved_pages)
              print("Removed originals: "..saved_pages)
            elseif count==1 then
              pcall(os.execute, "mv "..saved_pages..combined_file)
              print("Renamed original: "..saved_pages.." to: "..combined_file)
            end
        end
end

The script is here and here. This resulted in about 500 pdf files being downloaded, each of which had to be tagged onto the end of the main pdf file for its respective topic. This was done by using a file manager to list the new files, a spreadsheet to make-up a pdfunite command file, as follows, and then the multi-rename tool of our file manager to rename the newly-generated files back to their original names, for simplicity. 

pdfunite 173.pdf appendix.pdf 173_extra_pictures.pdf 173_new.pdf
pdfunite 2546.pdf appendix.pdf 2546_extra_pictures.pdf 2546_new.pdf
pdfunite 2748.pdf appendix.pdf 2748_extra_pictures.pdf 2748_new.pdf
pdfunite 4980.pdfappendix.pdf 4980_extra_pictures.pdf 4980_new.pdf

Note that a banner page (appendix.pdf) is added between the main thread and the pages containing the additional pictures to give some explanation, hopefully. The additional pages are not searchable - for some reason the text (white on a grey background) has been stored as pictures, and I did not fancy OCR'ing it as it would have taken yet more time and would only duplicate existing text. 

Distribution

OK, so going to see if its practicable to distribute this by cloud storage. The zipped-up files come to just over 6 Gb which is too big for many free cloud storage sites, let alone ones that allow file sharing, and its also a bit too big for standard DVD's but should be OK for the 9 Gb ones. Hence, the plan is to try my current host which is ADrive and if its all too slow I will send out a few memory sticks or dual-layer DVDs to interested forum members. 

In the end I went for Google Drive which is free up to 15 Gb and the final thing is here

Allow an hour or two for it to download, about an hour or two to unzip it and another hour to index it. 

Comments