Convert online forum to pdf collection.
Internet forums have a habit of being run by people who, very understandably, become interested in other things in life and, over the years, the site admin can slip from being foremost in their minds. The key forum members can fall out and start banning each other, which doesn't help. Likewise, once very keen moderators can also become inactive. I belonged to an old car forum which was administered by someone who simply never used it. If the site went down, someone had to ring her and ask if she could do a reboot. Needless to say something went wrong with the server hardware and, since there was no backup, everything was lost. The decades of accumulated expertise of its members were simply gone in a flash. Whilst that was about 15 years ago with a forum run by old car fuddy-duddies that had only been running for a few years, a more recent and spectacular disaster was the Nekochan forum for 3D graphic wizards which ran for about 15 years. Anyone, trying to run a Silicon Graphics computer simply had to use this site to keep their system running. A couple of years ago (2018), the site administrator decided to pull the plug on the server, saying it was too difficult to make the forum GDPR-compliant, or something. The reasons don't matter but everything was gone (except for bits saved by the users) and it never came back.
It is interesting to look at discussions about the possibility of users backing up forums on php bulletin boards and its quite staggering that the software experts have very little idea what it is like for a forum member to suddenly lose contact with all the expertise that was previously available to them. You can find posts saying things like "...just because the server goes down, it does not mean that the information is lost..."!! Well, yep, of course it doesn't mean the information is lost, but then... err... where is it?? The site admin aren't contactable, the members don't have each others e-mail addresses because they were using PM's instead... so all you can do is keep trying in vain to reconnect to a website that never comes back.
Hence, I have wondered if it is possible for users to backup phpBB's themselves and there are some interesting sites about this on the web and also some people who offer to do this as a service. It seems to require at least some knowledge of how php, scripting and databases work, etc, so that you can set up a local working copy of the forum on your own computer. Anyhow, right-now I have absolutely no interest at all in learning that sort of thing. So is there a way that a numpty can make a safe copy of the publically-available parts of an online forum? Who knows, it could well be all that's left of it in future?
I had a pea-brained bumble around the forum I was interested in and found that each topic had a unique topic number which can be called-up in this manner:
http://www.volvo300mania.com/uk/forum/viewtopic.php?f=54&t=17749
where this particular discussion is referred to by an f-number and a t- or topic-number. I guessed taht f is the internal forum number for the subject area and more dunder-headed bungling around the forum suggested that this is true. So if we want to automatically call up each topic, do we also need its f-number? Thankfully the answer seems to be 'no' because if we miss out the "f=54&" and call up:
http://www.volvo300mania.com/uk/forum/viewtopic.php?t=17749
we get exactly the same discussion. That's good, less brain-ache. So it looks like we could generate every possible value of t, up to the total number of posts in the forum, to see all of the topics ever discussed.
Ok, so there's a plan growing in my head to write a script that calls-up every topic and grabs a copy. Do we want the copy to be in the form of html and picture files, stylesheets, smilies, etc?? Not really. What about something a bit more rock-solid like pdf?? Nutter! It would be mad to use pdf for something that changes all the time like a forum. True, but the old ones don't change that much because everyone is using bookface instead. How are you ever going to organise and search through thousands of pdf files to find the relevant ones? Well, it would seem likely that there must be some pretty sophisticated software for text-searching pdf collections as its such a stable, standard and popular format. I'm not sure about this, but we'll leave that one for now and check it later.
So, we'll go with pdf as the format of choice, partly because we can make a printer-friendly version of any topic on the forum by clicking on the spanner icon and selecting 'Print view'. But, there are thousands of topics so is there a way we can do that programagically? Yo, it seems that indeed we can by adding "&view=print" after the topic number:
http://www.volvo300mania.com/uk/forum/viewtopic.php?t=17749&view=print
This gives us a cut-down view of the topic with a white background and it still includes the pictures, which is great for pdf. However, one problem remains for large topics which is that the server only dredges up 15 posts at a time and there doesn't seem to be a way for users to change the number of posts displayed at any one time. Again, I could be wrong on that, but for now lets see how we can call up the successive pages. Choosing a topic spread over several pages, it became pretty clear that we can select the page we want to view by specifying the starting post number of that page. In the example below we can call up page 2 as follows:
http://www.volvo300mania.com/uk/forum/viewtopic.php?t=16827&start=15
OK, but you said there were 15 posts per page so shouldn't page 2 start at post number 16 since posts 1 to 15 will be on page 1? Er, yep, did cross my mind, too, but it turns out that the numbering is zero-based, i.e. the first post on page 1 is actually post number zero. You can check it yourself - set "&start=1" and you get the second post in that thread, rather than the first! Anyhow, it looks like we could step through all the pages of any particular topic by incrementing the "start" number in steps of 15. Yep, sounds good. Nice one, but there's a hitch because if you go beyond the last page of a topic, the server simply dredges up the last page again and again. We can show that for the above topic which is spread over three pages. The last page therefore begins at post number 30 but we can increase the start number to 45, 60, 75, etc, and we keep on generating exactly the same page. So if we do this with a script or something, how is it ever going to know that we have reached the last page if we don't get an error or anything when we have exceeded the page limit?? Good point. Well, we could look at the text within each page and see if it is the same as the text in the previous page and, if it is, we stop reading more pages. Seems really dumb to be doing all this, but it might work??
So we could make a script that increases the t number up to the total number of topics in the forum and for each value of t we increase the start number in steps of 15 till the text stops changing. "Nice one, cheers. What are you having? Piña colada! Corr, you must be sophisticated." (A. Sayle, "Ullo John...", 1982). At each step on the t number and the start number, we grab a copy of the page which the server generates. Then, once we have read all the pages for that topic, we join them together.
Alright then, so how do we do all this then? Anyone remember ZX81 basic? Well, about the closest you can get to it is little Lua, which is a Brazillian scripting language that shaves seconds off competitors' run-times by a very wide margin, apparently. It has a thing called socket.http that allows it to read web-pages and there is a program we can get called wkhtmltopdf which converts web pages to pdf. We run that for each page that we visit on the forum and use another program called pdfunite that joins the pdf files together in the right order for each topic. We run these external programs with protected calls (pcall) so that if any step gives an error, the whole web-scraping process does not stop while you are asleep. A bit of housekeeping to delete unwanted files, etc, and we end up with one pdf file for each thread in the forum. This is the actual script called e.g. scrapy.lua and its set-up for linux but I'm sure it could be changed to work on Windows e.g. mv becomes move, rm becomes del, etc, and both wkhtmltopdf and pdfunite are available for Windows. Alternatively, you could use a virtual linux shell like msys, msys2, wsl, etc??
start_topic=0
end_topic=20000
urlbit1="http://www.volvo300mania.com/fr/forum/viewtopic.php?t="
urlbit2="&start="
urlbit3="&view=print"
for topic=start_topic,end_topic do
--for topic in iterator1 do
old_url_text_length=0
saved_pages=""
for start_page=0, 15000, 15 do
url=urlbit1..topic..urlbit2..start_page..urlbit3
url_text, statusCode, headers, statusText = http.request(url)
if statusCode ~=200 then
print("Topic "..topic.." does not exist.")
break
end
worked_OK, url_text_length=pcall(string.len,url_text)
if not worked_OK then print("Error getting length of text for topic: "..topic) break
elseif start_page==0 then print("New topic: "..topic)
elseif url_text_length==old_url_text_length then
print("Reached end of this topic.")
break
end
old_url_text_length=url_text_length
newfile=topic.."_"..start_page..".pdf"
wkhtmltopdf_string="wkhtmltopdf ".."\""..url.."\" "..newfile
--print(wkhtmltopdf_string)
if pcall(os.execute, wkhtmltopdf_string)
then
print("Saved: "..newfile)
saved_pages=saved_pages..newfile.." "
else
print("Error converting topic: "..topic.." to pdf.")
end
end
combined_file=topic..".pdf"
_, count = string.gsub(saved_pages, ".pdf ", ".pdf ")
if count>0 then
print("Saved pages: "..saved_pages)
end
if count>1 then
pcall(os.execute, "pdfunite "..saved_pages..combined_file)
print("Combined as: "..combined_file)
pcall(os.execute, "rm "..saved_pages)
print("Removed originals: "..saved_pages)
elseif count==1 then
pcall(os.execute, "mv "..saved_pages..combined_file)
print("Renamed original: "..saved_pages.." to: "..combined_file)
end
Now, I said something about detecting the end of a topic by generating each successive page till the text stops changing, well, that was what my original scripts did but there was a slight complication because some pages contain images that have been uploaded to the forum itself, rather than to an external site e.g. Photobucket, etc. Each time you visit one of these pages, the forum software updates the number of times the image has been viewed. Therefore if you keep reloading the page, the number of views shown in that page will keep incrementing and the text within that page will keep changing. Hence, it turned out that comparing the actual text within a page with that of the previous page is not a good way of detecting the end of the thread. Brain-ache, but I think we can solve it practically by looking at the length of the text within a page and comparing that with the number of characters in the previous page. That's not an absolutely fail-safe way of doing it because two successive pages could still be very different and have exactly the same text-length, but its unlikely and this is probably the best I can do, given that I don't want to spend weeks on this or learn php, etc. So we use Lua's string.len() function to get the text-length of each page in a topic and when that stops changing from one page to the next, we stop looking for further pages in that topic. I found out the hard way that if your script keeps reloading the same page thousands of times, you will be wasting tons of bandwidth and the server may start to throttle your downloads. Hence, it is best to test out this sort of thing late at night when networks are quiet.
There's a link to the lua script here and a prettified version here. There's an annotated picture of the script below showing the bits that would need changing for any particular forum and some of the points discussed above.
Just install Lua and type:
lua scrapy.lua
or something like:
lua scrapy.lua > scrapy_results.log &
on linux to save the results in a log file. The topics on the forum will then be saved as separate pdf files in the form topic_number.pdf in your working folder.
Since there are about 20,000 topics in the forum that I wanted to backup, netiquette suggests that we should do this in blocks of only a few thousand topics at a time when the server and network are relatively lightly used, e.g. probably after midnight.
Anyhow, after a few nights of downloading, the resulting pdf files look reasonably OK.
Manuals
Topic 16367
I'm not saying this is the abso best combination of keywords, etc, but wget is pretty flexible so you can cook-up your own options and it will usually work. The following lua script runs wget on each of the 26 pages of this thread and saves all the html and pictures in a multitude of local folders.
--iterator1 = io.lines("uniq.txt")
start_topic=16367
end_topic=16367
urlbit1="http://www.volvo300mania.com/uk/forum/viewtopic.php?t="
urlbit2="&start="
urlbit3="&view=print"
for topic=start_topic,end_topic do
--for topic in iterator1 do
old_url_text_length=0
saved_pages=""
for start_page=0, 500, 15 do
url=urlbit1..topic..urlbit2..start_page..urlbit3
url_text, statusCode, headers, statusText = http.request(url)
if statusCode ~=200 then
print("Topic "..topic.." does not exist.")
break
end
worked_OK, url_text_length=pcall(string.len,url_text)
if not worked_OK then print("Error getting length of text for topic: "..topic) break
elseif start_page==0 then print("New topic: "..topic)
elseif url_text_length==old_url_text_length then
print("Reached end of this topic.")
break
end
old_url_text_length=url_text_length
newfile=topic.."_"..start_page
wget_string="wget -E -H -k -r -l 1 -w 2 --random-wait --tries=1 ".."\""..url.."\""
--print(wget_string)
if pcall(os.execute, wget_string)
then
print("Saved: "..newfile)
else
print("Error saving: "..newfile)
end
end
end
Comments
Post a Comment