SplitBlogBook: VBA code to Generate Contents Internal Links List for HTML blog book and/or Split it into smaller HTML files

Last updated on 9 Sep. 2023 

9 Sep. 2023 Update: I rarely use the software covered in this post now. To see the current software for creating Blogger blogbooks that I use, please visit: Short User Guide to creating Blogger Blogbooks from Backup/Export File using ExportFileFilterAndGenBook and another VBA projects' macros/code (free and open source), https://ravisiyermisc.blogspot.com/2023/09/short-user-guide-to-creating-blogger.html .

end-Update 9 Sep. 2023

Note: At a convenient time, I intend to rename the SplitBlogBook program/macro to something like GenContentLinksListAndOrSplitBlogBook as that will better describe its current functionality.

The VBA program BlogExportFileToBook described in my post: BlogExportFileToBook: VBA code to produce HTML blog book from Blogger blog backup/export XML file, https://ravisiyermisc.blogspot.com/2023/07/BlogExportFileToBook-vba-code.html , produces a single HTML blog book for a blog having all its posts, pages and comments. In some cases, the produced blog book is way too large (e.g. 32 MB UTF-16 LE encoding file having around 16 million UNICODE characters) for Chrome or Edge browsers to load fully on my 4 GB RAM Windows 10 PC. Then I need to split this large HTML blog book file into smaller pieces, say of around 5 MB size each. That need resulted in the SplitBlogBook program described in this post.

Here's the folder share for this macro having the associated files: SplitBlogBook,  https://drive.google.com/drive/folders/1C6bhnwyTlOJp34BFxekWrdk5-YkKyyy5?usp=drive_link . To save space, large input file(s) have not been put in the Google Drive share directory.

1st Aug 2023 Update Start

Main changes in this version:

  • Added parameters of GenerateContentsLinks (boolean) and associated ContentsLinksInsertMarker to main Sub. SplitBlogBook.
  • Modified prompt dialog to allow user to choose Generate Contents Links either with split or without split (SplitSize 0).
  • Improved messages text in Debug.Print log as well as user interface MsgBox(es).
  • Cleaned up code including deletion of commented out old function/sub versions.


The sub-folder has the 4 Spiritual blog (ravisiyer.blogspot.com) split html files with contents internal links as well as TNS blog with contents internal links with TNS blog not being split. This gets this work to even more closer to release version where the output split file links can be published on the blog download blogbook page.

The latest version code is in the file: 20230731Macro-SplitBlogBook-Code.txt. https://drive.google.com/file/d/1VF8TMBf-MUuO_Lf0AROACwSOaQO3PXYe/view?usp=drive_link

1st Aug 2023 Update End

30 Jul 2023  & 31 Jul 2023 Update Start

Added generation and insertion of list of contents internal links (like TOC but without page numbers) in split html files. Run data is shared in the sub-folder: 20230730Run,  https://drive.google.com/drive/folders/1f4PafwweHDb5a1R_uYSDkaYHaFrgWBeM?usp=drive_link .

The sub-folder has the 4 Spiritual blog (ravisiyer.blogspot.com) split html files with Contents internal links. This gets this work to very close to release version where the output split file links can be published on the blog download blogbook page.

The latest version code is in the file: 20230730v2Macro-SplitBlogBook-Code.txt,  https://drive.google.com/file/d/1Ni8KDrRhGgtC2YSYBzZ9mxuHe82mhQvz/view?usp=drive_link . It has the current version as well as an older version of the main TOC function: GenerateContentsLinks (older version: OldGenerateContentsLinks). In the older version I was doing the parsing of the HTML h1 tag myself. Later I came to know about how to use HTMLDocument and associated classes/objects from a Microsoft library from VBA. That was a much better solution and so the current version uses the HTML Document and related classes/objects.

Another important sub added is: GenerateAndInsertContentsLinksIntoFile which invokes the above function and later inserts the TOC in the split file. 

30 Jul 2023  & 31 Jul 2023 Update End

29 Jul 2023 section continues ...

The latest version macro code is copy-pasted into 20230729Macro-SplitBlogBook-Code.txt, https://drive.google.com/file/d/1m3ucTxpPlO9AjBlzA7uDfFIbq4gLJTWw/view?usp=drive_link . Note that I used the macro in Microsoft Word 2007 program and document. This code reads all the data of the file in one function call (ReadAll of TextStream class/object) and works on my 4 GB RAM PC with the 32 MB UTF-16 LE encoding file having around 16 million UNICODE characters mentioned above. But I don't know at what size bigger than around 16 million UNICODE characters, it will not work. See the Notes section towards bottom of this post to read about an earlier code version of SplitBlogBook which read and wrote chunks of data in a loop, but which has some limitations.

The Notes section also talks about possibility of using csplit or similar programs to do the same task as SplitBlogBook.

The key subs of the VBA code are listed below, along with some support subs/functions (not listed below):

  1. SplitBlogBook - the main function/sub.
  2. DriverDefaultSplitBlogBook - invokes SplitBlogBook with default input data
  3. DriverPromptInputSplitBlogBook - invokes SplitBlogBook with prompt dialogs for user to choose input data
  4. DriverTestSplitBlogBook - invokes SplitBlogBook with test input data

For testing, I used the single file full blog book files produced by BlogExportFileToBook VBA program/macro for two of my blogger blogs. Details are given below. [Google Drive Preview renders the output HTML file as text with HTML tags shown and so is not useful. Download the output HTML files to PC using download button in top right of preview page and then open in browser. ]

Note that if output files are converted to UTF-8 their size will roughly reduce by half and will still display Devanagari characters correctly on browsers (like Chrome and Edge).

Run(s)1) Input file: 20230623-ravisiyer-blog-BlogBook.html, 32783 KB (in UTF-16 LE) having 16784687 UNICODE characters.

Sub invoked: DriverDefaultSplitBlogBook (uses default Splitsize of 5242880 UNICODE UTF-16 characters i.e. 5 MB characters which roughly takes 10 MB (megabytes))

SplitSize specified is: 5242880 characters

Output files (4): 

Note: To view output files, download them from Google Drive (GD) share link to PC and open that in browser.

  1. 20230623-ravisiyer-blog-BlogBook-Split1.html, 10,245 KB having 5245114 UNICODE characters, GD share link:  https://drive.google.com/file/d/1qLCaAEQZ_zpjLpd0i_jcswk_W0oXTCm1/view?usp=drive_link
  2. 20230623-ravisiyer-blog-BlogBook-Split2.html, 10,244 KB having 5244746 UNICODE characters, GD share link:  https://drive.google.com/file/d/14Gpw8QpJIjguIgwauMgvf97KKS06K_i4/view?usp=drive_link
  3. 20230623-ravisiyer-blog-BlogBook-Split3.html, 10,969 KB having 5615744 UNICODE characters, GD share link:  https://drive.google.com/file/d/1iqOI227uQaBlhHIDeoIv6RljxN6udk8T/view?usp=drive_link
  4. 20230623-ravisiyer-blog-BlogBook-Split4.html, 1,327 KB having 679266 UNICODE characters, GD share link:  https://drive.google.com/file/d/1RDcVpGZD4UVmM4uia8zWZUGrOULjHxze/view?usp=drive_link

A previous run version produced split files similar to above and which were of roughly similar sizes to above (they were slightly smaller by around 250 KB each). I noted the time Chrome took to fully load the previous run files. For Split1 to Split3, the time was roughly between 10 seconds and 30 seconds. Split4 got fully loaded in less than 6 seconds. Details of both runs are provided in Run1Info-DebugPrintLog.txt, GD share link: https://drive.google.com/file/d/1Be8D8vxNBrYxeEOy_N1R0V3VM9V37Hnk/view?usp=drive_link.

So after this split into 4 files, the Spiritual blogbook parts are quite quickly being fully loaded by Chrome on my 4 GB RAM Windows 10 PC. That is a good workaround for the problem of Chrome (or Edge) not being able to fully load the full blogbook file of 32 MB (with around 16 million UNICODE characters).

Run2) Input file: 20230523-tns-blog-BlogBook.html, 943 KB (in UTF-16 LE) having 482543 UNICODE characters.

Sub invoked: DriverPromptInputSplitBlogBook 

This run was a test case run as the input file is a small one (less than 1 MB) and is quite quickly loaded in Chrome. I wanted to know if this SplitBlogBook program/macro can be used in a more general way than what it was originally designed for. Specifically, if I wanted smaller chunks of say 100 K characters, could the program deliver? The program seems to deliver!

SplitSize specified is: 100000 characters (around 100 K UNICODE UTF-16 characters which roughly takes 200 KB (kilobytes) in disk storage)

Output files (5): 

  1. 20230523-tns-blog-BlogBook-Split1.html, 219 KB having 111925 UNICODE characters
  2. 20230523-tns-blog-BlogBook-Split2.html, 223 KB having 113940 UNICODE characters
  3. 20230523-tns-blog-BlogBook-Split3.html, 218 KB having 111343 UNICODE characters
  4. 20230523-tns-blog-BlogBook-Split4.html, 225 KB having 114892 UNICODE characters
  5. 20230523-tns-blog-BlogBook-Split5.html, 60 KB having 30687 UNICODE characters

Details of the run are provided in Run2Info-DebugPrintLog.txt, GD share link:  https://drive.google.com/file/d/1KPTeHKkOK4YA46so3sGZqtvJ2pVaXgmQ/view?usp=drive_link .

Run3) This is a more extreme generalized usage as well as boundary conditions test. The primary intent here was to use a small input file with limited dummy "posts" and use SplitSize of very small value with clear idea of the expected output split files both in number of files and in content of each file. For this purpose, an input file of around 10 dummy post entries with end of post marker and with each dummy post entry being 100 characters, excepting the first which is lesser, was created - test.html, GD share link: https://drive.google.com/file/d/1Q3iDSqUVtwSC3yzL_vc8ZyZyHPVQxD3L/view?usp=drive_link . This test with SplitSize of 50 (UNICODE characters) was successful in creating 11 output files with expected data - the 11th having the end HTML tags alone with empty body.

Next step was to explore this program being used to split XML files having post (and page and comment) entries like the Blogger backup/export file. The end of entry marker, "</entry>", is different from end of post marker in above HTML blog book case, "End of Post==========================<br/><br/>". So I made end of post marker an argument to the main function. Now one could call the main macro SplitBlogBook with a .xml input file and "</entry>" as the end of post marker. I did not have time for extensive tests but just wanted to know if the program would work for a small test xml file which I created - test.xml, GD share link: https://drive.google.com/file/d/1Zv-XMtheZVdVNt-1r7wXR9h0blQfYhiw/view?usp=drive_link . This test.xml has 10 entries of 100 characters each enclosed with <entry> and </entry> tags, and the whole list of entries being enclosed in <feed> and </feed> tags. For this small test, the program worked as expected, producing 11 output files when SplitSize was specified as 50.

I then felt I should try it out with the more general text file case. I did a similar test with a small test.txt file, GD share link: https://drive.google.com/file/d/1xTK33dvVcxYa5lsb23EepRer2X0-wTm-/view?usp=drive_link . SplitBlogBook program worked as expected producing 10 split text files for SplitSize = 50.

Sub invoked: DriverTestSplitBlogBook 

Details of the run are provided in Run3Info-DebugPrintLog.txt, GD share link  https://drive.google.com/file/d/19F4v0E4VGqag-6QXQvly0uOhq_CenWCS/view?usp=drive_link .

Notes
1. The first versions of the main Sub/macro code would read an agrument specified amount of data (SplitSize) from input file, process it including writing it to current output split file, read additional amount of data specified by another argument (SplitOverflowReadSize) in a loop till the end of marker was encountered and writing all such read data to current output split file till end of marker. On encountering end of marker, after writing required data to output file, the code would loop back to the initial part of SplitSize reading code loop, writing to the next split file, the data after end of marker that was already read in the previous SplitOverflowReadSize related read, and then read the next SplitSize amount of data from input file.

But that code had limitations. 20230728v1Macro-SplitBlogBook-Code.txt,  https://drive.google.com/file/d/174LcdDav4eu5GOscXad3sARWck14FtKQ/view?usp=drive_link documents those limitations as comments in the code and I have given those comments below:

'Rough solution that works acceptably, as far as I could make out from my tests, when SplitSize is large like 5 MB,
' and SplitOverflowReadSize is at least around 1 percent of SplitSize like 50 KB in this case.
' It is not precise as it does not take care of conditions like end marker being split across reads. If the sizes
' are big like mentioned above it will typically miss out only on a few posts whose end marker is split across reads.
' So for my Spiritual blog (around 16 MB UNICODE characters), I get 4 split files with first three being only slightly
' more than 5 MB. So this rough solution satisfies my key needs. However, it would be great if I could take care
' of such conditions without too much effort. So I plan to spend some limited time on exploring solutions that take
' care of such conditions too.
-----

The alternative was to use ReadAll function of the TextStream object to read all the data in one go. Then the abovementioned problem of end marker being split across reads would not be there. The question was whether ReadAll would be able to read all of the 16 million odd UNICODE characters (32 MB disk filesize) in one go, on my 4 GB RAM PC. It was a pleasant surprise for me to see that it was able to do so. The above mentioned code (text) file has the initial test code I wrote to check out ReadAll for this 16 million odd characters data. Subsequently my later code versions used ReadAll.

2. I did consider the possibility of using csplit or similar programs to do the same task as SplitBlogBook, somewhere down the line after I had started coding SplitBlogBook. I could not find a simple csplit program on its own that I could download and use on my Windows PC. It was available as part of a set of utlities but I did not want to download and install the whole set, at that time. I also looked for an online csplit like website and tried out one but it did not seem to meet my needs. I plan to explore the csplit possibility in more depth, later on. As of now, especially with ReadAll function working for my biggest full blog book (16 M characters with 32 MB file size), I find coding for SplitBlogBook to be quite easy and so want to continue with this approach. However, at some later point I want to explore searching through the input data for some pattern and selecting posts based on that pattern. That's when csplit which supports Regular Expression search, may be very useful, if I can figure out how to use it for my needs. Unix spec for csplit:  https://pubs.opengroup.org/onlinepubs/7908799/xcu/csplit.html .

Comments

Archive

Show more