SplitBlogBook: VBA code to Generate Contents Internal Links List for HTML blog book and/or Split it into smaller HTML files
Last updated on 9 Sep. 2023
9 Sep. 2023 Update: I rarely use the software covered in this post now. To see the current software for creating Blogger blogbooks that I use, please visit: Short User Guide to creating Blogger Blogbooks from Backup/Export File using ExportFileFilterAndGenBook and another VBA projects' macros/code (free and open source), https://ravisiyermisc.blogspot.com/2023/09/short-user-guide-to-creating-blogger.html .
end-Update 9 Sep. 2023
Note: At a convenient time, I intend to rename the SplitBlogBook program/macro to something like GenContentLinksListAndOrSplitBlogBook as that will better describe its current functionality.
The VBA program BlogExportFileToBook described in my post: BlogExportFileToBook: VBA code to produce HTML blog book from Blogger blog backup/export XML file, https://ravisiyermisc.blogspot.com/2023/07/BlogExportFileToBook-vba-code.html , produces a single HTML blog book for a blog having all its posts, pages and comments. In some cases, the produced blog book is way too large (e.g. 32 MB UTF-16 LE encoding file having around 16 million UNICODE characters) for Chrome or Edge browsers to load fully on my 4 GB RAM Windows 10 PC. Then I need to split this large HTML blog book file into smaller pieces, say of around 5 MB size each. That need resulted in the SplitBlogBook program described in this post.
Here's the folder share for this macro having the associated files: SplitBlogBook, https://drive.google.com/drive/folders/1C6bhnwyTlOJp34BFxekWrdk5-YkKyyy5?usp=drive_link . To save space, large input file(s) have not been put in the Google Drive share directory.
1st Aug 2023 Update Start
Main changes in this version:
- Added parameters of GenerateContentsLinks (boolean) and associated ContentsLinksInsertMarker to main Sub. SplitBlogBook.
- Modified prompt dialog to allow user to choose Generate Contents Links either with split or without split (SplitSize 0).
- Improved messages text in Debug.Print log as well as user interface MsgBox(es).
- Cleaned up code including deletion of commented out old function/sub versions.
The latest version code is in the file: 20230731Macro-SplitBlogBook-Code.txt. https://drive.google.com/file/d/1VF8TMBf-MUuO_Lf0AROACwSOaQO3PXYe/view?usp=drive_link
1st Aug 2023 Update End
30 Jul 2023 & 31 Jul 2023 Update Start
Added generation and insertion of list of contents internal links (like TOC but without page numbers) in split html files. Run data is shared in the sub-folder: 20230730Run, https://drive.google.com/drive/folders/1f4PafwweHDb5a1R_uYSDkaYHaFrgWBeM?usp=drive_link .
The sub-folder has the 4 Spiritual blog (ravisiyer.blogspot.com) split html files with Contents internal links. This gets this work to very close to release version where the output split file links can be published on the blog download blogbook page.
The latest version code is in the file: 20230730v2Macro-SplitBlogBook-Code.txt, https://drive.google.com/file/d/1Ni8KDrRhGgtC2YSYBzZ9mxuHe82mhQvz/view?usp=drive_link . It has the current version as well as an older version of the main TOC function: GenerateContentsLinks (older version: OldGenerateContentsLinks). In the older version I was doing the parsing of the HTML h1 tag myself. Later I came to know about how to use HTMLDocument and associated classes/objects from a Microsoft library from VBA. That was a much better solution and so the current version uses the HTML Document and related classes/objects.
Another important sub added is: GenerateAndInsertContentsLinksIntoFile which invokes the above function and later inserts the TOC in the split file.
30 Jul 2023 & 31 Jul 2023 Update End
29 Jul 2023 section continues ...
The latest version macro code is copy-pasted into 20230729Macro-SplitBlogBook-Code.txt, https://drive.google.com/file/d/1m3ucTxpPlO9AjBlzA7uDfFIbq4gLJTWw/view?usp=drive_link . Note that I used the macro in Microsoft Word 2007 program and document. This code reads all the data of the file in one function call (ReadAll of TextStream class/object) and works on my 4 GB RAM PC with the 32 MB UTF-16 LE encoding file having around 16 million UNICODE characters mentioned above. But I don't know at what size bigger than around 16 million UNICODE characters, it will not work. See the Notes section towards bottom of this post to read about an earlier code version of SplitBlogBook which read and wrote chunks of data in a loop, but which has some limitations.
The Notes section also talks about possibility of using csplit or similar programs to do the same task as SplitBlogBook.
The key subs of the VBA code are listed below, along with some support subs/functions (not listed below):
- SplitBlogBook - the main function/sub.
- DriverDefaultSplitBlogBook - invokes SplitBlogBook with default input data
- DriverPromptInputSplitBlogBook - invokes SplitBlogBook with prompt dialogs for user to choose input data
- DriverTestSplitBlogBook - invokes SplitBlogBook with test input data
For testing, I used the single file full blog book files produced by BlogExportFileToBook VBA program/macro for two of my blogger blogs. Details are given below. [Google Drive Preview renders the output HTML file as text with HTML tags shown and so is not useful. Download the output HTML files to PC using download button in top right of preview page and then open in browser. ]
Note that if output files are converted to UTF-8 their size will roughly reduce by half and will still display Devanagari characters correctly on browsers (like Chrome and Edge).
Run(s)1) Input file: 20230623-ravisiyer-blog-BlogBook.html, 32783 KB (in UTF-16 LE) having 16784687 UNICODE characters.
Sub invoked: DriverDefaultSplitBlogBook (uses default Splitsize of 5242880 UNICODE UTF-16 characters i.e. 5 MB characters which roughly takes 10 MB (megabytes))
SplitSize specified is: 5242880 characters
Output files (4):
Note: To view output files, download them from Google Drive (GD) share link to PC and open that in browser.
- 20230623-ravisiyer-blog-BlogBook-Split1.html, 10,245 KB having 5245114 UNICODE characters, GD share link: https://drive.google.com/file/d/1qLCaAEQZ_zpjLpd0i_jcswk_W0oXTCm1/view?usp=drive_link
- 20230623-ravisiyer-blog-BlogBook-Split2.html, 10,244 KB having 5244746 UNICODE characters, GD share link: https://drive.google.com/file/d/14Gpw8QpJIjguIgwauMgvf97KKS06K_i4/view?usp=drive_link
- 20230623-ravisiyer-blog-BlogBook-Split3.html, 10,969 KB having 5615744 UNICODE characters, GD share link: https://drive.google.com/file/d/1iqOI227uQaBlhHIDeoIv6RljxN6udk8T/view?usp=drive_link
- 20230623-ravisiyer-blog-BlogBook-Split4.html, 1,327 KB having 679266 UNICODE characters, GD share link: https://drive.google.com/file/d/1RDcVpGZD4UVmM4uia8zWZUGrOULjHxze/view?usp=drive_link
A previous run version produced split files similar to above and which were of roughly similar sizes to above (they were slightly smaller by around 250 KB each). I noted the time Chrome took to fully load the previous run files. For Split1 to Split3, the time was roughly between 10 seconds and 30 seconds. Split4 got fully loaded in less than 6 seconds. Details of both runs are provided in Run1Info-DebugPrintLog.txt, GD share link: https://drive.google.com/file/d/1Be8D8vxNBrYxeEOy_N1R0V3VM9V37Hnk/view?usp=drive_link.
So after this split into 4 files, the Spiritual blogbook parts are quite quickly being fully loaded by Chrome on my 4 GB RAM Windows 10 PC. That is a good workaround for the problem of Chrome (or Edge) not being able to fully load the full blogbook file of 32 MB (with around 16 million UNICODE characters).
Run2) Input file: 20230523-tns-blog-BlogBook.html, 943 KB (in UTF-16 LE) having 482543 UNICODE characters.
Sub invoked: DriverPromptInputSplitBlogBook
This run was a test case run as the input file is a small one (less than 1 MB) and is quite quickly loaded in Chrome. I wanted to know if this SplitBlogBook program/macro can be used in a more general way than what it was originally designed for. Specifically, if I wanted smaller chunks of say 100 K characters, could the program deliver? The program seems to deliver!
SplitSize specified is: 100000 characters (around 100 K UNICODE UTF-16 characters which roughly takes 200 KB (kilobytes) in disk storage)
Output files (5):
- 20230523-tns-blog-BlogBook-Split1.html, 219 KB having 111925 UNICODE characters
- 20230523-tns-blog-BlogBook-Split2.html, 223 KB having 113940 UNICODE characters
- 20230523-tns-blog-BlogBook-Split3.html, 218 KB having 111343 UNICODE characters
- 20230523-tns-blog-BlogBook-Split4.html, 225 KB having 114892 UNICODE characters
- 20230523-tns-blog-BlogBook-Split5.html, 60 KB having 30687 UNICODE characters
Details of the run are provided in Run2Info-DebugPrintLog.txt, GD share link: https://drive.google.com/file/d/1KPTeHKkOK4YA46so3sGZqtvJ2pVaXgmQ/view?usp=drive_link .
Run3) This is a more extreme generalized usage as well as boundary conditions test. The primary intent here was to use a small input file with limited dummy "posts" and use SplitSize of very small value with clear idea of the expected output split files both in number of files and in content of each file. For this purpose, an input file of around 10 dummy post entries with end of post marker and with each dummy post entry being 100 characters, excepting the first which is lesser, was created - test.html, GD share link: https://drive.google.com/file/d/1Q3iDSqUVtwSC3yzL_vc8ZyZyHPVQxD3L/view?usp=drive_link . This test with SplitSize of 50 (UNICODE characters) was successful in creating 11 output files with expected data - the 11th having the end HTML tags alone with empty body.
Next step was to explore this program being used to split XML files having post (and page and comment) entries like the Blogger backup/export file. The end of entry marker, "</entry>", is different from end of post marker in above HTML blog book case, "End of Post==========================<br/><br/>". So I made end of post marker an argument to the main function. Now one could call the main macro SplitBlogBook with a .xml input file and "</entry>" as the end of post marker. I did not have time for extensive tests but just wanted to know if the program would work for a small test xml file which I created - test.xml, GD share link: https://drive.google.com/file/d/1Zv-XMtheZVdVNt-1r7wXR9h0blQfYhiw/view?usp=drive_link . This test.xml has 10 entries of 100 characters each enclosed with <entry> and </entry> tags, and the whole list of entries being enclosed in <feed> and </feed> tags. For this small test, the program worked as expected, producing 11 output files when SplitSize was specified as 50.
I then felt I should try it out with the more general text file case. I did a similar test with a small test.txt file, GD share link: https://drive.google.com/file/d/1xTK33dvVcxYa5lsb23EepRer2X0-wTm-/view?usp=drive_link . SplitBlogBook program worked as expected producing 10 split text files for SplitSize = 50.
Sub invoked: DriverTestSplitBlogBook
Comments
Post a Comment