User:Ivan Shmakov/NAAF-based Netnews Archival/Narchive
narchive [-v|--verbose] [-j|--bzip2|-z|--gzip] [-o|--output=FILE]... [-l|--log=FILE] [--max-group-articles=N] [--max-articles=N] [--max-groups=N] [--max-output-size=N] [--] [LISTFILE]...
Single newsgroup case
Perhaps the most simple use case for Narchive is retrieving articles from a single newsgroup. Virtually any use of Narchive would require a group list, which for this case could be as simple as:
Assuming that the group list above is available from a file named
listfile, the following command will fetch at most 16 articles from there to a NAAF file named
$ narchive -o fetched.naaf --max-articles=16 \ -- listfile I: news.aioe.org: Connected to server I: comp.lang.perl.misc: Selected group (28834 to 29374) I: comp.lang.perl.misc: Fetched 16 articles nntp://news.aioe.org/comp.lang.perl.misc 28839-28854 I: news.aioe.org: Fetched 16 articles $
The output produced by the command above contains informative messages (marked with
I:) and the new group list, which now includes 28839–28854 as the range of articles already downloaded (and saved to
fetched.naaf.) It can be saved either by redirecting the command’s standard output to a file (using the shell’s
> ⟨new file⟩ syntax), or by using the
--log= option, like:
$ narchive -o fetched.naaf.1 --max-articles=16 \ --log=listfile.1 \ -- listfile I: news.aioe.org: Connected to server I: comp.lang.perl.misc: Selected group (28834 to 29374) I: comp.lang.perl.misc: Fetched 16 articles I: news.aioe.org: Fetched 16 articles $
listfile.1 file is now expected to contain the resulting group list, like:
Note that it is not possible to use a single file for both the original and resulting group lists, as the file specified will generally be truncated (that is, emptied) before it could’ve been read by Narchive.
Now that we retrieved some 16 articles, we can retrieve some more by feeding the new group list to Narchive. Note that we also use a new name for the the output NAAF file (
-o), as the previous one will otherwise be overwritten.
$ narchive -o fetched.naaf.2 --max-articles=16 \ --log=listfile.2 \ -- listfile.1 I: news.aioe.org: Connected to server I: comp.lang.perl.misc: Selected group (28834 to 29374) I: comp.lang.perl.misc: Fetched 16 articles I: news.aioe.org: Fetched 16 articles $
listfile.2 is expected to contain the third revision of the group list, like:
Archiving a newsgroup from multiple servers
Narchive allows for a single newsgroup to be fetched from more than one server. If that’s the case, the resulting NAAF file will contain a single
Article record for all the copies of any given article, as available from the servers listed, containing all variants of its header (which are expected to differ in at least the
Path: field values.)
The servers for any given newsgroup will be considered in the order they appear in the group list. For instance, for the following group list, the news:alt.answers group will be first looked at nntp://news.example.org/, while news:news.misc will be first looked at nntp://nntp.example.com/ (and news:rec.humor.oracle will only be looked at nntp://usenet.example.net/.)
nntp://news.example.org/alt.answers nntp://usenet.example.net/rec.humor.oracle nntp://news.example.com/news.misc nntp://news.example.org/news.misc nntp://news.example.com/alt.answers
Before the articles are retrieved, the so-called overview file (as available via the NNTP
XOVER command) is obtained for the group from each of the servers listed.
For a given server, the newsgroup’s articles will be downloaded in the order they appear in overview, skipping any articles retrieved previously in this or an earlier Narchive session. Whenever an article is fetched from a server, its
Message-Id: is tried against all the other servers.
It’s possible at this point to check if the headers for the copies of the article thus retrieved match the respective overview records (primarily the
Message-Id: header fields, on which Narchive itself relies), and also that the respective bodies match octet-wise.
The article is then stored (along with all the variants of its header), and all the respective (group, article number) pairs (as per the
Xref: overview data) are marked as retrieved in the working copy of the group list, so to avoid fetching a cross-posted article more than once in the session.
Setting the limits
There’re several options to limit the amount of data archived, applying either to the archive files produced, or to the interaction with the servers.
The purpose of the
--max-output-size= option is twofold: first, it allows one to impose a soft limit on the amount of data saved to the filesystem; and, it provides a way to split the archives produced into sections of suitable size.
--max-output-size= option given, Narchive will use the first
-o) file given until the limit specified (in bytes) is reached, whereupon it will switch to the next output file, and so on, until the list is exhausted, or until there’re no more groups to consider.
Note, however, that in order not to separate the
Group record (and the associated
XOVER data) from the articles retrieved for the group, Narchive will defer the switch to another file until the next group or server. Therefore, the size of the archive files produced may end up being much greater than specified.
--max-group-articles= set the overall and per-group limits on the number of articles to fetch within the session.
The latter option may be used together with
--max-output-size= to “force” Narchive to follow the output (section) size limit more closely. However, please note that irrespective of either of these limits, the
XOVER data is always saved in full for the groups archived.
Whereupon either of the overall limits (
--max-output-size=, multiplied by the number of outputs, or
--max-articles=) is reached, Narchive will cease to archive groups, and will simply output all the pending group list entries.
Note that the pending entries are still parsed, merged (if there’re more than one for a given server, group pair), and serialized anew. This could be used to merge several group lists into one without contacting any of the servers listed, like:
$ narchive -o /dev/full --max-articles=0 \ --log=listfile.new \ -- listfile.1 listfile.2 listfile.3
Narchive is written by Ivan Shmakov.
Narchive is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This documentation is a free collaborative project going on at Wikibooks, and is available under the Creative Commons Attribution/Share-Alike License (CC BY-SA) version 3.0.