User:Ivan Shmakov/NAAF-based Netnews Archival/Narchive

From Wikibooks, open books for an open world
Jump to: navigation, search

Narchive connects to NNTP servers and fetches netnews articles from there, as specified by the user. The contents, as well as some of the NNTP metadata, is stored in one or more NAAF files.

Synopsis[edit]

$ narchive [-v | --verbose] [-j | --bzip2 | -z | --gzip]
      [-l | --log=FILE]     [-o | --output=FILE]...
      [--max-articles=N]    [--max-group-articles=N]
      [--max-groups=N]
      [--max-output-size=N]
      [--] [LISTFILE]...

Basic usage[edit]

Single newsgroup case[edit]

Perhaps the most simple use case for Narchive is retrieving articles from a single newsgroup. Virtually any use of Narchive would require a group list, which for this case could be as simple as:

nntp://news.aioe.org/comp.lang.perl.misc

Here, we specified that we’re interested in the news:comp.lang.perl.misc newsgroup, as available from the Aioe NNTP server, nntp://news.aioe.org/.

Assuming that the group list above is available from a file named listfile, the following command will fetch at most 16 articles from there to a NAAF file named fetched.naaf.

$ narchive -o fetched.naaf --max-articles=16 \
      -- listfile 
I: news.aioe.org: Connected to server
I: comp.lang.perl.misc: Selected group (28834 to 29374)
I: comp.lang.perl.misc: Fetched 16 articles
nntp://news.aioe.org/comp.lang.perl.misc 28839-28854
I: news.aioe.org: Fetched 16 articles
$ 

The output produced by the command above contains informative messages (marked with I:) and the new group list, which now includes 28839–28854 as the range of articles already downloaded (and saved to fetched.naaf.) It can be saved either by redirecting the command’s standard output to a file (using the shell’s >new file⟩ syntax), or by using the --log= option, like:

$ narchive -o fetched.naaf.1 --max-articles=16 \
      --log=listfile.1 \
      -- listfile 
I: news.aioe.org: Connected to server
I: comp.lang.perl.misc: Selected group (28834 to 29374)
I: comp.lang.perl.misc: Fetched 16 articles
I: news.aioe.org: Fetched 16 articles
$ 

The listfile.1 file is now expected to contain the resulting group list, like:

nntp://news.aioe.org/comp.lang.perl.misc 28839-28854

Note that it is not possible to use a single file for both the original and resulting group lists, as the file specified will generally be truncated (that is, emptied) before it could’ve been read by Narchive.

Now that we retrieved some 16 articles, we can retrieve some more by feeding the new group list to Narchive. Note that we also use a new name for the the output NAAF file (-o), as the previous one will otherwise be overwritten.

$ narchive -o fetched.naaf.2 --max-articles=16 \
      --log=listfile.2 \
      -- listfile.1 
I: news.aioe.org: Connected to server
I: comp.lang.perl.misc: Selected group (28834 to 29374)
I: comp.lang.perl.misc: Fetched 16 articles
I: news.aioe.org: Fetched 16 articles
$ 

Now listfile.2 is expected to contain the third revision of the group list, like:

nntp://news.aioe.org/comp.lang.perl.misc 28839-28870

Archiving a newsgroup from multiple servers[edit]

Narchive allows for a single newsgroup to be fetched from more than one server. If that’s the case, the resulting NAAF file will contain a single Article record for all the copies of any given article, as available from the servers listed, containing all variants of its header (which are expected to differ in at least the Xref: and Path: field values.)

The servers for any given newsgroup will be considered in the order they appear in the group list. For instance, for the following group list, the news:alt.answers group will be first looked at nntp://news.example.org/, while news:news.misc will be first looked at nntp://nntp.example.com/ (and news:rec.humor.oracle will only be looked at nntp://usenet.example.net/.)

nntp://news.example.org/alt.answers
nntp://usenet.example.net/rec.humor.oracle
nntp://news.example.com/news.misc
nntp://news.example.org/news.misc
nntp://news.example.com/alt.answers

Before the articles are retrieved, the so-called overview file (as available via the NNTP XOVER command) is obtained for the group from each of the servers listed.

For a given server, the newsgroup’s articles will be downloaded in the order they appear in overview, skipping any articles retrieved previously in this or an earlier Narchive session. Whenever an article is fetched from a server, its Message-Id: is tried against all the other servers.

It’s possible at this point to check if the headers for the copies of the article thus retrieved match the respective overview records (primarily the Xref: and Message-Id: header fields, on which Narchive itself relies), and also that the respective bodies match octet-wise.

The article is then stored (along with all the variants of its header), and all the respective (group, article number) pairs (as per the Xref: overview data) are marked as retrieved in the working copy of the group list, so to avoid fetching a cross-posted article more than once in the session.

Setting the limits[edit]

There’re several options to limit the amount of data archived, applying either to the archive files produced, or to the interaction with the servers.

The purpose of the --max-output-size= option is twofold: first, it allows one to impose a soft limit on the amount of data saved to the filesystem; and, it provides a way to split the archives produced into sections of suitable size.

With the --max-output-size= option given, Narchive will use the first --output= (or -o) file given until the limit specified (in bytes) is reached, whereupon it will switch to the next output file, and so on, until the list is exhausted, or until there’re no more groups to consider.

Note, however, that in order not to separate the Group record (and the associated XOVER data) from the articles retrieved for the group, Narchive will defer the switch to another file until the next group or server. Therefore, the size of the archive files produced may end up being much greater than specified.

The --max-articles= and --max-group-articles= set the overall and per-group limits on the number of articles to fetch within the session.

The latter option may be used together with --max-output-size= to “force” Narchive to follow the output (section) size limit more closely. However, please note that irrespective of either of these limits, the XOVER data is always saved in full for the groups archived.

Whereupon either of the overall limits (--max-output-size=, multiplied by the number of outputs, or --max-articles=) is reached, Narchive will cease to archive groups, and will simply output all the pending group list entries.

Note that the pending entries are still parsed, merged (if there’re more than one for a given server, group pair), and serialized anew. This could be used to merge several group lists into one without contacting any of the servers listed, like:

$ narchive -o /dev/full --max-articles=0 \
      --log=listfile.new \
      -- listfile.1 listfile.2 listfile.3 

Author[edit]

Narchive is written by Ivan Shmakov.

Narchive is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This documentation is a free collaborative project going on at Wikibooks, and is available under the Creative Commons Attribution/Share-Alike License (CC BY-SA) version 3.0.