Visual Basic for Applications/Pseudo Random Repeated Substrings
Jump to navigation Jump to search
This page describes some matters that apply to the Rnd() function of VBA. In particular it illustrates that repeated substrings can result when the Randomize() function is wrongly positioned inside the same loop as Rnd(), instead of before it.
The VBA Rnd() Function
- The Rnd() function is pseudo random, as opposed to true random . True randomness is found rarely, one notable example being the sequence of data that can be derived from white noise. White noise, like the radio noise from the sun or perhaps the unintentional noise in a radio or other electronic device, has a fairly uniform distribution of frequencies, and can be exploited to produce random distributions of data; also known as linear probability distributions because their frequency distributions are straight lines parallel to the horizontal axis.
- Pseudo randomness can be obtained with a feedback algorithm, where a sequence of output values of a function is fed back and assists in the making of the next part of an output stream. These are referred to as pseudo random number generators (PRNG). Such a process, although complex, is nonetheless determinate, being based entirely on its starting value. Such generators, depending on their design can produce long sequences of values, all unique, before the entire stream eventually repeats itself.
- A PRNG output stream will always repeat itself if a long enough set of values is generated. The Rnd function in VBA can generate a sequence of up to 16,777,216 numbers before any one number is repeated, at which time the entire sequence itself is repeated. This is adequate in most cases. The Rnd() function has been described by Microsoft as belonging to the set of PRNGs called Linear Congruential Generators (LCG), though it is unclear as to whether or not the algorithm has since been modified.
- The Rnd function is not suitable for large tables or for cryptographic use, and VBA itself is quite insecure in its own right. For given starting values the generator always will produce the same sequence. Clearly, if any part of the stream is known, this allows other values in the sequence to be predicted, and this state of affairs is insecure for cryptographic use. Perhaps surprisingly, modeling methods that make much use of random values need even longer unique sequences than that produced by Rnd().
- The exact coding of the Microsoft Rnd() function is not available, and their descriptive material for it is quite sketchy. A recent attempt by me to implement the assumed algorithm in VBA code failed because of overflow, so those who intend to study such generators in VBA need to use another algorithm. Perhaps study of the Wichmann-Hill (1982) CLCG algorithm, that can be implemented in VBA would be a better choice. A VBA implementation, (by others), of the Wichmann-Hill (1982) algorithm is provided in another page of this series, along with some simpler generator examples.
Worst Case for Rnd() Substrings?
- A well designed PRNG stream consists of unique numbers, but this applies only to the designer's unfiltered set of numbers in the range from zero to unity, [0,1]. As soon as we start to take some values from the stream, and ignore others, say to make custom outputs, the new steams will take on different characteristics. The combination of cherry-picking elements from the natural sequence and the mapping of a large set to a very small set takes its toll. When observing the new set, the count of characters to the cycle repeat point shortens, and the number of repeated substrings increases throughout the set.
- The code listing below allows checking of a Rnd() stream for substrings, using preset filter settings, eg; capitals, lower case, integers, etc., and in addition, includes a similar generator based on a hash for those who would like to compare it.
- The repeat substring procedure is quite slow, depending as it does on the location of repeats. The worst case is for no repeats found where the number of cycles becomes maximum at (0.5*n)^2, the square of half the number of characters in the test string. Of course the smallest number of cycles is just one when a simple string is repeated, eg; abcabc. Clearly, an increase of string length by a factor of ten could increase the run time by a factor of one hundred. (1000 characters in about 2 seconds, 2000 in 4secs, 10000 in 200secs, is, so far, the best timing!).
- Coding layout can affect the length of repeated substrings too. The reader might compare the effect of placing the Randomize function outside the random number loop, then inside the loop, while making an output of only capitals. (See code in MakeLongRndStr). In the past the repeat strings have worsened considerably when placed within. The code as listed here to test Rnd() with 1000 samples of capitals, no use of DoEvents, and Randomize wrongly placed inside the loop, will return a repeat substring of up to four hundred characters for this author. Increasing the code lines in the loop, affecting the time (?) for each iteration of the loop to run, also affects the length of any substrings.
Option Explicit Sub TestRndForRepeats() 'run this to make a pseudo random string 'and to test it for longest repeated substring Dim strRnd As String, sOut As String Dim nOut As Long, nLen As Long strRnd = MakeLongRndStr(1000) MsgBox strRnd,, "Long string..." sOut = LongestRepeatSubstring(strRnd, nOut) MsgBox "Repeated substring found is : " & _ vbCrLf & sOut & vbCrLf & _ "Number of these found : " & nOut & vbCrLf & _ "Length of each : " & Len(sOut),, "Repeat substring..." End Sub Sub TestHashForRepeats() 'run this to make a long hash-based output 'and to test it for longest repeated substring Dim sOut As String, sHash As String, nOut As Long sHash = LongHash("String to hash", 1000) MsgBox "The following sha256-based hash has " & _ Len(sHash) & " characters." & _ vbCrLf & vbCrLf & sHash,, "Long hash..." sOut = LongestRepeatSubstring(sHash, nOut) MsgBox "Repeated substring found is : " & _ vbCrLf & sOut & vbCrLf & _ "Number of these found : " & nOut & vbCrLf & _ "Length of each : " & Len(sOut),, "Repeat substring..." End Sub Function MakeLongRndStr(nNumChr As Long) As String 'Makes a long capital letters string using rnd VBA function Dim n As Long, sChr As String, nAsc As Long Dim nSamp As Long, sRec As String '======================================================================== ' Notes and Conclusions: ' The VBA function rnd is UNSUITED to generation of long random strings. ' Both length and number of repeats increases rapidly near 256 charas. ' Reasonable results can be obtained by keeping below 128 characters. ' For longer strings, consider hash-based methods of generation. '======================================================================== 'Randomize 'right place Do Until n >= nNumChr 'DoEvents Randomize 'wrong place nSamp = Int((122 - 48 + 1) * Rnd + 48) 'range includes all charas sChr = Chr(nSamp) 'cherry-picks 10, 26, 36, 52, or 62 from a set of 75 Select Case nSamp 'chara filter Case 65 To 90 'upper case letters sRec = sRec & sChr Case 48 To 57 'integers 'sRec = sRec & sChr Case 97 To 122 'lower case letters 'sRec = sRec & sChr Case Else 'disregard End Select n = Len(sRec) Loop 'MsgBox sAccum MakeLongRndStr = Left$(sRec, nNumChr) End Function Function LongHash(sIn As String, nReq As Long, Optional sSeed As String = "") As String 'makes a long sha256 hash - length specified by user 'Parameters: sIn; the string to hash 'nReq; the length of output needed 'sSeed; optional added string modifier Dim n As Long, m As Long, c As Long, nAsc As Integer, sChr As String Dim sF As String, sHash As String, sRec As String, sAccum As String Do Until m >= nReq DoEvents n = n + 1 'increment 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 'you set your own cycle increment here sF = sIn & sSeed & sAccum & (7 * n * m / 3) 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 'get a single hash of sF sHash = HashSHA256(sF) 'filter output for chara type For c = 1 To Len(sHash) sChr = Mid$(sHash, c, 1) nAsc = Asc(sChr) 'cherry-picks 10, 26, 36 ,52, or 62 from a set of 64 Select Case nAsc 'chara filter Case 65 To 90 'upper case letters sRec = sRec & sChr Case 48 To 57 'integers 'sRec = sRec & sChr Case 97 To 122 'lower case letters 'sRec = sRec & sChr Case Else 'disregard End Select Next c 'accumulate sAccum = sAccum & sRec m = Len(sAccum) sRec = "" 'delete line at your peril! Loop LongHash = Left$(sAccum, nReq) End Function Function HashSHA256(sIn As String) As String 'Set a reference to mscorlib 4.0 64-bit 'HASHES sIn string using SHA2-256 algorithm 'NOTE 'total 88 output text charas of base 64 'Standard empty string input gives : 47DEQpj8HBSa+/...etc, Dim oT As Object, oSHA256 As Object Dim TextToHash() As Byte, bytes() As Byte Set oT = CreateObject("System.Text.UTF8Encoding") Set oSHA256 = CreateObject("System.Security.Cryptography.SHA256Managed") TextToHash = oT.GetBytes_4(sIn) bytes = oSHA256.ComputeHash_2((TextToHash)) HashSHA256 = ConvB64(bytes) Set oT = Nothing Set oSHA256 = Nothing End Function Function ConvB64(vIn As Variant) As Variant 'used to produce a base-64 output 'Set a reference to mscorlib 4.0 64-bit Dim oD As Object Set oD = CreateObject("MSXML2.DOMDocument") With oD .LoadXML "<root />" .DocumentElement.DataType = "bin.base64" .DocumentElement.nodeTypedValue = vIn End With ConvB64 = Replace(oD.DocumentElement.Text, vbLf, "") Set oD = Nothing End Function Function LongestRepeatSubstring(sIn As String, Optional nSS As Long) As String 'finds longest repeated non-overlapping substring and number of repeats 'greatest number cycles = (0.5*n)^2 for when "none found", eg; abcdef (9) 'shortest number cycles = 1 for one simple duplicated string; eg abcabc Dim s1 As String, s2 As String, X As Long Dim sPrev As String, nPrev As Long, nLPrev As Long Dim nL As Long, nTrial As Long, nPos As Long, vAr As Variant nL = Len(sIn) For nTrial = Int(nL / 2) To 1 Step -1 DoEvents For nPos = 1 To (nL - (2 * nTrial) + 1) X = 0 s1 = Mid(sIn, nPos, nTrial) s2 = Right(sIn, (nL - nPos - nTrial + 1)) vAr = Split(s2, s1) X = UBound(vAr) - LBound(vAr) If X > 0 Then If nPrev < X Then sPrev = s1 nPrev = X End If End If Next nPos If nPrev <> 0 Then LongestRepeatSubstring = sPrev nSS = nPrev Exit Function End If Next nTrial End Function