Streaming regex parsing with version 0.3.6 #1071

chrisduerr · 2023-08-26T03:10:55Z

chrisduerr
Aug 26, 2023

I've just tried upgrading regex-automata from 0.1 to 0.3 and have run into two major issues for my streaming regex parser.

The first issue is somewhat minor, but I like the new start_state_reverse and start_state_forward methods, over a universal start state. However they do require an Input. Is it fine to just pass &[] as input for a streaming engine? It seems like it doesn't make a difference at least in my case.

The second issue leaves me puzzled as to how the API is supposed to work. Previously I'd switch states using next_state_unchecked, then check if the state is a match or dead state. However this doesn't work anymore.

The following log should explain things:

INPUT REGEX: "Ala.*123"
REVERSE: false, ANCHORED: false, EARLIEST: true
START STATE: StateID(384)
PROCESSED 'A' (NEW STATE: StateID(864))
PROCESSED 'l' (NEW STATE: StateID(896))
PROCESSED 'a' (NEW STATE: StateID(928))
PROCESSED 'c' (NEW STATE: StateID(928))
PROCESSED 'r' (NEW STATE: StateID(928))
PROCESSED 'i' (NEW STATE: StateID(928))
PROCESSED 't' (NEW STATE: StateID(928))
PROCESSED 't' (NEW STATE: StateID(928))
PROCESSED 'y' (NEW STATE: StateID(928))
PROCESSED '1' (NEW STATE: StateID(960))
PROCESSED '2' (NEW STATE: StateID(1280))
PROCESSED '3' (NEW STATE: StateID(1312))
PROCESSED ' ' (NEW STATE: StateID(64))
MATCH STATE (StateID(64))

I'm using the simplest form of DFA here (forward, not anchored, return earliest match). I'd expect this to return as soon as 3 is hit and it did so previously. But it does not transition into the match state (64) until a whitespace is fed in.

Now if earliest was false, I'd understand this, since it needs to check if it might be Alacritty123123 for example. But with earliest set I'd expect it to return the… earliest match. Surely that's how this is supposed to work and this is equivalent to longest_match(false) in 0.1?

Answered by BurntSushi

Aug 26, 2023

The first issue is somewhat minor, but I like the new start_state_reverse and start_state_forward methods, over a universal start state. However they do require an Input. Is it fine to just pass &[] as input for a streaming engine? It seems like it doesn't make a difference at least in my case.

I believed this is answered in the discussion for this PR: #1031. The TL;DR is that the discussion should give you a path forward with the current API, but it's non-obvious. It should be possible to compute the start state without providing a full Input. Instead, there will be a new type that mostly mirrors Input, but instead of accepting a full &[u8] haystack, it will accept a single (optional) …

View full answer

BurntSushi · 2023-08-26T03:20:20Z

BurntSushi
Aug 26, 2023
Maintainer

The first issue is somewhat minor, but I like the new start_state_reverse and start_state_forward methods, over a universal start state. However they do require an Input. Is it fine to just pass &[] as input for a streaming engine? It seems like it doesn't make a difference at least in my case.

I believed this is answered in the discussion for this PR: #1031. The TL;DR is that the discussion should give you a path forward with the current API, but it's non-obvious. It should be possible to compute the start state without providing a full Input. Instead, there will be a new type that mostly mirrors Input, but instead of accepting a full &[u8] haystack, it will accept a single (optional) look-behind byte.

As for your other question, it's not particularly clear what's going on because I don't see a haystack and I don't see any code. My best guess is that perhaps you aren't using the end-of-input (EOI) transition? One change from regex-automata 0.1 is that matches are now delayed by one byte to account for look-around assertions (such as $ and \b). Those weren't supported in regex-automata 0.1. See: https://docs.rs/regex-automata/latest/regex_automata/dfa/trait.Automaton.html#tymethod.next_eoi_state

0 replies

chrisduerr · 2023-08-26T03:44:31Z

chrisduerr
Aug 26, 2023
Author

As for your other question, it's not particularly clear what's going on because I don't see a haystack and I don't see any code. My best guess is that perhaps you aren't using the end-of-input (EOI) transition? One change from regex-automata 0.1 is that matches are now delayed by one byte to account for look-around assertions (such as $ and \b). Those weren't supported in regex-automata 0.1. See: https://docs.rs/regex-automata/latest/regex_automata/dfa/trait.Automaton.html#tymethod.next_eoi_state

Is there an option to get back the old behavior if I don't care about this?

That said, I don't see why I'd need to process another byte, since I haven't reached the end of my haystack (and with streaming searches I won't ever reach it in a lot of cases). Considering the input, it should always be clear that the end state is reached, even without processing an additional byte, right?

I'd assume with the current version of regex-automata, this just means that I need to always cut off one character from my matches because it needs to "overshoot" every match to handle the thing I don't actually care about?

1 reply

BurntSushi Aug 26, 2023
Maintainer

Is there an option to get back the old behavior if I don't care about this?

Nope. It's not feasible to provide this.

I don't see why I'd need to process another byte

The match delay is built into the automaton at determinization time. It's not optional. Basically, instead of marking a DFA state as a match state when it contains an NFA match state, it delays that to the next DFA state.

since I haven't reached the end of my haystack (and with streaming searches I won't ever reach it in a lot of cases)

If it's a stream, then you just have to wait until one more byte is processed until it can be known that a match exists.

Considering the input, it should always be clear that the end state is reached, even without processing an additional byte, right?

You can see that, but that only works when you take the specific regex you're using into account. Since you regex has no look-around assertions, you can see that a match is detectable at offset i, but the DFA itself still needs to wait until i+1.

It's not feasible to provide an option to tweak this. If I did, then you'd have to write different search routines depending on whether the original regex contains specific types of look-around assertions in the suffix of the regex. It's theoretically possible, but there is absolutely no way I'm maintaining something like that.

I'd assume with the current version of regex-automata, this just means that I need to always cut off one character from my matches because it needs to "overshoot" every match to handle the thing I don't actually care about?

One byte, but yes. That's how the search routines in regex-automata that use DFAs work. You can see an example in the docs for Automaton::is_special_state. Notice that when a match state is reported, the offset returned is i. If the match weren't delayed, that would have to be i+1 instead.

chrisduerr · 2023-08-26T04:16:40Z

chrisduerr
Aug 26, 2023
Author

One byte, but yes. That's how the search routines in regex-automata that use DFAs work.

Thanks for pointing that out, my current solution works by always parsing one character more but that might have some sneaky error cases my tests don't cover.

I've gotten this to pass all my tests by handling EOI and cutting off the last char (for now), but I noticed that by constructing Input for an empty haystack, the start_state_reverse is identical to start_state_forward. I was hoping that I could just use a single DFA, but I had to use the thompson::Config::reverse(true) at build time to get the reverse search working. Is this expected?

9 replies

chrisduerr Aug 26, 2023
Author

regex-automata 0.1 only had a DFA::start_state method because it is the same for both forward and reverse DFAs.

Ah I think I got it now. So you need to both construct a reverse DFA and call the reverse start method. In my case since I will never feed in any lookahead bytes and always start at the beginning of the input I suppose the universal start should work just fine, but I still need to construct both normal and reverse DFAs.

BurntSushi Aug 26, 2023
Maintainer

No. It's just how the start state is computed. A reverse DFA is completely different from a forward DFA. All of the concatenations are reversed (and the look-around assertions are inverted).

If you don't have any look-around assertions, then I believe it's correct to say that start_state_forward and start_state_reverse will return the same results for both forward and reverse DFAs. Without look-around assertions, it should be true that all starting configurations collapse into a single DFA state and so it doesn't matter what you do at that point. But if you have look-around assertions, then a forward search starting at i (inclusive) has to read i-1 for the look-behind byte where as a reverse search starting at i (exclusive) has to read i for the look-behind byte. See:

regex/regex-automata/src/util/start.rs

Lines 76 to 94 in 81e328a

    
           pub(crate) fn fwd(&self, input: &Input) -> Start { 
        
               match input 
        
                   .start() 
        
                   .checked_sub(1) 
        
                   .and_then(|i| input.haystack().get(i)) 
        
               { 
        
                   None => Start::Text, 
        
                   Some(&byte) => self.get(byte), 
        
               } 
        
           } 
        
           /// Return the reverse starting configuration for the given `input`. 
        
           #[cfg_attr(feature = "perf-inline", inline(always))] 
        
           pub(crate) fn rev(&self, input: &Input) -> Start { 
        
               match input.haystack().get(input.end()) { 
        
                   None => Start::Text, 
        
                   Some(&byte) => self.get(byte), 
        
               } 
        
           }

I think the thing I'm confused by here is you saying that you needed a second DFA in regex-automata 0.3, and I took that to mean that you didn't need a second DFA in regex-automata 0.1. That should not be true. If you need a second DFA in 0.3 then you should have also needed on in 0.1. And if you didn't need one in 0.1, then you shouldn't also need one in 0.3.

BurntSushi Aug 26, 2023
Maintainer

Ah I think I got it now. So you need to both construct a reverse DFA and call the reverse start method. In my case since I will never feed in any lookahead bytes and always start at the beginning of the input I suppose the universal start should work just fine, but I still need to construct both normal and reverse DFAs.

Plausibly. But if your search is anchored, you know that if a match occurs, its starting position must be at the point where you began the search. In which case, you don't need a second DFA. See for example how a DFA regex handles this:

regex/regex-automata/src/dfa/regex.rs

Lines 494 to 504 in 81e328a

    
           // We can also skip the reverse search if we know our search was 
        
           // anchored. This occurs either when the input config is anchored or 
        
           // when we know the regex itself is anchored. In this case, we know the 
        
           // start of the match, if one is found, must be the start of the 
        
           // search. 
        
           if self.is_anchored(input) { 
        
               return Ok(Some(Match::new( 
        
                   end.pattern(), 
        
                   input.start()..end.offset(), 
        
               ))); 
        
           }

chrisduerr Aug 26, 2023
Author

I think the thing I'm confused by here is you saying that you needed a second DFA in regex-automata 0.3, and I took that to mean that you didn't need a second DFA in regex-automata 0.1.

That is incorrect. In regex-automata 0.1 I actually had 4 different DFAs. Both for forwards/reverse and for anchored/unachored. I was able to get rid of the anchored/unachored difference, but am still using forward/reverse ones.

Plausibly. But if your search is anchored, you know that if a match occurs, its starting position must be at the point where you began the search. In which case, you don't need a second DFA. See for example how a DFA regex handles this:

My initial search is always unanchored and forward/reverse is always based on search direction. The second DFA is used to determine the start.

I suppose this means that I should not be using the universal start, but instead use the forward/reverse start functions? Though for a streaming search I don't see how this would make any difference considering I always start at the start of my input.

BurntSushi Aug 26, 2023
Maintainer

You can use universal start if you know your regexes don't contain any look around assertions.

Did you read the discussion in the PR I linked initially? I feel like that should clarify things here...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming regex parsing with version 0.3.6 #1071

{{title}}

Replies: 3 comments 10 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Streaming regex parsing with version 0.3.6 #1071

chrisduerr Aug 26, 2023

Replies: 3 comments · 10 replies

BurntSushi Aug 26, 2023 Maintainer

chrisduerr Aug 26, 2023 Author

BurntSushi Aug 26, 2023 Maintainer

chrisduerr Aug 26, 2023 Author

chrisduerr Aug 26, 2023 Author

BurntSushi Aug 26, 2023 Maintainer

BurntSushi Aug 26, 2023 Maintainer

chrisduerr Aug 26, 2023 Author

BurntSushi Aug 26, 2023 Maintainer

chrisduerr
Aug 26, 2023

Replies: 3 comments 10 replies

BurntSushi
Aug 26, 2023
Maintainer

chrisduerr
Aug 26, 2023
Author

BurntSushi Aug 26, 2023
Maintainer

chrisduerr
Aug 26, 2023
Author

chrisduerr Aug 26, 2023
Author

BurntSushi Aug 26, 2023
Maintainer

BurntSushi Aug 26, 2023
Maintainer

chrisduerr Aug 26, 2023
Author

BurntSushi Aug 26, 2023
Maintainer