Best Practices: Reasonable lengths in unique ids #518

skinkie · 2024-11-14T11:16:41Z

Describe the problem

Today, I was confronted with a GTFS feed that used 100+ character IDs for individual trip_ids. You may guess what the size of the feed looked like. Their old feeds used only 8 digits as trip_id.

Use cases

Efficient resource usage.

Proposed solution

I want to propose a 36 byte soft limit as best practice for any identifier used in GTFS. A UUID would fit, I would say even a NeTEx ServiceJourney or ScheduledStopPoint identifier would fit as whole. If a value exceeds 36 bytes, a nice warning can and should be presented.

leonardehrenfried · 2024-11-14T11:30:45Z

I must admit this one made me laugh - no matter how many rules the spec imposes, data producers will always find new ways to create a mess.

Famous last words: "36 characters ought to be enough for anyone."

laurentg · 2024-11-14T11:32:03Z

I'm not certain which problem we are trying to solve here. The zip will compress large and multiple IDs nicely, and trip count is never that huge to prevent storing data, even in memory. Also giving an (arbitrary) limit could be taken as an excuse for some re-users to justify not being able to consume feeds with a few IDs larger than this limit. I've encountered some re-users that cannot ingest GTFS with IDs longer than 80 or 255 chars for example, even if only a few IDs are above that threshold.

In summary I'm rather against this; for me this is rather useless, somehow arbitrary and open to misinterpretation.

skinkie · 2024-11-14T12:25:59Z

@laurentg in this case the feed of "last week" was 74MB compressed, and this week 890MB compressed. For compression (itself) to work properly, some things must be guaranteed first. For example, the data in the files are sorted. But this is not about compression or not, processing and running matching still requires this idiotic long strings to be stored in memory, unless the implementation throws them overboard anyhow and creates hashes.

With respect to your other comment, that it is never too big to store something in memory. 372996 trips, multiplied by 100 is indeed "only" 37MB. But it could also have been just 2.2MB.

The example that you give, that there exists GTFS-ids with a length of 80 - 255 already shows we need to have a best practice. Nobody in their sane mind has more than 10^80 stops in the network.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best Practices: Reasonable lengths in unique ids #518

Best Practices: Reasonable lengths in unique ids #518

skinkie commented Nov 14, 2024 •

edited

Loading

leonardehrenfried commented Nov 14, 2024

laurentg commented Nov 14, 2024

skinkie commented Nov 14, 2024

Best Practices: Reasonable lengths in unique ids #518

Best Practices: Reasonable lengths in unique ids #518

Comments

skinkie commented Nov 14, 2024 • edited Loading

Describe the problem

Use cases

Proposed solution

leonardehrenfried commented Nov 14, 2024

laurentg commented Nov 14, 2024

skinkie commented Nov 14, 2024

skinkie commented Nov 14, 2024 •

edited

Loading