Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best Practices: Reasonable lengths in unique ids #518

Open
skinkie opened this issue Nov 14, 2024 · 3 comments
Open

Best Practices: Reasonable lengths in unique ids #518

skinkie opened this issue Nov 14, 2024 · 3 comments

Comments

@skinkie
Copy link
Contributor

skinkie commented Nov 14, 2024

Describe the problem

Today, I was confronted with a GTFS feed that used 100+ character IDs for individual trip_ids. You may guess what the size of the feed looked like. Their old feeds used only 8 digits as trip_id.

Use cases

Efficient resource usage.

Proposed solution

I want to propose a 36 byte soft limit as best practice for any identifier used in GTFS. A UUID would fit, I would say even a NeTEx ServiceJourney or ScheduledStopPoint identifier would fit as whole. If a value exceeds 36 bytes, a nice warning can and should be presented.

@leonardehrenfried
Copy link
Contributor

I must admit this one made me laugh - no matter how many rules the spec imposes, data producers will always find new ways to create a mess.

Famous last words: "36 characters ought to be enough for anyone."

@laurentg
Copy link

I'm not certain which problem we are trying to solve here. The zip will compress large and multiple IDs nicely, and trip count is never that huge to prevent storing data, even in memory. Also giving an (arbitrary) limit could be taken as an excuse for some re-users to justify not being able to consume feeds with a few IDs larger than this limit. I've encountered some re-users that cannot ingest GTFS with IDs longer than 80 or 255 chars for example, even if only a few IDs are above that threshold.

In summary I'm rather against this; for me this is rather useless, somehow arbitrary and open to misinterpretation.

@skinkie
Copy link
Contributor Author

skinkie commented Nov 14, 2024

@laurentg in this case the feed of "last week" was 74MB compressed, and this week 890MB compressed. For compression (itself) to work properly, some things must be guaranteed first. For example, the data in the files are sorted. But this is not about compression or not, processing and running matching still requires this idiotic long strings to be stored in memory, unless the implementation throws them overboard anyhow and creates hashes.

With respect to your other comment, that it is never too big to store something in memory. 372996 trips, multiplied by 100 is indeed "only" 37MB. But it could also have been just 2.2MB.

The example that you give, that there exists GTFS-ids with a length of 80 - 255 already shows we need to have a best practice. Nobody in their sane mind has more than 10^80 stops in the network.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants